-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Add HostAllocator as the unified parent class #151431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151431
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New FailuresAs of commit aa2a8fd with merge base 107121d ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@@ -343,7 +345,7 @@ struct CachingHostAllocatorImpl { | |||
TORCH_CHECK_NOT_IMPLEMENTED(false, "Not implemented for copy_data"); | |||
} | |||
|
|||
HostStats getStats() { | |||
HostStats get_stats() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about renaming all these functions. CamelCase is pretty standard in our C++ codebase so I think it's best to avoid the churn of having to change call sites !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guangyey It's not necessary to rename all these functions, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I’ve changed the function names in CachingHostAllocatorImpl
back to their original CamelCase
style. However, since the APIs in HostAllocator
will be public, I don’t want users to have to remember which APIs use CamelCase
and which use snake_case
.
To ensure a consistent and user-friendly API experience, I suggest we keep using snake_case
for HostAllocator
, aligning with the style already used in its parent Allocator
. This would promote better consistency and usability for users. What do you think about?
|
||
TORCH_XPU_API bool CachingHostAllocator_recordEvent( | ||
inline TORCH_XPU_API bool CachingHostAllocator_recordEvent( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this function taking a c10::Stream directly? I don't see it being used either within this repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a device-specific API would be deprecated in the next PR. We would like to recommend user to use a unified API at::getHostAllocator(kXPU).record_event(...)
that accepts c10::Stream
as input parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in torch-xpu-ops repo for CopyKernel to ensure non-blocking copy accuracy.
@@ -18,25 +18,38 @@ namespace at::cuda { | |||
// call between host and device, and passed the corresponding context from the | |||
// allocation. This is currently invoked by at::native::copy_kernel_cuda. | |||
// | |||
TORCH_CUDA_CPP_API c10::Allocator* getCachingHostAllocator(); | |||
inline TORCH_CUDA_CPP_API at::HostAllocator* getCachingHostAllocator() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we call this a "caching" host allocator in these APIs rather that just getHostAllocator
?
From what I can tell, all access to this just need a host allocator..
Do you think we could simplify the naming here by removing the "caching" concept from the user APIs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code change here is just for the backward compatibility. I plan to deprecate these device-specific APIs in this file in next PR. And recommend the user to new unified APIs to cover this.
at::HostAllocator* allocator, | ||
uint8_t priority = 0); | ||
|
||
TORCH_API at::HostAllocator* getHostAllocator(at::DeviceType device_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need a per-device lookup here?
Can we rely on the fact that there is a single accelerator right now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable. @guangyey the default accelerator is able to tell which device type should be now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a per-device lookup design is safer.
From an implementation perspective, I recall that devices like MTIA
or PrivateUser1
can co-exist with a CUDA-enabled binary, since getAccelerator() is resolved at runtime rather than at build time for them.
In practice, users can combine getAccelerator()
with getHostAllocator()
to enable single accelerator usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is, ok!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok ok!
Thanks for the clarifications!
at::HostAllocator* allocator, | ||
uint8_t priority = 0); | ||
|
||
TORCH_API at::HostAllocator* getHostAllocator(at::DeviceType device_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is, ok!
@pytorchbot merge -i |
"These failures are irrelevant." |
Merge startedYour change will be merged while ignoring the following 4 checks: xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 2, 4, linux.idc.xpu), xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 3, 4, linux.idc.xpu), xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 1, 4, linux.idc.xpu), xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 4, 4, linux.idc.xpu) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
# Motivation This PR aims to deprecate the host allocator legacy API and recommend users to use the unified API `getHostAllocator(device_type)` APIs, such as: ```cpp at::getHostAllocator(device_type)->allocate(...); at::getHostAllocator(device_type)->empty_cache(); at::getHostAllocator(device_type)->record_event(...); at::getHostAllocator(device_type)->get_stats(); at::getHostAllocator(device_type)->reset_accumulated_stats(); at::getHostAllocator(device_type)->reset_peak_stats(); ``` # Additional Context TODO: - [ ] Move is_pinned from `AcceleratorHookInterface` to `HostAllocator` - [ ] Deprecate `getPinnedMemoryAllocator` inside `AcceleratorHookInterface` and recommend using `getHostAllocator` instead. Pull Request resolved: #151437 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #151403, #151431
ghstack-source-id: cd02580 Pull Request resolved: pytorch/pytorch#151431
Stack from ghstack (oldest at bottom):
Motivation
This PR introduces a unified parent class
HostAllocator
with the following goals:The new interface includes:
at::getHostAllocator()->allocate
at::getHostAllocator()->empty_cache
at::getHostAllocator()->record_event
at::getHostAllocator()->get_stats
at::getHostAllocator()->reset_accumulated_stats
at::getHostAllocator()->reset_peak_stats
Additional Context
We plan to deprecate legacy APIs such as
at::cuda::CachingHostAllocator_emptyCache
and recommend users migrate to the new backend-specific API, for example:This refactor will help standardize host memory management across devices and simplify backend integration in the future.
Another key improvement I am going to do is move the
is_pinned
functionality into theHostAllocator
class, which enables centralized pinned memory verification through calls likeat::getHostAllocator(at::kCUDA)->is_pinned(ptr)
.Benefits include:
AcceleratorHooksInterface
in a more modular wayThis architecture makes the system more maintainable and extensible for future device support.
cc @albanD @EikanWang