Skip to content

Add HostAllocator as the unified parent class #151431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

guangyey
Copy link
Collaborator

@guangyey guangyey commented Apr 16, 2025

Stack from ghstack (oldest at bottom):

Motivation

This PR introduces a unified parent class HostAllocator with the following goals:

  1. Enable backend-specific host allocator registration, including support for out-of-tree backends.
  2. Provide a unified and extensible API surface for host memory management across all backends, especially accelerators.

The new interface includes:

  • at::getHostAllocator()->allocate
  • at::getHostAllocator()->empty_cache
  • at::getHostAllocator()->record_event
  • at::getHostAllocator()->get_stats
  • at::getHostAllocator()->reset_accumulated_stats
  • at::getHostAllocator()->reset_peak_stats

Additional Context

We plan to deprecate legacy APIs such as at::cuda::CachingHostAllocator_emptyCache and recommend users migrate to the new backend-specific API, for example:

at::getHostAllocator(at::kCUDA)->empty_cache();

This refactor will help standardize host memory management across devices and simplify backend integration in the future.
Another key improvement I am going to do is move the is_pinned functionality into the HostAllocator class, which enables centralized pinned memory verification through calls like at::getHostAllocator(at::kCUDA)->is_pinned(ptr).
Benefits include:

  • Consistent host memory handling across all device backends
  • Decouple pinned memory functionality with AcceleratorHooksInterface in a more modular way
  • Clearer separation between device memory allocation and pinned host memory management

This architecture makes the system more maintainable and extensible for future device support.

cc @albanD @EikanWang

Copy link

pytorch-bot bot commented Apr 16, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151431

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures

As of commit aa2a8fd with merge base 107121d (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@guangyey guangyey added ciflow/xpu Run XPU CI tasks ciflow/trunk Trigger trunk jobs on your pull request release notes: cpp release notes category labels Apr 16, 2025
@guangyey guangyey added the module: accelerator Issues related to the shared accelerator API label Apr 16, 2025
@guangyey guangyey requested a review from albanD April 16, 2025 10:50
@@ -343,7 +345,7 @@ struct CachingHostAllocatorImpl {
TORCH_CHECK_NOT_IMPLEMENTED(false, "Not implemented for copy_data");
}

HostStats getStats() {
HostStats get_stats() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about renaming all these functions. CamelCase is pretty standard in our C++ codebase so I think it's best to avoid the churn of having to change call sites !

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guangyey It's not necessary to rename all these functions, right?

Copy link
Collaborator Author

@guangyey guangyey Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I’ve changed the function names in CachingHostAllocatorImpl back to their original CamelCase style. However, since the APIs in HostAllocator will be public, I don’t want users to have to remember which APIs use CamelCase and which use snake_case.
To ensure a consistent and user-friendly API experience, I suggest we keep using snake_case for HostAllocator, aligning with the style already used in its parent Allocator. This would promote better consistency and usability for users. What do you think about?


TORCH_XPU_API bool CachingHostAllocator_recordEvent(
inline TORCH_XPU_API bool CachingHostAllocator_recordEvent(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this function taking a c10::Stream directly? I don't see it being used either within this repo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a device-specific API would be deprecated in the next PR. We would like to recommend user to use a unified API at::getHostAllocator(kXPU).record_event(...) that accepts c10::Stream as input parameter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in torch-xpu-ops repo for CopyKernel to ensure non-blocking copy accuracy.

@@ -18,25 +18,38 @@ namespace at::cuda {
// call between host and device, and passed the corresponding context from the
// allocation. This is currently invoked by at::native::copy_kernel_cuda.
//
TORCH_CUDA_CPP_API c10::Allocator* getCachingHostAllocator();
inline TORCH_CUDA_CPP_API at::HostAllocator* getCachingHostAllocator() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we call this a "caching" host allocator in these APIs rather that just getHostAllocator?
From what I can tell, all access to this just need a host allocator..

Do you think we could simplify the naming here by removing the "caching" concept from the user APIs ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change here is just for the backward compatibility. I plan to deprecate these device-specific APIs in this file in next PR. And recommend the user to new unified APIs to cover this.

at::HostAllocator* allocator,
uint8_t priority = 0);

TORCH_API at::HostAllocator* getHostAllocator(at::DeviceType device_type);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a per-device lookup here?
Can we rely on the fact that there is a single accelerator right now?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. @guangyey the default accelerator is able to tell which device type should be now?

Copy link
Collaborator Author

@guangyey guangyey Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a per-device lookup design is safer.
From an implementation perspective, I recall that devices like MTIA or PrivateUser1 can co-exist with a CUDA-enabled binary, since getAccelerator() is resolved at runtime rather than at build time for them.
In practice, users can combine getAccelerator() with getHostAllocator() to enable single accelerator usage.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is, ok!

[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok ok!
Thanks for the clarifications!

at::HostAllocator* allocator,
uint8_t priority = 0);

TORCH_API at::HostAllocator* getHostAllocator(at::DeviceType device_type);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is, ok!

@guangyey guangyey moved this to Approved in PyTorch Intel Apr 18, 2025
@guangyey
Copy link
Collaborator Author

@pytorchbot merge -i

@guangyey
Copy link
Collaborator Author

"These failures are irrelevant."

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 4 checks: xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 2, 4, linux.idc.xpu), xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 3, 4, linux.idc.xpu), xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 1, 4, linux.idc.xpu), xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 4, 4, linux.idc.xpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-project-automation github-project-automation bot moved this from Approved to Done in PyTorch Intel Apr 18, 2025
pytorchmergebot pushed a commit that referenced this pull request Apr 22, 2025
# Motivation
This PR aims to deprecate the host allocator legacy API and recommend users to use the unified API `getHostAllocator(device_type)` APIs, such as:
```cpp
at::getHostAllocator(device_type)->allocate(...);
at::getHostAllocator(device_type)->empty_cache();
at::getHostAllocator(device_type)->record_event(...);
at::getHostAllocator(device_type)->get_stats();
at::getHostAllocator(device_type)->reset_accumulated_stats();
at::getHostAllocator(device_type)->reset_peak_stats();
```

# Additional Context
TODO:
- [ ] Move is_pinned from `AcceleratorHookInterface` to `HostAllocator`
- [ ] Deprecate `getPinnedMemoryAllocator` inside `AcceleratorHookInterface` and recommend using `getHostAllocator` instead.

Pull Request resolved: #151437
Approved by: https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #151403, #151431
Divigroup-RAP pushed a commit to Divigroup-RAP/PYTORCH that referenced this pull request Apr 22, 2025
@github-actions github-actions bot deleted the gh/guangyey/137/head branch May 28, 2025 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks Merged module: accelerator Issues related to the shared accelerator API open source release notes: cpp release notes category
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants