Skip to content

Update MAIAHooksInterface to pin host memory in MAIA device #155541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

trajepl
Copy link
Contributor

@trajepl trajepl commented Jun 10, 2025

This pull request introduces support for the MAIA device type in the PyTorch codebase by adding hooks, detection mechanisms, and interface updates. The changes ensure that MAIA is integrated alongside other device types like CUDA and XPU. Below is a summary of the key changes:

Integration of MAIA Device Type:

  • Device Hooks and Detection:

    • Added MAIA hooks in the Context class to retrieve device-specific hooks and check device availability (hasMAIA). ([[1]](https://github.com/pytorch/pytorch/pull/155541/files#diff-468e440eb305e0a440c3d600087003dab97dc368fba20ebedb85e68b99f02e63R64-R65), [[2]](https://github.com/pytorch/pytorch/pull/155541/files#diff-468e440eb305e0a440c3d600087003dab97dc368fba20ebedb85e68b99f02e63R142-R144))
    • Removed a redundant hasMAIA implementation that used DeviceGuardImpl. ([aten/src/ATen/Context.hL182-L184](https://github.com/pytorch/pytorch/pull/155541/files#diff-468e440eb305e0a440c3d600087003dab97dc368fba20ebedb85e68b99f02e63L182-L184))
    • Updated DeviceAccelerator.cpp to detect and validate MAIA as an accelerator device. ([[1]](https://github.com/pytorch/pytorch/pull/155541/files#diff-18f0936bc5ffb161d84cf5e7fd707f633eaedf27ba4f86bd51b2adbe47e71f25R45), [[2]](https://github.com/pytorch/pytorch/pull/155541/files#diff-18f0936bc5ffb161d84cf5e7fd707f633eaedf27ba4f86bd51b2adbe47e71f25R62))
  • MAIA Hooks Interface:

    • Introduced a MAIAHooksInterface in MAIAHooksInterface.h with methods for initialization, context checks, and memory management. ([[1]](https://github.com/pytorch/pytorch/pull/155541/files#diff-441db538a658d9b86c83e8093e6116576f92cc6d7e67b3f90741169c2f68badfR3), [[2]](https://github.com/pytorch/pytorch/pull/155541/files#diff-441db538a658d9b86c83e8093e6116576f92cc6d7e67b3f90741169c2f68badfR12-R43))
    • Added a constant MAIA_HELP message for error handling and replaced hardcoded error strings with this constant. ([aten/src/ATen/detail/MAIAHooksInterface.hR12-R43](https://github.com/pytorch/pytorch/pull/155541/files#diff-441db538a658d9b86c83e8093e6116576f92cc6d7e67b3f90741169c2f68badfR12-R43))

Copy link

pytorch-bot bot commented Jun 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155541

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit e3237f0 with merge base 12b0213 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

linux-foundation-easycla bot commented Jun 10, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@trajepl
Copy link
Contributor Author

trajepl commented Jun 10, 2025

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jun 10, 2025
@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 10, 2025
@zou3519 zou3519 requested a review from albanD June 10, 2025 13:35
@guangyey
Copy link
Collaborator

@albanD Would it be reasonable to move is_pinned to HostAllocator at this point, similar to what was done in #151439?

@albanD
Copy link
Collaborator

albanD commented Jun 12, 2025

I'm a bit concerned by the direction here actually.
If the intent is to make this a full fledged device and not what ORT was before, then I expect we are embarking here into a multi-month project to add all the scaffolding into the core lib.
We should discuss in more details before starting on that path, especially given that we now have fully working out-of-tree extension point where all this scaffolding is done and are ready for use today (see e2e example in https://github.com/pytorch/pytorch/tree/main/test/cpp_extensions/open_registration_extension )

Let's sync on an issue or slack about this before you start working on more of these PRs.

@trajepl
Copy link
Contributor Author

trajepl commented Jun 12, 2025

I'm a bit concerned by the direction here actually.
If the intent is to make this a full fledged device and not what ORT was before, then I expect we are embarking here into a multi-month project to add all the scaffolding into the core lib. We should discuss in more details before starting on that path, especially given that we now have fully working out-of-tree extension point where all this scaffolding is done and are ready for use today (see e2e example in https://github.com/pytorch/pytorch/tree/main/test/cpp_extensions/open_registration_extension )

Let's sync on an issue or slack about this before you start working on more of these PRs.

Thanks for raising this @albanD — I understand your concern and agree that long-term, using the open registration (PrivateUse1) mechanism is the right direction for out-of-tree backend integration.

That said, our current backend has already been deeply integrated using the older in-tree approach. We have implemented a wide range of kernels and system components (memory, distributed, collective, dispatch, etc.) based on this, and the transition to PrivateUse1 is non-trivial for us in the short term.

We’re not opposed to making this shift, but given the amount of scaffolding already built, we would need to plan and migrate incrementally. We're happy to align with the openReg approach long-term and would appreciate guidance or best practices to gradually restructure our integration around PrivateUse1.

Let’s sync up in this issue to discuss a possible migration path that minimizes disruption to our current functionality while moving toward the recommended direction.

cc @wschin

@wschin
Copy link
Collaborator

wschin commented Jun 17, 2025

@albanD, as you saw above, we built our code on previous integration approach, so switching to OpenReg will take months to make this simple change happen. From PRs to OpenReg, it seems it's still under development without clear documents. We are investigating for a feasible migration plan to OpenReg. To unblock our production code and lower our risk, could you please re-consider if this PR can be accepted?

Some concerns for OpenReg: #155864 (comment)

@trajepl
Copy link
Contributor Author

trajepl commented Jun 23, 2025

Move to use private_use1. Close this PR. More details can be found in #155864

@trajepl trajepl closed this Jun 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants