Fix torch.export.export() GPU failure with RNN modules. #155734

penknife6153 · 2025-06-11T20:35:27Z

torch.export.export() uses fake tensors during graph tracing, which don't have actual memory storage. The RNN module's flatten_parameters() method was calling p.data_ptr() for aliasing detection, causing "Cannot access data pointer of Tensor" errors on GPU.

Fixes #155309

Changes:

Replace p.data_ptr() with StorageWeakRef(p.untyped_storage()) for PT2 compatibility
Add graceful fallback for cases where StorageWeakRef fails
Works with fake tensors during exportFix torch.export.export() GPU failure with RNN modules

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @tugsbayasgalan

pytorch-bot · 2025-06-11T20:35:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155734

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 11 New Failures

As of commit 2a5c538 with merge base 3a56237 ():

NEW FAILURES - The following jobs have failed:

BC Lint / bc_linter (gh)
Process completed with exit code 1.
Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/_dynamo/debug_utils.py:
pull / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 2, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
export/test_serdes.py::SerDesExportTestExport::test_export_rnn_flatten_parameters_fake_tensor_serdes_strict
pull / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 3, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
export/test_export_training_ir_to_run_decomp.py::TrainingIRToRunDecompExportTestExport::test_export_rnn_flatten_parameters_fake_tensor_training_ir_to_decomp_strict
pull / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 4, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-jammy-py3.10-clang18-asan / test (default, 5, 6, lf.linux.4xlarge) (gh)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-jammy-py3.13-clang12 / test (crossref, 2, 2, lf.linux.2xlarge) (gh)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-jammy-py3.13-clang12 / test (default, 5, 5, lf.linux.4xlarge) (gh)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-jammy-py3.9-clang12 / test (crossref, 2, 2, lf.linux.2xlarge) (gh)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-jammy-py3.9-clang12 / test (default, 5, 5, lf.linux.4xlarge) (gh)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-jammy-py3.9-gcc11 / test (default, 5, 5, lf.linux.2xlarge) (gh)
test_public_bindings.py::TestPublicBindings::test_correct_module_names

This comment was automatically generated by Dr. CI and updates every 15 minutes.

penknife6153 · 2025-06-11T20:46:32Z

@pytorchbot label "release notes: nn"

…n RNNBase

tugsbayasgalan · 2025-06-26T18:39:25Z

torch/nn/modules/rnn.py

-        if len(unique_data_ptrs) != len(self._flat_weights):
-            return
+        try:
+            from torch.multiprocessing.reductions import StorageWeakRef


StorageWeakRef implementation actually doesn't look like it is multiprocessing specific, so you should move it out of there. It is also not good practice to do local imports like this. @bdhirsh What do you think about defining in torch/init.py?

Here would be the right place

pytorch/torch/utils/weak.py

Line 322 in 43a0918

class TensorWeakRef:

@tugsbayasgalan @albanD I've moved StorageWeakRef to torch.utils.weak as suggested, and updated all relevant imports across the codebase. Let me know if there’s anything else you'd like me to adjust.

tugsbayasgalan · 2025-06-26T18:39:40Z

torch/nn/modules/rnn.py

+            }
+            if len(unique_storage_refs) != len(self._flat_weights):
+                return
+        except Exception:


Need a test case.

Working on it today.

albanD · 2025-06-30T15:52:21Z

torch/nn/modules/rnn.py

+        except Exception:
+            # Fallback for cases where StorageWeakRef is not available or fails
+            # This maintains PT2 compatibility by skipping aliasing check
+            pass


Why is it ok to just pass here?

Yep we should at least narrow down the exception.

albanD · 2025-06-30T15:53:47Z

torch/multiprocessing/reductions.py

@@ -21,43 +21,6 @@
    pass


-class StorageWeakRef:


It is NOT ok to remove public APIs without appropriate care.
In this case, I don't think it is worth breaking existing user code by removing this altogether.
You can import StorageWeakRef from torch.utils.weak here to preserve the current public API.

Imported, thanks for pointing this out.

Looks like some tests are still failing.

albanD · 2025-06-30T15:54:38Z

torch/utils/weak.py

@@ -336,3 +337,40 @@ def __call__(self):
        # TODO, add _fix_weakref type binding
        out._fix_weakref()  # type: ignore[attr-defined]
        return out
+
+
+class StorageWeakRef:


FYI @ezyang I forgot we had this in serialization. We should migrate this like the above when we have a minute now that we have pyobject preservation for Storage.

Ah yes. Sounds LLM amenable, file an issue maybe?

…public API.

ezyang · 2025-06-30T21:03:41Z

@albanD @tugsbayasgalan Can we make sure this lands? This has definitely been a huge irritation in the past

albanD · 2025-07-09T11:57:24Z

CI Needs fixing, happy to review again after that.

tugsbayasgalan · 2025-07-31T19:33:57Z

@penknife6153 just wondering if you plan to work on this further?

penknife6153 · 2025-07-31T20:00:14Z

@penknife6153 just wondering if you plan to work on this further?

@tugsbayasgalan been sick for the past few weeks, but I'm currently working on this.

tugsbayasgalan · 2025-08-04T21:08:04Z

@penknife6153 Just checking in to see if you made any progress on this PR, since ONNX team plans to switch to export based graph capture very soon, this item needs to be closed sooner i think (cc: @titaiwangms)

titaiwangms · 2025-08-06T15:50:13Z

We need this by 2.9

penknife6153 · 2025-08-08T13:13:48Z

@penknife6153 Just checking in to see if you made any progress on this PR, since ONNX team plans to switch to export based graph capture very soon, this item needs to be closed sooner i think (cc: @titaiwangms)

Hi @tugsbayasgalan and @titaiwangms!

I’ve drafted some tests for RNN, LSTM, and GRU, added error handling at RNNBase, and resolved the merge conflicts. Aiming to have everything finalized within this weekend.

ezyang · 2025-08-10T11:55:27Z

torch/nn/modules/rnn.py

+            }
+        except Exception as e:
+            if isinstance(e, RuntimeError) and "share storage" in str(e):
+                raise  # Re-raise actual aliasing errors


Fix torch.export.export() GPU failure with RNN modules (pytorch#155309)

8e6b6dc

penknife6153 requested review from albanD, jbschlosser and mikaylagawarecki as code owners June 11, 2025 20:35

[modules/rnn] Revert back to old formatting for placeholders

fc12697

pytorchbot added the open source label Jun 11, 2025

pytorch-bot bot added the release notes: nn release notes category label Jun 11, 2025

Fix potential NoneType error in unique storage reference collection i…

ebe9978

…n RNNBase

janeyx99 requested review from angelayi and tugsbayasgalan and removed request for albanD, mikaylagawarecki and jbschlosser June 13, 2025 01:32

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 13, 2025

tugsbayasgalan requested changes Jun 26, 2025

View reviewed changes

shhubhxm mentioned this pull request Jun 28, 2025

torch.export.export() fails on GPU with LSTM model: "Cannot access data pointer of Tensor" #155309

Open

Move StorageWeakRef to utils/weak.

1599e76

penknife6153 requested review from bdhirsh and zou3519 as code owners June 28, 2025 22:18

pytorch-bot bot added module: dynamo module: inductor labels Jun 28, 2025

penknife6153 added 2 commits June 28, 2025 17:26

Merge branch 'main' into lstm-gpu-export

836ddf6

Fix lint errors.

be2ab87

albanD requested changes Jun 30, 2025

View reviewed changes

import StorageWeakRef to multiprocessing/reductions preserve current …

f7e7c4f

…public API.

Improve error handling for StorageWeakRef in RNNBase.

b326e2f

titaiwangms mentioned this pull request Jul 10, 2025

[ONNX] Flip dynamo default to True in torch.onnx.export #151693

Open

penknife6153 added 5 commits August 8, 2025 03:32

Fix lint errors.

9372a3f

Initialize tests for GRU, LSTM, RNN.

c8e71ad

Correct typo.

1eb73d6

More lint fixes in tests.

7ed01d4

Merge branch 'main' into lstm-gpu-export

2a5c538

ezyang reviewed Aug 10, 2025

View reviewed changes

Fix torch.export.export() GPU failure with RNN modules. #155734

Are you sure you want to change the base?

Fix torch.export.export() GPU failure with RNN modules. #155734

Uh oh!

Conversation

penknife6153 commented Jun 11, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155734

❌ 11 New Failures

Uh oh!

penknife6153 commented Jun 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang commented Jun 30, 2025

Uh oh!

albanD commented Jul 9, 2025

Uh oh!

tugsbayasgalan commented Jul 31, 2025

Uh oh!

penknife6153 commented Jul 31, 2025

Uh oh!

tugsbayasgalan commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

titaiwangms commented Aug 6, 2025

Uh oh!

penknife6153 commented Aug 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

penknife6153 commented Jun 11, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 11, 2025 •

edited

Loading

tugsbayasgalan commented Aug 4, 2025 •

edited

Loading