Work around MPSGraph issue in backward pass of nn.ReplicationPad1d/2d #152094

xwu-498 · 2025-04-24T08:58:02Z

When the 3rd from last dimension is 2^16 or greater, MPSGraph returns 0 for padgradient.
To work around this, we break the problematic dimension into chunks with chunk size being
no greater than 2^16 - 1.

Test case for nn.ReplicationPad1d:

    shape = [65739, 2, 4]
    x_cpu = torch.randn(shape, device='cpu', requires_grad=True)
    x_mps = x_cpu.clone().detach().to('mps').requires_grad_(True)
    model = torch.nn.ReplicationPad1d((1, 1))

    out_cpu = model(x_cpu)
    out_mps = model(x_mps)

    # backward
    g_cpu = torch.randn_like(out_cpu)
    g_mps = g_cpu.clone().detach().to('mps').requires_grad_(False)
    out_cpu.backward(g_cpu)
    out_mps.backward(g_mps)

    print(f"{((x_cpu.grad - x_mps.grad.cpu()).abs() > 1e-5).sum() = }")

    # Expected Output:
    # ((x_cpu.grad - x_mps.grad.cpu()).abs() > 1e-5).sum() = tensor(0)

Test case for nn.ReplicationPad2d,

    shape = [2, 65739, 2, 4]
    x_cpu = torch.randn(shape, device='cpu', requires_grad=True)
    x_mps = x_cpu.clone().detach().to('mps').requires_grad_(True)
    model = torch.nn.ReplicationPad2d((1, 1, 1, 1))

    out_cpu = model(x_cpu)
    out_mps = model(x_mps)

    # backward
    g_cpu = torch.randn_like(out_cpu)
    g_mps = g_cpu.clone().detach().to('mps').requires_grad_(False)
    out_cpu.backward(g_cpu)
    out_mps.backward(g_mps)

    print(f"{((x_cpu.grad - x_mps.grad.cpu()).abs() > 1e-5).sum() = }")

    # Expected Output:
    # ((x_cpu.grad - x_mps.grad.cpu()).abs() > 1e-5).sum() = tensor(0)

These tests produce expected output with this workaround.

pytorch-bot · 2025-04-24T08:58:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152094

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d64f0c6 with merge base ef1d45b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-04-24T08:58:07Z

The committers listed above are authorized under a signed CLA.

✅ login: xwu-498 (d64f0c6)

malfet

Please fix lint and also looks like it fails on exactly the test you are trying to add

malfet · 2025-05-16T14:16:57Z

aten/src/ATen/native/mps/operations/Pad.mm

+  // we break the tensor into chuncks where the problematic dimention is no greater than 2**16-1.
+  // This is reported in https://github.com/pytorch/pytorch/issues/135447.
+  // Internal radar for MPSGraph: rdar://149853787.
+  const int64_t max_sub_batch_size = 65535;


Suggested change

const int64_t max_sub_batch_size = 65535;

constexpr auto max_sub_batch_size = 65535;

Thanks @malfet for the comments. I will follow up on these issues.

Hi @malfet, could you have another look?

I've made the following changes:

The change you suggested here: Work around MPSGraph issue in backward pass of nn.ReplicationPad1d/2d #152094 (comment).

Fixed lint

Changed test_ReplicationPad*_large in test_nn.py to be marked with @expectedFailureMPSPre15 instead of @expectedFailureMPS. The fix made in the PR addressed the issue with large dimensions. But it exposed another issue when OS version was older than 15. So we use @expectedFailureMPSPre15 to allow these tests to be validated on MacOS 15 and above.

Removed the test cases added to test_mps.py because there were already test cases in test_nn.py to exercise the large dimensions.

Thank you.

skotapati · 2025-05-19T21:47:50Z

@pytorchbot rebase

pytorchmergebot · 2025-05-19T21:49:38Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-05-19T21:49:42Z

Successfully rebased fix-pad-grad onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix-pad-grad && git pull --rebase)

skotapati · 2025-05-29T20:24:48Z

@pytorchbot rebase

pytorchmergebot · 2025-05-29T20:26:19Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

Fixes pytorch#135447. When the 3rd from last dimension is 2^16 or greater, MPSGraph returns 0 for padgradient. To work around this, we break the problematic dimension into chunks with chunk size being no greater than 2^16 - 1. Test case for nn.ReplicationPad1d: ``` shape = [65739, 2, 4] x_cpu = torch.randn(shape, device='cpu', requires_grad=True) x_mps = x_cpu.clone().detach().to('mps').requires_grad_(True) model = torch.nn.ReplicationPad1d((1, 1)) out_cpu = model(x_cpu) out_mps = model(x_mps) # backward g_cpu = torch.randn_like(out_cpu) g_mps = g_cpu.clone().detach().to('mps').requires_grad_(False) out_cpu.backward(g_cpu) out_mps.backward(g_mps) print(f"{((x_cpu.grad - x_mps.grad.cpu()).abs() > 1e-5).sum() = }") # Expected Output: # ((x_cpu.grad - x_mps.grad.cpu()).abs() > 1e-5).sum() = tensor(0) ``` Test case for nn.ReplicationPad2d, ``` shape = [2, 65739, 2, 4] x_cpu = torch.randn(shape, device='cpu', requires_grad=True) x_mps = x_cpu.clone().detach().to('mps').requires_grad_(True) model = torch.nn.ReplicationPad2d((1, 1, 1, 1)) out_cpu = model(x_cpu) out_mps = model(x_mps) # backward g_cpu = torch.randn_like(out_cpu) g_mps = g_cpu.clone().detach().to('mps').requires_grad_(False) out_cpu.backward(g_cpu) out_mps.backward(g_mps) print(f"{((x_cpu.grad - x_mps.grad.cpu()).abs() > 1e-5).sum() = }") # Expected Output: # ((x_cpu.grad - x_mps.grad.cpu()).abs() > 1e-5).sum() = tensor(0) ``` These tests produce expected output with this workaround.

pytorchmergebot · 2025-05-29T20:26:21Z

Successfully rebased fix-pad-grad onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix-pad-grad && git pull --rebase)

github-actions · 2025-08-11T20:36:48Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

xwu-498 requested review from kulinseth and malfet as code owners April 24, 2025 08:58

pytorch-bot bot added the release notes: mps Release notes category label Apr 24, 2025

pytorchbot added the open source label Apr 24, 2025

xwu-498 force-pushed the fix-pad-grad branch 2 times, most recently from 97ba670 to 9256e4e Compare April 24, 2025 23:34

xwu-498 changed the title ~~Work around MPSGraph issue in handling backward pass of nn.Replicatio…~~ Work around MPSGraph issue in backward pass of nn.ReplicationPad1d/2d. Apr 24, 2025

xwu-498 changed the title ~~Work around MPSGraph issue in backward pass of nn.ReplicationPad1d/2d.~~ Work around MPSGraph issue in backward pass of nn.ReplicationPad1d/2d Apr 24, 2025

xwu-498 force-pushed the fix-pad-grad branch from 9256e4e to 77bb5c7 Compare April 25, 2025 03:09

soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 28, 2025

xwu-498 force-pushed the fix-pad-grad branch 2 times, most recently from ebd8f3a to 786fd71 Compare May 12, 2025 19:06

skotapati added the ciflow/mps Run MPS tests (subset of trunk) label May 14, 2025

malfet requested changes May 16, 2025

View reviewed changes

xwu-498 force-pushed the fix-pad-grad branch from 786fd71 to 754c102 Compare May 17, 2025 22:44

pytorch-bot bot removed the ciflow/mps Run MPS tests (subset of trunk) label May 17, 2025

xwu-498 force-pushed the fix-pad-grad branch from 754c102 to 0c523d5 Compare May 17, 2025 22:48

skotapati added ciflow/mps Run MPS tests (subset of trunk) keep-going Don't stop on first failure, keep running tests until the end labels May 19, 2025

pytorchmergebot force-pushed the fix-pad-grad branch from 0c523d5 to ed573a0 Compare May 19, 2025 21:49

pytorch-bot bot removed the ciflow/mps Run MPS tests (subset of trunk) label May 19, 2025

xwu-498 force-pushed the fix-pad-grad branch 2 times, most recently from b7e3c93 to 94564e6 Compare May 28, 2025 20:32

skotapati added the ciflow/mps Run MPS tests (subset of trunk) label May 29, 2025

pytorchmergebot force-pushed the fix-pad-grad branch from 94564e6 to d64f0c6 Compare May 29, 2025 20:26

pytorch-bot bot removed the ciflow/mps Run MPS tests (subset of trunk) label May 29, 2025

skotapati added the ciflow/mps Run MPS tests (subset of trunk) label May 29, 2025

xwu-498 requested a review from malfet June 12, 2025 20:26

github-actions bot added the Stale label Aug 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Work around MPSGraph issue in backward pass of nn.ReplicationPad1d/2d #152094

Work around MPSGraph issue in backward pass of nn.ReplicationPad1d/2d #152094

Uh oh!

xwu-498 commented Apr 24, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 24, 2025 •

edited

Loading

Uh oh!

linux-foundation-easycla bot commented Apr 24, 2025 •

edited

Loading

Uh oh!

malfet left a comment

Uh oh!

malfet May 16, 2025

Uh oh!

xwu-498 May 16, 2025

Uh oh!

xwu-498 May 30, 2025

Uh oh!

skotapati commented May 19, 2025

Uh oh!

pytorchmergebot commented May 19, 2025

Uh oh!

pytorchmergebot commented May 19, 2025

Uh oh!

skotapati commented May 29, 2025

Uh oh!

pytorchmergebot commented May 29, 2025

Uh oh!

pytorchmergebot commented May 29, 2025

Uh oh!

github-actions bot commented Aug 11, 2025

Uh oh!

Uh oh!

	const int64_t max_sub_batch_size = 65535;
	constexpr auto max_sub_batch_size = 65535;

Work around MPSGraph issue in backward pass of nn.ReplicationPad1d/2d #152094

Are you sure you want to change the base?

Work around MPSGraph issue in backward pass of nn.ReplicationPad1d/2d #152094

Uh oh!

Conversation

xwu-498 commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152094

✅ No Failures

Uh oh!

linux-foundation-easycla bot commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

malfet May 16, 2025

Choose a reason for hiding this comment

Uh oh!

xwu-498 May 16, 2025

Choose a reason for hiding this comment

Uh oh!

xwu-498 May 30, 2025

Choose a reason for hiding this comment

Uh oh!

skotapati commented May 19, 2025

Uh oh!

pytorchmergebot commented May 19, 2025

Uh oh!

pytorchmergebot commented May 19, 2025

Uh oh!

skotapati commented May 29, 2025

Uh oh!

pytorchmergebot commented May 29, 2025

Uh oh!

pytorchmergebot commented May 29, 2025

Uh oh!

github-actions bot commented Aug 11, 2025

Uh oh!

Uh oh!

xwu-498 commented Apr 24, 2025 •

edited

Loading

pytorch-bot bot commented Apr 24, 2025 •

edited

Loading

linux-foundation-easycla bot commented Apr 24, 2025 •

edited

Loading