[MPS] Optimize cummin/cummax metal kernels #156794

manuelcandales · 2025-06-25T04:16:39Z

Stack from ghstack (oldest at bottom):

-> [MPS] Optimize cummin/cummax metal kernels #156794

Performance improvement (M4 Max 64GB, macOS 15.5):

                                              | Current | Previous
      cummin-dim0-32x32 (torch.float16)       |  103.4  |   102.5
      cummin-dim0-128x128 (torch.float16)     |  112.2  |   133.6
      cummin-dim0-512x512 (torch.float16)     |  146.9  |   233.1
      cummin-dim0-1024x1024 (torch.float16)   |  193.6  |   364.2
      cummin-dim1-32x32 (torch.float16)       |  102.0  |    94.4
      cummin-dim1-128x128 (torch.float16)     |  103.0  |   109.9
      cummin-dim1-512x512 (torch.float16)     |  109.1  |   227.0
      cummin-dim1-1024x1024 (torch.float16)   |  140.5  |   985.1
      cummin-1d-100 (torch.float16)           |  101.8  |   100.7
      cummin-1d-10000 (torch.float16)         |  112.8  |   805.0
      cummin-1d-1000000 (torch.float16)       | 1343.8  | 70545.6
      cummin-dim0-32x32 (torch.float32)       |  104.6  |   102.7
      cummin-dim0-128x128 (torch.float32)     |  112.3  |   137.2
      cummin-dim0-512x512 (torch.float32)     |  146.6  |   209.7
      cummin-dim0-1024x1024 (torch.float32)   |  194.0  |   340.1
      cummin-dim1-32x32 (torch.float32)       |  100.1  |    99.2
      cummin-dim1-128x128 (torch.float32)     |  101.4  |   111.9
      cummin-dim1-512x512 (torch.float32)     |  110.3  |   250.7
      cummin-dim1-1024x1024 (torch.float32)   |  141.4  |   987.9
      cummin-1d-100 (torch.float32)           |  101.0  |   100.6
      cummin-1d-10000 (torch.float32)         |  112.9  |   794.7
      cummin-1d-1000000 (torch.float32)       | 1311.7  | 71995.3
      cummin-dim0-32x32 (torch.bfloat16)      |  105.8  |   105.9
      cummin-dim0-128x128 (torch.bfloat16)    |  111.9  |   135.7
      cummin-dim0-512x512 (torch.bfloat16)    |  147.1  |   231.9
      cummin-dim0-1024x1024 (torch.bfloat16)  |  191.2  |   327.7
      cummin-dim1-32x32 (torch.bfloat16)      |  101.8  |    91.3
      cummin-dim1-128x128 (torch.bfloat16)    |  100.2  |   108.5
      cummin-dim1-512x512 (torch.bfloat16)    |  108.9  |   222.0
      cummin-dim1-1024x1024 (torch.bfloat16)  |  140.1  |   936.9
      cummin-1d-100 (torch.bfloat16)          |  103.0  |   106.6
      cummin-1d-10000 (torch.bfloat16)        |  113.1  |   795.8
      cummin-1d-1000000 (torch.bfloat16)      | 1296.8  | 68667.4

[ghstack-poisoned]

ghstack-source-id: 85e2fdf Pull-Request: #156794

pytorch-bot · 2025-06-25T04:16:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156794

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

VolumeLimitExceeded Issue for linux.2xlarge and linux.4xlarge

⏳ No Failures, 8 Pending

As of commit 015bfc9 with merge base 0d8e4e2 ():
💚 Looks good so far! There are no failures yet. 💚

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

⏳ pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

aten/src/ATen/native/mps/operations/ScanKernel.mm

[ghstack-poisoned]

ghstack-source-id: ac0ce95 Pull-Request: #156794

malfet · 2025-06-26T23:28:34Z

@pytorchbot merge -f "Let's land in trunk"

pytorchmergebot · 2025-06-26T23:30:09Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Update

a3f232a

[ghstack-poisoned]

manuelcandales requested review from kulinseth and malfet as code owners June 25, 2025 04:16

manuelcandales added a commit that referenced this pull request Jun 25, 2025

[MPS] Optimize cummin/cummax metal kernels

8e0f31d

ghstack-source-id: 85e2fdf Pull-Request: #156794

pytorch-bot bot added ciflow/mps Run MPS tests (subset of trunk) release notes: mps Release notes category labels Jun 25, 2025

Update

1927a62

[ghstack-poisoned]

manuelcandales mentioned this pull request Jun 25, 2025

duplicate #156835

Closed

Update

0ef056e

[ghstack-poisoned]

manuelcandales mentioned this pull request Jun 25, 2025

[MPS] Implement logcumsumexp metal kernel #156858

Closed

Update

29b95b9

[ghstack-poisoned]

manuelcandales mentioned this pull request Jun 25, 2025

duplicate #156859

Closed

manuelcandales mentioned this pull request Jun 25, 2025

[MPS] Add benchmark for scan with indices #156860

Closed

malfet reviewed Jun 26, 2025

View reviewed changes

Update

8d469be

[ghstack-poisoned]

malfet added the topic: performance topic category label Jun 26, 2025

malfet approved these changes Jun 26, 2025

View reviewed changes

Update

015bfc9

[ghstack-poisoned]

manuelcandales added a commit that referenced this pull request Jun 26, 2025

[MPS] Optimize cummin/cummax metal kernels

4b39acf

ghstack-source-id: ac0ce95 Pull-Request: #156794

pytorchmergebot added the merging label Jun 26, 2025

pytorchmergebot closed this in 1fff635 Jun 26, 2025

pytorchmergebot added the Merged label Jun 26, 2025

pytorchmergebot removed the merging label Jun 26, 2025

manuelcandales mentioned this pull request Jun 16, 2025

Most requested ops for the MPS backend #154052

Open

github-actions bot deleted the gh/manuelcandales/12/head branch July 27, 2025 02:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MPS] Optimize cummin/cummax metal kernels #156794

[MPS] Optimize cummin/cummax metal kernels #156794

Uh oh!

manuelcandales commented Jun 25, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

malfet commented Jun 26, 2025

Uh oh!

pytorchmergebot commented Jun 26, 2025

Uh oh!

Uh oh!

[MPS] Optimize cummin/cummax metal kernels #156794

[MPS] Optimize cummin/cummax metal kernels #156794

Uh oh!

Conversation

manuelcandales commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156794

❗ 1 Active SEVs

⏳ No Failures, 8 Pending

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

malfet commented Jun 26, 2025

Uh oh!

pytorchmergebot commented Jun 26, 2025

Merge started

Uh oh!

Uh oh!

manuelcandales commented Jun 25, 2025 •

edited

Loading

pytorch-bot bot commented Jun 25, 2025 •

edited

Loading