squash xblock for persistent inner reduction #102444

ngimel · 2023-05-28T01:11:21Z

Currently layer norm kernel performance is pretty bad due to triton perf bug https://gist.github.com/ngimel/c1e7f70f8268f038e710e835b0065f63, but since XBLOCK for persistent reduction is 1 we can just drop this dimension and operate on 1d tensors (and then perf of ln kernels improves a lot)
Perf results http://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2022%20May%202023%2001%3A27%3A25%20GMT&stopTime=Mon%2C%2029%20May%202023%2001%3A27%3A25%20GMT&suite=torchbench&mode=training&dtype=amp&lBranch=ngimel/persistent_1d&lCommit=1d5175f5e682f37aae15fd217bc3767e1788bacf&rBranch=main&rCommit=c9f4f01981fd73fcc7c27676cc50230cd1b5bc22, approx 4% on hf

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10

pytorch-bot · 2023-05-28T01:11:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102444

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 235b88e:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ngimel · 2023-05-29T06:12:23Z

@pytorchbot merge

pytorchmergebot · 2023-05-29T06:14:33Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

ngimel · 2023-05-29T06:15:21Z

@pytorchbot merge

pytorchmergebot · 2023-05-29T06:17:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-05-29T07:28:27Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm5.4.2-py3.8 / test (default, 1, 3, linux.rocm.gpu)

Details for Dev Infra team

Raised by workflow job

ngimel · 2023-05-30T02:48:26Z

cc @jataylo, I had to exclude rocm from this optimization, because rocm is on the old triton version that doesn't have tl.reduce and workaround doesn't work for 1d tensors (with some fixes it worked for me locally when I forced the workaround path in triton_helpers.py but it was still failing on rocm CI d97b63e). Can you guys update your triton to a later pin so that tl.reduce workaround is not longer needd?

ngimel · 2023-05-30T02:48:46Z

@pytorchbot merge

pytorchmergebot · 2023-05-30T02:51:04Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

jataylo · 2023-05-31T14:50:06Z

cc @jataylo, I had to exclude rocm from this optimization, because rocm is on the old triton version that doesn't have tl.reduce and workaround doesn't work for 1d tensors (with some fixes it worked for me locally when I forced the workaround path in triton_helpers.py but it was still failing on rocm CI d97b63e). Can you guys update your triton to a later pin so that tl.reduce workaround is not longer needd?

Thanks for the heads up @ngimel, we hit a blocker that stopped us updating triton for awhile to bring in the tl.reduce change (ROCm/triton#208), we've fixed that associated issue and we are hoping to be able to move our triton commit forward shortly after merging the changes into our pyt2.0 branch of triton.

I'll remove the conditionalisation of this commit in the PR that bumps our triton commit and add you as a reviewer.

cc: @dllehr-amd

Revert aten.prod explicit fallback on ROCm and enabling the use of tl.reduce in triton codegen. This PR also enables an optimisation that was previously conditionalised out for ROCm #102444 Pull Request resolved: #104099 Approved by: https://github.com/peterbell10, https://github.com/malfet

This reverts commit 2cc6ae1.

This basically reverts #102444

no_x_dim is used to indicate that a reduction operates on a single row, and data loaded for the reduction is 1-dimensional. no_x_dim was introduced in #102444 - in which there was bad perf in some reductions, and using 1D tensors fixed the perf issue. However, it appears that this perf issue no longer exists in current Triton versions. #118822 checked this, and we can also check this on H100 benchmarks (linked below). And another motivation for removing this behavior is that it enables larger loads, which we observe is necessary for good performance on certain shapes on Blackwell. H100 inference benchmarks: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a H100 training benchmarks: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a Overall, the benchmarks show minimal change in performance. Differential Revision: [D79599286](https://our.internmc.facebook.com/intern/diff/D79599286) Pull Request resolved: #159810 Approved by: https://github.com/ngimel, https://github.com/eellison

squash xblock for persistent inner reduction

1d5175f

github-actions bot added ciflow/inductor module: inductor labels May 28, 2023

ngimel requested a review from jansel May 29, 2023 01:28

jansel approved these changes May 29, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 29, 2023

pytorchmergebot added the merging label May 29, 2023

pytorchmergebot removed the merging label May 29, 2023

ngimel added the topic: not user facing topic category label May 29, 2023

pytorchmergebot added the merging label May 29, 2023

pytorchmergebot removed the merging label May 29, 2023

Natalia Gimelshein added 3 commits May 29, 2023 19:32

workaround for no tl.reduce + 0d tensors

d97b63e

don't do no_x_dim optimization on rocm

4309e7f

another try

235b88e

pytorchmergebot added the merging label May 30, 2023

pytorchmergebot added Merged and removed merging labels May 30, 2023

pytorchmergebot closed this in 2cc6ae1 May 30, 2023

ezyang mentioned this pull request Jun 3, 2023

[inductor] Inline ComputedBuffer computation when there are no reads #102000

Closed

anijain2305 mentioned this pull request Jun 4, 2023

[inductor] initial value for max_values is of type fp32, but the then block redefines it as <[1], fp32> #102925

Closed

williamwen42 mentioned this pull request Jun 13, 2023

[inductor] certain reductions cause triton compile error #103481

Closed

This was referenced Jun 23, 2023

Update triton commit pin for ROCm #104035

Closed

[ROCm] Enable tl.reduce usage on ROCm #104099

Closed

htyu added a commit that referenced this pull request Oct 28, 2023

Revert "squash xblock for persistent inner reduction (#102444)"

8c7b396

This reverts commit 2cc6ae1.

htyu added a commit that referenced this pull request Oct 30, 2023

[Triton] Always do dimensions.

34f7fee

This basically reverts #102444

htyu added a commit that referenced this pull request Nov 3, 2023

[Triton] Always do dimensions.

937f899

This basically reverts #102444

qxy11 mentioned this pull request Feb 1, 2024

Remove no_x_dim references #118822

Closed

github-actions bot deleted the ngimel/persistent_1d branch November 28, 2024 02:11

davidberard98 mentioned this pull request Aug 11, 2025

[inductor] remove no_x_dim #159810

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

squash xblock for persistent inner reduction #102444

squash xblock for persistent inner reduction #102444

Uh oh!

ngimel commented May 28, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 28, 2023 •

edited

Loading

Uh oh!

ngimel commented May 29, 2023

Uh oh!

pytorchmergebot commented May 29, 2023

Uh oh!

ngimel commented May 29, 2023

Uh oh!

pytorchmergebot commented May 29, 2023

Uh oh!

pytorchmergebot commented May 29, 2023

Uh oh!

ngimel commented May 30, 2023

Uh oh!

ngimel commented May 30, 2023

Uh oh!

pytorchmergebot commented May 30, 2023

Uh oh!

jataylo commented May 31, 2023 •

edited

Loading

Uh oh!

Uh oh!

squash xblock for persistent inner reduction #102444

squash xblock for persistent inner reduction #102444

Uh oh!

Conversation

ngimel commented May 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102444

✅ No Failures

Uh oh!

ngimel commented May 29, 2023

Uh oh!

pytorchmergebot commented May 29, 2023

Merge failed

Uh oh!

ngimel commented May 29, 2023

Uh oh!

pytorchmergebot commented May 29, 2023

Merge started

Uh oh!

pytorchmergebot commented May 29, 2023

Merge failed

Uh oh!

ngimel commented May 30, 2023

Uh oh!

ngimel commented May 30, 2023

Uh oh!

pytorchmergebot commented May 30, 2023

Merge started

Uh oh!

jataylo commented May 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ngimel commented May 28, 2023 •

edited

Loading

pytorch-bot bot commented May 28, 2023 •

edited

Loading

jataylo commented May 31, 2023 •

edited

Loading