Skip to content

squash xblock for persistent inner reduction #102444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Conversation

ngimel
Copy link
Collaborator

@ngimel ngimel commented May 28, 2023

@pytorch-bot
Copy link

pytorch-bot bot commented May 28, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102444

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 235b88e:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@ngimel
Copy link
Collaborator Author

ngimel commented May 29, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 29, 2023
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@ngimel ngimel added the topic: not user facing topic category label May 29, 2023
@ngimel
Copy link
Collaborator Author

ngimel commented May 29, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm5.4.2-py3.8 / test (default, 1, 3, linux.rocm.gpu)

Details for Dev Infra team Raised by workflow job

@ngimel
Copy link
Collaborator Author

ngimel commented May 30, 2023

cc @jataylo, I had to exclude rocm from this optimization, because rocm is on the old triton version that doesn't have tl.reduce and workaround doesn't work for 1d tensors (with some fixes it worked for me locally when I forced the workaround path in triton_helpers.py but it was still failing on rocm CI d97b63e). Can you guys update your triton to a later pin so that tl.reduce workaround is not longer needd?

@ngimel
Copy link
Collaborator Author

ngimel commented May 30, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@jataylo
Copy link
Collaborator

jataylo commented May 31, 2023

cc @jataylo, I had to exclude rocm from this optimization, because rocm is on the old triton version that doesn't have tl.reduce and workaround doesn't work for 1d tensors (with some fixes it worked for me locally when I forced the workaround path in triton_helpers.py but it was still failing on rocm CI d97b63e). Can you guys update your triton to a later pin so that tl.reduce workaround is not longer needd?

Thanks for the heads up @ngimel, we hit a blocker that stopped us updating triton for awhile to bring in the tl.reduce change (ROCm/triton#208), we've fixed that associated issue and we are hoping to be able to move our triton commit forward shortly after merging the changes into our pyt2.0 branch of triton.

I'll remove the conditionalisation of this commit in the PR that bumps our triton commit and add you as a reviewer.

cc: @dllehr-amd

pytorchmergebot pushed a commit that referenced this pull request Jun 27, 2023
Revert aten.prod explicit fallback on ROCm and enabling the use of tl.reduce in triton codegen. This PR also enables an optimisation that was previously conditionalised out for ROCm #102444

Pull Request resolved: #104099
Approved by: https://github.com/peterbell10, https://github.com/malfet
htyu added a commit that referenced this pull request Oct 28, 2023
htyu added a commit that referenced this pull request Oct 30, 2023
This basically reverts #102444
htyu added a commit that referenced this pull request Nov 3, 2023
This basically reverts #102444
@github-actions github-actions bot deleted the ngimel/persistent_1d branch November 28, 2024 02:11
pytorchmergebot pushed a commit that referenced this pull request Aug 12, 2025
no_x_dim is used to indicate that a reduction operates on a single row, and data loaded for the reduction is 1-dimensional.

no_x_dim was introduced in #102444 - in which there was bad perf in some reductions, and using 1D tensors fixed the perf issue.

However, it appears that this perf issue no longer exists in current Triton versions. #118822 checked this, and we can also check this on H100 benchmarks (linked below). And another motivation for removing this behavior is that it enables larger loads, which we observe is necessary for good performance on certain shapes on Blackwell.

H100 inference benchmarks:
https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a

H100 training benchmarks:
https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a

Overall, the benchmarks show minimal change in performance.

Differential Revision: [D79599286](https://our.internmc.facebook.com/intern/diff/D79599286)
Pull Request resolved: #159810
Approved by: https://github.com/ngimel, https://github.com/eellison
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants