-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Update the heuristic for AArch64 bmm/baddbmm #149122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149122
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 53a6da1 with merge base a53d14d ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a bit more description on what this consolidation is trying to do?
Or is it just a better engineering change?
If it changes the semantic of the check, would be nice to have some sort of unit test added, if it's a performance only change would be nice to have script in PR description that one can run to observe those perf changes
Added some extra information in the PR description, let me know if anything more is needed. |
#if defined(__aarch64__) | ||
const int64_t mkldnn_acl_bmm_baddbmm_threshold = get_mkldnn_acl_bmm_baddbmm_threshold(); | ||
// BATCH_SIZE^2 * M * N * K >= THRESHOLD | ||
return mat1.size(0) * mat1.size(0) * mat1.size(1) * mat1.size(2) * mat2.size(2) >= mkldnn_acl_bmm_baddbmm_threshold; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Neoverse cores, mkldnn_acl_bmm_baddbmm_threshold
== 1L<<22
Does this imply we don't want to dispatch to MKLDNN for aten::baddbmm or aten::bmm cases?
The script attached only tests for float32 cases. Did we test the performance for float16 / bfloat16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Neoverse cores, mkldnn_acl_bmm_baddbmm_threshold == 1L<<22
Does this imply we don't want to dispatch to MKLDNN for aten::baddbmm or aten::bmm cases?
1L<<22
is 4,194,304
which is not as large as it seems.
for K, N = 1024
(which is not large by today's standards), we'll go to oneDNN/ACL with M=4
and batch size = 1
@@ -322,6 +322,42 @@ void mkldnn_matmul( | |||
|
|||
} | |||
|
|||
#if defined(__aarch64__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be guarded behind USE_MKLDNN_ACL or something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@pytorchbot label "ciflow/linux-aarch64" |
Some more performance data
|
@pytorchbot label "ciflow/linux-aarch64" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
FYI this change is being reverted right now as causing 50% perf regression... |
@pytorchbot revert -m "breaking internal models, @malfet may you help merge this?" -c ghfirst |
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit d759a51. Reverted #149122 on behalf of https://github.com/jeanschmidt due to breaking internal models, @malfet may you help merge this? ([comment](#149122 (comment)))
@michalowski-arm your PR has been successfully reverted. |
This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.
@malfet @jeanschmidt where can I look for the model break / regression suite to fix the heuristics ? |
@malfet @jeanschmidt just pinging this to ask where we can see the perf results again? In my tests there was a general speedup with this heuristic for all except the smallest shapes, so to fix it I'd need to see the shapes being ran. |
@malfet @jeanschmidt In this CI run we don't see any regressions: TorchInductor Performance DashBoard. This was for bf16 only, I don't think we can run f32. Are there other tests we are supposed to be running? Also, presumably the regression you saw was for AArch64, or did this change somehow impact x86 performance? |
Waiting for Meta team to share details on internal regressions |
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Rebase failed due to Command
Raised by https://github.com/pytorch/pytorch/actions/runs/16790309166 |
Replaced the defined(__aarch64__) with AT_MKLDNN_ACL_ENABLED() in the parts of Matmul.cpp relevant to the bmm heuristic
4ac203f
to
53a6da1
Compare
Updates heuristic for bmm/baddbmm and consolidates all heuristic logic in a single location
Matmul.cpp
, where there already exists heuristic-based selection for mkldnn.From the script we see that cumulative mean time from all test cases (at 16 threads) is:
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01