-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Add SVE implementation of embedding_lookup_idx #133995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133995
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit ba0493c with merge base b82000c ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@albanD @seemethere The PR sanity check is currently failing since this is 7300+ LoC patch. However the vast majority of this is the output of code generation. It seems to me that this should be a suitable exception to the large-PR rule. Could you let me know your thoughts? Thank you. |
@pytorchbot label "module: arm" |
@pytorchbot label "ciflow/linux-aarch64" |
Hi @Sqvid , thanks for this PR. this is one of the areas we wanted to improve PyTorch for aarch64. Please mention details about what has been tested and what's the performance uplift observed. |
Hi @Sqvid , Thanks for this PR. |
@pytorchbot label "skip-pr-sanity-checks" |
Validation was done via the embedding unit tests in pytorch:
Benchmarking was done using the caffe2 micro benchmarks in Additionally I wrote a simple script to perform forward passes on an
@abhishek-iitmadras Thank you for the tips. I will run |
@abhishek-iitmadras The patch has been linted. Thanks for the pointers. |
Out of curiosity, I would like to know why you used the _x variant of the SVE intrinsic instead of the _m or _z variants for the same operation. Is this choice related to performance considerations? |
My reasoning is that in this case I am not making use of any inactive elements so there is no reason to specify a strategy of dealing with them (which is what the _m and _z variants are for). The code should work just fine with the _z variant as well for example, but I don't see a point in explicitly zeroing the inactive elements out. |
@snadampal I have now also run benchmarks against the
Inductor mode:
|
We removed the caffe2 codebase from the PyTorch repo so could you give some details on why this is being added there? Also, even if the code is generated, it needs to be properly reviewed, tested, etc. We would need someone to sign up to do that properly. |
Hi @albanD , for aten embedding bag implementation, for non For aarch64 there is no FBGEMM support, so, caffe2 perf kernels is the path the embedding lookup index takes. However, there is only scalar implementation for caffe2/perfkernels on aarch64. So, this PR adds the sve implementation which would help every PyTorch inference scenario with the embedding step. btw, I have seen a PR earlier to remove caffe2 python bindings, but are you saying that you're planning to remove the perf kernels as well? |
As mentioned above, caffe2/perfkernels is still used in EmbeddingBag.cpp. Is it possible to fix all lint issues in the code generator and integrate it as part of CMake (bazel, BUCK) so that the cpp file is generated only when the related CPU has been detected? |
I think @snadampal laid out the case best; there does not seem to be a more appropriate place to optimise EmbeddingBag for aarch64 as of now.
Yes that is absolutely true, I was not trying to trivialise the engineering effort. My comment was in the context of justifying why I think the PR cannot be naturally split into smaller PRs. |
I am open to generating the code at build time. However, the current form mirrors the corresponding files for AVX2: |
avx2.cc came from old days of Caffe2 and we should take efforts to try generate it on build time. Meanwhile, I want newer code to avoid that behaviour from the beginning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial syntax check, haven't dug into the generated code yet
Todo:
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 1 checks: pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
@huydhn Thanks for the merge attempt. Are the failures related to my changes or just transient? |
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 1 checks: pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot revert -m "breaking internal tests, I wondering if this just needs a targets change for buck? D64360590" -c ghfirst Copy of truncated error logs: cc @malfet @huydhn as reviewers, but I think you guys approved just to unblock CI so I'm not sure who the actual reviewers should be
|
@pytorchbot successfully started a revert job. Check the current status here. |
@Sqvid your PR has been successfully reverted. |
This reverts commit 770c134. Reverted #133995 on behalf of https://github.com/clee2000 due to breaking internal tests, I wondering if this just needs a targets change for buck? ([comment](#133995 (comment)))
Looks like all it needs is an archaic internal change to exclude _sve files from builds on x86 platform |
@pytorchbot merge -f "I've fixed it internally" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
- Instead of one monolithic docker image, build a wheel using the same builder image as upstream and install it in a smaller docker image. - Using the same builder image makes the dev process easier and closer to upstream - Installing in a smaller image simplifies redistribution - Move to very recent PyTorch/oneDNN/ideep/ACL revisions from main - Remove torchtext example because it has been deprecated and the last release is only compatible with PyTorch 2.3 - Add requirements.txt in case users want to use plain venv+wheel - Remove config.mk, appears to be unused - Remove all MLCommons examples and patches, we still have standalone BERT and Resnet models, and the MLCommons examples were harder to use. We may revisit this later. - Removed RetinaNet object detection as this extra example was not the overhead from the patches required for it to work. - Remove pytorch_embedding_lookup_sve_concept.patch because pytorch/pytorch#133995 has been merged - Static quantization patches have been replaced with wgetting patches in get-source.sh Co-authored-by: Ryo Suzuki <Ryo.Suzuki@arm.com> Co-authored-by: Fadi Arafeh <Fadi.Arafeh@arm.com> Co-authored-by: David Svantesson-Yeung <David.Svantesson-Yeung@arm.com>
Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to
caffe2/perfkernels/hp_emblookup_codegen.py
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @malfet @snadampal @milpuz01 @albanD @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec