Add SVE implementation of embedding_lookup_idx #133995

Sqvid · 2024-08-20T13:04:57Z

Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to caffe2/perfkernels/hp_emblookup_codegen.py

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @malfet @snadampal @milpuz01 @albanD @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec

pytorch-bot · 2024-08-20T13:05:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133995

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ba0493c with merge base b82000c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2024-08-20T13:05:02Z

The committers listed above are authorized under a signed CLA.

✅ login: Sqvid / name: Siddhartha Menon (ba0493c)

Sqvid · 2024-08-20T14:19:56Z

@albanD @seemethere The PR sanity check is currently failing since this is 7300+ LoC patch. However the vast majority of this is the output of code generation. It seems to me that this should be a suitable exception to the large-PR rule. Could you let me know your thoughts? Thank you.

Sqvid · 2024-08-20T14:22:19Z

@pytorchbot label "module: arm"

Sqvid · 2024-08-20T14:23:04Z

@pytorchbot label "ciflow/linux-aarch64"

snadampal · 2024-08-20T14:45:04Z

Hi @Sqvid , thanks for this PR. this is one of the areas we wanted to improve PyTorch for aarch64. Please mention details about what has been tested and what's the performance uplift observed.

abhishek-iitmadras · 2024-08-21T04:51:31Z

@albanD @seemethere The PR sanity check is currently failing since this is 7300+ LoC patch. However the vast majority of this is the output of code generation. It seems to me that this should be a suitable exception to the large-PR rule. Could you let me know your thoughts? Thank you.

Hi @Sqvid , Thanks for this PR.
This PR needs to have the "skip-pr-sanity-checks" label added.
Additionally, please try to resolve the linting issues with this PR. https://github.com/pytorch/pytorch/wiki/lintrunner

Sqvid · 2024-08-21T12:48:55Z

@pytorchbot label "skip-pr-sanity-checks"

Sqvid · 2024-08-21T13:46:13Z

Hi @Sqvid , thanks for this PR. this is one of the areas we wanted to improve PyTorch for aarch64. Please mention details about what has been tested and what's the performance uplift observed.

Validation was done via the embedding unit tests in pytorch:

$ python test/nn/test_embedding.py
...............s.....s.....................................................................................................................................sss...
/home/user/PyTorch/torch/autograd/gradcheck.py:1153: FutureWarning: Please use `torch.vmap` instead of `torch._vmap_internals.vmap`.
  result = vmap(vjp)(torch.stack(grad_outputs))
/home/user/PyTorch/torch/autograd/gradcheck.py:1153: FutureWarning: Please use `torch.vmap` instead of `torch._vmap_internals.vmap`.
  result = vmap(vjp)(torch.stack(grad_outputs))

----------------------------------------------------------------------
Ran 161 tests in 6.137s

OK (skipped=5)

Benchmarking was done using the caffe2 micro benchmarks in benchmarks/operator_benchmark/pt/embeddingbag_test.py, using python -m pt.embeddingbag_test --omp-num-threads 1 --mkl-num-threads 1.
The processed relative speedups are attached: micro_bench_speedups.txt

Additionally I wrote a simple script to perform forward passes on an nn.EmbeddingBag(1000000, 64, mode = "mean") with randomised data. Speedups of roughly 4x were observed.

Hi @Sqvid , Thanks for this PR. This PR needs to have the "skip-pr-sanity-checks" label added. Additionally, please try to resolve the linting issues with this PR. https://github.com/pytorch/pytorch/wiki/lintrunner

@abhishek-iitmadras Thank you for the tips. I will run lintrunner and update the PR shortly.

Sqvid · 2024-08-21T13:59:10Z

@abhishek-iitmadras The patch has been linted. Thanks for the pointers.

abhishek-iitmadras · 2024-08-21T15:05:27Z

Out of curiosity, I would like to know why you used the _x variant of the SVE intrinsic instead of the _m or _z variants for the same operation. Is this choice related to performance considerations?

Sqvid · 2024-08-21T15:10:09Z

Out of curiosity, I would like to know why you used the _x variant of the SVE intrinsic instead of the _m or _z variants for the same operation. Is this choice related to performance considerations?

My reasoning is that in this case I am not making use of any inactive elements so there is no reason to specify a strategy of dealing with them (which is what the _m and _z variants are for).

The code should work just fine with the _z variant as well for example, but I don't see a point in explicitly zeroing the inactive elements out.

Sqvid · 2024-08-21T16:35:31Z

@snadampal I have now also run benchmarks against the dlrm model from pytorch/benchmark. I ran the benchmark via python3 run_benchmark.py cpu --model dlrm --test eval --nwarmup 5 --niter 15. I also tried it with eager mode and inductor mode by optionally using the flag --torchdynamo inductor.
Eager mode:

16 threads: 20.18% speedup
8  threads: 32.16% speedup
4  threads: 47.05% speedup
2  threads: 55.70% speedup
1  threads: 62.25% speedup

Inductor mode:

16 threads: 25.24% speedup
8  threads: 47.53% speedup
4  threads: 52.99% speedup
2  threads: 66.21% speedup
1  threads: 78.84% speedup

albanD · 2024-08-21T20:48:58Z

We removed the caffe2 codebase from the PyTorch repo so could you give some details on why this is being added there?

Also, even if the code is generated, it needs to be properly reviewed, tested, etc. We would need someone to sign up to do that properly.

snadampal · 2024-08-22T00:42:27Z

Hi @albanD , for aten embedding bag implementation, for non FBGEMM scenario, caffe2 perf kernels is the only option I see today.
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/EmbeddingBag.cpp#L20
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/EmbeddingBag.cpp#L269

For aarch64 there is no FBGEMM support, so, caffe2 perf kernels is the path the embedding lookup index takes. However, there is only scalar implementation for caffe2/perfkernels on aarch64.
https://github.com/pytorch/pytorch/blob/main/caffe2/perfkernels/embedding_lookup_idx.cc#L20

So, this PR adds the sve implementation which would help every PyTorch inference scenario with the embedding step.

btw, I have seen a PR earlier to remove caffe2 python bindings, but are you saying that you're planning to remove the perf kernels as well?
and also there was a discussion about adding aarch64 embedding lookup kernels to fbgemm sometime this year. Please let me know if that is something you're proposing here.

cyyever · 2024-08-22T01:21:54Z

As mentioned above, caffe2/perfkernels is still used in EmbeddingBag.cpp. Is it possible to fix all lint issues in the code generator and integrate it as part of CMake (bazel, BUCK) so that the cpp file is generated only when the related CPU has been detected?

Sqvid · 2024-08-22T09:25:54Z

@albanD

We removed the caffe2 codebase from the PyTorch repo so could you give some details on why this is being added there?

I think @snadampal laid out the case best; there does not seem to be a more appropriate place to optimise EmbeddingBag for aarch64 as of now.

Also, even if the code is generated, it needs to be properly reviewed, tested, etc. We would need someone to sign up to do that properly.

Yes that is absolutely true, I was not trying to trivialise the engineering effort. My comment was in the context of justifying why I think the PR cannot be naturally split into smaller PRs.

Sqvid · 2024-08-22T09:31:22Z

@cyyever

As mentioned above, caffe2/perfkernels is still used in EmbeddingBag.cpp. Is it possible to fix all lint issues in the code generator and integrate it as part of CMake (bazel, BUCK) so that the cpp file is generated only when the related CPU has been detected?

I am open to generating the code at build time. However, the current form mirrors the corresponding files for AVX2:

caffe2/perfkernels/hp_emblookup_codegen.py

caffe2/perfkernels/embedding_lookup_idx_avx2.cc

cmake/MiscCheck.cmake

cyyever · 2024-08-23T02:06:29Z

@cyyever

As mentioned above, caffe2/perfkernels is still used in EmbeddingBag.cpp. Is it possible to fix all lint issues in the code generator and integrate it as part of CMake (bazel, BUCK) so that the cpp file is generated only when the related CPU has been detected?

I am open to generating the code at build time. However, the current form mirrors the corresponding files for AVX2:

caffe2/perfkernels/hp_emblookup_codegen.py

caffe2/perfkernels/embedding_lookup_idx_avx2.cc

avx2.cc came from old days of Caffe2 and we should take efforts to try generate it on build time. Meanwhile, I want newer code to avoid that behaviour from the beginning.

Ryo-not-rio

Initial syntax check, haven't dug into the generated code yet

caffe2/perfkernels/sve_emblookup_codegen.py

cmake/Modules/FindARM.cmake

caffe2/perfkernels/sve_emblookup_codegen.py

Sqvid · 2024-08-29T13:37:02Z

huydhn · 2024-10-12T02:38:27Z

@pytorchbot merge -i

pytorchmergebot · 2024-10-12T02:40:24Z

Merge started

Your change will be merged while ignoring the following 1 checks: pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-12T08:39:01Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

Sqvid · 2024-10-14T07:48:35Z

@huydhn Thanks for the merge attempt. Are the failures related to my changes or just transient?

huydhn · 2024-10-14T10:14:11Z

@pytorchbot merge -i

pytorchmergebot · 2024-10-14T10:17:10Z

Merge started

Your change will be merged while ignoring the following 1 checks: pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

clee2000 · 2024-10-15T17:21:59Z

@pytorchbot revert -m "breaking internal tests, I wondering if this just needs a targets change for buck? D64360590" -c ghfirst

Copy of truncated error logs:

cc @malfet @huydhn as reviewers, but I think you guys approved just to unblock CI so I'm not sure who the actual reviewers should be

Here are the last few lines of Buck log:
   ...
      42 | typedef __clang_svint8x2_t svint8x2_t;
         |         ^
   .../include/arm_sve.h:43:9: error: unknown type name '__clang_svint16x2_t'
      43 | typedef __clang_svint16x2_t svint16x2_t;
         |         ^
   .../include/arm_sve.h:44:9: error: unknown type name '__clang_svint32x2_t'
      44 | typedef __clang_svint32x2_t svint32x2_t;
         |         ^
   .../include/arm_sve.h:45:9: error: unknown type name '__clang_svint64x2_t'
      45 | typedef __clang_svint64x2_t svint64x2_t;
         |         ^
   .../include/arm_sve.h:46:9: error: unknown type name '__clang_svuint8x2_t'
      46 | typedef __clang_svuint8x2_t svuint8x2_t;
         |         ^
   .../include/arm_sve.h:47:9: error: unknown type name '__clang_svuint16x2_t'
      47 | typedef __clang_svuint16x2_t svuint16x2_t;
         |         ^
   .../include/arm_sve.h:48:9: error: unknown type name '__clang_svuint32x2_t'
      48 | typedef __clang_svuint32x2_t svuint32x2_t;
         |         ^
   fatal error: too many errors emitted, stopping now [-ferror-limit=]
   20 errors generated.

pytorchmergebot · 2024-10-15T17:23:44Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2024-10-15T17:23:53Z

@Sqvid your PR has been successfully reverted.

This reverts commit 770c134. Reverted #133995 on behalf of https://github.com/clee2000 due to breaking internal tests, I wondering if this just needs a targets change for buck? ([comment](#133995 (comment)))

malfet · 2024-10-15T17:50:47Z

@pytorchbot revert -m "breaking internal tests, I wondering if this just needs a targets change for buck? D64360590" -c ghfirst

Copy of truncated error logs:

cc @malfet @huydhn as reviewers, but I think you guys approved just to unblock CI so I'm not sure who the actual reviewers should be

Here are the last few lines of Buck log:
   ...
      42 | typedef __clang_svint8x2_t svint8x2_t;
         |         ^
   .../include/arm_sve.h:43:9: error: unknown type name '__clang_svint16x2_t'
      43 | typedef __clang_svint16x2_t svint16x2_t;
         |         ^
   .../include/arm_sve.h:44:9: error: unknown type name '__clang_svint32x2_t'
      44 | typedef __clang_svint32x2_t svint32x2_t;
         |         ^
   .../include/arm_sve.h:45:9: error: unknown type name '__clang_svint64x2_t'
      45 | typedef __clang_svint64x2_t svint64x2_t;
         |         ^
   .../include/arm_sve.h:46:9: error: unknown type name '__clang_svuint8x2_t'
      46 | typedef __clang_svuint8x2_t svuint8x2_t;
         |         ^
   .../include/arm_sve.h:47:9: error: unknown type name '__clang_svuint16x2_t'
      47 | typedef __clang_svuint16x2_t svuint16x2_t;
         |         ^
   .../include/arm_sve.h:48:9: error: unknown type name '__clang_svuint32x2_t'
      48 | typedef __clang_svuint32x2_t svuint32x2_t;
         |         ^
   fatal error: too many errors emitted, stopping now [-ferror-limit=]
   20 errors generated.

Looks like all it needs is an archaic internal change to exclude _sve files from builds on x86 platform

malfet · 2024-10-15T18:50:53Z

@pytorchbot merge -f "I've fixed it internally"

pytorchmergebot · 2024-10-15T18:52:35Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

- Instead of one monolithic docker image, build a wheel using the same builder image as upstream and install it in a smaller docker image. - Using the same builder image makes the dev process easier and closer to upstream - Installing in a smaller image simplifies redistribution - Move to very recent PyTorch/oneDNN/ideep/ACL revisions from main - Remove torchtext example because it has been deprecated and the last release is only compatible with PyTorch 2.3 - Add requirements.txt in case users want to use plain venv+wheel - Remove config.mk, appears to be unused - Remove all MLCommons examples and patches, we still have standalone BERT and Resnet models, and the MLCommons examples were harder to use. We may revisit this later. - Removed RetinaNet object detection as this extra example was not the overhead from the patches required for it to work. - Remove pytorch_embedding_lookup_sve_concept.patch because pytorch/pytorch#133995 has been merged - Static quantization patches have been replaced with wgetting patches in get-source.sh Co-authored-by: Ryo Suzuki <Ryo.Suzuki@arm.com> Co-authored-by: Fadi Arafeh <Fadi.Arafeh@arm.com> Co-authored-by: David Svantesson-Yeung <David.Svantesson-Yeung@arm.com>

pytorchbot added the open source label Aug 20, 2024

pytorch-bot bot added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Aug 20, 2024

pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Aug 20, 2024

pytorch-bot bot added the skip-pr-sanity-checks label Aug 21, 2024

Sqvid force-pushed the sveEB branch from eb29994 to a27bb5b Compare August 21, 2024 13:55

albanD removed the skip-pr-sanity-checks label Aug 21, 2024

Sqvid commented Aug 22, 2024

View reviewed changes

cmake/MiscCheck.cmake Outdated Show resolved Hide resolved

Ryo-not-rio reviewed Aug 27, 2024

View reviewed changes

Sqvid force-pushed the sveEB branch from a27bb5b to ef9735c Compare August 28, 2024 12:17

Ryo-not-rio reviewed Aug 29, 2024

View reviewed changes

caffe2/perfkernels/sve_emblookup_codegen.py Outdated Show resolved Hide resolved

caffe2/perfkernels/sve_emblookup_codegen.py Outdated Show resolved Hide resolved

pytorchmergebot added the merging label Oct 12, 2024

pytorchmergebot added the Merged label Oct 14, 2024

pytorchmergebot closed this in 770c134 Oct 14, 2024

pytorchmergebot removed the merging label Oct 14, 2024

pytorchmergebot added the Reverted label Oct 15, 2024

pytorchmergebot reopened this Oct 15, 2024

pytorchmergebot added the merging label Oct 15, 2024

pytorchmergebot closed this in e1e6417 Oct 15, 2024

pytorchmergebot removed the merging label Oct 15, 2024

jondea mentioned this pull request Nov 13, 2024

PyTorch refactor: Build wheel + install in docker ARM-software/Tool-Solutions#253

Merged

annop-w mentioned this pull request Mar 31, 2025

Optimize SVE embedding performance #150176

Closed

Add SVE implementation of embedding_lookup_idx #133995

Add SVE implementation of embedding_lookup_idx #133995

Uh oh!

Conversation

Sqvid commented Aug 20, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133995

✅ No Failures

Uh oh!

linux-foundation-easycla bot commented Aug 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sqvid commented Aug 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sqvid commented Aug 20, 2024

Uh oh!

Sqvid commented Aug 20, 2024

Uh oh!

snadampal commented Aug 20, 2024

Uh oh!

abhishek-iitmadras commented Aug 21, 2024

Uh oh!

Sqvid commented Aug 21, 2024

Uh oh!

Sqvid commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sqvid commented Aug 21, 2024

Uh oh!

abhishek-iitmadras commented Aug 21, 2024

Uh oh!

Sqvid commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sqvid commented Aug 21, 2024

Uh oh!

albanD commented Aug 21, 2024

Uh oh!

snadampal commented Aug 22, 2024

Uh oh!

cyyever commented Aug 22, 2024

Uh oh!

Sqvid commented Aug 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sqvid commented Aug 22, 2024

Uh oh!

Uh oh!

cyyever commented Aug 23, 2024

Uh oh!

Ryo-not-rio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Sqvid commented Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Todo:

Uh oh!

huydhn commented Oct 12, 2024

Uh oh!

pytorchmergebot commented Oct 12, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 12, 2024

Sqvid commented Aug 20, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 20, 2024 •

edited

Loading

linux-foundation-easycla bot commented Aug 20, 2024 •

edited

Loading

Sqvid commented Aug 20, 2024 •

edited

Loading

Sqvid commented Aug 21, 2024 •

edited

Loading

Sqvid commented Aug 21, 2024 •

edited

Loading

Sqvid commented Aug 22, 2024 •

edited

Loading

Sqvid commented Aug 29, 2024 •

edited

Loading

clee2000 commented Oct 15, 2024 •

edited

Loading