Skip to content

Add SVE implementation of embedding_lookup_idx #133995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

Sqvid
Copy link
Contributor

@Sqvid Sqvid commented Aug 20, 2024

Copy link

pytorch-bot bot commented Aug 20, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133995

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ba0493c with merge base b82000c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

linux-foundation-easycla bot commented Aug 20, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: Sqvid / name: Siddhartha Menon (ba0493c)

@Sqvid
Copy link
Contributor Author

Sqvid commented Aug 20, 2024

@albanD @seemethere The PR sanity check is currently failing since this is 7300+ LoC patch. However the vast majority of this is the output of code generation. It seems to me that this should be a suitable exception to the large-PR rule. Could you let me know your thoughts? Thank you.

@Sqvid
Copy link
Contributor Author

Sqvid commented Aug 20, 2024

@pytorchbot label "module: arm"

@pytorch-bot pytorch-bot bot added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Aug 20, 2024
@Sqvid
Copy link
Contributor Author

Sqvid commented Aug 20, 2024

@pytorchbot label "ciflow/linux-aarch64"

@pytorch-bot pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Aug 20, 2024
@snadampal
Copy link
Collaborator

Hi @Sqvid , thanks for this PR. this is one of the areas we wanted to improve PyTorch for aarch64. Please mention details about what has been tested and what's the performance uplift observed.

@abhishek-iitmadras
Copy link
Collaborator

@albanD @seemethere The PR sanity check is currently failing since this is 7300+ LoC patch. However the vast majority of this is the output of code generation. It seems to me that this should be a suitable exception to the large-PR rule. Could you let me know your thoughts? Thank you.

Hi @Sqvid , Thanks for this PR.
This PR needs to have the "skip-pr-sanity-checks" label added.
Additionally, please try to resolve the linting issues with this PR. https://github.com/pytorch/pytorch/wiki/lintrunner

@Sqvid
Copy link
Contributor Author

Sqvid commented Aug 21, 2024

@pytorchbot label "skip-pr-sanity-checks"

@Sqvid
Copy link
Contributor Author

Sqvid commented Aug 21, 2024

Hi @Sqvid , thanks for this PR. this is one of the areas we wanted to improve PyTorch for aarch64. Please mention details about what has been tested and what's the performance uplift observed.

Validation was done via the embedding unit tests in pytorch:

$ python test/nn/test_embedding.py
...............s.....s.....................................................................................................................................sss...
/home/user/PyTorch/torch/autograd/gradcheck.py:1153: FutureWarning: Please use `torch.vmap` instead of `torch._vmap_internals.vmap`.
  result = vmap(vjp)(torch.stack(grad_outputs))
/home/user/PyTorch/torch/autograd/gradcheck.py:1153: FutureWarning: Please use `torch.vmap` instead of `torch._vmap_internals.vmap`.
  result = vmap(vjp)(torch.stack(grad_outputs))

----------------------------------------------------------------------
Ran 161 tests in 6.137s

OK (skipped=5)

Benchmarking was done using the caffe2 micro benchmarks in benchmarks/operator_benchmark/pt/embeddingbag_test.py, using python -m pt.embeddingbag_test --omp-num-threads 1 --mkl-num-threads 1.
The processed relative speedups are attached: micro_bench_speedups.txt

Additionally I wrote a simple script to perform forward passes on an nn.EmbeddingBag(1000000, 64, mode = "mean") with randomised data. Speedups of roughly 4x were observed.

Hi @Sqvid , Thanks for this PR. This PR needs to have the "skip-pr-sanity-checks" label added. Additionally, please try to resolve the linting issues with this PR. https://github.com/pytorch/pytorch/wiki/lintrunner

@abhishek-iitmadras Thank you for the tips. I will run lintrunner and update the PR shortly.

@Sqvid
Copy link
Contributor Author

Sqvid commented Aug 21, 2024

@abhishek-iitmadras The patch has been linted. Thanks for the pointers.

@abhishek-iitmadras
Copy link
Collaborator

Out of curiosity, I would like to know why you used the _x variant of the SVE intrinsic instead of the _m or _z variants for the same operation. Is this choice related to performance considerations?

@Sqvid
Copy link
Contributor Author

Sqvid commented Aug 21, 2024

Out of curiosity, I would like to know why you used the _x variant of the SVE intrinsic instead of the _m or _z variants for the same operation. Is this choice related to performance considerations?

My reasoning is that in this case I am not making use of any inactive elements so there is no reason to specify a strategy of dealing with them (which is what the _m and _z variants are for).

The code should work just fine with the _z variant as well for example, but I don't see a point in explicitly zeroing the inactive elements out.

@Sqvid
Copy link
Contributor Author

Sqvid commented Aug 21, 2024

@snadampal I have now also run benchmarks against the dlrm model from pytorch/benchmark. I ran the benchmark via python3 run_benchmark.py cpu --model dlrm --test eval --nwarmup 5 --niter 15. I also tried it with eager mode and inductor mode by optionally using the flag --torchdynamo inductor.
Eager mode:

16 threads: 20.18% speedup
8  threads: 32.16% speedup
4  threads: 47.05% speedup
2  threads: 55.70% speedup
1  threads: 62.25% speedup

Inductor mode:

16 threads: 25.24% speedup
8  threads: 47.53% speedup
4  threads: 52.99% speedup
2  threads: 66.21% speedup
1  threads: 78.84% speedup

@albanD
Copy link
Collaborator

albanD commented Aug 21, 2024

We removed the caffe2 codebase from the PyTorch repo so could you give some details on why this is being added there?

Also, even if the code is generated, it needs to be properly reviewed, tested, etc. We would need someone to sign up to do that properly.

@snadampal
Copy link
Collaborator

Hi @albanD , for aten embedding bag implementation, for non FBGEMM scenario, caffe2 perf kernels is the only option I see today.
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/EmbeddingBag.cpp#L20
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/EmbeddingBag.cpp#L269

For aarch64 there is no FBGEMM support, so, caffe2 perf kernels is the path the embedding lookup index takes. However, there is only scalar implementation for caffe2/perfkernels on aarch64.
https://github.com/pytorch/pytorch/blob/main/caffe2/perfkernels/embedding_lookup_idx.cc#L20

So, this PR adds the sve implementation which would help every PyTorch inference scenario with the embedding step.

btw, I have seen a PR earlier to remove caffe2 python bindings, but are you saying that you're planning to remove the perf kernels as well?
and also there was a discussion about adding aarch64 embedding lookup kernels to fbgemm sometime this year. Please let me know if that is something you're proposing here.

@cyyever
Copy link
Collaborator

cyyever commented Aug 22, 2024

As mentioned above, caffe2/perfkernels is still used in EmbeddingBag.cpp. Is it possible to fix all lint issues in the code generator and integrate it as part of CMake (bazel, BUCK) so that the cpp file is generated only when the related CPU has been detected?

@Sqvid
Copy link
Contributor Author

Sqvid commented Aug 22, 2024

@albanD

We removed the caffe2 codebase from the PyTorch repo so could you give some details on why this is being added there?

I think @snadampal laid out the case best; there does not seem to be a more appropriate place to optimise EmbeddingBag for aarch64 as of now.

Also, even if the code is generated, it needs to be properly reviewed, tested, etc. We would need someone to sign up to do that properly.

Yes that is absolutely true, I was not trying to trivialise the engineering effort. My comment was in the context of justifying why I think the PR cannot be naturally split into smaller PRs.

@Sqvid
Copy link
Contributor Author

Sqvid commented Aug 22, 2024

@cyyever

As mentioned above, caffe2/perfkernels is still used in EmbeddingBag.cpp. Is it possible to fix all lint issues in the code generator and integrate it as part of CMake (bazel, BUCK) so that the cpp file is generated only when the related CPU has been detected?

I am open to generating the code at build time. However, the current form mirrors the corresponding files for AVX2:

caffe2/perfkernels/hp_emblookup_codegen.py

caffe2/perfkernels/embedding_lookup_idx_avx2.cc

@cyyever
Copy link
Collaborator

cyyever commented Aug 23, 2024

@cyyever

As mentioned above, caffe2/perfkernels is still used in EmbeddingBag.cpp. Is it possible to fix all lint issues in the code generator and integrate it as part of CMake (bazel, BUCK) so that the cpp file is generated only when the related CPU has been detected?

I am open to generating the code at build time. However, the current form mirrors the corresponding files for AVX2:

caffe2/perfkernels/hp_emblookup_codegen.py

caffe2/perfkernels/embedding_lookup_idx_avx2.cc

avx2.cc came from old days of Caffe2 and we should take efforts to try generate it on build time. Meanwhile, I want newer code to avoid that behaviour from the beginning.

Copy link
Collaborator

@Ryo-not-rio Ryo-not-rio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial syntax check, haven't dug into the generated code yet

@Sqvid
Copy link
Contributor Author

Sqvid commented Aug 29, 2024

Todo:

  • Basic build system integration:
    • CMake
    • Bazel
    • Buck
  • Generate source at build time:
    • CMake
    • Bazel
    • Buck

@huydhn
Copy link
Contributor

huydhn commented Oct 12, 2024

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@Sqvid
Copy link
Contributor Author

Sqvid commented Oct 14, 2024

@huydhn Thanks for the merge attempt. Are the failures related to my changes or just transient?

@huydhn
Copy link
Contributor

huydhn commented Oct 14, 2024

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@clee2000
Copy link
Contributor

clee2000 commented Oct 15, 2024

@pytorchbot revert -m "breaking internal tests, I wondering if this just needs a targets change for buck? D64360590" -c ghfirst

Copy of truncated error logs:

cc @malfet @huydhn as reviewers, but I think you guys approved just to unblock CI so I'm not sure who the actual reviewers should be

Here are the last few lines of Buck log:
   ...
      42 | typedef __clang_svint8x2_t svint8x2_t;
         |         ^
   .../include/arm_sve.h:43:9: error: unknown type name '__clang_svint16x2_t'
      43 | typedef __clang_svint16x2_t svint16x2_t;
         |         ^
   .../include/arm_sve.h:44:9: error: unknown type name '__clang_svint32x2_t'
      44 | typedef __clang_svint32x2_t svint32x2_t;
         |         ^
   .../include/arm_sve.h:45:9: error: unknown type name '__clang_svint64x2_t'
      45 | typedef __clang_svint64x2_t svint64x2_t;
         |         ^
   .../include/arm_sve.h:46:9: error: unknown type name '__clang_svuint8x2_t'
      46 | typedef __clang_svuint8x2_t svuint8x2_t;
         |         ^
   .../include/arm_sve.h:47:9: error: unknown type name '__clang_svuint16x2_t'
      47 | typedef __clang_svuint16x2_t svuint16x2_t;
         |         ^
   .../include/arm_sve.h:48:9: error: unknown type name '__clang_svuint32x2_t'
      48 | typedef __clang_svuint32x2_t svuint32x2_t;
         |         ^
   fatal error: too many errors emitted, stopping now [-ferror-limit=]
   20 errors generated.

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@Sqvid your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Oct 15, 2024
This reverts commit 770c134.

Reverted #133995 on behalf of https://github.com/clee2000 due to breaking internal tests, I wondering if this just needs a targets change for buck? ([comment](#133995 (comment)))
@malfet
Copy link
Contributor

malfet commented Oct 15, 2024

@pytorchbot revert -m "breaking internal tests, I wondering if this just needs a targets change for buck? D64360590" -c ghfirst

Copy of truncated error logs:

cc @malfet @huydhn as reviewers, but I think you guys approved just to unblock CI so I'm not sure who the actual reviewers should be

Here are the last few lines of Buck log:
   ...
      42 | typedef __clang_svint8x2_t svint8x2_t;
         |         ^
   .../include/arm_sve.h:43:9: error: unknown type name '__clang_svint16x2_t'
      43 | typedef __clang_svint16x2_t svint16x2_t;
         |         ^
   .../include/arm_sve.h:44:9: error: unknown type name '__clang_svint32x2_t'
      44 | typedef __clang_svint32x2_t svint32x2_t;
         |         ^
   .../include/arm_sve.h:45:9: error: unknown type name '__clang_svint64x2_t'
      45 | typedef __clang_svint64x2_t svint64x2_t;
         |         ^
   .../include/arm_sve.h:46:9: error: unknown type name '__clang_svuint8x2_t'
      46 | typedef __clang_svuint8x2_t svuint8x2_t;
         |         ^
   .../include/arm_sve.h:47:9: error: unknown type name '__clang_svuint16x2_t'
      47 | typedef __clang_svuint16x2_t svuint16x2_t;
         |         ^
   .../include/arm_sve.h:48:9: error: unknown type name '__clang_svuint32x2_t'
      48 | typedef __clang_svuint32x2_t svuint32x2_t;
         |         ^
   fatal error: too many errors emitted, stopping now [-ferror-limit=]
   20 errors generated.

Looks like all it needs is an archaic internal change to exclude _sve files from builds on x86 platform

@malfet
Copy link
Contributor

malfet commented Oct 15, 2024

@pytorchbot merge -f "I've fixed it internally"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

jondea added a commit to jondea/Tool-Solutions that referenced this pull request Nov 13, 2024
- Instead of one monolithic docker image, build a wheel using the same
  builder image as upstream and install it in a smaller docker image.
  - Using the same builder image makes the dev process easier and closer
    to upstream
  - Installing in a smaller image simplifies redistribution
- Move to very recent PyTorch/oneDNN/ideep/ACL revisions from main
- Remove torchtext example because it has been deprecated and the last
  release is only compatible with PyTorch 2.3
- Add requirements.txt in case users want to use plain venv+wheel
- Remove config.mk, appears to be unused
- Remove all MLCommons examples and patches, we still have standalone
  BERT and Resnet models, and the MLCommons examples were harder to use.
  We may revisit this later.
- Removed RetinaNet object detection as this extra example was not the
  overhead from the patches required for it to work.
- Remove pytorch_embedding_lookup_sve_concept.patch because
  pytorch/pytorch#133995 has been merged
- Static quantization patches have been replaced with wgetting patches
  in get-source.sh

Co-authored-by: Ryo Suzuki <Ryo.Suzuki@arm.com>
Co-authored-by: Fadi Arafeh <Fadi.Arafeh@arm.com>
Co-authored-by: David Svantesson-Yeung <David.Svantesson-Yeung@arm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request Merged module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo open source release notes: quantization release notes category Reverted skip-pr-sanity-checks topic: improvements topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.