[Flex Attn][CPU] support flash decoding for cpu #159835

Valentine233 · 2025-08-05T02:35:28Z

Description:

Support flash decoding in CppFlexAttentionTemplate. We prefer to choose flash decoding instead of flash attention when query length is 1.
For flash decoding, we add a kernel option PARTITION_SIZE to define the partition size of doing the parallelism on KV length dimension. The default value is 128, which should be multiple of KV cache block size to use flash decoding.
As mentioned in Fix large_tensor_test skipping cpu #158617, flex_attn UTs for the cpu backend are disabled because of the long duration. Here we re-enable the essential ones.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

pytorch-bot · 2025-08-05T02:35:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159835

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 35eb3f4 with merge base 8a2f53c ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
/var/lib/jenkins/workspace/xla/torch_xla/csrc/runtime/BUILD:476:14: Compiling torch_xla/csrc/runtime/xla_util_test.cpp failed: (Exit 1): gcc failed: error executing CppCompile command (from target //torch_xla/csrc/runtime:xla_util_test) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 229 arguments skipped)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Valentine233 · 2025-08-06T01:18:19Z

@jianan-gu @CaoE Please help review, thanks~

support flash decoding for cpu

17149fa

pytorch-bot bot added ciflow/inductor module: inductor labels Aug 5, 2025

Valentine233 marked this pull request as draft August 5, 2025 02:35

Valentine233 added the topic: not user facing topic category label Aug 5, 2025

pytorchbot added the open source label Aug 5, 2025

Valentine233 added 6 commits August 4, 2025 22:39

update

ac8d133

fix format

26bd1f8

fix format

8cd6336

fix format

6780265

fix format

67a07cb

fix format

35eb3f4

CaoE added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Flex Attn][CPU] support flash decoding for cpu #159835

[Flex Attn][CPU] support flash decoding for cpu #159835

Uh oh!

Valentine233 commented Aug 5, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 5, 2025 •

edited

Loading

Uh oh!

Valentine233 commented Aug 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

[Flex Attn][CPU] support flash decoding for cpu #159835

Are you sure you want to change the base?

[Flex Attn][CPU] support flash decoding for cpu #159835

Uh oh!

Conversation

Valentine233 commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159835

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Valentine233 commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Valentine233 commented Aug 5, 2025 •

edited

Loading

pytorch-bot bot commented Aug 5, 2025 •

edited

Loading

Valentine233 commented Aug 6, 2025 •

edited

Loading