[Draft][WIP] Enable XPU path for FlexAttention #143553

liangan1 · 2024-12-19T04:39:48Z

Motivation

The Attention has been the critical performance bottleneck in the current LLM models, and FlexAttention is a good choice to cover the broad variants in the transformers series models. With FlexAttention, it is easy for us to enable the paged attention and fused SDPA in the transformers repo on XPU device. Besides, it also provide a candidate to process attention in LLM ecosystem libraries ., e.g., vLLM, SGLang on XPU device.
FlexAttention is good start point to push the intel triton based GEMM kernel to be matured. FlexAttention provide both flexattention kernel and flexdecoding kernel to cover both compute bound and memory bound GEMM computation, and different shapes should also been supported to serve LLM inference., e.g. head_dim=64, 96, 128, 256.

What does this PR do?

Enable the device type for Flexattention kernel and UTs to ensure all important UTs pass on XPU device.
For E2E model inference, ensure the functionality of LLM models inference with FlexAttention to be ready.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela @yf225 @ColinPeppler @desertfire

pytorch-bot · 2024-12-19T04:39:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143553

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 9 New Failures, 1 Pending

As of commit 3de28ca with merge base 01bcf9a ():

NEW FAILURES - The following jobs have failed:

Build Triton wheels / Build Triton Windows Wheel (3.10, xpu) (gh)
ninja: build stopped: subcommand failed
Build Triton wheels / Build Triton Windows Wheel (3.11, xpu) (gh)
ninja: build stopped: subcommand failed
Build Triton wheels / Build Triton Windows Wheel (3.12, xpu) (gh)
ninja: build stopped: subcommand failed
Build Triton wheels / Build Triton Windows Wheel (3.13, xpu) (gh)
ninja: build stopped: subcommand failed
Build Triton wheels / Build Triton Windows Wheel (3.13t, xpu) (gh)
ninja: build stopped: subcommand failed
Build Triton wheels / Build Triton Windows Wheel (3.9, xpu) (gh)
ninja: build stopped: subcommand failed
Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for .github/workflows/xpu.yml:
pull / linux-jammy-py3.10-clang18-asan / test (default, 1, 6, linux.4xlarge) (gh)
'test/inductor/test_flex_attention.py::TestFlexAttentionCPU::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape0_cpu', 'test/inductor/test_flex_attention.py::TestFlexAttentionCPU::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cpu'
pull / linux-jammy-py3.13-clang12 / test (default, 2, 5, linux.4xlarge) (gh)
cpp/test_jit 1/1 failed!

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2024-12-19T04:39:54Z

The committers listed above are authorized under a signed CLA.

✅ login: hoshibara / name: Xingyuan Li (3d992ad, b7bbb96, e642f6e, a6f267c, 2c279f2, 272c817, 685bfda, 9d6eb25, 8a3421f, 453ec83, fc93c31, b32ee44, 0107952, 9390351, b21033c, 7d4c320, 14ccfeb, aa4be5a, ba427ee, 1585ef8, 962ee1b, 90b6f0b, 7594b64, 500ed41, 75a52c5, dfec943, b83e482, 47757b2, 471592d, 800bed4, dc718c8, b1c8ef1, b514310, 401b399, 33cb3e1, 214550b, 2981e41, e4bc4b9, 377eef5, 0e89cf0, f6b827b, fcda7b3, ca1f749, 486757c, 46e9aac, ffdb463, 5250f64, 91c3709, 24a2602, cbb7fe6, 28715ed, dce24dd, d87b92c, 96f44d7, 3e1d1af, a12cc6f, 6ff44b3, 7e69c40, 39971c1, 4d77dc2, 749392c, ee9682f, ad280b4, f7cff4e, 73db0c2, b9ea95b, fd14da6, 24dc060, c810417, f79f883, 0b4e4fc, 973ebaa, 58a681a, f47e886, 46df31b, 2843189, e9818be, d2ee416, bde2d86, 13cb614, 6b442c7, 851aee1, de7babd, 5717803, 54d8df9, 8e56615, 5ba57df, 9e316fd, ff7c5e4, 3de28ca, 7579c4c, b6caa7d, 1bf03c3)
✅ login: liangan1 / name: Zhang, Liangang (686e05d, bbc1fc4, b557dde, e0d24ce, 2cc84f2, 113dcf9, 0f4578c, 2ff0774)
✅ login: retonym / name: Mao Yunfei (5beaca3, 6da92dd, b51f840, e0be178)
✅ login: majing921201 / name: majing (553de0c, 619619f)

pytorch-bot · 2024-12-24T02:08:47Z

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot · 2024-12-24T02:14:21Z

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

…mask

…n crash

liangan1 · 2025-02-10T00:42:10Z

@pytorchbot rebase

pytorch-bot · 2025-02-10T00:42:14Z

You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra.

anmyachev · 2025-07-28T09:40:54Z

torch/_inductor/template_heuristics.py

+        if config.max_autotune:
+            if config.max_autotune_flex_search_space == "EXHAUSTIVE":
+                return self.exhaustive_flex_attn_fwd_configs
+            flex_attn_fwd_configs += self.flex_attn_fwd_autotune_configs


Hi @hoshibara, could you define flex_attn_fwd_autotune_configs in XPUConfigHeuristic instead of using it from base class.

It is very likely that for xpu these configs will be different, although now it can be left the same, but it will make patching easier for us. More details: intel/intel-xpu-backend-for-triton#4265 (comment)

OK, I'll add it

index_put fix in torch xpu ops

EikanWang · 2024-12-24T02:11:10Z

test/inductor/test_flex_attention.py

@@ -461,8 +466,9 @@ def run_test(
            block_mask = create_block_mask(
                noop_mask, Q_B, Q_H, Q_S, KV_S, device=self.device
            )
+        gold_dtype = torch.float64 if not HAS_XPU else torch.float32


Why does Intel GPU need to downgrade to FP32?

Have been removed

EikanWang · 2025-08-06T23:37:30Z

.ci/docker/common/install_xpu.sh

@chuanqi129 , have you submitted a pr to update the infra?

EikanWang · 2025-08-07T00:01:19Z

test/inductor/test_flex_attention.py

@@ -3803,7 +3875,7 @@ def forward(self, arg0_1: "f64[]", arg1_1: "i32[]", arg2_1: "i32[]", arg3_1: "i3

    class mask_graph0(torch.nn.Module):
        def forward(self, arg0_1: "i32[]", arg1_1: "i32[]", arg2_1: "i32[]", arg3_1: "i32[]"):
-            full_default: "b8[]" = torch.ops.aten.full.default([], True, dtype = torch.bool, layout = torch.strided, device = device(type='cuda', index=0), pin_memory = False)
+            full_default: "b8[]" = torch.ops.aten.full.default([], True, dtype = torch.bool, layout = torch.strided, device = device(type='GPU_TYPE', index=0), pin_memory = False)


Suggested change

full_default: "b8[]" = torch.ops.aten.full.default([], True, dtype = torch.bool, layout = torch.strided, device = device(type='GPU_TYPE', index=0), pin_memory = False)

full_default: "b8[]" = torch.ops.aten.full.default([], True, dtype = torch.bool, layout = torch.strided, device = device(type=GPU_TYPE, index=0), pin_memory = False)

GPU_TYPE is a placeholder to be replaced by the actual running device type, so it should not be modified.

EikanWang · 2025-08-07T00:02:25Z

third_party/xpu.txt

Does FlexAttention depend on torch-xpu-ops?

We need to test Yutao's fix PR. This is just a temp modification.

EikanWang · 2025-08-07T00:03:44Z

.ci/docker/ubuntu-xpu/Dockerfile

@chuanqi129 , ditto

This change is to test the CI pass rate on the rolling driver. It will be removed before merging.

update xpu ops pin add intel gpu specific autotune configs

pytorch-bot · 2025-08-07T05:58:45Z

To add the ciflow label ciflow/inductor please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

remove unnecessary rolling driver code

Enable XPU path for FlexAttention

0f4578c

pytorch-bot bot added the module: inductor label Dec 19, 2024

liangan1 marked this pull request as draft December 19, 2024 04:39

pytorchbot added the open source label Dec 19, 2024

liangan1 added 3 commits December 18, 2024 23:14

Enable flexAttention UT

2cc84f2

Fix the fp64 error

113dcf9

Add xpu for _validate_device

b557dde

EikanWang added topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module ciflow/xpu Run XPU CI tasks labels Dec 24, 2024

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Dec 24, 2024

EikanWang self-requested a review December 24, 2024 02:14

EikanWang added the ciflow/xpu Run XPU CI tasks label Dec 24, 2024

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Dec 24, 2024

liangan1 and others added 5 commits December 23, 2024 19:31

minor changes

686e05d

support test_flex_decoding.py with xpu device

5beaca3

For test_flex_attention.py, add missing device info for create_block_…

6da92dd

…mask

temporarily use rocm config to avoid flex-attention backward assertio…

b51f840

…n crash

Merge branch 'main' into liangan1/flex_attention

2ff0774

Fix xpu rebase bug

bbc1fc4

liangan1 mentioned this pull request Feb 25, 2025

[FlexAttention] get poor performance with Triton based GEMM kernel. intel/intel-xpu-backend-for-triton#3518

Open

enable new flexattention xpu ut

e0be178

pytorch-bot bot added the module: dynamo label Feb 27, 2025

remove debug config

9d6eb25

fix test_flex_decoding.py for intel gpu

e642f6e

anmyachev mentioned this pull request Jul 23, 2025

Add flex decoding patch. intel/intel-xpu-backend-for-triton#4766

Merged

hoshibara added 2 commits July 24, 2025 01:27

remove the lower bound of block_m

fcda7b3

Merge branch 'pytorch:main' into liangan1/flex_attention

6b442c7

anmyachev reviewed Jul 28, 2025

View reviewed changes

anmyachev mentioned this pull request Jul 28, 2025

[FlexAttention] The flex attention test takes too long time to run in CI. intel/intel-xpu-backend-for-triton#4265

Open

hoshibara added 9 commits August 1, 2025 02:07

Merge branch 'upstream-main' into liangan1/flex_attention

13cb614

Merge branch 'pytorch:main' into liangan1/flex_attention

5ba57df

add TRITON_LESS_FLEX_ATTN_BWD_CONFIGS for UT speedup

5717803

index_put fix in torch xpu ops

fix

f47e886

update torch xpu ops for index_put

851aee1

Merge branch 'upstream-main' into liangan1/flex_attention

b6caa7d

add flex_attn_fwd_autotune_configs

bde2d86

update index put fixing pin

e9818be

update xpu ops pin

1bf03c3

EikanWang reviewed Aug 7, 2025

View reviewed changes

hoshibara added 2 commits August 7, 2025 13:34

Merge branch 'pytorch:main' into liangan1/flex_attention

46df31b

align cuda & intel gpu generated tensor

9e316fd

update xpu ops pin add intel gpu specific autotune configs

ZhiweiYan-96 added ciflow/xpu Run XPU CI tasks ciflow/inductor labels Aug 7, 2025

pytorch-bot bot removed the ciflow/inductor label Aug 7, 2025

whitneywhtsang mentioned this pull request Aug 7, 2025

Remove torchdata==0.9.0 pin and torchtext as deprecated (#4849) intel/intel-xpu-backend-for-triton#4860

Merged

Merge branch 'upstream-main' into liangan1/flex_attention

7579c4c

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Aug 8, 2025

hoshibara added 4 commits August 8, 2025 02:53

rename XPU to Intel GPU in user-facing msg

ff7c5e4

remove unnecessary rolling driver code

update triton pin for ci test

8e56615

Merge branch 'pytorch:main' into liangan1/flex_attention

de7babd

temporarily increase the timeout limit for functional test

3de28ca

etaf added the ciflow/xpu Run XPU CI tasks label Aug 12, 2025

	full_default: "b8[]" = torch.ops.aten.full.default([], True, dtype = torch.bool, layout = torch.strided, device = device(type='GPU_TYPE', index=0), pin_memory = False)
	full_default: "b8[]" = torch.ops.aten.full.default([], True, dtype = torch.bool, layout = torch.strided, device = device(type=GPU_TYPE, index=0), pin_memory = False)

[Draft][WIP] Enable XPU path for FlexAttention #143553

Are you sure you want to change the base?

[Draft][WIP] Enable XPU path for FlexAttention #143553

Conversation

liangan1 commented Dec 19, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143553

❌ 9 New Failures, 1 Pending

Uh oh!

linux-foundation-easycla bot commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 24, 2024

Uh oh!

pytorch-bot bot commented Dec 24, 2024

Uh oh!

liangan1 commented Feb 10, 2025

Uh oh!

pytorch-bot bot commented Feb 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pytorch-bot bot commented Aug 7, 2025

Uh oh!

Uh oh!

liangan1 commented Dec 19, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Dec 19, 2024 •

edited

Loading

linux-foundation-easycla bot commented Dec 19, 2024 •

edited

Loading