[Intel GPU] Support SDPA backend selection and priority setting on XPU #159464

LuFinch · 2025-07-30T08:50:07Z

Currentlly SPDA XPU use own priority_order instead of the one from global context. Hence it does not support with sdpa_kernel(order, set_priority=True) with set_priority=True.

This PR enables this feature. To make default priority_order from global context works for XPU, I also move MATH backend to lowest priority, otherwise cudnn attention and overrideable attention will never be selected.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168

pytorch-bot · 2025-07-30T08:50:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159464

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3d6a1cd with merge base b602ea9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

LuFinch · 2025-07-30T08:51:00Z

@chunhuanMeng @guangyey Could you help take a look?

guangyey

I am wondering if we support with sdpa_kernel(order, set_priority=True): with set_priority=True

guangyey · 2025-07-30T14:05:39Z

aten/src/ATen/native/mkldnn/xpu/Attention.cpp

      sdp::SDPBackend::flash_attention,
+      sdp::SDPBackend::math,


I guess we currently don't support set_priority=True since our priority_order is a constant value.

guangyey · 2025-07-30T14:10:27Z

For PyTorch, math has high priority than flash_attention,

pytorch/aten/src/ATen/SDPBackend.h

Lines 7 to 14 in 1465757

    
           enum class SDPBackend { 
        
             error = -1, 
        
             math = 0, 
        
             flash_attention = 1, 
        
             efficient_attention = 2, 
        
             cudnn_attention = 3, 
        
             overrideable = 4 
        
           };

Do we really want to change this behavior for XPU?

LuFinch · 2025-07-31T02:38:22Z

@guangyey
For the above comments, I think we can support set_priority=True.

To support this feature, XPU need to use priority_order like

pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp

Line 819 in d7a5ec9

const auto ordering = priority_order(kernel_params);

As you know, the overrideable backend has lowest priority in the default priority_order and we don't have entry for flashattention backend at the beginning. This means it will run into math instead of overrideable beforce. I guess this is the reason why xpu use hard-code priority_order instead of priority_order from global context.

pytorch/aten/src/ATen/Context.h

Lines 432 to 437 in d7a5ec9

    
           std::array<at::SDPBackend, at::num_sdp_backends> sdp_priority_order = { 
        
               at::SDPBackend::flash_attention, 
        
               at::SDPBackend::efficient_attention, 
        
               at::SDPBackend::math, 
        
               at::SDPBackend::cudnn_attention, 
        
               at::SDPBackend::overrideable};

Now we have added flashattention entry and fallback to overrideable. We can use priority_order from global context.

pytorch/aten/src/ATen/native/mkldnn/xpu/Attention.cpp

Lines 110 to 116 in d7a5ec9

    
           case sdp::SDPBackend::flash_attention: 
        
             if (ctx.userEnabledFlashSDP() && 
        
                 use_overrideable_xpu(kernel_params, print_debug)) { 
        
               TORCH_WARN( 
        
                   "Flash Attention is not supported on XPU, falling back to overrideable kernel."); 
        
               return sdp::SDPBackend::overrideable; 
        
             }

LuFinch · 2025-07-31T02:41:39Z

However, in near future, we will add cutlass-sycl version SDPA to FlashAttention backend.
If user use default priority_order and input params don't choose FlashAttention backend, then it will run into MATH backend instead of OVERRIDEABLE backend due to default priority_order.

pytorch/aten/src/ATen/Context.h

Lines 432 to 437 in d7a5ec9

    
           std::array<at::SDPBackend, at::num_sdp_backends> sdp_priority_order = { 
        
               at::SDPBackend::flash_attention, 
        
               at::SDPBackend::efficient_attention, 
        
               at::SDPBackend::math, 
        
               at::SDPBackend::cudnn_attention, 
        
               at::SDPBackend::overrideable};

guangyey · 2025-07-31T02:54:35Z

Let's have a discussion offline.

aten/src/ATen/Context.h

guangyey

Add a ut to ensure overrideable by default has high priority than math

aten/src/ATen/native/transformers/cuda/sdp_utils.cpp

guangyey

Thanks for your update!

guangyey · 2025-08-01T05:24:10Z

@drisspg May I know if this PR is reasonable to you. This PR doesn't change CUDA behavior, and ensures all non-cuda backends align to the cuda sdpa priority behavior.

guangyey · 2025-08-08T03:08:23Z

@pytorchbot rebase

pytorchmergebot · 2025-08-08T03:09:52Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-08-08T03:09:55Z

Successfully rebased lfq/change_sdpa_priority onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout lfq/change_sdpa_priority && git pull --rebase)

drisspg

Does this change the global order for all privateuse1 backends?

guangyey · 2025-08-08T17:46:21Z

@drisspg I think privateuser1 will also benefit from this PR. It doesn't change the ordering between "flash_attention", "efficient_attention", and "math".
If privateuser1 isn't using "overrideable", this PR has no effect on their case. If they are, then it's reasonable for "overrideable" to take precedence over "math", since "math" is intended as the fallback implementation.
@LuFinch Please correct me if I am wrong.

drisspg · 2025-08-08T23:28:57Z

aten/src/ATen/Context.h

@@ -432,9 +432,9 @@ class TORCH_API Context {
  std::array<at::SDPBackend, at::num_sdp_backends> sdp_priority_order = {
      at::SDPBackend::flash_attention,
      at::SDPBackend::efficient_attention,
+      at::SDPBackend::overrideable,


can you undo this change, this code

pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp

Lines 60 to 100 in 731ee31

// tracks whether we've set the default priority order once, to avoid setting

// it redundantly or overwriting a user-specified priority order

// when the priority order context manager is used before the default priority

// order is initialized the following happens:

// (1) the current priority order is queried

// (2) priority_order() is called, which initializes it to the default as init_ is false

// (3) the user-specified priority order is set

// (3.1) we are in the priority context...

// (3.2) we exit the priority context...

// (4) the previous priority order (default) is restored

bool priority_order_init_ = false;

// TODO(eqy): more benchmarking to determine whether this should include sm86/89

// Needs to be kept in-sync with test_fused_chocie in test_transformers.py

bool check_prefer_cudnn_attention() {

static const bool prefer_cudnn = c10::utils::check_env("TORCH_CUDNN_SDPA_PREFERRED") == true;

if (!prefer_cudnn) {

return false;

}

#if (defined(CUDNN_VERSION) && (CUDNN_VERSION > 90000))

auto dprops = at::cuda::getCurrentDeviceProperties();

return dprops->major >= 9 && !dprops->minor;

#else

return false;

#endif

}

// flash_attention V2 is universally faster than efficient_attention and Math

std::array<SDPBackend, num_backends> priority_order(sdp_params const& params) {

if (!priority_order_init_) {

priority_order_init_ = true;

if (check_prefer_cudnn_attention()) {

const std::vector<int64_t> cudnn_order = {static_cast<int64_t>(at::SDPBackend::cudnn_attention),

static_cast<int64_t>(at::SDPBackend::flash_attention),

static_cast<int64_t>(at::SDPBackend::efficient_attention),

static_cast<int64_t>(at::SDPBackend::math)};

at::globalContext().setSDPPriorityOrder(cudnn_order);

}

}

return at::globalContext().sDPPriorityOrder();

}

shows how to set a order per a specific backend

Thanks, it looks good. We will follow this code to set XPU priority order. cc @LuFinch

@drisspg May I know if we have addressed your comments.

LuFinch · 2025-08-11T03:45:22Z

@guangyey I undo the changes on CUDA and update XPU priority order setting code. Please help review.

guangyey

Looks better.

guangyey · 2025-08-11T05:18:14Z

@pytorchbot rebase

pytorchmergebot · 2025-08-11T05:19:38Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

…r SDPA XPU

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>

pytorchmergebot · 2025-08-11T05:19:41Z

Successfully rebased lfq/change_sdpa_priority onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout lfq/change_sdpa_priority && git pull --rebase)

LuFinch requested review from EikanWang, gujinghui, albanD, jbschlosser and mikaylagawarecki as code owners July 30, 2025 08:50

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jul 30, 2025

pytorchbot added the open source label Jul 30, 2025

albanD requested review from drisspg and removed request for albanD July 30, 2025 13:34

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 30, 2025

guangyey reviewed Jul 30, 2025

View reviewed changes

guangyey added this to PyTorch Intel Jul 30, 2025

guangyey moved this to Pre-Review Required in PyTorch Intel Jul 30, 2025

guangyey reviewed Jul 30, 2025

View reviewed changes

LuFinch changed the title ~~[Intel GPU] Make SDPA FLASH_ATTENTION backend has higher priority than MATH backend on XPU~~ [Intel GPU] Support SDPA backend selection and priority setting on XPU Jul 31, 2025

LuFinch requested a review from guangyey July 31, 2025 08:31

guangyey added the topic: not user facing topic category label Jul 31, 2025

guangyey reviewed Jul 31, 2025

View reviewed changes

aten/src/ATen/Context.h Outdated Show resolved Hide resolved

guangyey reviewed Jul 31, 2025

View reviewed changes

aten/src/ATen/native/transformers/cuda/sdp_utils.cpp Outdated Show resolved Hide resolved

guangyey reviewed Jul 31, 2025

View reviewed changes

aten/src/ATen/native/transformers/cuda/sdp_utils.cpp Outdated Show resolved Hide resolved

guangyey approved these changes Aug 1, 2025

View reviewed changes

guangyey added the ciflow/xpu Run XPU CI tasks label Aug 1, 2025

pytorchmergebot force-pushed the lfq/change_sdpa_priority branch from e5490c7 to 6506ea3 Compare August 8, 2025 03:09

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Aug 8, 2025

guangyey added the ciflow/xpu Run XPU CI tasks label Aug 8, 2025

drisspg reviewed Aug 8, 2025

View reviewed changes

mikaylagawarecki removed their request for review August 8, 2025 21:16

drisspg reviewed Aug 8, 2025

View reviewed changes

LuFinch force-pushed the lfq/change_sdpa_priority branch from 6506ea3 to f4cde98 Compare August 11, 2025 03:42

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Aug 11, 2025

guangyey approved these changes Aug 11, 2025

View reviewed changes

guangyey requested a review from drisspg August 11, 2025 05:16

LuFinch and others added 10 commits August 11, 2025 05:19

make flash attention backend has higher priority than math backend fo…

b60058c

…r SDPA XPU

use priority_order from context

a18eb93

code lint fix

2d850bd

Update aten/src/ATen/native/transformers/cuda/sdp_utils.cpp

1a1eca5

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>

move cudnn back to lowest priority

3557398

Update aten/src/ATen/native/transformers/cuda/sdp_utils.cpp

350c8f6

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>

code refine and add ut

af4af10

refine UT

2b4f6ff

fix namespace

74dd1a1

update default priority order setting

3d6a1cd

pytorchmergebot force-pushed the lfq/change_sdpa_priority branch from f4cde98 to 3d6a1cd Compare August 11, 2025 05:19

guangyey added the ciflow/xpu Run XPU CI tasks label Aug 11, 2025

	// tracks whether we've set the default priority order once, to avoid setting
	// it redundantly or overwriting a user-specified priority order
	// when the priority order context manager is used before the default priority
	// order is initialized the following happens:
	// (1) the current priority order is queried
	// (2) priority_order() is called, which initializes it to the default as init_ is false
	// (3) the user-specified priority order is set
	// (3.1) we are in the priority context...
	// (3.2) we exit the priority context...
	// (4) the previous priority order (default) is restored
	bool priority_order_init_ = false;

	// TODO(eqy): more benchmarking to determine whether this should include sm86/89
	// Needs to be kept in-sync with test_fused_chocie in test_transformers.py
	bool check_prefer_cudnn_attention() {
	static const bool prefer_cudnn = c10::utils::check_env("TORCH_CUDNN_SDPA_PREFERRED") == true;
	if (!prefer_cudnn) {
	return false;
	}
	#if (defined(CUDNN_VERSION) && (CUDNN_VERSION > 90000))
	auto dprops = at::cuda::getCurrentDeviceProperties();
	return dprops->major >= 9 && !dprops->minor;
	#else
	return false;
	#endif
	}

	// flash_attention V2 is universally faster than efficient_attention and Math
	std::array<SDPBackend, num_backends> priority_order(sdp_params const& params) {
	if (!priority_order_init_) {
	priority_order_init_ = true;
	if (check_prefer_cudnn_attention()) {
	const std::vector<int64_t> cudnn_order = {static_cast<int64_t>(at::SDPBackend::cudnn_attention),
	static_cast<int64_t>(at::SDPBackend::flash_attention),
	static_cast<int64_t>(at::SDPBackend::efficient_attention),
	static_cast<int64_t>(at::SDPBackend::math)};
	at::globalContext().setSDPPriorityOrder(cudnn_order);
	}
	}
	return at::globalContext().sDPPriorityOrder();
	}

[Intel GPU] Support SDPA backend selection and priority setting on XPU #159464

Are you sure you want to change the base?

[Intel GPU] Support SDPA backend selection and priority setting on XPU #159464

Conversation

LuFinch commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159464

✅ No Failures

Uh oh!

LuFinch commented Jul 30, 2025

Uh oh!

guangyey left a comment

Choose a reason for hiding this comment

Uh oh!

guangyey Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey commented Jul 30, 2025

Uh oh!

LuFinch commented Jul 31, 2025

Uh oh!

LuFinch commented Jul 31, 2025

Uh oh!

guangyey commented Jul 31, 2025

Uh oh!

Uh oh!

guangyey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

guangyey left a comment

Choose a reason for hiding this comment

Uh oh!

guangyey commented Aug 1, 2025

Uh oh!

guangyey commented Aug 8, 2025

Uh oh!

pytorchmergebot commented Aug 8, 2025

Uh oh!

pytorchmergebot commented Aug 8, 2025

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

guangyey commented Aug 8, 2025

Uh oh!

drisspg Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

LuFinch commented Aug 11, 2025

Uh oh!

guangyey left a comment

Choose a reason for hiding this comment

Uh oh!

guangyey commented Aug 11, 2025

Uh oh!

pytorchmergebot commented Aug 11, 2025

Uh oh!

pytorchmergebot commented Aug 11, 2025

Uh oh!

Uh oh!

LuFinch commented Jul 30, 2025 •

edited

Loading

pytorch-bot bot commented Jul 30, 2025 •

edited

Loading