Skip to content

blog edit #1995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 24, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions _posts/2025-04-23-pytorch-2-7.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,13 @@ This release is composed of 3262 commits from 457 contributors since PyTorch 2.6
<tr>
<td>
</td>
<td>FlexAttention LLM <span style="text-decoration:underline;">first token processing</span> on X86 CPUs
<td>FlexAttention LLM <span style="text-decoration:underline;">first token processing</span> on x86 CPUs
</td>
</tr>
<tr>
<td>
</td>
<td>FlexAttention LLM <span style="text-decoration:underline;">throughput mode optimization</span> on X86 CPUs
<td>FlexAttention LLM <span style="text-decoration:underline;">throughput mode optimization</span> on x86 CPUs
</td>
</tr>
<tr>
Expand Down Expand Up @@ -135,9 +135,9 @@ For more information regarding Intel GPU support, please refer to [Getting Start
See also the tutorials [here](https://pytorch.org/tutorials/prototype/inductor_windows.html) and [here](https://pytorch.org/tutorials/prototype/pt2e_quant_xpu_inductor.html).


### [Prototype] FlexAttention LLM first token processing on X86 CPUs
### [Prototype] FlexAttention LLM first token processing on x86 CPUs

FlexAttention X86 CPU support was first introduced in PyTorch 2.6, offering optimized implementations — such as PageAttention, which is critical for LLM inference—via the TorchInductor C++ backend. In PyTorch 2.7, more attention variants for first token processing of LLMs are supported. With this feature, users can have a smoother experience running FlexAttention on x86 CPUs, replacing specific *scaled_dot_product_attention* operators with a unified FlexAttention API, and benefiting from general support and good performance when using torch.compile.
FlexAttention x86 CPU support was first introduced in PyTorch 2.6, offering optimized implementations — such as PageAttention, which is critical for LLM inference—via the TorchInductor C++ backend. In PyTorch 2.7, more attention variants for first token processing of LLMs are supported. With this feature, users can have a smoother experience running FlexAttention on x86 CPUs, replacing specific *scaled_dot_product_attention* operators with a unified FlexAttention API, and benefiting from general support and good performance when using torch.compile.


### [Prototype] FlexAttention LLM throughput mode optimization
Expand Down