[inductor][cpu] Fix double-offset issue in `GEMM_TEMPLATE` #159233

Phoslight · 2025-07-27T23:50:27Z

Basically, the gemm template generates code like

cpp_CppMicroGemmRef_micro_gemm<static_cast<bool>(false), static_cast<bool>(false)>(
            &(X[static_cast<int64_t>(k_start + 196LL*m_start + 38416LL*ks_b_index)]),
            &(W[static_cast<int64_t>(200704000LL + n_start + 80LL*k_start + 15680LL*ks_b_index)]),
            &(local_acc_buf[static_cast<int64_t>(Nr*nci + ((-1LL)*Nr*nc))]),
            static_cast<int64_t>(m_end + ((-1LL)*m_start)),
            static_cast<int64_t>(Nr),
            static_cast<int64_t>(k_end + ((-1LL)*k_start)),
            static_cast<int64_t>(196LL),
            static_cast<int64_t>(80LL),
            static_cast<int64_t>(Nc_blocks*Nr)
        );

However, when the input tensor W has a storage offset, this results in a double offset issue. That is, the resulting pointer is 2 * 200704000LL away from W.storage().data_ptr(), which causes an out-of-bounds access.

The storage offset of W is introduced by this patch, but I think it's a reasonable fix. So cpp_gemm_template.py should handle input matrices with storage offsets properly.

I think a good way to fix this issue is to create a new matrix that has no storage offset.

When should_block_weights is true, block_weight() creates a clean new matrix, so that branch is not affected by this issue.

BTW I've also examined the FX IRs generated by torch.compile(), as well as the generated python module, and they are correct.

The newly-added test in test_cpu_select_algorithm.py can reproduce the issue. With this patch, the crash is fixed. It also resolves the crash reported in #158076.

I ran CPU tests in test_cpu_select_algorithm.py, but many of them are skipped due to MKL and AMX. I'd be appreciated if someone can help verify the test.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

pytorch-bot · 2025-07-27T23:50:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159233

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 2 Unrelated Failures

As of commit 674a976 with merge base f636736 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-mypy / linux-job (gh)
>>> Lint for torch/_inductor/codegen/cpp_gemm_template.py:
Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for test/inductor/test_cpu_select_algorithm.py:

CANCELLED JOB - The following job was cancelled. Please retry:

Apply lint suggestions (gh)

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-jammy-cuda12.8-py3.10-gcc11-test / test (distributed, 2, 3, linux.g4dn.12xlarge.nvidia.gpu) (gh) (disabled by #147726 but the issue was closed recently and a rebase is needed to make it pass)
distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_reduce_coalesced

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
sccache: error: couldn't connect to server

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-07-27T23:50:31Z

✅login: Phoslight / (674a976)

The committers listed above are authorized under a signed CLA.

Phoslight · 2025-07-27T23:56:19Z

@pytorchbot label "topic: not user facing"

leslie-fang-intel · 2025-07-29T01:41:05Z

I ran CPU tests in test_cpu_select_algorithm.py, but many of them are skipped due to MKL and AMX. I'd be appreciated if someone can help verify the test.

@CaoE could you help to take a look of this fix?

leslie-fang-intel · 2025-07-29T01:50:34Z

That is, the resulting pointer is 2 * 200704000LL away from W.storage().data_ptr(), which causes an out-of-bounds access.

hi @Phoslight, thanks for the fixing and want to understand more about why the offset will cause out-of-bounds. AFAIK, the offset should come from the original view node.

Phoslight · 2025-07-29T04:29:17Z

That is, the resulting pointer is 2 * 200704000LL away from W.storage().data_ptr(), which causes an out-of-bounds access.

hi @Phoslight, thanks for the fixing and want to understand more about why the offset will cause out-of-bounds. AFAIK, the offset should come from the original view node.

+----------------------------------+
| contiguous W with storage offset |  // should_block_weight == False
+-----+----------------------------+
      |
      |
      v
 compile_fx()
      +
      |
      |
      |  // (1) generates the cpp template:
      +------> run_node()
      |              +
      |              | ...
      |              v
      |        tuned_bmm()
      |              +
      |              | ...
      |              v
      |        CppBmmTemplate::render()   // generates template with W's storage offset
      |                                   // e.g. W[static_cast<int64_t>(200704000LL + n_start + 80LL*k_start + 15680LL*ks_b_index)]
      |                                   //   (in this patch I dropped the offset 200704000LL
      |                                   //    to align with should_block_weight branch in prep_weight)
      |
      |
      |  // (2) generates the example inputs
      +------> do_autotuning()
                     +
                     | ...
                     v
               AlgorithmSelectorCache::benchmark()
               +->benchmark_in_current_process()
                  +> get_inputs()
                     +> benchmark_example_value()
                     |  +> unwrap_view()   // drops the offset from W.data_ptr
                     |
                     | ...
                     |
                     +> DataProcessorChoiceCallerWrapper::benchmark()
                        +> CppGemmTemplate::preprocessor()
                           +> normalize_shapes()   // adds back the offset to W.data_ptr (your fix)

Thank you for your reply, Leslie.

Without this patch, the cpp template and the example inputs both add an offset, which causes the double-offset issue.
My proposed fix is to remove the offset from the cpp template to align with the behavior of the should_block_weight branch.

Hope the above chart helps clarify the issue.

Phoslight · 2025-07-31T03:46:52Z

Gentle ping. Any other reviews? Thanks in advance.

swolchok

not familiar with this code, but approving workflows to run

swolchok · 2025-08-11T23:46:46Z

torch/_inductor/codegen/cpp_gemm_template.py

+                # GEMM_TEMPLATE emits code like:
+                #   W.data_ptr[offset + ...]
+                # but the data_ptr already includes the offset.


This makes it sound like the correct fix is to remove the offset from the index calculation in the emitted code rather than copy. I assume that would break something else?

Fix double-offsetting issue

674a976

pytorch-bot bot added the module: inductor label Jul 27, 2025

pytorch-bot bot added the topic: not user facing topic category label Jul 27, 2025

Phoslight mentioned this pull request Jul 28, 2025

torch.compile on BFloat16 Segment Anything segfaults in cpp_CppMicroGemmRef_micro_gemm<false, false> on Mac #158076

Open

pytorchbot added the open source label Jul 28, 2025

HDCharles requested review from leslie-fang-intel and swolchok July 28, 2025 21:23

HDCharles added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 28, 2025

leslie-fang-intel requested a review from CaoE July 29, 2025 01:38

leslie-fang-intel approved these changes Jul 29, 2025

View reviewed changes

swolchok reviewed Aug 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[inductor][cpu] Fix double-offset issue in `GEMM_TEMPLATE` #159233

[inductor][cpu] Fix double-offset issue in `GEMM_TEMPLATE` #159233

Uh oh!

Phoslight commented Jul 27, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 27, 2025 •

edited

Loading

Uh oh!

linux-foundation-easycla bot commented Jul 27, 2025 •

edited

Loading

Uh oh!

Phoslight commented Jul 27, 2025

Uh oh!

leslie-fang-intel commented Jul 29, 2025

Uh oh!

leslie-fang-intel commented Jul 29, 2025

Uh oh!

Phoslight commented Jul 29, 2025 •

edited

Loading

Uh oh!

Phoslight commented Jul 31, 2025

Uh oh!

swolchok left a comment

Uh oh!

swolchok Aug 11, 2025

Uh oh!

Uh oh!

[inductor][cpu] Fix double-offset issue in GEMM_TEMPLATE #159233

Are you sure you want to change the base?

[inductor][cpu] Fix double-offset issue in GEMM_TEMPLATE #159233

Uh oh!

Conversation

Phoslight commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159233

❌ 2 New Failures, 1 Cancelled Job, 2 Unrelated Failures

Uh oh!

linux-foundation-easycla bot commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Phoslight commented Jul 27, 2025

Uh oh!

leslie-fang-intel commented Jul 29, 2025

Uh oh!

leslie-fang-intel commented Jul 29, 2025

Uh oh!

Phoslight commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Phoslight commented Jul 31, 2025

Uh oh!

swolchok left a comment

Choose a reason for hiding this comment

Uh oh!

swolchok Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

[inductor][cpu] Fix double-offset issue in `GEMM_TEMPLATE` #159233

[inductor][cpu] Fix double-offset issue in `GEMM_TEMPLATE` #159233

Phoslight commented Jul 27, 2025 •

edited

Loading

pytorch-bot bot commented Jul 27, 2025 •

edited

Loading

linux-foundation-easycla bot commented Jul 27, 2025 •

edited

Loading

Phoslight commented Jul 29, 2025 •

edited

Loading