Update upstream opinfo to generate appropriately scaled sample inputs #158018

matthewhagraphcore · 2025-07-10T09:41:07Z

Currently, opinfo generates random inputs for _scaled_mm but does not enforce type saturation, unlike the upstream test implementation which explicitly saturates both fp8 data types.

Problem:
The current random input generation in sample_inputs_scaled_mm may produce values outside the valid range for the input types, potentially missing edge cases that the CUDA tests intentionally cover.

Solution:
Modify the sample_inputs_scaled_mm implementation to:

Apply the same input saturation logic used in CUDA _scaled_mm tests

pytorch/test/test_matmul_cuda.py

Lines 1007 to 1052 in 52e4e41

    
           def test_scaled_mm_vs_emulated(self, base_dtype): 
        
               torch.manual_seed(42) 
        
               input_dtype = e4m3_type 
        
               output_dtype = base_dtype 
        
               compare_type = torch.float32 
        
               x = torch.randn(16, 16, device="cuda", dtype=base_dtype) 
        
               y = torch.randn(32, 16, device="cuda", dtype=base_dtype).t() 
        
               x_scale = tensor_to_scale(x, input_dtype).float() 
        
               y_scale = tensor_to_scale(y, input_dtype).float() 
        
               x_fp8 = to_fp8_saturated(x * x_scale, input_dtype) 
        
               y_fp8 = to_fp8_saturated(y * y_scale, input_dtype) 
        
               # Calculate actual F8 mm 
        
               out_scaled_mm = mm_float8( 
        
                   x_fp8, 
        
                   y_fp8, 
        
                   a_scale=x_scale, 
        
                   b_scale=y_scale, 
        
                   output_dtype=output_dtype 
        
               ) 
        
               # Calculate emulated F8 mm 
        
               out_emulated = mm_float8_emulated( 
        
                   x_fp8, 
        
                   x_scale, 
        
                   y_fp8, 
        
                   y_scale, 
        
                   output_dtype 
        
               ) 
        
               if output_dtype != base_dtype: 
        
                   out_scaled_mm = out_scaled_mm.to(compare_type) 
        
                   out_scaled_mm = out_scaled_mm / tensor_to_scale(out_scaled_mm, input_dtype) 
        
                   out_emulated = out_emulated.to(compare_type) 
        
                   out_emulated = out_emulated / tensor_to_scale(out_emulated, input_dtype) 
        
               if base_dtype in {torch.bfloat16, torch.float16}: 
        
                   atol, rtol = 7e-2, 7e-2 
        
               else: 
        
                   atol, rtol = 3e-3, 3e-3 
        
               torch.testing.assert_close(out_scaled_mm, out_emulated, atol=atol, rtol=rtol)

Maintain the existing random generation approach but clamp values to type-appropriate ranges.

This required a bit of alteration to adapt to the late realization of the inputs. I think I have done this correctly, but open to suggestions.

pytorch-bot · 2025-07-10T09:41:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158018

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit a2174ed with merge base 178515d ():

NEW FAILURES - The following jobs have failed:

pull / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 2, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
test_ops.py::TestCommonCUDA::test_out_warning_torch__scaled_mm_cuda
pull / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 4, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
test_decomp.py::TestDecompCUDA::test_comprehensive_torch__scaled_mm_cuda_float8_e4m3fn

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/testing/_internal/common_methods_invocations.py

drisspg · 2025-07-15T23:34:53Z

@pytorchbot rebase

pytorchmergebot · 2025-07-15T23:36:29Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-07-15T23:36:33Z

Successfully rebased PYT-996-update-upstream-opinfo-to-generate-appropriately-scaled-sample-inputs onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout PYT-996-update-upstream-opinfo-to-generate-appropriately-scaled-sample-inputs && git pull --rebase)

…ppropriately-scaled-sample-inputs

matthewhagraphcore · 2025-07-22T09:58:11Z

REGRESSION: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17947513212 is 1.80% higher than expected 17630000000 ±+1.50% if this is an expected regression, please update the expected results.

This seems unrelated? I've made no changes to inductor

matthewhagraphcore · 2025-07-23T12:12:41Z

@drisspg Could I get a re-review?

…ppropriately-scaled-sample-inputs

matthewhagraphcore requested a review from mruberry as a code owner July 10, 2025 09:41

pytorch-bot bot added the release notes: python_frontend python frontend release notes category label Jul 10, 2025

matthewhagraphcore added the topic: not user facing topic category label Jul 10, 2025

pytorchbot added the open source label Jul 10, 2025

mikaylagawarecki requested a review from drisspg July 14, 2025 15:02

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 14, 2025

drisspg reviewed Jul 15, 2025

View reviewed changes

torch/testing/_internal/common_methods_invocations.py Show resolved Hide resolved

drisspg approved these changes Jul 15, 2025

View reviewed changes

matthewhagraphcore added 2 commits July 15, 2025 23:36

Change sample inputs to generate saturated fp8 inputs

e0ffe1a

remove partials

5192752

pytorchmergebot force-pushed the PYT-996-update-upstream-opinfo-to-generate-appropriately-scaled-sample-inputs branch from 5c2a209 to 5192752 Compare July 15, 2025 23:36

matthewhagraphcore added 2 commits July 17, 2025 08:37

Merge branch 'main' into PYT-996-update-upstream-opinfo-to-generate-a…

e169663

…ppropriately-scaled-sample-inputs

suggestion

e793cf1

drisspg approved these changes Jul 23, 2025

View reviewed changes

Merge branch 'main' into PYT-996-update-upstream-opinfo-to-generate-a…

a2174ed

…ppropriately-scaled-sample-inputs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update upstream opinfo to generate appropriately scaled sample inputs #158018

Update upstream opinfo to generate appropriately scaled sample inputs #158018

Uh oh!

matthewhagraphcore commented Jul 10, 2025

Uh oh!

pytorch-bot bot commented Jul 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

drisspg commented Jul 15, 2025

Uh oh!

pytorchmergebot commented Jul 15, 2025

Uh oh!

pytorchmergebot commented Jul 15, 2025

Uh oh!

matthewhagraphcore commented Jul 22, 2025

Uh oh!

matthewhagraphcore commented Jul 23, 2025

Uh oh!

Uh oh!

	def test_scaled_mm_vs_emulated(self, base_dtype):
	torch.manual_seed(42)
	input_dtype = e4m3_type
	output_dtype = base_dtype
	compare_type = torch.float32

	x = torch.randn(16, 16, device="cuda", dtype=base_dtype)
	y = torch.randn(32, 16, device="cuda", dtype=base_dtype).t()

	x_scale = tensor_to_scale(x, input_dtype).float()
	y_scale = tensor_to_scale(y, input_dtype).float()

	x_fp8 = to_fp8_saturated(x * x_scale, input_dtype)
	y_fp8 = to_fp8_saturated(y * y_scale, input_dtype)

	# Calculate actual F8 mm
	out_scaled_mm = mm_float8(
	x_fp8,
	y_fp8,
	a_scale=x_scale,
	b_scale=y_scale,
	output_dtype=output_dtype
	)

	# Calculate emulated F8 mm
	out_emulated = mm_float8_emulated(
	x_fp8,
	x_scale,
	y_fp8,
	y_scale,
	output_dtype
	)

	if output_dtype != base_dtype:
	out_scaled_mm = out_scaled_mm.to(compare_type)
	out_scaled_mm = out_scaled_mm / tensor_to_scale(out_scaled_mm, input_dtype)

	out_emulated = out_emulated.to(compare_type)
	out_emulated = out_emulated / tensor_to_scale(out_emulated, input_dtype)

	if base_dtype in {torch.bfloat16, torch.float16}:
	atol, rtol = 7e-2, 7e-2
	else:
	atol, rtol = 3e-3, 3e-3

	torch.testing.assert_close(out_scaled_mm, out_emulated, atol=atol, rtol=rtol)

Update upstream opinfo to generate appropriately scaled sample inputs #158018

Are you sure you want to change the base?

Update upstream opinfo to generate appropriately scaled sample inputs #158018

Uh oh!

Conversation

matthewhagraphcore commented Jul 10, 2025

Uh oh!

pytorch-bot bot commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158018

❌ 2 New Failures

Uh oh!

Uh oh!

drisspg commented Jul 15, 2025

Uh oh!

pytorchmergebot commented Jul 15, 2025

Uh oh!

pytorchmergebot commented Jul 15, 2025

Uh oh!

matthewhagraphcore commented Jul 22, 2025

Uh oh!

matthewhagraphcore commented Jul 23, 2025

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 10, 2025 •

edited

Loading