foreach CUDA tests flaky on CUDA 12.6+ due to flaky profiler results

While updating to CUDA 12.6 eager test, PR: https://github.com/pytorch/pytorch/pull/148602

Failing workflow: https://github.com/pytorch/pytorch/actions/runs/13690790469/job/38285054097#step:22:4164
We see following test failure:
```
_ TestForeachCUDA.test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_bfloat16 _
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 1159, in test_wrapper
    return test(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/mock.py", line 1833, in _inner
    return f(*args, **kw)
  File "/var/lib/jenkins/workspace/test/test_foreach.py", line 386, in test_pointwise_op_with_tensor_of_scalarlist_overload
    self._pointwise_test(
  File "/var/lib/jenkins/workspace/test/test_foreach.py", line 510, in _pointwise_test
    actual = op(inputs, self.is_cuda, is_fastpath, **kwargs)
  File "/var/lib/jenkins/workspace/test/test_foreach.py", line 90, in __call__
    assert mta_called == (expect_fastpath and (not zero_size)), (
AssertionError: mta_called=False, expect_fastpath=True, zero_size=False, self.func.__name__='_foreach_addcmul', keys=('aten::_foreach_addcmul', 'Unrecognized', 'aten::result_type', 'aten::empty_strided', 'cudaLaunchKernel', 'cudaDeviceSynchronize')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3150, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3150, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 454, in instantiated_test
    result = test(self, **param_kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 1612, in wrapper
    fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 1171, in test_wrapper
    raise e_tracked from e
Exception: Caused by sample input at index 9: SampleInput(input=TensorList[Tensor[size=(0,), device="cuda:0", dtype=torch.bfloat16], Tensor[size=(19, 19), device="cuda:0", dtype=torch.bfloat16], Tensor[size=(18, 18), device="cuda:0", dtype=torch.bfloat16], Tensor[size=(0,), device="cuda:0", dtype=torch.bfloat16], Tensor[size=(16, 16), device="cuda:0", dtype=torch.bfloat16], Tensor[size=(15, 15), device="cuda:0", dtype=torch.bfloat16], Tensor[size=(0,), device="cuda:0", dtype=torch.bfloat16], Tensor[size=(13, 13), device="cuda:0", dtype=torch.bfloat16], Tensor[size=(12, 12), device="cuda:0", dtype=torch.bfloat16], Tensor[size=(0,), device="cuda:0", dtype=torch.bfloat16], Tensor[size=(10, 10), device="cuda:0", dtype=torch.bfloat16], Tensor[size=(9, 9), device="cuda:0", dtype=torch.bfloat16], Tensor[size=(0,), device="cuda:0", dtype=torch.bfloat16], Tensor[size=(7, 7), device="cuda:0", dtype=torch.bfloat16], Tensor[size=(6, 6), device="cuda:0", 
```

cc @ptrblck @msaroufim @eqy @janeyx99 @crcrpar @tinglvv @nWEIdia 

### Versions

2.7.0 nightly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

foreach CUDA tests flaky on CUDA 12.6+ due to flaky profiler results #148681

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

foreach CUDA tests flaky on CUDA 12.6+ due to flaky profiler results #148681

Description

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions