[Profiler] Fix Empty C Call Queue #150370

sraikund16 · 2025-04-01T00:05:07Z

Summary:
My commandeer of #150102

Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. It adds some unnecessary events at worst but prevents segfaults/failures. My PR just cleans up some refcount impl and logging.

Contributors: @arjun-choudhry

Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues.

Differential Revision: D72207570

pytorch-bot · 2025-04-01T00:05:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150370

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit f7b4515 with merge base 783f045 ():

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (#149370)
REGRESSION: benchmark ('aotdispatcher_training_nosubclass_cpu', 'compile_time_instruction_count') failed, actual result 3635912369 is 1.59% higher than expected 3579000000 ±+1.50% if this is an expected regression, please update the expected results.
pull / linux-jammy-xpu-2025.0-py3.9 / build (gh) (#150430)
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/sstream:152:52: error: expected value in expression

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-04-01T00:05:17Z

This pull request was exported from Phabricator. Differential Revision: D72207570

sraikund16 · 2025-04-01T00:07:22Z

@pramodk @oraluben do you mind checking if this PR fixes the issue?

facebook-github-bot · 2025-04-01T05:42:42Z

This pull request was exported from Phabricator. Differential Revision: D72207570

torch/csrc/autograd/profiler_python.cpp

Summary: Pull Request resolved: pytorch#150370 My commandeer of pytorch#150102 Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. My diff just cleans up some refcount impl and logging. Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues. Differential Revision: D72207570

facebook-github-bot · 2025-04-01T17:28:18Z

This pull request was exported from Phabricator. Differential Revision: D72207570

pramodk · 2025-04-01T21:46:18Z

@sraikund16, thanks! I am able to build with latest change but I won't be able to test this at the moment / today. I am attaching here another simple test I saw failing with nemo:25.02 container. In case you could test this with local build (it's standalone):

import torch

def get_profiler():
    return torch.profiler.profile(
        activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
        with_stack=True,
    )

def profile_tensor_ops():
    device = "cuda"
    with get_profiler() as prof:
        for _ in range(5):
            x = torch.randn(1000, 1000, device=device)
            y = torch.randn(1000, 1000, device=device)
            z = torch.matmul(x, y)
            torch.cuda.synchronize()
            prof.step()

    print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=10))


if __name__ == "__main__":
    profile_tensor_ops()

and running as:

WORKSPACE_PATH=$(pwd)
docker run \
  --gpus all \
  -it \
  --rm \
  --ipc=host \
  --network=host \
  -v $WORKSPACE_PATH:$WORKSPACE_PATH \
  nvcr.io/nvidia/nemo:25.02 \
  python $WORKSPACE_PATH/test.py

was producing

   ....
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 777, in __exit__
    self.stop()
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 793, in stop
    self._transit_action(self.current_action, None)
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 836, in _transit_action
    action()
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 239, in stop_trace
    self.profiler.__exit__(None, None, None)
  File "/usr/local/lib/python3.12/dist-packages/torch/autograd/profiler.py", line 378, in __exit__
    self.kineto_results = _disable_profiler()
                          ^^^^^^^^^^^^^^^^^^^
RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/torch/csrc/autograd/profiler_python.cpp":982, please report a bug to PyTorch. Python replay stack is empty.

Just one note: if you would test in local build, would be good to to verify if this test fails without this PR. Just for cross-checking as I was wondering if behavior changes a bit based on the different build configs. But I didn't get time to verify this thoroughly....

Edit 1: By the way, the reason for the above note is that I was further cross-checking above simple example with PyTorch containers:

nvcr.io/nvidia/pytorch:25.01-py3, 2.6.0a0+ecf3bae40a.nv25.0
nvcr.io/nvidia/pytorch:25.02-py3, 2.7.0a0+ecf3bae40a.nv25.02
nvcr.io/nvidia/pytorch:25.02-py3, 2.7.0a0+7c8ec84dab.nv25.03

they all use Python 3.12.3 but only 25.01-py3 (torch.__version__ -> 2.6.0a0+ecf3bae40a.nv25.01) fails with the assert. And hence, initial conclusion that it's "only" Python version related is no longer true (?).

Edit 2: one common thing between failing pytorch:25.01 and nemo:25.02 is that sys.version for both is the same '3.12.3 (main, Nov 6 2024, 18:32:19) [GCC 13.2.0]' i.e. GCC 13.2 whereas newer Pytorch are 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] !

sraikund16 · 2025-04-01T22:50:03Z

@sraikund16, thanks! I am able to build with latest change but I won't be able to test this at the moment / today. I am attaching here another simple test I saw failing with nemo:25.02 container. In case you could test this with local build (it's standalone):
import torch

def get_profiler():
    return torch.profiler.profile(
        activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
        with_stack=True,
    )

def profile_tensor_ops():
    device = "cuda"
    with get_profiler() as prof:
        for _ in range(5):
            x = torch.randn(1000, 1000, device=device)
            y = torch.randn(1000, 1000, device=device)
            z = torch.matmul(x, y)
            torch.cuda.synchronize()
            prof.step()

    print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=10))


if __name__ == "__main__":
    profile_tensor_ops()
and running as:
WORKSPACE_PATH=$(pwd)
docker run \
  --gpus all \
  -it \
  --rm \
  --ipc=host \
  --network=host \
  -v $WORKSPACE_PATH:$WORKSPACE_PATH \
  nvcr.io/nvidia/nemo:25.02 \
  python $WORKSPACE_PATH/test.py
was producing
   ....
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 777, in __exit__
    self.stop()
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 793, in stop
    self._transit_action(self.current_action, None)
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 836, in _transit_action
    action()
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 239, in stop_trace
    self.profiler.__exit__(None, None, None)
  File "/usr/local/lib/python3.12/dist-packages/torch/autograd/profiler.py", line 378, in __exit__
    self.kineto_results = _disable_profiler()
                          ^^^^^^^^^^^^^^^^^^^
RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/torch/csrc/autograd/profiler_python.cpp":982, please report a bug to PyTorch. Python replay stack is empty.
Just one note: if you would test in local build, would be good to to verify if this test fails without this PR. Just for cross-checking as I was wondering if behavior changes a bit based on the different build configs. But I didn't get time to verify this thoroughly....

Edit 1: By the way, the reason for the above note is that I was further cross-checking above simple example with PyTorch containers:

nvcr.io/nvidia/pytorch:25.01-py3, 2.6.0a0+ecf3bae40a.nv25.0

nvcr.io/nvidia/pytorch:25.02-py3, 2.7.0a0+ecf3bae40a.nv25.02

nvcr.io/nvidia/pytorch:25.02-py3, 2.7.0a0+7c8ec84dab.nv25.03

they all use Python 3.12.3 but only 25.01-py3 (torch.__version__ -> 2.6.0a0+ecf3bae40a.nv25.01) fails with the assert. And hence, initial conclusion that it's "only" Python version related is no longer true (?).

Edit 2: one common thing between failing pytorch:25.01 and nemo:25.02 is that sys.version for both is the same '3.12.3 (main, Nov 6 2024, 18:32:19) [GCC 13.2.0]' i.e. GCC 13.2 whereas newer Pytorch are 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] !

I don't have a machine with those virtual environments set up so I won't be able to test it myself. Since this PR is supposed to fix multiple issues, lets get it in and then we can do follow up on these other potential issues later.

sraikund16 · 2025-04-02T02:37:27Z

@pytorchbot merge

pytorchmergebot · 2025-04-02T02:39:15Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

arjun-choudhry · 2025-04-02T11:39:27Z

@sraikund16 Can you please add me as a contributor in the PR? Thanks

clee2000 · 2025-04-02T16:39:11Z

@pytorchbot revert -m "broke some profiler tests when building with debug asserts profiler/test_memory_profiler.py::TestMemoryProfiler::test_config_check GH job link HUD commit link" -c nosignal

pytorchmergebot · 2025-04-02T16:40:46Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit 5734909. Reverted #150370 on behalf of https://github.com/clee2000 due to broke some profiler tests when building with debug asserts profiler/test_memory_profiler.py::TestMemoryProfiler::test_config_check [GH job link](https://github.com/pytorch/pytorch/actions/runs/14211763078/job/39822158330) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/3ac5a499ddac701f607a9f7206f9bec8871e1cbb) ([comment](#150370 (comment)))

pytorchmergebot · 2025-04-02T16:40:57Z

@sraikund16 your PR has been successfully reverted.

facebook-github-bot · 2025-04-02T16:43:25Z

@sraikund16 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

sraikund16 · 2025-04-02T16:44:54Z

@sraikund16 Can you please add me as a contributor in the PR? Thanks

Added you to the PR description, I think you need to push to the branch itself to be considered a contributor on GH though. Let me know if there is another way

sraikund16 · 2025-04-02T22:13:23Z

@pytorchbot merge

pytorchmergebot · 2025-04-02T22:15:10Z

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team

Raised by workflow job

sraikund16 · 2025-04-02T22:18:32Z

@pytorchbot merge

pytorchmergebot · 2025-04-02T22:20:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: My commandeer of pytorch#150102 Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. It adds some unnecessary events at worst but prevents segfaults/failures. My PR just cleans up some refcount impl and logging. Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues. Differential Revision: D72207570 Pull Request resolved: pytorch#150370 Approved by: https://github.com/aaronenyeshi

This reverts commit 5734909. Reverted pytorch#150370 on behalf of https://github.com/clee2000 due to broke some profiler tests when building with debug asserts profiler/test_memory_profiler.py::TestMemoryProfiler::test_config_check [GH job link](https://github.com/pytorch/pytorch/actions/runs/14211763078/job/39822158330) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/3ac5a499ddac701f607a9f7206f9bec8871e1cbb) ([comment](pytorch#150370 (comment)))

@arjun-choudhry

Summary: My commandeer of pytorch#150102 Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. It adds some unnecessary events at worst but prevents segfaults/failures. My PR just cleans up some refcount impl and logging. Contributors: @arjun-choudhry Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues. Differential Revision: D72207570 Pull Request resolved: pytorch#150370 Approved by: https://github.com/aaronenyeshi

…55446) Hi team, Please help review this patch. This PR #150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable. I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by python/cpython@257c413 on 3.12.5. So I think the #150370 cannot fix the problem, this patch reverts the change of it. There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain. ~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~ Pull Request resolved: #155446 Approved by: https://github.com/sraikund16, https://github.com/cyyever

…55446) Hi team, Please help review this patch. This PR #150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable. I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by python/cpython@257c413 on 3.12.5. So I think the #150370 cannot fix the problem, this patch reverts the change of it. There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain. ~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~ Pull Request resolved: #155446 Approved by: https://github.com/sraikund16

facebook-github-bot added the fb-exported label Apr 1, 2025

sraikund16 requested a review from aaronenyeshi April 1, 2025 00:05

sraikund16 requested a review from briancoutinho April 1, 2025 00:07

sraikund16 assigned ngimel and unassigned ngimel Apr 1, 2025

sraikund16 requested a review from ngimel April 1, 2025 00:07

sraikund16 force-pushed the export-D72207570 branch from f3bfe5a to a9dead5 Compare April 1, 2025 05:42

pramodk reviewed Apr 1, 2025

View reviewed changes

torch/csrc/autograd/profiler_python.cpp Outdated Show resolved Hide resolved

sraikund16 force-pushed the export-D72207570 branch from a9dead5 to e90ce37 Compare April 1, 2025 17:21

sraikund16 force-pushed the export-D72207570 branch from e90ce37 to 9cb0479 Compare April 1, 2025 17:28

sraikund16 added release notes: profiler release notes category topic: bug fixes topic category labels Apr 1, 2025

aaronenyeshi approved these changes Apr 1, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 1, 2025

pytorchmergebot added the merging label Apr 2, 2025

pytorchmergebot added the Merged label Apr 2, 2025

pytorchmergebot closed this in 5734909 Apr 2, 2025

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Apr 2, 2025

pytorchmergebot reopened this Apr 2, 2025

fix failing internal tests

f7b4515

pytorchmergebot added the merging label Apr 2, 2025

pytorchmergebot removed the merging label Apr 2, 2025

pytorchmergebot added the merging label Apr 2, 2025

pytorchmergebot closed this in a677b49 Apr 2, 2025

pytorchmergebot removed the merging label Apr 2, 2025

D-D-H mentioned this pull request Jun 9, 2025

[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 #155446

Closed

[Profiler] Fix Empty C Call Queue #150370

[Profiler] Fix Empty C Call Queue #150370

Uh oh!

Conversation

sraikund16 commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150370

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

facebook-github-bot commented Apr 1, 2025

Uh oh!

sraikund16 commented Apr 1, 2025

Uh oh!

facebook-github-bot commented Apr 1, 2025

Uh oh!

Uh oh!

facebook-github-bot commented Apr 1, 2025

Uh oh!

pramodk commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sraikund16 commented Apr 1, 2025

Uh oh!

sraikund16 commented Apr 2, 2025

Uh oh!

pytorchmergebot commented Apr 2, 2025

Merge started

Uh oh!

arjun-choudhry commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clee2000 commented Apr 2, 2025

Uh oh!

pytorchmergebot commented Apr 2, 2025

Uh oh!

pytorchmergebot commented Apr 2, 2025

Uh oh!

facebook-github-bot commented Apr 2, 2025

Uh oh!

sraikund16 commented Apr 2, 2025

Uh oh!

sraikund16 commented Apr 2, 2025

Uh oh!

pytorchmergebot commented Apr 2, 2025

Merge failed

Uh oh!

sraikund16 commented Apr 2, 2025

Uh oh!

pytorchmergebot commented Apr 2, 2025

Merge started

Uh oh!

Uh oh!

sraikund16 commented Apr 1, 2025 •

edited

Loading

pytorch-bot bot commented Apr 1, 2025 •

edited

Loading

pramodk commented Apr 1, 2025 •

edited

Loading

arjun-choudhry commented Apr 2, 2025 •

edited

Loading