gh-135953: Implement sampling tool under profile.sample #135998

lkollar · 2025-06-26T19:39:03Z

No description provided.

Lib/profile/__main__.py

This allows adding a new 'sample' submodule and enables invoking the sampling profiler through 'python -m profile.sample', while retaining backwards compatibility when using 'profile'.

Implement a statistical sampling profiler that can profile external Python processes by PID. Uses the _remote_debugging module and converts the results to pstats-compatible format for analysis.

This variant overrides how column headers are printed to avoid conflating call counts with sample counts. The SampledStats results are stored in the exact same format as Stats, but since the results don't represent call counts but sample counts, the column headers are different to account for this. This ensure that using the pstats browser instantiates the right object to handle the correct columns, add a factory function which can instantiate the correct class. As the Stats class can only handle either a filename or an object which provides the 'stats' attribute in a pre-parsed format, this provides a StatsLoaderShim to avoid marshalling the data twice (once to check the marker and once in the Stats module if we were to pass the file name).

Implements collapsed stack trace format output for the sampling profiler. This format represents complete call stacks as semicolon- delimited strings with sample counts, making it compatible with external flamegraph generation tools like flamegraph.pl. The format uses filename:function:line notation and stores call trees during sampling for efficient post-processing into the collapsed format.

bedevere-bot · 2025-07-10T09:38:27Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit a33d166 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F135998%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

bedevere-bot · 2025-07-10T11:21:13Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 0235127 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F135998%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

bedevere-bot · 2025-07-10T11:26:58Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 90260a6 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F135998%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

bedevere-bot · 2025-07-10T12:14:21Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 5683b76 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F135998%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

pablogsal · 2025-07-10T17:15:51Z

!buildbot android

bedevere-bot · 2025-07-10T17:15:55Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 5a83439 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F135998%2Fmerge

The command will test the builders whose names match following regular expression: android

The builders matched are:

aarch64 Android PR
AMD64 Android PR

pablogsal · 2025-07-10T17:44:36Z

Congrats @lkollar 🎉

…#135998) Implement a statistical sampling profiler that can profile external Python processes by PID. Uses the _remote_debugging module and converts the results to pstats-compatible format for analysis. Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

P403n1x87 · 2025-08-03T16:16:44Z

Lib/profile/sample.py

Apologies if these comments are not 100% accurate but I just had a quick scan of this source and the implementation of the unwinder. My initial reaction is that it is not clear what the data that is being collected is actually representing. Clearly this is not a CPU-time profile (the profiler is just counting stacks, but with a rough conversion factor that depends on the sampling rate one could turn those into physical time estimates, if one really wants to) because we don't know if the collected stacks were on CPU or not. In the build of CPython with the GIL, it looks like we're only sampling the thread that is holding the GIL. One may assume that a thread that holds the GIL is on CPU, but this does not need to be the case (indeed the application might have a "bug" whereby it's holding the GIL while it might actually release it), so the profiles one get are on-GIL profiles, which are not wall-time nor CPU-time profiles in general. I think it would be beneficial if the unwinder returned extra information, such as whether the stack was (likely) on CPU, whether its thread was (likely) holding the GIL. And if one wants a wall-time mode for the profiler, sampling just the on-GIL thread won't provide an accurate picture of where each thread is spending its wall-time.

Thanks for the detailed feedback - you raise very important points about the semantic clarity of what we’re actually measuring here. You’re absolutely right that the current state creates “on-GIL profiles” rather than true CPU-time or wall-time profiles, and that holding the GIL doesn’t guarantee CPU execution.

The missing point here is that the profiler is not finished yet and there are plenty of things we still need to finalize (we are also waiting for the PEP to be approved to put it into its final namespace). Right now we’re focused on getting the base infrastructure in place and working reliably across platforms (we don't even have yet the mode to run scripts or modules, only attaching).

What we had in mind here is something close to what you’re hinting at - in GIL mode, we want to avoid re-sampling threads that aren’t actually moving (i.e., would produce identical stack traces) so the idea is that the profiler only samples the thread with the GIL and signals that the other stacks are the same to the frontend . The frontend will then use the last samples to calculate the stats. For now we don’t have the signaling infrastructure for that, but it’s on the roadmap.

I’m particularly intrigued by your point about CPU detection. Do you have any concrete plan in mind for what you propose? Unless I am missing something there’s no good portable way to reliably determine if a thread is actually on-CPU from a remote process without sacrificing significant performance. On Linux we could theoretically examine kernel stacks from /proc or check the stat pseudohandle but that’s racy, likely slow, and doesn’t work on macOS or Windows. And I don't look forward to start calling NtQuerySystemInformation all over the place

Do you have thoughts on practical approaches for this? What’s your take on the best path forward for providing more semantic clarity about what the samples actually represent?

Also, PRs welcomed! 😉

The missing point here is that the profiler is not finished yet and there are plenty of things we still need to finalize (we are also waiting for the PEP to be approved to put it into its final namespace).

Ah fair enough.

in GIL mode, we want to avoid re-sampling threads that aren’t actually moving

That's an interesting idea, but I struggle to see how this could work 🤔 If a thread is essentially always off-GIL then it will never be sampled. Or it could switch between idle functions but the stack would never get re-sampled by the profiler. For example, consider this case

def foo(): a() # on-CPU b() # I/O-bound, off-GIL thread = Thread(target=foo)

The profiler will likely see a on the stack, but might miss the samples where b is on the stack because the thread will be off-GIL. Then in wall-time profiles it will look as if a was the only function running on that thread.

I’m particularly intrigued by your point about CPU detection. Do you have any concrete plan in mind for what you propose? Unless I am missing something there’s no good portable way to reliably determine if a thread is actually on-CPU from a remote process without sacrificing significant performance. On Linux we could theoretically examine kernel stacks from /proc or check the stat pseudohandle but that’s racy, likely slow, and doesn’t work on macOS or Windows. And I don't look forward to start calling NtQuerySystemInformation all over the place

Well this approach is already racy by nature since threads are not being stopped before taking samples so there could be all sorts of inconsistencies already. In Austin we have platform-dependent implementations of py_thread__is_idle that determine whether a thread is idle or not (and yes it uses NtQuerySystemInformation on Windows). Surely it is racy, but I don't know of other ways of finding out about the thread status. In all my experiments the accuracy is pretty good, and the overhead not too bad. With simple stacks Austin can still sample at over 100 kHz, with the main overhead coming from the remote reads of datastack chunks.

Do you have thoughts on practical approaches for this? What’s your take on the best path forward for providing more semantic clarity about what the samples actually represent?

I think samples would have to include the CPU state of threads at the very least to provide both a wall- and CPU-time modes, which are pretty common for profilers. The GIL state might be added bonus for e.g. GIL contention analysis, figuring out if there is a lot of idle time spent with the GIL held etc...)

Also, PRs welcomed! 😉

I'm currently short on bandwidth so I don't think I'll be able to contribute much in the short-term, but I'm more than happy to review PRs if needed and share more of my experience with developing Austin. Also I wonder if there isn't a better place to take this discussion to so to have all the details in one place?

The profiler will likely see a on the stack, but might miss the samples where b is on the stack because the thread will be off-GIL. Then in wall-time profiles it will look as if a was the only function running on that thread.

No, the condition would look like:

Go over all the threads

For the threads that don't have the GIL check the current frame

If the current frame is the same ( as in 'the same pointer' to the frame) then assume the stack hasn't changed.

Return something that signals that the stack is the same instead of resolving the full stack.

Technically is possible that at low sample rates if you have A -> B -> C then the stack may have moved to A -> B2 -> C where C somehow has the same pointer. I think for this we can introduce some monotonically increased version numbers somewhere but I still need to explore this. In any case the chances of that happening are very low.

Well this approach is already racy by nature since threads are not being stopped before taking samples so there could be all sorts of inconsistencies already.

Yes but this is worse. Is one thing that we are sampling different threads at different times but that's not too bad as long as we can identify when a single sample is consistent or likely consistent. But if we sample separately the "on-cpu" info and the actual stack that is worse because is possible that the thread has moved or the sys call is existing or whatever. I think I need to benchmark a bit to know how bad this is but I will trust your experience if you say is not too bad.

In Austin we have platform-dependent implementations of py_thread__is_idle that determine whether a thread is idle or not (and yes it uses NtQuerySystemInformation on Windows). Surely it is racy, but I don't know of other ways of finding out about the thread status. In all my experiments the accuracy is pretty good, and the overhead not too bad. With simple stacks Austin can still sample at over 100 kHz, with the main overhead coming from the remote reads of datastack chunks.

Well, if the overhead is not too bad maybe we could make it opt-in? So the user can somehow switch between on-CPU and wall time?

I think samples would have to include the CPU state of threads at the very least to provide both a wall- and CPU-time modes, which are pretty common for profilers. The GIL state might be added bonus for e.g. GIL contention analysis, figuring out if there is a lot of idle time spent with the GIL held etc...)

Ok, I am happy to explore this once we finish the rest of the stuff we have pending (which is still a lot).

I'm currently short on bandwidth so I don't think I'll be able to contribute much in the short-term, but I'm more than happy to review PRs if needed and share more of my experience with developing Austin. Also I wonder if there isn't a better place to take this discussion to so to have all the details in one place?

Fair enough. For design we are mostly communicating in char or email to not saturate the issues because that tends to not work fantastically. If you want I could add you to the chain or alternatively you can wait until we can go on a issue-by-issue basis but that won't happen until the PEP is accepted and we finish the other parts.

No, the condition would look like:

Go over all the threads

For the threads that don't have the GIL check the current frame

If the current frame is the same ( as in 'the same pointer' to the frame) then assume the stack hasn't changed.

Return something that signals that the stack is the same instead of resolving the full stack.

Ah I see, but this still feels risky, even more so considering that interpreter frames live in datastack chunks, which foster slot re-use. Maybe checking the executable object and the frame address could mitigate the problem a bit, but the risk is probably still there 🤔

But if we sample separately the "on-cpu" info and the actual stack that is worse because is possible that the thread has moved

Very true, but the same applies to, say, getting the top frame from PyThreadInfo, and then unwinding that. By the time you unwind the frame the thread could have moved to a different top frame. Unfortunately that's the nature of the process that is being implemented. Unless the stacks vary very wildly I think this approach is still OK (this paper from Emery Berger has some accuracy figures in regards, so that you don't just have to take my word on it 🙂)

Well, if the overhead is not too bad maybe we could make it opt-in? So the user can somehow switch between on-CPU and wall time?

Yes I think it would be very useful to have an on-CPU switch.

If you want I could add you to the chain or alternatively you can wait until we can go on a issue-by-issue basis but that won't happen until the PEP is accepted and we finish the other parts.

Sure, I'm happy to wait 👍

Ah I see, but this still feels risky, even more so considering that interpreter frames live in datastack chunks, which foster slot re-use. Maybe checking the executable object and the frame address could mitigate the problem a bit, but the risk is probably still there 🤔

Yeah that's where having frame version numbers can help. I need to think how we can have those without impacting runtime though.... In any case the reuse is not that bad since the data stack chunks also contains the locals so unless you have EXACTLY the same locals and exactly the same layout is super unlikely you will reuse.

I just ran an experiment on the entire CPython test suite and I got 1e-4 % of frames that reused addresses with the code object being the same. Pretty sure you can have a custom example that will trigger a much higher percentage but at least we know this is not super common

Very true, but the same applies to, say, getting the top frame from PyThreadInfo, and then unwinding that. By the time you unwind the frame the thread could have moved to a different top frame. Unfortunately that's the nature of the process that is being implemented.

Yes but this is an order of magnitude less likely because you cache the entire frame storage, so within the sample you get as much atomicity as possible. This doesn't apply to generators though. But in any case the argument does matter because saying that "well everything is racy doesn't remove from the fact that some things are more racy than others and that adding more racy stuff is worse than having less racy stuff. Indeed, in my measurements accessing /proc/PID/stat adds a 1-3 us latency, which is not prohibitively but is still a lot. Maybe is fine when there aren't a lot of threads....

I think we can add both modes and coordinate with the docs working group to explain this to users in a way that's not "too many knobs" situation.

No, the condition would look like:

Go over all the threads
For the threads that don't have the GIL check the current frame
If the current frame is the same ( as in 'the same pointer' to the frame) then assume the stack hasn't changed.
Return something that signals that the stack is the same instead of resolving the full stack.

This sounds like an interesting idea to try from our end too, but then you'd need to keep track of a mapping from python thread id to the last frame pointer, right? As Gab noted above, this could be risky in case if frame goes out of scope, even though you can check for such cases.

Yes but this is an order of magnitude less likely because you cache the entire frame storage, so within the sample you get as much atomicity as possible. This doesn't apply to generators though. But in any case the argument does matter because saying that "well everything is racy doesn't remove from the fact that some things are more racy than others and that adding more racy stuff is worse than having less racy stuff. Indeed, in my measurements accessing /proc/PID/stat adds a 1-3 us latency, which is not prohibitively but is still a lot. Maybe is fine when there aren't a lot of threads....

It depends on how often you query /proc/PID/stat for how many threads, but it could be quite expensive to use /proc/PID/stat. For a continuous profiler, it was too much, using a few seconds of cpu time per minute. Might be ok for an add-hoc use cases.

I think we can add both modes and coordinate with the docs working group to explain this to users in a way that's not "too many knobs" situation.

Are you implying that users or docs WG are generally against having too many knobs?

This sounds like an interesting idea to try from our end too, but then you'd need to keep track of a mapping from python thread id to the last frame pointer, right? As Gab noted above, this could be risky in case if frame goes out of scope, even though you can check for such cases.

I can't see why is risky. Is comparing a memory address not dereferenckng the pointer.

Are you implying that users or docs WG are generally against having too many knobs?

I am saying that normally users tend to be more confused the more knobs they are and that will make the docs harder so I want the docs WG to recommend how to approach whatever we decide to do in the best way for users or even if they recommend having less knobs

…#135998) Implement a statistical sampling profiler that can profile external Python processes by PID. Uses the _remote_debugging module and converts the results to pstats-compatible format for analysis. Co-authored-by: Pablo Galindo <pablogsal@gmail.com>

bedevere-app bot added the awaiting review label Jun 26, 2025

lkollar changed the title ~~Implement sampling tool under profile.sample~~ gh-135953: Implement sampling tool under profile.sample Jun 26, 2025

bedevere-app bot mentioned this pull request Jun 26, 2025

Expose internal stack introspection APIs as a statistical runtime analysis tool #135953

Open

lkollar force-pushed the sampling-profiler branch 4 times, most recently from 0828aa3 to 57e3152 Compare June 27, 2025 20:09

lkollar changed the title ~~gh-135953: Implement sampling tool under profile.sample~~ gh-135953: Implement sampling tool under profile.sample Jul 3, 2025

pablogsal reviewed Jul 3, 2025

View reviewed changes

Lib/profile/__main__.py Outdated Show resolved Hide resolved

lkollar added 5 commits July 3, 2025 21:35

Move core profiling module into a package

82092dd

This allows adding a new 'sample' submodule and enables invoking the sampling profiler through 'python -m profile.sample', while retaining backwards compatibility when using 'profile'.

Add sampling profiler

cd5c814

Implement a statistical sampling profiler that can profile external Python processes by PID. Uses the _remote_debugging module and converts the results to pstats-compatible format for analysis.

Add tests for sampling profiler

bf9e3fa

lkollar force-pushed the sampling-profiler branch 2 times, most recently from a019e6a to a0be753 Compare July 3, 2025 21:09

fixup! Add tests for sampling profiler

aeca768

lkollar force-pushed the sampling-profiler branch from a0be753 to aeca768 Compare July 3, 2025 21:58

pablogsal added 6 commits July 6, 2025 17:33

Improve CLI

7a76f68

Add more tests

97ba97e

Format files

543b13d

Formatting improvementts

a8f1bdd

Fix small issues

0440856

Add news entry

65d60e9

pablogsal requested a review from ambv July 6, 2025 17:38

pablogsal self-assigned this Jul 6, 2025

pablogsal added 4 commits July 6, 2025 19:39

Moar fixes

bf15570

Use the new gil sampling

219670e

Correct NEWS entry

f3dc377

Add docs

3002ab8

lkollar requested a review from erlend-aasland as a code owner July 10, 2025 08:28

Raise on unsupported platforms in GetPyRuntimeAddress

a33d166

pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 10, 2025

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 10, 2025

fixup! Skip unsupported platforms in profiler tests

0235127

pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 10, 2025

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 10, 2025

Require subprocess support

90260a6

pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 10, 2025

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 10, 2025

fixup! fixup! Skip unsupported platforms in profiler tests

5683b76

pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 10, 2025

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 10, 2025

Skip test on Android

5a83439

pablogsal approved these changes Jul 10, 2025

View reviewed changes

pablogsal merged commit 59acdba into python:main Jul 10, 2025
41 checks passed

bedevere-app bot removed the awaiting merge label Jul 10, 2025

lkollar deleted the sampling-profiler branch July 10, 2025 18:26

P403n1x87 reviewed Aug 3, 2025

View reviewed changes

Uh oh!

gh-135953: Implement sampling tool under profile.sample #135998

gh-135953: Implement sampling tool under profile.sample #135998

Uh oh!

Conversation

lkollar commented Jun 26, 2025

Uh oh!

Uh oh!

bedevere-bot commented Jul 10, 2025

Uh oh!

bedevere-bot commented Jul 10, 2025

Uh oh!

bedevere-bot commented Jul 10, 2025

Uh oh!

bedevere-bot commented Jul 10, 2025

Uh oh!

pablogsal commented Jul 10, 2025

Uh oh!

bedevere-bot commented Jul 10, 2025

Uh oh!

Uh oh!

pablogsal commented Jul 10, 2025

Uh oh!

P403n1x87 Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

pablogsal Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

P403n1x87 Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

pablogsal Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

P403n1x87 Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

pablogsal Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taegyunkim Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

pablogsal Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pablogsal Aug 3, 2025 •

edited

Loading

pablogsal Aug 4, 2025 •

edited

Loading