Better mapping of `sys.monitoring` branches back to source code #122762

jaltmayerpizzorno · 2024-08-06T22:10:50Z

Feature or enhancement

Proposal:

sys.monitoring allows us to record branches taken during execution, reporting them in terms of their source and destination offsets in bytecode. These offsets are meaningless to developers, however, unless they take the time to generate and study the bytecode disassembly. Code objects provide co_positions() to map bytecode locations to the source code that originated it, but the branches don't always map to locations that make sense to someone only looking at the source code.

For example, when executed,

def foo(x):
    if x: print(x)
foo(0)

shows a branch from offset 4, which maps to the if keyword, to offset 30, which also maps to that if.

  1           0 RESUME                   0

  2           2 LOAD_FAST                0 (x)
              4 POP_JUMP_IF_FALSE       12 (to 30)
              6 LOAD_GLOBAL              1 (NULL + print)
             16 LOAD_CONST               1 ('x')
             18 CALL                     1
             26 POP_TOP
             28 RETURN_CONST             0 (None)
        >>   30 RETURN_CONST             0 (None)

ex1.py 2:7-2:8 (foo@4) -> 2:7-2:8 (foo@30)

How can/should a coverage tool attempting to report this branch recognize that it is a branch out of the function?
The branch does go to a RETURN_... opcode, but is that reliable?

Branches from for loops also show a bit obfuscated. In the example

def foo():
    for i in range(3):
        print(i)

foo()

the branch taken into the block containing the print call actually shows as a branch to the i in line 2:

  1           0 RESUME                   0

  2           2 LOAD_GLOBAL              1 (NULL + range)
             12 LOAD_CONST               1 (3)
             14 CALL                     1
             22 GET_ITER
        >>   24 FOR_ITER                13 (to 54)
             28 STORE_FAST               0 (i)

  3          30 LOAD_GLOBAL              3 (NULL + print)
             40 LOAD_FAST                0 (i)
             42 CALL                     1
             50 POP_TOP
             52 JUMP_BACKWARD           15 (to 24)

  2     >>   54 END_FOR
             56 RETURN_CONST             0 (None)

ex2.py 2:4-3:16 (foo@24) -> 2:8-2:9 (foo@28)

The final branch once the loop is done also shows a bit funny, not to offset 54, as expected, but to offset 56:

ex2.py 2:4-3:16 (foo@24) -> 2:4-3:16 (foo@56)

I've seen unexpected behavior with while loops as well... I can probably recreate it if desired.

@markshannon, @nedbat and I spoke about this issue previously; Mark asked me to open this issue. It should arguably be a bug report rather than a feature request... it can be your call, Mark.

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

I spoke briefly about these issues in my PyCon'24 talk: https://www.youtube.com/watch?v=X9aXWeLC_C0

The text was updated successfully, but these errors were encountered:

markshannon · 2024-08-07T14:35:31Z

Making locations "make sense to someone only looking at the source code" is hard as it is somewhat subjective, but I'll do my best 🙂

It seems that the locations that don't "make sense ..." are branches that exit the scope.
The False branch in:

def foo(x):
    if x: print(x)

and exiting the loop in:

def foo():
    for i in range(3):
        print(i)

The problem is that there is no location for the exit from the function, but we need to provide a location for the destination of the branch.

PEP 626 says that the line events must reflect the lines executed, so we cannot introduce spurious line changes.
PEP 657 refines lines to locations, but the same reasoning applies. Therefore any branch that exits the scope must have destination_location == source_location to avoid introducing extra events. Any branch that does not exit the scope should have destination_location != source_location. If you find either of statements to not be true, please submit an issue for that case.

Not all locations have been correct in earlier versions and the locations for some parts of compounds statements were overly broad, but that has been fixed now. I believe the locations are all good in 3.13 and 3.14 @iritkatriel is that correct?

The final branch once the loop is done also shows a bit funny, not to offset 54, as expected, but to offset 56

The actual offsets should be regarded as an implementation detail. It is the locations that matter. The runtime may skip instructions that it can tell it doesn't need to execute them.
It is probably best to treat an offset as just an index into co_positions for the actual location, not as anything meaningful in itself.

iritkatriel · 2024-08-07T15:33:59Z

I believe the locations are all good in 3.13 and 3.14 @iritkatriel is that correct?

3.13 and 3.14 are awesome.

The only open issue I am aware of re location is #96999 .

jaltmayerpizzorno · 2024-08-07T15:59:38Z

Thanks, Mark. Here's an example with while (where I added a return None just to avoid the branch-to-exit):

def foo(n):
    while n<3:
        print(n)
        n += 1

    return None

foo(0)

  1           0 RESUME                   0

  2           2 LOAD_FAST                0 (n)
              4 LOAD_CONST               1 (3)
              6 COMPARE_OP               2 (<)
             10 POP_JUMP_IF_FALSE       22 (to 56)

  3     >>   12 LOAD_GLOBAL              1 (NULL + print)
             22 LOAD_FAST                0 (n)
             24 CALL                     1
             32 POP_TOP

  4          34 LOAD_FAST                0 (n)
             36 LOAD_CONST               2 (1)
             38 BINARY_OP               13 (+=)
             42 STORE_FAST               0 (n)

  2          44 LOAD_FAST                0 (n)
             46 LOAD_CONST               1 (3)
             48 COMPARE_OP               2 (<)
             52 POP_JUMP_IF_FALSE        1 (to 56)
             54 JUMP_BACKWARD           22 (to 12)

  6     >>   56 RETURN_CONST             0 (None)

ex3.py 2:10-2:13 (foo@10) -> 3:8-3:13 (foo@12)
ex3.py 2:10-2:13 (foo@52) -> 2:10-2:13 (foo@54)
ex3.py 2:10-2:13 (foo@52) -> 2:10-2:13 (foo@54)
ex3.py 2:10-2:13 (foo@52) -> 6:11-6:15 (foo@56)

In the first iteration, the branch is shown as going into the block, but
the next two show as going from the condition to the condition.

jaltmayerpizzorno · 2024-08-07T16:04:06Z

Coverage tools need to be able to enumerate all possible branches... Is it reasonable to do so
by using dis to go through the bytecode, recording what every jump opcode's offsets and
their destinations map to? I.e., see if the dis.Instruction.opcode is in hasjrel or hasjabs
and interpret Instruction.arg as either a delta or absolute jump destination, then use co_positions
on those values?

markshannon · 2024-08-08T09:05:37Z

I've created an issue for the inconsistency in while statement locations.
@jaltmayerpizzorno could you make individual issues for each problem you find. That way we can track them more easily and make sure that none get overlooked.

markshannon · 2024-08-08T17:12:44Z

Coverage tools need to be able to enumerate all possible branches...

I'd be careful using dis output. It is mainly a tool for the curious and for debugging the bytecode compiler.
Having said that, what you suggest should work and I don't have a better suggestion that you can use immediately.

Maybe we should add something like co_branches() which would list all the branches. It should be straightforward to implement as we already need to be able to quickly find all branches for instrumentation.

jaltmayerpizzorno · 2024-08-08T19:24:30Z

Maybe we should add something like co_branches() which would list all the branches. It should be straightforward to implement as we already need to be able to quickly find all branches for instrumentation.

I would love that :) Coverage tools also need to know about all lines that contain code, which co_lines() provides.

markshannon · 2024-08-12T09:55:52Z

I've added co_branches() which returns an iterator of 3-tuples, (start, not_taken, taken) byte offsets to the draft PR

jaltmayerpizzorno · 2024-08-16T19:01:12Z

I've been experimenting with using the AST to identify what the branches do in the code (e.g., from an if to its block, or from one clause to another clause in an expression). This would allow us to filter out certain branches, such as those only relevant for clause coverage, or other branches (such as the one checking for a pending exception after a with context handler) that do not correspond to source code branches.

I am mapping each branch source and destination to the source code using co_positions() and comparing these to AST elements lineno, col_offset, end_lineno and end_col_offset. I then decide the function of the branch based on its source and destination, e.g.:

if the source equals the destination, the branch is going out of scope (returning from the function, etc.);
if both the source and destination of a branch lie in the test of an ast.If or ast.While, these are branches from one clause to another;
if a branch source is in a ast.For's iter and its destination is in its target, then it's a branch into the block (and I can use the block's first node's line/column for the source code mapping);
if it's ast.match_case, branches within its pattern and itself, or pattern and guard would be clause branches;

@markshannon @iritkatriel and possibly others: does this make sense? Does this way of interpreting the branches have a good chance of staying stable across Python releases?

Thank you!

iritkatriel · 2024-08-19T14:40:00Z

The structure of the AST is not stable, so something like ast.For may not even exist in the next version.

nedbat · 2024-08-31T20:18:22Z

The structure of the AST is not stable, so something like ast.For may not even exist in the next version.

It's hard to imagine the AST not having a For node while the language has a for statement.

But in any case, we'd really appreciate help with how to correctly interpret branches. co_branches() is very helpful, but what is the right way to determine that a branch is exiting the function?

def foo(x):
    if x: print(x)

produces this dis output:

1:0-1:0             0       RESUME                   0

2:7-2:8             2       LOAD_FAST                0 (x)
2:7-2:8             4       TO_BOOL
2:7-2:8            12       POP_JUMP_IF_FALSE       13 (to L1)
2:7-2:8            16       NOT_TAKEN
2:10-2:15          18       LOAD_GLOBAL              1 (print + NULL)
2:16-2:17          28       LOAD_FAST                0 (x)
2:10-2:18          30       CALL                     1
2:10-2:18          38       POP_TOP
2:10-2:18          40       RETURN_CONST             0 (None)
2:7-2:8      L1:   42       RETURN_CONST             0 (None)

and co_branches() says there is one branch starting from offset 12, with 18/42 as the taken/not-taken. What should we do to reliably say that a branch is exiting the function?

markshannon · 2024-12-31T13:56:16Z

What does exiting the function mean? Unless there is an infinite loop, all functions will exit.
I'm guessing that what you want to know is "will exit the function without any further BRANCH or LINE events.

To determine that, you'll need to project the execution until the next branch or exit is reached.
If we get to a new line, or a branch, then it doesn't exit. If we reach a RETURN_VALUE or RAISE, then it does exit.
Does that make sense?

I don't think that is something we want to support directly in CPython.

For prototyping, you could use dis.get_instructions() to find out whether a path will exit.
If that works, we could add a more streamlined way to get instructions from a code object as dis.get_instructions() is rather slow and cumbersome.

nedbat · 2025-01-01T21:24:58Z

I'm experimenting with get_instructions. I don't see a way to know which opcode are "maybe jumps" and which are "definitely jump" other than having a list of the three(?) current "definitely jump" opcodes. The "maybe jumps" are where branches happen, if I understand correctly.

nedbat · 2025-01-02T11:26:26Z

Doing some more experiments with simple code.. I'm trying to understand what offsets will be reported for BRANCH_LEFT and BRANCH_RIGHT.

foo.py:

def foo():
    for v in [1, 2]:
        if v > 0:
            x = v

    return x

foo()

Disassembled:

   1          0       RESUME                   0

   2          2       LOAD_CONST               0 ((1, 2))
              4       GET_ITER
       L1:    6       FOR_ITER                16 (to L3)
             10       NOT_TAKEN
             12       STORE_FAST               0 (v)

   3         14       LOAD_FAST                0 (v)
             16       LOAD_SMALL_INT           0
             18       COMPARE_OP             148 (bool(>))
             22       POP_JUMP_IF_TRUE         3 (to L2)
             26       NOT_TAKEN
             28       JUMP_BACKWARD           13 (to L1)

  --   L2:   32       NOP

   4         34       LOAD_FAST                0 (v)
             36       STORE_FAST               1 (x)
             38       JUMP_BACKWARD           18 (to L1)

   2   L3:   42       END_FOR
             44       POP_TOP

   6         46       LOAD_FAST_CHECK          1 (x)
             48       RETURN_VALUE

Here is the data reported by sys.monitoring as shown by run_sysmon.py. Ignore the line numbers (#2->2), they aren't important here. I'm looking at the instruction offsets (@6->12):

3.14.0a3+ (heads/main:d903b17499b, Jan  1 2025, 07:12:49) [Clang 16.0.0 (clang-1600.0.26.6)]
PY_START: foo.py@0 #0
LINE: foo.py #1
LINE: foo.py #8
PY_START: foo.py@0 #1
LINE: foo.py #2
BRANCH_LEFT: foo.py@6->12 #2->2
LINE: foo.py #3
BRANCH_RIGHT: foo.py@22->32 #3->None
LINE: foo.py #4
JUMP: foo.py@38->6 #4->2
LINE: foo.py #2
BRANCH_RIGHT: foo.py@6->46 #2->6
LINE: foo.py #6
PY_RETURN: foo.py@48 #6
PY_RETURN: foo.py@24 #8

The two branches from offset 6 are confusing:

Why is the left branch @6->12? What happened to the instruction at 10? Why isn't it reported as @6->10?
Why is the right branch @6->46? Shouldn't it be @6->42?

What are the rules for how instruction offsets get reported? How can I see the bytecode that will make the most sense as those offsets?

markshannon · 2025-01-06T15:31:42Z

For mapping back to source code, try using locations not offsets.
Branches can appear to skip instructions, especially if it is faster to do so.

When tracking which branches have executed, offsets will be best for speed.
But when it comes to mapping back to source code, locations are better.
If there is no location for the destination or it is the same as the source, then you'll need to look at the next instruction to find a new location.

All BRANCH event offsets should match the offsets in co_branches(), if that helps.

nedbat · 2025-01-06T16:52:40Z

try using locations not offsets.

I don't understand what you mean by locations here? Can you give me an example of mapping the events I showed above?

markshannon · 2025-01-07T17:10:51Z

I don't understand what you mean by locations here?

The locations provided by co_positions: The start-line, start-column, end-line, end-column quad.

Can you give me an example of mapping the events I showed above?

Suppose we have three instructions, at offsets O1, O2 and O3 at locations L1, L2 and L2 respectively. Note that the second and third locations have the same location.
It shouldn't matter whether a branch is from O1 -> O2 or from O1 -> O3 as both as from L1 -> L2.

In more concrete terms using the example above: why is the branch from @6->12 not @6->10?
It is because @10 and @12 have the same location, so @6->12 and @6->10 are effectively equivalent.

Running the example on latest main we have:

  1          0       RESUME                   0

  2          2       LOAD_CONST               0 ((1, 2))
             4       GET_ITER
      L1:    6       FOR_ITER                14 (to L3)
            10       STORE_FAST               0 (v)

  3         12       LOAD_FAST                0 (v)
            14       LOAD_SMALL_INT           0
            16       COMPARE_OP             148 (bool(>))
            20       POP_JUMP_IF_TRUE         3 (to L2)
            24       NOT_TAKEN
            26       JUMP_BACKWARD           12 (to L1)

  4   L2:   30       LOAD_FAST                0 (v)
            32       STORE_FAST               1 (x)
            34       JUMP_BACKWARD           16 (to L1)

  2   L3:   38       END_FOR
            40       POP_ITER

  6         42       LOAD_FAST_CHECK          1 (x)
            44       RETURN_VALUE

PY_START: foo.py@0 #0
LINE: foo.py #1
LINE: foo.py #8
PY_START: foo.py@0 #1
LINE: foo.py #2
BRANCH_LEFT: foo.py@6->10 #2->2
LINE: foo.py #3
BRANCH_RIGHT: foo.py@20->30 #3->4
LINE: foo.py #4
JUMP: foo.py@34->6 #4->2
LINE: foo.py #2
BRANCH_RIGHT: foo.py@6->42 #2->6
LINE: foo.py #6
PY_RETURN: foo.py@44 #6
PY_RETURN: foo.py@24 #8

nedbat · 2025-01-08T00:01:17Z

I see what you are saying. I've adjusted my code to walk the trails from each branch destination:
Starting from @6: @8-> 10, 12; (lines 2->3) or @38-> 38, 40, 42; (lines 2->6)
Starting from @20: @22-> 24, 26, 6; (lines 3->2) or @30-> 30; (lines 3->4)

This gives me a way to look up the offsets reported in the events and map them to lines. Thanks.

jaltmayerpizzorno added the type-feature A feature request or enhancement label Aug 6, 2024

markshannon mentioned this issue Aug 8, 2024

Inconsistent locations for conditional branches in while statements #122821

Open

jaltmayerpizzorno mentioned this issue Aug 15, 2024

sys.monitoring branches for match cases showing incorrectly #123044

Open

picnixz added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Aug 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better mapping of `sys.monitoring` branches back to source code #122762

Better mapping of `sys.monitoring` branches back to source code #122762

jaltmayerpizzorno commented Aug 6, 2024 •

edited by github-actions bot

Loading

markshannon commented Aug 7, 2024 •

edited

Loading

iritkatriel commented Aug 7, 2024

jaltmayerpizzorno commented Aug 7, 2024

jaltmayerpizzorno commented Aug 7, 2024 •

edited

Loading

markshannon commented Aug 8, 2024

markshannon commented Aug 8, 2024

jaltmayerpizzorno commented Aug 8, 2024

markshannon commented Aug 12, 2024

jaltmayerpizzorno commented Aug 16, 2024 •

edited

Loading

iritkatriel commented Aug 19, 2024

nedbat commented Aug 31, 2024

markshannon commented Dec 31, 2024

nedbat commented Jan 1, 2025

nedbat commented Jan 2, 2025 •

edited

Loading

markshannon commented Jan 6, 2025 •

edited

Loading

nedbat commented Jan 6, 2025

markshannon commented Jan 7, 2025

nedbat commented Jan 8, 2025

Better mapping of sys.monitoring branches back to source code #122762

Better mapping of sys.monitoring branches back to source code #122762

Comments

jaltmayerpizzorno commented Aug 6, 2024 • edited by github-actions bot Loading

Feature or enhancement

Proposal:

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

markshannon commented Aug 7, 2024 • edited Loading

iritkatriel commented Aug 7, 2024

jaltmayerpizzorno commented Aug 7, 2024

jaltmayerpizzorno commented Aug 7, 2024 • edited Loading

markshannon commented Aug 8, 2024

markshannon commented Aug 8, 2024

jaltmayerpizzorno commented Aug 8, 2024

markshannon commented Aug 12, 2024

jaltmayerpizzorno commented Aug 16, 2024 • edited Loading

iritkatriel commented Aug 19, 2024

nedbat commented Aug 31, 2024

markshannon commented Dec 31, 2024

nedbat commented Jan 1, 2025

nedbat commented Jan 2, 2025 • edited Loading

markshannon commented Jan 6, 2025 • edited Loading

nedbat commented Jan 6, 2025

markshannon commented Jan 7, 2025

nedbat commented Jan 8, 2025

Better mapping of `sys.monitoring` branches back to source code #122762

Better mapping of `sys.monitoring` branches back to source code #122762

jaltmayerpizzorno commented Aug 6, 2024 •

edited by github-actions bot

Loading

markshannon commented Aug 7, 2024 •

edited

Loading

jaltmayerpizzorno commented Aug 7, 2024 •

edited

Loading

jaltmayerpizzorno commented Aug 16, 2024 •

edited

Loading

nedbat commented Jan 2, 2025 •

edited

Loading

markshannon commented Jan 6, 2025 •

edited

Loading