Skip to content

Better mapping of sys.monitoring branches back to source code #122762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jaltmayerpizzorno opened this issue Aug 6, 2024 · 18 comments
Open

Better mapping of sys.monitoring branches back to source code #122762

jaltmayerpizzorno opened this issue Aug 6, 2024 · 18 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@jaltmayerpizzorno
Copy link

jaltmayerpizzorno commented Aug 6, 2024

Feature or enhancement

Proposal:

sys.monitoring allows us to record branches taken during execution, reporting them in terms of their source and destination offsets in bytecode. These offsets are meaningless to developers, however, unless they take the time to generate and study the bytecode disassembly. Code objects provide co_positions() to map bytecode locations to the source code that originated it, but the branches don't always map to locations that make sense to someone only looking at the source code.

For example, when executed,

def foo(x):
    if x: print(x)
foo(0)

shows a branch from offset 4, which maps to the if keyword, to offset 30, which also maps to that if.

  1           0 RESUME                   0

  2           2 LOAD_FAST                0 (x)
              4 POP_JUMP_IF_FALSE       12 (to 30)
              6 LOAD_GLOBAL              1 (NULL + print)
             16 LOAD_CONST               1 ('x')
             18 CALL                     1
             26 POP_TOP
             28 RETURN_CONST             0 (None)
        >>   30 RETURN_CONST             0 (None)

ex1.py 2:7-2:8 (foo@4) -> 2:7-2:8 (foo@30)

How can/should a coverage tool attempting to report this branch recognize that it is a branch out of the function?
The branch does go to a RETURN_... opcode, but is that reliable?

Branches from for loops also show a bit obfuscated. In the example

def foo():
    for i in range(3):
        print(i)

foo()

the branch taken into the block containing the print call actually shows as a branch to the i in line 2:

  1           0 RESUME                   0

  2           2 LOAD_GLOBAL              1 (NULL + range)
             12 LOAD_CONST               1 (3)
             14 CALL                     1
             22 GET_ITER
        >>   24 FOR_ITER                13 (to 54)
             28 STORE_FAST               0 (i)

  3          30 LOAD_GLOBAL              3 (NULL + print)
             40 LOAD_FAST                0 (i)
             42 CALL                     1
             50 POP_TOP
             52 JUMP_BACKWARD           15 (to 24)

  2     >>   54 END_FOR
             56 RETURN_CONST             0 (None)

ex2.py 2:4-3:16 (foo@24) -> 2:8-2:9 (foo@28)

The final branch once the loop is done also shows a bit funny, not to offset 54, as expected, but to offset 56:

ex2.py 2:4-3:16 (foo@24) -> 2:4-3:16 (foo@56)

I've seen unexpected behavior with while loops as well... I can probably recreate it if desired.

@markshannon, @nedbat and I spoke about this issue previously; Mark asked me to open this issue. It should arguably be a bug report rather than a feature request... it can be your call, Mark.

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

I spoke briefly about these issues in my PyCon'24 talk: https://www.youtube.com/watch?v=X9aXWeLC_C0

@jaltmayerpizzorno jaltmayerpizzorno added the type-feature A feature request or enhancement label Aug 6, 2024
@markshannon
Copy link
Member

markshannon commented Aug 7, 2024

Making locations "make sense to someone only looking at the source code" is hard as it is somewhat subjective, but I'll do my best 🙂

It seems that the locations that don't "make sense ..." are branches that exit the scope.
The False branch in:

def foo(x):
    if x: print(x)

and exiting the loop in:

def foo():
    for i in range(3):
        print(i)

The problem is that there is no location for the exit from the function, but we need to provide a location for the destination of the branch.

PEP 626 says that the line events must reflect the lines executed, so we cannot introduce spurious line changes.
PEP 657 refines lines to locations, but the same reasoning applies. Therefore any branch that exits the scope must have destination_location == source_location to avoid introducing extra events. Any branch that does not exit the scope should have destination_location != source_location. If you find either of statements to not be true, please submit an issue for that case.

Not all locations have been correct in earlier versions and the locations for some parts of compounds statements were overly broad, but that has been fixed now. I believe the locations are all good in 3.13 and 3.14 @iritkatriel is that correct?

The final branch once the loop is done also shows a bit funny, not to offset 54, as expected, but to offset 56

The actual offsets should be regarded as an implementation detail. It is the locations that matter. The runtime may skip instructions that it can tell it doesn't need to execute them.
It is probably best to treat an offset as just an index into co_positions for the actual location, not as anything meaningful in itself.

@iritkatriel
Copy link
Member

I believe the locations are all good in 3.13 and 3.14 @iritkatriel is that correct?

3.13 and 3.14 are awesome.

The only open issue I am aware of re location is #96999 .

@jaltmayerpizzorno
Copy link
Author

Thanks, Mark. Here's an example with while (where I added a return None just to avoid the branch-to-exit):

def foo(n):
    while n<3:
        print(n)
        n += 1

    return None

foo(0)
  1           0 RESUME                   0

  2           2 LOAD_FAST                0 (n)
              4 LOAD_CONST               1 (3)
              6 COMPARE_OP               2 (<)
             10 POP_JUMP_IF_FALSE       22 (to 56)

  3     >>   12 LOAD_GLOBAL              1 (NULL + print)
             22 LOAD_FAST                0 (n)
             24 CALL                     1
             32 POP_TOP

  4          34 LOAD_FAST                0 (n)
             36 LOAD_CONST               2 (1)
             38 BINARY_OP               13 (+=)
             42 STORE_FAST               0 (n)

  2          44 LOAD_FAST                0 (n)
             46 LOAD_CONST               1 (3)
             48 COMPARE_OP               2 (<)
             52 POP_JUMP_IF_FALSE        1 (to 56)
             54 JUMP_BACKWARD           22 (to 12)

  6     >>   56 RETURN_CONST             0 (None)

ex3.py 2:10-2:13 (foo@10) -> 3:8-3:13 (foo@12)
ex3.py 2:10-2:13 (foo@52) -> 2:10-2:13 (foo@54)
ex3.py 2:10-2:13 (foo@52) -> 2:10-2:13 (foo@54)
ex3.py 2:10-2:13 (foo@52) -> 6:11-6:15 (foo@56)

In the first iteration, the branch is shown as going into the block, but
the next two show as going from the condition to the condition.

@jaltmayerpizzorno
Copy link
Author

jaltmayerpizzorno commented Aug 7, 2024

Coverage tools need to be able to enumerate all possible branches... Is it reasonable to do so
by using dis to go through the bytecode, recording what every jump opcode's offsets and
their destinations map to? I.e., see if the dis.Instruction.opcode is in hasjrel or hasjabs
and interpret Instruction.arg as either a delta or absolute jump destination, then use co_positions
on those values?

@markshannon
Copy link
Member

I've created an issue for the inconsistency in while statement locations.
@jaltmayerpizzorno could you make individual issues for each problem you find. That way we can track them more easily and make sure that none get overlooked.

@markshannon
Copy link
Member

Coverage tools need to be able to enumerate all possible branches...

I'd be careful using dis output. It is mainly a tool for the curious and for debugging the bytecode compiler.
Having said that, what you suggest should work and I don't have a better suggestion that you can use immediately.

Maybe we should add something like co_branches() which would list all the branches. It should be straightforward to implement as we already need to be able to quickly find all branches for instrumentation.

@jaltmayerpizzorno
Copy link
Author

Maybe we should add something like co_branches() which would list all the branches. It should be straightforward to implement as we already need to be able to quickly find all branches for instrumentation.

I would love that :) Coverage tools also need to know about all lines that contain code, which co_lines() provides.

@markshannon
Copy link
Member

I've added co_branches() which returns an iterator of 3-tuples, (start, not_taken, taken) byte offsets to the draft PR

@jaltmayerpizzorno
Copy link
Author

jaltmayerpizzorno commented Aug 16, 2024

I've been experimenting with using the AST to identify what the branches do in the code (e.g., from an if to its block, or from one clause to another clause in an expression). This would allow us to filter out certain branches, such as those only relevant for clause coverage, or other branches (such as the one checking for a pending exception after a with context handler) that do not correspond to source code branches.

I am mapping each branch source and destination to the source code using co_positions() and comparing these to AST elements lineno, col_offset, end_lineno and end_col_offset. I then decide the function of the branch based on its source and destination, e.g.:

  • if the source equals the destination, the branch is going out of scope (returning from the function, etc.);
  • if both the source and destination of a branch lie in the test of an ast.If or ast.While, these are branches from one clause to another;
  • if a branch source is in a ast.For's iter and its destination is in its target, then it's a branch into the block (and I can use the block's first node's line/column for the source code mapping);
  • if it's ast.match_case, branches within its pattern and itself, or pattern and guard would be clause branches;

@markshannon @iritkatriel and possibly others: does this make sense? Does this way of interpreting the branches have a good chance of staying stable across Python releases?

Thank you!

@picnixz picnixz added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Aug 17, 2024
@iritkatriel
Copy link
Member

The structure of the AST is not stable, so something like ast.For may not even exist in the next version.

@nedbat
Copy link
Member

nedbat commented Aug 31, 2024

The structure of the AST is not stable, so something like ast.For may not even exist in the next version.

It's hard to imagine the AST not having a For node while the language has a for statement.

But in any case, we'd really appreciate help with how to correctly interpret branches. co_branches() is very helpful, but what is the right way to determine that a branch is exiting the function?

def foo(x):
    if x: print(x)

produces this dis output:

1:0-1:0             0       RESUME                   0

2:7-2:8             2       LOAD_FAST                0 (x)
2:7-2:8             4       TO_BOOL
2:7-2:8            12       POP_JUMP_IF_FALSE       13 (to L1)
2:7-2:8            16       NOT_TAKEN
2:10-2:15          18       LOAD_GLOBAL              1 (print + NULL)
2:16-2:17          28       LOAD_FAST                0 (x)
2:10-2:18          30       CALL                     1
2:10-2:18          38       POP_TOP
2:10-2:18          40       RETURN_CONST             0 (None)
2:7-2:8      L1:   42       RETURN_CONST             0 (None)

and co_branches() says there is one branch starting from offset 12, with 18/42 as the taken/not-taken. What should we do to reliably say that a branch is exiting the function?

@markshannon
Copy link
Member

What does exiting the function mean? Unless there is an infinite loop, all functions will exit.
I'm guessing that what you want to know is "will exit the function without any further BRANCH or LINE events.

To determine that, you'll need to project the execution until the next branch or exit is reached.
If we get to a new line, or a branch, then it doesn't exit. If we reach a RETURN_VALUE or RAISE, then it does exit.
Does that make sense?

I don't think that is something we want to support directly in CPython.

For prototyping, you could use dis.get_instructions() to find out whether a path will exit.
If that works, we could add a more streamlined way to get instructions from a code object as dis.get_instructions() is rather slow and cumbersome.

@nedbat
Copy link
Member

nedbat commented Jan 1, 2025

I'm experimenting with get_instructions. I don't see a way to know which opcode are "maybe jumps" and which are "definitely jump" other than having a list of the three(?) current "definitely jump" opcodes. The "maybe jumps" are where branches happen, if I understand correctly.

@nedbat
Copy link
Member

nedbat commented Jan 2, 2025

Doing some more experiments with simple code.. I'm trying to understand what offsets will be reported for BRANCH_LEFT and BRANCH_RIGHT.

foo.py:

def foo():
    for v in [1, 2]:
        if v > 0:
            x = v

    return x

foo()

Disassembled:

   1          0       RESUME                   0

   2          2       LOAD_CONST               0 ((1, 2))
              4       GET_ITER
       L1:    6       FOR_ITER                16 (to L3)
             10       NOT_TAKEN
             12       STORE_FAST               0 (v)

   3         14       LOAD_FAST                0 (v)
             16       LOAD_SMALL_INT           0
             18       COMPARE_OP             148 (bool(>))
             22       POP_JUMP_IF_TRUE         3 (to L2)
             26       NOT_TAKEN
             28       JUMP_BACKWARD           13 (to L1)

  --   L2:   32       NOP

   4         34       LOAD_FAST                0 (v)
             36       STORE_FAST               1 (x)
             38       JUMP_BACKWARD           18 (to L1)

   2   L3:   42       END_FOR
             44       POP_TOP

   6         46       LOAD_FAST_CHECK          1 (x)
             48       RETURN_VALUE

Here is the data reported by sys.monitoring as shown by run_sysmon.py. Ignore the line numbers (#2->2), they aren't important here. I'm looking at the instruction offsets (@6->12):

3.14.0a3+ (heads/main:d903b17499b, Jan  1 2025, 07:12:49) [Clang 16.0.0 (clang-1600.0.26.6)]
PY_START: foo.py@0 #0
LINE: foo.py #1
LINE: foo.py #8
PY_START: foo.py@0 #1
LINE: foo.py #2
BRANCH_LEFT: foo.py@6->12 #2->2
LINE: foo.py #3
BRANCH_RIGHT: foo.py@22->32 #3->None
LINE: foo.py #4
JUMP: foo.py@38->6 #4->2
LINE: foo.py #2
BRANCH_RIGHT: foo.py@6->46 #2->6
LINE: foo.py #6
PY_RETURN: foo.py@48 #6
PY_RETURN: foo.py@24 #8

The two branches from offset 6 are confusing:

  • Why is the left branch @6->12? What happened to the instruction at 10? Why isn't it reported as @6->10?
  • Why is the right branch @6->46? Shouldn't it be @6->42?

What are the rules for how instruction offsets get reported? How can I see the bytecode that will make the most sense as those offsets?

@markshannon
Copy link
Member

markshannon commented Jan 6, 2025

For mapping back to source code, try using locations not offsets.
Branches can appear to skip instructions, especially if it is faster to do so.

When tracking which branches have executed, offsets will be best for speed.
But when it comes to mapping back to source code, locations are better.
If there is no location for the destination or it is the same as the source, then you'll need to look at the next instruction to find a new location.

All BRANCH event offsets should match the offsets in co_branches(), if that helps.

@nedbat
Copy link
Member

nedbat commented Jan 6, 2025

try using locations not offsets.

I don't understand what you mean by locations here? Can you give me an example of mapping the events I showed above?

@markshannon
Copy link
Member

I don't understand what you mean by locations here?

The locations provided by co_positions: The start-line, start-column, end-line, end-column quad.

Can you give me an example of mapping the events I showed above?

Suppose we have three instructions, at offsets O1, O2 and O3 at locations L1, L2 and L2 respectively. Note that the second and third locations have the same location.
It shouldn't matter whether a branch is from O1 -> O2 or from O1 -> O3 as both as from L1 -> L2.

In more concrete terms using the example above: why is the branch from @6->12 not @6->10?
It is because @10 and @12 have the same location, so @6->12 and @6->10 are effectively equivalent.

Running the example on latest main we have:

  1          0       RESUME                   0

  2          2       LOAD_CONST               0 ((1, 2))
             4       GET_ITER
      L1:    6       FOR_ITER                14 (to L3)
            10       STORE_FAST               0 (v)

  3         12       LOAD_FAST                0 (v)
            14       LOAD_SMALL_INT           0
            16       COMPARE_OP             148 (bool(>))
            20       POP_JUMP_IF_TRUE         3 (to L2)
            24       NOT_TAKEN
            26       JUMP_BACKWARD           12 (to L1)

  4   L2:   30       LOAD_FAST                0 (v)
            32       STORE_FAST               1 (x)
            34       JUMP_BACKWARD           16 (to L1)

  2   L3:   38       END_FOR
            40       POP_ITER

  6         42       LOAD_FAST_CHECK          1 (x)
            44       RETURN_VALUE
PY_START: foo.py@0 #0
LINE: foo.py #1
LINE: foo.py #8
PY_START: foo.py@0 #1
LINE: foo.py #2
BRANCH_LEFT: foo.py@6->10 #2->2
LINE: foo.py #3
BRANCH_RIGHT: foo.py@20->30 #3->4
LINE: foo.py #4
JUMP: foo.py@34->6 #4->2
LINE: foo.py #2
BRANCH_RIGHT: foo.py@6->42 #2->6
LINE: foo.py #6
PY_RETURN: foo.py@44 #6
PY_RETURN: foo.py@24 #8

@nedbat
Copy link
Member

nedbat commented Jan 8, 2025

I see what you are saying. I've adjusted my code to walk the trails from each branch destination:
Starting from @6: @8-> 10, 12; (lines 2->3) or @38-> 38, 40, 42; (lines 2->6)
Starting from @20: @22-> 24, 26, 6; (lines 3->2) or @30-> 30; (lines 3->4)

This gives me a way to look up the offsets reported in the events and map them to lines. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants