Skip to content

gh-119786: improve internal docs on co_linetable #123198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 61 additions & 26 deletions InternalDocs/code_objects.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

# Code objects

A `CodeObject` is a builtin Python type that represents a compiled executable,
Expand Down Expand Up @@ -43,7 +42,7 @@ so a compact format is very important.
Note that traceback objects don't store all this information -- they store the start line
number, for backward compatibility, and the "last instruction" value.
The rest can be computed from the last instruction (`tb_lasti`) with the help of the
locations table. For Python code, there is a convenience method
locations table. For Python code, there is a convenience method
(`codeobject.co_positions`)[https://docs.python.org/dev/reference/datamodel.html#codeobject.co_positions]
which returns an iterator of `({line}, {endline}, {column}, {endcolumn})` tuples,
one per instruction.
Expand Down Expand Up @@ -75,9 +74,11 @@ returned by the `co_positions()` iterator.
> See [`Objects/lnotab_notes.txt`](../Objects/lnotab_notes.txt) for more details.

`co_linetable` consists of a sequence of location entries.
Each entry starts with a byte with the most significant bit set, followed by zero or more bytes with the most significant bit unset.
Each entry starts with a byte with the most significant bit set, followed by
zero or more bytes with the most significant bit unset.

Each entry contains the following information:

* The number of code units covered by this entry (length)
* The start line
* The end line
Expand All @@ -86,54 +87,88 @@ Each entry contains the following information:

The first byte has the following format:

Bit 7 | Bits 3-6 | Bits 0-2
---- | ---- | ----
1 | Code | Length (in code units) - 1
| Bit 7 | Bits 3-6 | Bits 0-2 |
|-------|----------|----------------------------|
| 1 | Code | Length (in code units) - 1 |

The codes are enumerated in the `_PyCodeLocationInfoKind` enum.

## Variable-length integer encodings
### Variable-length integer encodings

Integers are often encoded using a variable-length integer encoding
Integers are often encoded using a variable length integer encoding

### Unsigned integers (`varint`)
#### Unsigned integers (`varint`)

Unsigned integers are encoded in 6-bit chunks, least significant first.
Each chunk but the last has bit 6 set.
For example:

* 63 is encoded as `0x3f`
* 200 is encoded as `0x48`, `0x03`
* 200 is encoded as `0x48`, `0x03` since ``200 = (0x03 << 6) | 0x48``.

The following helper can be used to convert an integer into a `varint`:

```py
def encode_varint(s):
ret = []
while s >= 64:
ret.append(((s & 0x3F) | 0x40) & 0x3F)
s >>= 6
ret.append(s & 0x3F)
return bytes(ret)
```

To convert a `varint` into an unsigned integer:

```py
def decode_varint(chunks):
ret = 0
for chunk in reversed(chunks):
ret = (ret << 6) | chunk
return ret
```

### Signed integers (`svarint`)
#### Signed integers (`svarint`)

Signed integers are encoded by converting them to unsigned integers, using the following function:
```Python
def convert(s):

```py
def svarint_to_varint(s):
if s < 0:
return ((-s)<<1) | 1
return ((-s) << 1) | 1
else:
return (s<<1)
return s << 1
```

To convert a `varint` into a signed integer:

```py
def varint_to_svarint(uval):
return -(uval >> 1) if uval & 1 else (uval >> 1)
```

*Location entries*
### Location entries

The meaning of the codes and the following bytes are as follows:

Code | Meaning | Start line | End line | Start column | End column
---- | ---- | ---- | ---- | ---- | ----
0-9 | Short form | Δ 0 | Δ 0 | See below | See below
10-12 | One line form | Δ (code - 10) | Δ 0 | unsigned byte | unsigned byte
13 | No column info | Δ svarint | Δ 0 | None | None
14 | Long form | Δ svarint | Δ varint | varint | varint
15 | No location | None | None | None | None
| Code | Meaning | Start line | End line | Start column | End column |
|-------|----------------|---------------|----------|---------------|---------------|
| 0-9 | Short form | Δ 0 | Δ 0 | See below | See below |
| 10-12 | One line form | Δ (code - 10) | Δ 0 | unsigned byte | unsigned byte |
| 13 | No column info | Δ svarint | Δ 0 | None | None |
| 14 | Long form | Δ svarint | Δ varint | varint | varint |
| 15 | No location | None | None | None | None |

The Δ means the value is encoded as a delta from another value:

* Start line: Delta from the previous start line, or `co_firstlineno` for the first entry.
* End line: Delta from the start line
* End line: Delta from the start line.

### The short forms

*The short forms*
Codes 0-9 are the short forms. The short form consists of two bytes,
the second byte holding additional column information. The code is the
start column divided by 8 (and rounded down).

Codes 0-9 are the short forms. The short form consists of two bytes, the second byte holding additional column information. The code is the start column divided by 8 (and rounded down).
* Start column: `(code*8) + ((second_byte>>4)&7)`
* End column: `start_column + (second_byte&15)`
Loading