From 4fb04cfc6b4ce9df381791cb88462952e04769f7 Mon Sep 17 00:00:00 2001
From: Mark Shannon <mark@hotpy.org>
Date: Tue, 29 Jun 2021 14:00:16 +0100
Subject: [PATCH 1/3] Add file describing how to add or modify specialized
 families of instructions

---
 Python/adaptive.md | 97 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 97 insertions(+)
 create mode 100644 Python/adaptive.md

diff --git a/Python/adaptive.md b/Python/adaptive.md
new file mode 100644
index 00000000000000..ab50ecd2cb6915
--- /dev/null
+++ b/Python/adaptive.md
@@ -0,0 +1,97 @@
+# Adding or extending a family of adaptive instructions.
+
+## Families of instructions
+
+The core part of PEP 659 (specializing adaptive interpreter) is the families of instructions that perform the adaptive specialization.
+
+A family of instructions has the following fundamental properties:
+
+* It corresponds to a single instruction in the code generated by the bytecode compiler.
+* It has a single adaptive instruction that records an execution count and,
+   at regular intervals, attempts to specialize itself. If not specializing, it executes
+  the non-adaptive instruction.
+* It has at least one specialized form of the instruction that is tailored for a particular value or set of values at runtime.
+* All members of the family have access to same number of cache entries.
+  Individual family members do not need to use all of the entries.
+
+The current implementation also requires the following, although these are not fundamental and may change:
+
+* If a family uses one or more entries, then the first entry must be a `_PyAdaptiveEntry` entry.
+* If a family uses no cache entries, then the `oparg` is used as the counter for the adaptive instruction.
+* All instruction names should start with the name of the non-adaptive instruction.
+* The adaptive instruction should end in `_ADAPTIVE`.
+* Specialized forms should have names describing their specialization.
+
+## Performance analysis
+
+The benefit of a specialization can be assessed with the following formula:
+`Tbase/Tadaptive`.
+
+Where `Tbase` is the mean time to execute the base instruction,
+and `Tadaptive` is the mean time to execute the specialized and adaptive forms.
+
+`Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)`
+
+`Ti` is the time to execute the `i`th instruction in the family and `Ni` is the number of times that instruction is executed.
+`Tmiss` is the time to process a miss, including de-optimzation and the time to execute the base instruction.
+
+The ideal situation is where misses are rare and the specialized forms are much faster than the base instruction.
+`LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`.
+In which case we have `Tadaptive ≈ sum(Ti*Ni)`.
+Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and `LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction, we would expect the specialization of `LOAD_GLOBAL` to be profitable.
+
+## Design considerations
+
+While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and `CALL_FUNCTION` are not.
+For maximum performance we want to keep `Ti` low for all specialized instructions and `Nmiss` as low as possible.
+
+Keeping `Nmiss` low means that there should be specializations for almost
+all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means keeping `Ti`
+low which means minimizing branches and dependent memory accesses (pointer chasing).
+These two objectives may be in conflict, requiring judgement and experimentation to
+design the family of instructions.
+
+### Gathering data
+
+Before choosing how to specialize an instruction, it is important to gather some data. What are the pattern of usage of the base instruction?
+Data can best be gathered by instrumenting the interpreter.
+Since a specialization function and adaptive instruction are going to be required,
+instrumentation can most easily be added in the specialization function.
+
+### Choice of specializations
+
+The performance of the specializing adaptive interpreter relies on the quality of
+specialization and keeping the overhead of specialization low.
+
+Specialized instructions must be fast. In order to be fast, specialized instructions should be tailored 
+for a particular set of values that allows them to:
+1. Verify that incoming value is part of that set with low overhead.
+2. Perform the operation quickly.
+
+This requires that the set of values is chosen such that membership can be tested quickly and
+that membership is sufficient to allow the operation to performed quickly.
+
+For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()` dictionaries that have a keys with the expected version.
+
+This can be tested quickly:
+* `globals->keys->dk_version == expected_version`
+
+and the operation can be performed quickly:
+* `value = globals->keys->entries[index].value`.
+
+Because it is impossible to measure the performance of an instruction without also
+measuring unrelated factors, the assessment of the quality of a specialization will require some judgement.
+
+As a general rule, specialized instructions should be much faster than the base instruction.
+
+### Implementation of specialized instructions
+
+In general, specialized instructions should be implemented in two parts:
+1. A sequence of guards, each of the form `DEOPT_IF(guard-condition-is-false, BASE_NAME)`,
+  followed by a `record_cache_hit()`.
+2. The operation, which should ideally have no branches and a minimum number of dependent memory accesses.
+
+In practice, the parts may overlap, as data required for guards can be re-used in the operation.
+
+If there are branches in the operation, then consider further specialization to eliminate
+the branches.

From b0c9e51408c801487ea1ef75ae7b6d54bbf9af3f Mon Sep 17 00:00:00 2001
From: Mark Shannon <mark@hotpy.org>
Date: Wed, 30 Jun 2021 16:00:05 +0100
Subject: [PATCH 2/3] Stick to 80 char limit and add reference.

---
 Python/adaptive.md  | 96 ++++++++++++++++++++++++++++-----------------
 Python/specialize.c |  4 ++
 2 files changed, 63 insertions(+), 37 deletions(-)

diff --git a/Python/adaptive.md b/Python/adaptive.md
index ab50ecd2cb6915..f7b3a65f8cf98a 100644
--- a/Python/adaptive.md
+++ b/Python/adaptive.md
@@ -2,23 +2,30 @@
 
 ## Families of instructions
 
-The core part of PEP 659 (specializing adaptive interpreter) is the families of instructions that perform the adaptive specialization.
+The core part of PEP 659 (specializing adaptive interpreter) is the families
+of instructions that perform the adaptive specialization.
 
 A family of instructions has the following fundamental properties:
 
-* It corresponds to a single instruction in the code generated by the bytecode compiler.
+* It corresponds to a single instruction in the code
+  generated by the bytecode compiler.
 * It has a single adaptive instruction that records an execution count and,
-   at regular intervals, attempts to specialize itself. If not specializing, it executes
-  the non-adaptive instruction.
-* It has at least one specialized form of the instruction that is tailored for a particular value or set of values at runtime.
+  at regular intervals, attempts to specialize itself. If not specializing,
+  it executes the non-adaptive instruction.
+* It has at least one specialized form of the instruction that is tailored 
+  for a particular value or set of values at runtime.
 * All members of the family have access to same number of cache entries.
   Individual family members do not need to use all of the entries.
 
-The current implementation also requires the following, although these are not fundamental and may change:
+The current implementation also requires the following,
+although these are not fundamental and may change:
 
-* If a family uses one or more entries, then the first entry must be a `_PyAdaptiveEntry` entry.
-* If a family uses no cache entries, then the `oparg` is used as the counter for the adaptive instruction.
-* All instruction names should start with the name of the non-adaptive instruction.
+* If a family uses one or more entries, then the first entry must be a
+  `_PyAdaptiveEntry` entry.
+* If a family uses no cache entries, then the `oparg` is used as the
+  counter for the adaptive instruction.
+* All instruction names should start with the name of the non-adaptive
+  instruction.
 * The adaptive instruction should end in `_ADAPTIVE`.
 * Specialized forms should have names describing their specialization.
 
@@ -32,46 +39,56 @@ and `Tadaptive` is the mean time to execute the specialized and adaptive forms.
 
 `Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)`
 
-`Ti` is the time to execute the `i`th instruction in the family and `Ni` is the number of times that instruction is executed.
-`Tmiss` is the time to process a miss, including de-optimzation and the time to execute the base instruction.
+`Ti` is the time to execute the `i`th instruction in the family and `Ni` is
+the number of times that instruction is executed.
+`Tmiss` is the time to process a miss, including de-optimzation
+and the time to execute the base instruction.
 
-The ideal situation is where misses are rare and the specialized forms are much faster than the base instruction.
+The ideal situation is where misses are rare and the specialized
+forms are much faster than the base instruction.
 `LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`.
 In which case we have `Tadaptive ≈ sum(Ti*Ni)`.
-Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and `LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction, we would expect the specialization of `LOAD_GLOBAL` to be profitable.
+Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and
+`LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction,
+we would expect the specialization of `LOAD_GLOBAL` to be profitable.
 
 ## Design considerations
 
-While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and `CALL_FUNCTION` are not.
-For maximum performance we want to keep `Ti` low for all specialized instructions and `Nmiss` as low as possible.
+While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and
+`CALL_FUNCTION` are not. For maximum performance we want to keep `Ti`
+low for all specialized instructions and `Nmiss` as low as possible.
 
 Keeping `Nmiss` low means that there should be specializations for almost
-all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means keeping `Ti`
-low which means minimizing branches and dependent memory accesses (pointer chasing).
-These two objectives may be in conflict, requiring judgement and experimentation to
-design the family of instructions.
+all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means
+keeping `Ti` low which means minimizing branches and dependent memory
+accesses (pointer chasing). These two objectives may be in conflict,
+requiring judgement and experimentation to design the family of instructions.
 
 ### Gathering data
 
-Before choosing how to specialize an instruction, it is important to gather some data. What are the pattern of usage of the base instruction?
-Data can best be gathered by instrumenting the interpreter.
-Since a specialization function and adaptive instruction are going to be required,
+Before choosing how to specialize an instruction, it is important to gather
+some data. What are the pattern of usage of the base instruction?
+Data can best be gathered by instrumenting the interpreter. Since a 
+specialization function and adaptive instruction are going to be required,
 instrumentation can most easily be added in the specialization function.
 
 ### Choice of specializations
 
-The performance of the specializing adaptive interpreter relies on the quality of
-specialization and keeping the overhead of specialization low.
+The performance of the specializing adaptive interpreter relies on the
+quality of specialization and keeping the overhead of specialization low.
 
-Specialized instructions must be fast. In order to be fast, specialized instructions should be tailored 
-for a particular set of values that allows them to:
+Specialized instructions must be fast. In order to be fast,
+specialized instructions should be tailored for a particular
+set of values that allows them to:
 1. Verify that incoming value is part of that set with low overhead.
 2. Perform the operation quickly.
 
-This requires that the set of values is chosen such that membership can be tested quickly and
-that membership is sufficient to allow the operation to performed quickly.
+This requires that the set of values is chosen such that membership can be
+tested quickly and that membership is sufficient to allow the operation to
+performed quickly.
 
-For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()` dictionaries that have a keys with the expected version.
+For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()`
+dictionaries that have a keys with the expected version.
 
 This can be tested quickly:
 * `globals->keys->dk_version == expected_version`
@@ -79,19 +96,24 @@ This can be tested quickly:
 and the operation can be performed quickly:
 * `value = globals->keys->entries[index].value`.
 
-Because it is impossible to measure the performance of an instruction without also
-measuring unrelated factors, the assessment of the quality of a specialization will require some judgement.
+Because it is impossible to measure the performance of an instruction without
+also measuring unrelated factors, the assessment of the quality of a
+specialization will require some judgement.
 
-As a general rule, specialized instructions should be much faster than the base instruction.
+As a general rule, specialized instructions should be much faster than the
+base instruction.
 
 ### Implementation of specialized instructions
 
 In general, specialized instructions should be implemented in two parts:
-1. A sequence of guards, each of the form `DEOPT_IF(guard-condition-is-false, BASE_NAME)`,
+1. A sequence of guards, each of the form
+  `DEOPT_IF(guard-condition-is-false, BASE_NAME)`,
   followed by a `record_cache_hit()`.
-2. The operation, which should ideally have no branches and a minimum number of dependent memory accesses.
+2. The operation, which should ideally have no branches and
+  a minimum number of dependent memory accesses.
 
-In practice, the parts may overlap, as data required for guards can be re-used in the operation.
+In practice, the parts may overlap, as data required for guards
+can be re-used in the operation.
 
-If there are branches in the operation, then consider further specialization to eliminate
-the branches.
+If there are branches in the operation, then consider further specialization
+to eliminate the branches.
diff --git a/Python/specialize.c b/Python/specialize.c
index a8ae09ff0e3839..3277c6bc9e4894 100644
--- a/Python/specialize.c
+++ b/Python/specialize.c
@@ -7,6 +7,10 @@
 #include "opcode.h"
 #include "structmember.h"         // struct PyMemberDef, T_OFFSET_EX
 
+/* For guidance on adding or extending families of instructions see
+ * ./adaptive.md
+ */
+
 
 /* We layout the quickened data as a bi-directional array:
  * Instructions upwards, cache entries downwards.

From ca2a36a6187421d561456de9cf585298c0ddf0f3 Mon Sep 17 00:00:00 2001
From: Mark Shannon <mark@hotpy.org>
Date: Wed, 30 Jun 2021 16:09:11 +0100
Subject: [PATCH 3/3] Add example and fix a couple of grammatical errors.

---
 Python/adaptive.md | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/Python/adaptive.md b/Python/adaptive.md
index f7b3a65f8cf98a..66b80a17464bfb 100644
--- a/Python/adaptive.md
+++ b/Python/adaptive.md
@@ -14,7 +14,7 @@ A family of instructions has the following fundamental properties:
   it executes the non-adaptive instruction.
 * It has at least one specialized form of the instruction that is tailored 
   for a particular value or set of values at runtime.
-* All members of the family have access to same number of cache entries.
+* All members of the family have access to the same number of cache entries.
   Individual family members do not need to use all of the entries.
 
 The current implementation also requires the following,
@@ -29,6 +29,18 @@ although these are not fundamental and may change:
 * The adaptive instruction should end in `_ADAPTIVE`.
 * Specialized forms should have names describing their specialization.
 
+## Example family
+
+The `LOAD_GLOBAL` instruction (in Python/ceval.c) already has an adaptive
+family that serves as a relatively simple example.
+
+The `LOAD_GLOBAL_ADAPTIVE` instruction performs adaptive specialization,
+calling `_Py_Specialize_LoadGlobal()` when the counter reaches zero.
+
+There are two specialized instructions in the family, `LOAD_GLOBAL_MODULE`
+which is specialized for global variables in the module, and
+`LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables.
+
 ## Performance analysis
 
 The benefit of a specialization can be assessed with the following formula:
@@ -67,7 +79,7 @@ requiring judgement and experimentation to design the family of instructions.
 ### Gathering data
 
 Before choosing how to specialize an instruction, it is important to gather
-some data. What are the pattern of usage of the base instruction?
+some data. What are the patterns of usage of the base instruction?
 Data can best be gathered by instrumenting the interpreter. Since a 
 specialization function and adaptive instruction are going to be required,
 instrumentation can most easily be added in the specialization function.