Skip to content

gh-96538: Move some type-checking out of bisect loops #96539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 5, 2022

Conversation

sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented Sep 3, 2022

Here's a benchmark program:

from pyperf import Runner, perf_counter
from itertools import repeat
from bisect import bisect
from decimal import Decimal

def bench(loops, size, tp):
    a = sorted(map(tp, range(size)))
    k = tp(size//3)

    t0 = perf_counter()
    for _ in repeat(None, loops):
        bisect(a, k); bisect(a, k); bisect(a, k); bisect(a, k); bisect(a, k);
    t1 = perf_counter()

    return t1 - t0


btf = Runner().bench_time_func
for size in (10, 100, 1_000, 10_000, 100_000, 1_000_000):
    for tp in (int, float, Decimal, str):
        btf(f"[{tp.__name__}]*{size:_}", bench, size, tp, inner_loops=5)

And the results:

- [Decimal]*10: 102 ns +- 1 ns -> 89.7 ns +- 1.1 ns: 1.14x faster
- [Decimal]*100: 169 ns +- 17 ns -> 139 ns +- 2 ns: 1.21x faster
- [Decimal]*1_000: 249 ns +- 2 ns -> 202 ns +- 4 ns: 1.23x faster
- [Decimal]*10_000: 312 ns +- 6 ns -> 255 ns +- 2 ns: 1.23x faster
- [Decimal]*100_000: 396 ns +- 5 ns -> 321 ns +- 5 ns: 1.23x faster
- [Decimal]*1_000_000: 459 ns +- 10 ns -> 369 ns +- 4 ns: 1.24x faster

- [float]*10: 62.4 ns +- 0.4 ns -> 50.4 ns +- 0.5 ns: 1.24x faster
- [float]*100: 93.4 ns +- 0.8 ns -> 71.1 ns +- 0.8 ns: 1.31x faster
- [float]*1_000: 149 ns +- 1 ns -> 104 ns +- 2 ns: 1.43x faster
- [float]*10_000: 174 ns +- 2 ns -> 121 ns +- 3 ns: 1.44x faster
- [float]*100_000: 223 ns +- 3 ns -> 162 ns +- 15 ns: 1.37x faster
- [float]*1_000_000: 249 ns +- 5 ns -> 180 ns +- 10 ns: 1.38x faster

- [int]*10: 62.8 ns +- 1.5 ns -> 49.0 ns +- 0.6 ns: 1.28x faster
- [int]*100: 96.0 ns +- 5.7 ns -> 69.8 ns +- 3.6 ns: 1.37x faster
- [int]*1_000: 137 ns +- 1 ns -> 102 ns +- 4 ns: 1.35x faster
- [int]*10_000: 183 ns +- 1 ns -> 122 ns +- 7 ns: 1.51x faster
- [int]*100_000: 218 ns +- 2 ns -> 157 ns +- 3 ns: 1.39x faster
- [int]*1_000_000: 251 ns +- 6 ns -> 179 ns +- 2 ns: 1.41x faster

- [str]*10: 76.0 ns +- 1.0 ns -> 64.7 ns +- 4.4 ns: 1.18x faster
- [str]*100: 119 ns +- 1 ns -> 97.2 ns +- 1.9 ns: 1.22x faster
- [str]*1_000: 173 ns +- 5 ns -> 143 ns +- 4 ns: 1.20x faster
- [str]*10_000: 238 ns +- 10 ns -> 187 ns +- 2 ns: 1.27x faster
- [str]*100_000: 322 ns +- 6 ns -> 222 ns +- 6 ns: 1.45x faster
- [str]*1_000_000: 365 ns +- 7 ns -> 252 ns +- 2 ns: 1.45x faster

Geometric mean: 1.31x faster

if (res_obj == Py_NotImplemented) {
Py_DECREF(res_obj);
compare = NULL;
res = PyObject_RichCompareBool(item, litem, Py_LT);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: this could technically call __lt__, __lt__, __gt__ instead of the current behavior, calling __lt__, __gt__.

This is similar to what unsafe_object_compare does though.

@sweeneyde sweeneyde added performance Performance or resource usage stdlib Python modules in the Lib dir labels Sep 3, 2022
@rhettinger
Copy link
Contributor

With sorting, which is potentially O(n log n), there is a large payoff for all this added complexity. With bisect, which is O(log n), the payoff is very low. The granularity is different. We call sorting on a full dataset while bisect calls are one per lookup. You would get a better payoff by type specializing min() and max() than for the handful of comparisons made in a typical call to bisect().

I recognize that there has been a shift in the core developer attitudes towards complexity and that many things we would have never considered now are fair game if some clock cycles can be saved. With the interpreter, there was a high payoff that compensated for fewer people being able to understand and maintain the code. There is less of case to be made for complexifying code on the periphery. It would be nice if some simple things remained simple. At one time, the whole module was basically just this code:

        mid = (lo + hi) // 2
        if x < a[mid]:
            hi = mid
        else:
            lo = mid + 1

Tim, what are your thoughts?

@rhettinger rhettinger requested a review from tim-one September 4, 2022 00:28
@tim-one
Copy link
Member

tim-one commented Sep 4, 2022

Note that sort()'s type specialization is much more involved, writing its own comparison functions for a number of common types. This, in contrast, is more just applying strength reduction to the single "don't know what the type is and don't care - only care whether comparands are the same type, in which case the type-specific comparison function doesn't need to be deduced again" case. This is done on the fly, as needed; no pre-scan required, so no change to O() behavior.

The code seems pretty straightforward to me, and Dennis's timing results do show significant speedups for typical type-homogeneous cases, even for short lists. I think it would help if comments were added to point out which parts are logically necessary, and which just to speed typical cases.

@rhettinger
Copy link
Contributor

Tim, thanks for looking at this. It is not aways clear how far we should go.

@tim-one
Copy link
Member

tim-one commented Sep 5, 2022

It is not always clear how far we should go.

No argument here! IMO, "too many" changes go in that add too much complexity for too little gain. The changes here are conceptually simple, though, and the "1.31x faster" overall result looks credible, well-motivated, and more robust than the kinds of changes that make seemingly random changes to C spellings that yield seeming random "1.04x faster" results on some platforms, but also minor slowdowns on others 😉. We're actually taking code off critical paths in this one.

@sweeneyde sweeneyde merged commit 9e35d05 into python:main Sep 5, 2022
@sweeneyde
Copy link
Member Author

Thanks for the reviews!

@sweeneyde sweeneyde deleted the bisect_faster branch September 5, 2022 05:02
@pablogsal
Copy link
Member

pablogsal commented Sep 6, 2022

Looks like this PR introduced some reference leaks. Looks like there may be some leaks here:

9e35d05

In line 122 and maybe in line 301.

Please, check out or otherwise we will need to revert it if is not fixed in 24 hours as our buildbot policy (if confirmed that indeed the refleaks come from this PR).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants