Skip to content

bpo-28685: Optimize sorted() list.sort() with type-specialized comparisons #582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 50 commits into from
Jan 29, 2018

Conversation

embg
Copy link
Contributor

@embg embg commented Mar 9, 2017

Description of the optimization (see also this poster)

The idea is simple: in practice, it's very uncommon to sort type-heterogeneous lists. This is because lists in general tend to be used in a homogeneous way (if you're iterating and the type is changing, your code may break, depending on what you're doing), and because comparison is often not defined in the heterogeneous context ("apples and oranges").

So, instead of checking types during every single compare in the sort (dynamic dispatch), we can simply iterate once in a pre-sort check and see if the list is type-homogeneous. If it is, we can replace PyObject_RichCompareBool with whatever compare function would have ended up being dispatched for that type. Since this check is cheap and very unlikely to fail, and checking types every time we compare is expensive, this is a reasonable optimization to consider.

This is, however, only the beginning of what's possible. Namely, there are many safety checks that have to be performed during every compare in the common cases (string, int, float, tuple) that one encounters in practice. For example, character width has to be checked for both strings every time two strings are compared. Since these checks almost never fail in practice (because, e.g., non-latin strings are uncommon in practice, etc.), we can move them out of the comparison function and into the pre-sort check, as well. We then write special-case compare functions (I implemented one for each of the four types mentioned above) that are selected iff. the assumptions necessary to use them are verified for each list element.

Benchmarks

I considered two sets of benchmarks: one organized by type (random lists of that type), and one organized by structure. Full benchmark scripts can be found here. The results are below (standard deviations were less than 0.3% of the mean for all measurements):

By type

Type Percent improvement on random lists of [type] (1-patched/unpatched)
heterogeneous (lots of float with an int at the end; worst-case) -1.5%
float 48%
bounded int (magnitude smaller than 2^32) 48.4%
latin string (all characters in [0,255]) 32.7%
general int (reasonably uncommon?) 17.2%
general string (reasonably uncommon?) 9.2%
tuples of float 63.2%
tuples of bounded int 64.8%
tuples of latin string 55.8%
tuples of general int 50.3%
tuples of general string 44.1%
tuples of heterogeneous 41.5%

By structure

These are just the benchmarks described in Objects/listsort.txt. The first table is the loss we experience if we sort structured heterogeneous lists (worst case: list is already sorted, we go all the way through doing n type-checks, and then we only end up doing n comparisons. Tragic, but extremely unlikely in practice; in practice, we would usually find the first heterogeneous element early on, and break out of the check, but here, the single, lonely float is hiding all the way at the end of the list of int, so we don't find it until we've done all n checks):

Benchmark (for heterogeneous lists, worst-case) Percent improvement (1-patched/unpatched)
\sort -17.2%
/sort -19.8%
3sort -18.0%
+sort -18.8%
%sort -10.0%
~sort -2.1%
=sort -21.3%

The second table is the same benchmark, but on homogeneous lists (int):

Benchmark (for homogeneous lists) Percent improvement (1-patched/unpatched)
\sort 54.6%
/sort 56.5%
3sort 53.5%
+sort 55.3%
%sort 52.4%
~sort 48.0%
=sort 45.2%

Patch summary

Here we describe at a high level what each section of the patch does:

Line numbers in Objects/listobject.c What the lines do
1053-1069 Define a struct to hold the function pointers we will select in the pre-sort check. This struct then has to be passed in to every function that performs comparison (to keep things in local scope).
1075-1080 Compare function for heterogeneous lists; just a wrapper for PyObject_RichCompareBool. To be selected if all of our pre-checks fail.
1086-1108 Compare function for general homogeneous lists; just a wrapper for ob_type->tp_richcompare, which is stored by the pre-sort check at compare_funcs.key_richcompare. This yields modest optimization (neighbourhood of 10%), but we generally hope we can do better.
1111-1127 Compare function for lists of latin string. During the pre-sort check, we verify that every string in the list uses one character per byte; otherwise, we default to the general homogeneous compare. If this check is even somewhat likely to pass, it's worth it, because the payoff is large, as can be seen in the Benchmarks section. The compare function basically directly accesses the data buffers of the two strings and memcmps them.
1130-1154 Compare function for lists of bounded long. During the pre-sort check, we verify that every int in the list fits in a single machine word. If that check passes, we can use this optimized compare function, which basically directly compares the machine words representing the two ints (taking sign into account). This is faster than the general comparison, which has to figure out which word is most significant for both inputs, etc, in addition to all the type-checking.
1157-1166 Compare function for lists of float. Doesn't assume anything; just directly compares the two floats, skipping all the unnecessary type-checking. Because PyFloat_Type->tp_richcompare does a lot of typechecking that we want to move out of the sort loop, it pays to have this optimized compare available.
1173-1233 Compare function for lists of non-empty tuple. Tuple comparison is optimized on two levels. Namely, after selecting compare_funcs.key_compare in the pre-sort check, we run the pre-sort check again on the list T = [x[0] for x in L] (we don't actually run the check twice, but we do something functionally equivalent to this). If T is type-homogeneous, or even better, satisfies the requirements for one of our special-case compares, we can replace the call to PyObject_RichCompareBool for the first tuple element with a call to compare_funcs.tuple_elem_compare. This allows us to bypass two levels of wasteful safety checks. If the first elements of the two tuples are equal, of course, we have to call PyObject_RichCompareBool on subsequent elements; the idea is that this is uncommon in practice.
2168-2212 First part of the pre-sort check: we set the variables key_type, keys_are_all_same_type, ints_are_bounded, strings_are_latin, and keys_are_in_tuples (which is 1 iff. every list element is a non-empty tuple, in which case all the other variables refer to the list [x[0] for x in L]).
2215-2243 Second part of the pre-sort check: given values for those variables, select the appropriate compare function. If keys_are_in_tuples and key_type != &PyTuple_Type, then use the other variables to select compare_funcs.tuple_elem_compare, and set compare_funcs.key_compare = unsafe_tuple_compare.

Selected quotes from the python-ideas thread

Terry Reedy:

Do reference this thread, and quote Tim's approval in principle, if he did not post on the tracker.

Tim Peters:

Would someone please move the patch along? I expect it's my fault it's languished so long, since I'm probably the natural person to review it, but I've been buried under other stuff.

But the patch doesn't change anything about the sorting algorithm itself - even shallow knowledge of how timsort works is irrelevant. It's just plugging in a different bottom-level object comparison function when that appears valuable.

I've said from the start that it's obvious (to me ;-) ) that it's an excellent tradeoff. At worst it adds one simple (pre)pass over the list doing C-level pointer equality comparisons. That's cheap. The worst-case
damage is obviously small, the best-case gain is obviously large, and the best cases are almost certainly far more common than the worst cases in most code.

Later in that message, Tim also pointed out a bug, which has been fixed in this version of the patch.

https://bugs.python.org/issue28685

@mention-bot
Copy link

@embg, thanks for your PR! By analyzing the history of the files in this pull request, we identified @DanielStutzbach, @tim-one, @rhettinger, @loewis and @Yhg1s to be potential reviewers.

@vstinner vstinner added the performance Performance or resource usage label Mar 9, 2017
@@ -1044,6 +1045,193 @@ sortslice_advance(sortslice *slice, Py_ssize_t n)
#define IFLT(X, Y) if ((k = ISLT(X, Y)) < 0) goto fail; \
if (k)

/* Here we define custom comparison functions to optimize for the cases one commonly
* in practice: homogeneous lists, often of one of the basic types. */

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a "encounters" or "finds" missing here, between "commonly" and "in practice"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, thanks for pointing this out.

Copy link
Member

@tim-one tim-one left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C is a "call by value" language. In particular, when you pass a struct, the language has to act as if a physical copy of the struct is passed. The easiest way is for the implementation to actually make a copy every time a struct is passed; while a smart compiler can arrange to do it more efficiently in most cases, it's poor practice to rely on that:

http://www.cs.utsa.edu/~wagner/CS2213/structs/structs_pass.html

That's why you almost never see non-trivial structs passed in C. Instead a pointer is passed.

The uses of the MergeState struct in the existing sort code are typical examples of that. Indeed, I believe we'd be better off if you added the new comparison-function members directly to the MergeState struct, and pass around pointers to ms (the one MergeState instance) instead. The top layers of functions already pass pointers to that struct around, so their signatures wouldn't need to change at all.

@embg
Copy link
Contributor Author

embg commented Mar 11, 2017

@tim-one I'm aware that the struct would be copied; my reasoning for wanting to pass by value was rather nonsensical. I realize now that I was completely wrong.

The reason I chose to define a new struct instead of adding to MergeState was that gallop_left and gallop_right don't take in MergeState, and I thought it would be confusing to add it as a parameter. Your call: should I add MergeState to the gallop_* signatures, or just change CompareFuncs to CompareFuncs*? The former be much less invasive, actually.

Last thing: since I've got you here, can I ask you another question? Namely, this entire project was based on the following quote from the documentation:

Like its identity, an object’s type is also unchangeable. [1]

[1] It is possible in some cases to change an object’s type, under certain controlled conditions. It generally isn’t a good idea though, since it can lead to some very strange behavior if it is handled incorrectly.

So I figured I can assume the types will stay the same. I mean, what's the point of checking the types if they're going to change on you? However, ppperry supplied the the following counterexample on the tracker:

        @functools.total_ordering
	class ClassAssignmentCanBreakChecks():
		def __init__(self, i):
			self._i = i
		def __lt__ (self, other):
			last.__class__ = OrdinaryOldInteger
			return self._i < (other._i if hasattr(other, '_i') else other)
	@functools.total_ordering
	class OrdinaryOldInteger:
		def __init__(self, i):
			self._i = i
		def __lt__(self, other):
			return self._i < (other._i if hasattr(other, '_i') else other)
	lst = [ClassAssignmentCanBreakChecks(i) for i in range(10)]
	random.shuffle(lst)
	last = lst[-1]
	lst.sort()

My code would work correctly here, since the compare functions are equivalent sans side effects, but in general something like this could break unsafe_object_compare. Do you think this is something worth worrying about?

Thank you so much for reviewing my code!

@tim-one
Copy link
Member

tim-one commented Mar 11, 2017

Yes, I believe it would be best to add the new stuff to the MergeState struct.

Why that's not confusing: conceptually, we're building an object to do the sort. All the sort-related helper functions are really methods of that object, and the MergeState struct is that object's private data. Indeed, that's why the functions that do currently take a MergeState argument list it first: in analogy with a Python method taking self as its first argument. That some (conceptual) methods do not currently take a MergeState argument first is just an accidental micro-optimization. Now that some need access to the (conceptual) object's private data, the obvious way to do that is to give them an initial MergeState argument.

If CPython were coded in C++ instead, it really would be obvious ;-)

About the __class__-changing example, I don't care provided it doesn't blow up. But someone else should weigh in on that. I hope we can get, e.g., Raymond Hettinger to stare at this issue too.

@embg
Copy link
Contributor Author

embg commented Mar 11, 2017

@tim-one Now that you explained it, I'm totally on board; I'll rewrite everything to use MergeState.

About the __class__-changing example: there's no way it could segfault if your classes are defined using pure-Python, because the pure-Python operators/functions will raise exceptions if they get bad types (correct me if I'm wrong). So you would just get an exception, which is what is appropriate here anyway IMO. However, if your comparison function were defined using the C API, I'm pretty sure you could set it up to get a segfault.

@tim-one
Copy link
Member

tim-one commented Mar 11, 2017

I'm curious about what the patched Python prints for this program:

def mkn():
    return float("nan")
n = mkn()
xs = [(n, i) for i in range(4)]
ys = [(mkn(), i) for i in range(4)]
print(sorted(reversed(xs)))
print(sorted(reversed(ys)))

What it prints now:

[(nan, 0), (nan, 1), (nan, 2), (nan, 3)]
[(nan, 3), (nan, 2), (nan, 1), (nan, 0)]

@embg
Copy link
Contributor Author

embg commented Mar 11, 2017

@tim-one Just ran it; it printed this:

[(nan, 0), (nan, 1), (nan, 2), (nan, 3)]
[(nan, 3), (nan, 2), (nan, 1), (nan, 0)]

@embg
Copy link
Contributor Author

embg commented Mar 11, 2017

@tim-one Actually, I'm getting weirded out here.

First of all. The test you just gave should not have worked. What you were trying to get at is that PyObject_RichCompareBool compares equal if references are equal, while the nans don't compare equal. Right? Well, unsafe_tuple_compare, which calls unsafe_float_compare, doesn't look at the addresses -- it just compares the floats. So it should have given a different answer. I even made it print out which compares it was using, just to be sure, and it was using those two. So why did it work?

Another example. ppperry proposed the following test:

class A(int):
	def __lt__(self, other):
		print("zebra")
		A.__lt__ = A.__false_lt__
		return int.__lt__(self, other)
	__true_lt__ = __lt__
	def __false_lt__(self, other):
		print("gizmo")
		A.__lt__ = A.__true_lt__
		return int.__lt__(self, other)

The idea is that if you cache the richcompare, you're going to mess up the print statements. But my patched interpreter printed the same as the current implementation!

Finally, I decided to run the original counterexample ppperry proposed, putting in print statements to see what's going on:

from random import *
import functools

@functools.total_ordering
class ClassAssignmentCanBreakChecks():
    def __init__(self, i):
        self._i = i
    def __lt__ (self, other):
        print('gizmo')
        last.__class__ = OrdinaryOldInteger
        return self._i < (other._i if hasattr(other, '_i') else other)

    
@functools.total_ordering
class OrdinaryOldInteger:
    def __init__(self, i):
        self._i = i
    def __lt__(self, other):
        print('rocket')
        return self._i < (other._i if hasattr(other, '_i') else other)
lst = [ClassAssignmentCanBreakChecks(i) for i in range(10)]
shuffle(lst)
last = lst[-1]
lst.sort()

Again, it matched up with the current implementation!

So my question is: why is my patch working when it shouldn't be working??? I have no idea; can you figure this out?

The only explanation I can think of is that maybe heap types all use the same ob_type->tp_richcompare, which looks up the appropriate __lt__ method. Is this correct?

@tim-one
Copy link
Member

tim-one commented Mar 11, 2017

Slamming 3 distinct mysteries into one message doesn't really help ;-)

I'll address the one I started: I expect you'd get different results (in the NaN test) if your tuple compare loop:

for (i = 0; i < vlen && i < wlen; i++) {

started at i = 1 instead. At this point you already checked v[0] against w[0] twice, and since the "unsafe" float compare said "nope, not less than" both times you decided they must be equal. But the loop goes on to compare them a third time, using PyObject_RichCompareBool which does special-case object identity. So the program I gave yields the same results with or without the patch.

Which is a really weird "reason" to keep that loop starting at i = 0. Which I didn't realize it did at first, instead believing the comment:

We need to look at v[1:] and w[1:]

@embg
Copy link
Contributor Author

embg commented Mar 11, 2017

@tim-one Ah! That makes sense. What happened was, I figured, if you're already at the loop, it's going to be slow anyway, so I may as well exactly copy-paste from the current implementation to reduce the probability of error. Turns out that saved me! Anyway, your call: should I change it to start at 1 (adding a pointer compare to make it safe)?

As for the other mysteries: I'm convinced it must just be that ob_type->tp_richcompare is just a lookup method for __lt__, so when the type changes it automatically switches the compare.

So, basically, the patch has withstood all attempts to break it! Awesome!

assert(v->ob_type == w->ob_type &&
v->ob_type->tp_richcompare != NULL);
#endif
int ok;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current style guide allows declarations not at the start of the block.

if (res == NULL)
return -1;

if (PyBool_Check(res)){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a space before {.

#endif
int ok;

if (v == w) return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split the line before return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually removed -- it turns out you only need to look at addresses when you're doing equality testing.


/* Latin string compare: safe for any two latin (one byte per char) strings. */
static int
unsafe_latin_compare(PyObject* v, PyObject* w, MergeState* ms){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move { to separate line.

/* This function is used by unsafe_object_compare to optimize comparisons
* when we know our list is type-homogeneous but we can't assume anything else.
* In the pre-sort check it is set equal to key->ob_type->tp_richcompare */
PyObject* (*key_richcompare)(PyObject*, PyObject*, int);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more common to use a space before * rather than after it.

The same is everywhere in the new code.

Py_SIZE(lo.keys[0]) > 0);

PyTypeObject* key_type = (keys_are_in_tuples ?
PyTuple_GET_ITEM(lo.keys[0],0)->ob_type :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a space after ,.

* lo.keys[i], not lo.keys[i] itself! We verify type-homogeneity
* for lists of tuples in the if-statement directly above. */
PyObject* key = (keys_are_in_tuples ?
PyTuple_GET_ITEM(lo.keys[i],0) :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a space after ,.

break;
}

else if (key_type == &PyLong_Type && ints_are_bounded &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else is not needed since an alternate branch is ended by break.

If ints_are_bounded or Py_ABS(Py_SIZE(key)) > 1 are false, key_type == &PyUnicode_Type still is checked.

I would write:

if (key_type == &PyLong_Type) {
    if (ints_are_bounded && Py_ABS(Py_SIZE(key)) > 1)
        ints_are_bounded = 0;
}
else if (key_type == &PyUnicode_Type ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

/* Choose the best compare, given what we now know about the keys. */
if (keys_are_all_same_type) {

if (key_type == &PyUnicode_Type && strings_are_latin)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since in any case you use empty lines for readability, it would be better to add braces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

else if ((ms.key_richcompare = key_type->tp_richcompare) != NULL)
ms.key_compare = unsafe_object_compare;

} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else should be at the start of new line.

@serhiy-storchaka
Copy link
Member

Sorry, I missed that this PR is closed. Is the proposition rejected?

@embg
Copy link
Contributor Author

embg commented Mar 14, 2017

@serhiy-storchaka Thanks so much for your review! Tim asked me to write tests in Lib/tests/sort_test.py and then closed it; I think he wants me to open a new PR once I've written the tests?

@tim-one
Copy link
Member

tim-one commented Mar 14, 2017

Ack! This is the first time I've used github, and I didn't intend to close it. I must have hit a wrong button :-( Here, I'll try to open it again. Sorry!

@tim-one tim-one reopened this Mar 14, 2017
@embg
Copy link
Contributor Author

embg commented Mar 14, 2017

@tim-one Haha no problem, it's going to be a couple days till my tests are up to snuff :)

Comparing bytes and strings yields a warning, not an error, so assertRaises fails.
@embg
Copy link
Contributor Author

embg commented Mar 23, 2017

@serhiy-storchaka Don't know if you saw this or not; I fixed all the issues you raised. As far as I can tell, the patch is 100% finished.

L = [WackyList1([WackyComparator(i), i]) for i in range(10)]
elem = L[-1]
self.assertRaises(ValueError, L.sort)
self.assertRaises(ValueError, [(x,) for x in L].sort)
Copy link

@pppery pppery Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test on line 325 doesn't actually work; the first sort of the non-tupled list already did the __class__ modification, so L now contains a WackyList2 already and thus the sort would fail even if there was a bug that made sorting one-tuples not handle mid-sort tp_richcompare changes.

@rhettinger rhettinger changed the title bpo-28685: Optimize list.sort() by performing safety checks in advance bpo-28685: Optimize sorted() list.sort() with type-specialized comparisons Jan 28, 2018
* Use the with-statement form of self.assertRaises.

* Make the two assertions independent of one another.
  The second test was invalid because the first list
  was already sorted, making the ValueError inevitable.
@rhettinger rhettinger merged commit 1e34da4 into python:master Jan 29, 2018
@bedevere-bot
Copy link

@rhettinger: Please replace # with GH- in the commit message next time. Thanks!

jaraco pushed a commit that referenced this pull request Dec 2, 2022
Bumps [aiohttp](https://github.com/aio-libs/aiohttp) from 3.8.1 to 3.8.3.
- [Release notes](https://github.com/aio-libs/aiohttp/releases)
- [Changelog](https://github.com/aio-libs/aiohttp/blob/master/CHANGES.rst)
- [Commits](aio-libs/aiohttp@v3.8.1...v3.8.3)

---
updated-dependencies:
- dependency-name: aiohttp
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants