bpo-36218: Fix handling of heterogeneous values in list.sort #12209

remilapeyre · 2019-03-07T09:42:50Z

https://bugs.python.org/issue36218

tirkarthi · 2019-03-07T12:00:41Z

Please add a NEWS entry since this fixes a segfault in a released version.

remilapeyre · 2019-03-07T12:07:07Z

Thanks @tirkarthi!

embg · 2019-03-08T08:19:28Z

Author of the bug here: apologies for the error, this patch is absolutely correct.

remilapeyre · 2019-03-08T13:24:50Z

To be fair, it was a clever patch and did pass the review from Tim Peters ;) (and the bug took quite a long time to get noticed so I guess lists are actually homogenous in general)

remilapeyre · 2019-03-10T23:52:50Z

Hi @embg, thinking back about this bug, I'm not sure this is the best fix.

For inputs like [(1, False), (2.0, True)] the current pre-sort check will use ms.key_compare = safe_object_compare while it could use ms.key_compare = unsafe_tuple_compare and ms.tuple_elem_compare = safe_object_compare, don't you think?

embg · 2019-03-11T03:14:59Z

@remilapeyre You're correct, and according to my measurements from the original PR, the performance gain would be about 40% for the case you mention.

How about like this? (See changes below)

embg · 2019-03-11T03:18:06Z

Objects/listobject.c

@@ -2305,6 +2305,7 @@ list_sort_impl(PyListObject *self, PyObject *keyfunc, int reverse)
                             lo.keys[i]);

            if (key->ob_type != key_type) {
+                keys_are_in_tuples = 0;


Suggested change

keys_are_in_tuples = 0;

We remove this line, which over-compensates in the case you mentioned.

embg · 2019-03-11T03:19:36Z

Objects/listobject.c

+            if (keys_are_in_tuples) {
+                /* Make sure we're not dealing with tuples of tuples
+                * (remember: here, key_type refers list [key[0] for key in keys]) */
+                if (key_type == &PyTuple_Type) {


Suggested change

if (key_type == &PyTuple_Type) {

if ( (keys_are_all_same_type == 0) || (key_type == &PyTuple_Type) ) {

And instead modify this check, which sets tuple_elem_compare to safe_object_compare in the case of nested tuples, to also fall back when the tuple first-elements are heterogeneous.

I don't think that's going to work because of the early exit at https://github.com/python/cpython/pull/12209/files#diff-5037da2a492e5903f100107e0506f818R2310 you can't be sure that all keys are tuples anyway.

I think the simplest way forward would be to use two variables key_type and inner_key_type to know if it's item[0] or item that changed type.

Does this make sense?

embg · 2019-03-11T14:26:02Z

@remilapeyre On second thought, I'm not clear how the early exit would impact the change proposed in my review, since the check is outside the loop. Could you walk me through that?

Although I agree that two variables is a good solution if we can't patch it by just modifying a single line.

remilapeyre · 2019-03-11T15:06:42Z

Unless I missed something, you are suggesting to use this code:

    /* The pre-sort check: here's where we decide which compare function to use.
     * How much optimization is safe? We test for homogeneity with respect to
     * several properties that are expensive to check at compare-time, and
     * set ms appropriately. */
    if (saved_ob_size > 1) {
        /* Assume the first element is representative of the whole list. */
        int keys_are_in_tuples = (lo.keys[0]->ob_type == &PyTuple_Type &&
                                  Py_SIZE(lo.keys[0]) > 0);

        PyTypeObject* key_type = (keys_are_in_tuples ?
                                  PyTuple_GET_ITEM(lo.keys[0], 0)->ob_type :
                                  lo.keys[0]->ob_type);

        int keys_are_all_same_type = 1;
        int strings_are_latin = 1;
        int ints_are_bounded = 1;

        /* Prove that assumption by checking every key. */
        for (i=0; i < saved_ob_size; i++) {

            if (keys_are_in_tuples &&
                !(lo.keys[i]->ob_type == &PyTuple_Type && Py_SIZE(lo.keys[i]) != 0)) {
                keys_are_in_tuples = 0;
                keys_are_all_same_type = 0;
                break;
            }

            /* Note: for lists of tuples, key is the first element of the tuple
             * lo.keys[i], not lo.keys[i] itself! We verify type-homogeneity
             * for lists of tuples in the if-statement directly above. */
            PyObject *key = (keys_are_in_tuples ?
                             PyTuple_GET_ITEM(lo.keys[i], 0) :
                             lo.keys[i]);

            if (key->ob_type != key_type) {
                keys_are_all_same_type = 0;
                break;
            }

            if (key_type == &PyLong_Type) {
                if (ints_are_bounded && Py_ABS(Py_SIZE(key)) > 1)
                    ints_are_bounded = 0;
            }
            else if (key_type == &PyUnicode_Type){
                if (strings_are_latin &&
                    PyUnicode_KIND(key) != PyUnicode_1BYTE_KIND)
                strings_are_latin = 0;
            }
        }

        /* Choose the best compare, given what we now know about the keys. */
        if (keys_are_all_same_type) {

            if (key_type == &PyUnicode_Type && strings_are_latin) {
                ms.key_compare = unsafe_latin_compare;
            }
            else if (key_type == &PyLong_Type && ints_are_bounded) {
                ms.key_compare = unsafe_long_compare;
            }
            else if (key_type == &PyFloat_Type) {
                ms.key_compare = unsafe_float_compare;
            }
            else if ((ms.key_richcompare = key_type->tp_richcompare) != NULL) {
                ms.key_compare = unsafe_object_compare;
            }
            else {
                ms.key_compare = safe_object_compare;
            }
            
        }
        else {
            ms.key_compare = safe_object_compare;
        }

        if (keys_are_in_tuples) {
            /* Make sure we're not dealing with tuples of tuples
            * (remember: here, key_type refers list [key[0] for key in keys]) */
            if  ((keys_are_all_same_type == 0) || (key_type == &PyTuple_Type)) {
                ms.tuple_elem_compare = safe_object_compare;
            }
            else {
                ms.tuple_elem_compare = ms.key_compare;
            }

            ms.key_compare = unsafe_tuple_compare;
        }

    }
    /* End of pre-sort check: ms is now set properly! */

Then for input like [(1, True), (2.4, False), 6] the loop will break on the second element ((2.4, False)) because type(1) != type(2.4) with keys_are_in_tuples = 1 and keys_are_all_same_type = 0. This will set ms.key_compare = unsafe_tuple_compare which will segfault when comparing something with the third element (6).

If you break from the list before the third element, how can you be sure it will be a tuple?

embg · 2019-03-11T18:22:58Z

OK, I see what you mean now. Couldn't we fix it like this?

if (key->ob_type != key_type) {
    keys_are_all_same_type = 0;
    if (!keys_are_in_tuples) break;
}

That way we break early for the non-tuple case, but for tuples we check all the way through. The full diff would then be two lines.

ned-deily · 2019-03-12T04:21:44Z

@tim-one and @rhettinger, since you were involved in the original change, could you review this please?

remilapeyre · 2019-03-12T09:12:10Z

@embg yes, this seems to be the correct condition to exit the loop.

tim-one · 2019-03-13T05:18:14Z

@ned-deily I haven't chimed in here because it appears to be a shallow bug that's already getting first-class attention. The kinds of bugs we were looking for at the start were far subtler 😉 .

rhettinger · 2019-03-21T06:34:12Z

Serhiy, I would appreciate it if you could go through this as well.

Objects/listobject.c

Lib/test/test_sort.py

Objects/listobject.c

Co-Authored-By: remilapeyre <remi.lapeyre@henki.fr>

serhiy-storchaka

Oh, you have changed too much. This change adds a pessimization. I suggested to fix an indentation, not to change a code.

remilapeyre · 2019-03-22T13:55:42Z

Sorry, I will fix it.

embg · 2019-03-22T15:37:54Z

@remilapeyre Since sort() could conceivably be run on untrusted input, you should consider submitting the segfault you found to a bug bounty program: https://hackerone.com/ibb-python

remilapeyre · 2019-03-22T15:41:52Z

Thanks for invitation but it was Lyn Levenick that reported the issue on https://bugs.python.org/issue36218#msg337341

embg · 2019-03-22T15:45:58Z

Ah, I will post on the bugtracker thread then

rhettinger · 2019-03-24T05:53:32Z

Let's see if this can get wrapped up in time for the alpha release this weekend.

miss-islington · 2019-03-25T07:25:39Z

Thanks @remilapeyre for the PR, and @rhettinger for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7.
🐍🍒⛏🤖

bedevere-bot · 2019-03-25T07:25:52Z

GH-12532 is a backport of this pull request to the 3.7 branch.

…H-12209) (cherry picked from commit dd5417a) Co-authored-by: Rémi Lapeyre <remi.lapeyre@henki.fr>

GH-12532) (cherry picked from commit dd5417a) Co-authored-by: Rémi Lapeyre <remi.lapeyre@henki.fr>

Fix sorting of heterogeneous values in list.sort

e75d9fb

the-knights-who-say-ni added the CLA signed label Mar 7, 2019

bedevere-bot added the awaiting review label Mar 7, 2019

Add blurb

8498a21

embg reviewed Mar 11, 2019

View reviewed changes

serhiy-storchaka requested review from tim-one and rhettinger March 12, 2019 08:20

serhiy-storchaka added type-bug An unexpected behavior, bug, or error needs backport to 3.7 labels Mar 12, 2019

Remove early exit when keys are tuples

60aca1e

Rémi Lapeyre added 2 commits March 12, 2019 16:47

Fix bug found by test_json

eaa7c50

Remove empty line

3b02a05

embg approved these changes Mar 13, 2019

View reviewed changes

bedevere-bot added awaiting core review and removed awaiting review labels Mar 13, 2019

tim-one removed their request for review March 13, 2019 05:18

rhettinger approved these changes Mar 21, 2019

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels Mar 21, 2019

rhettinger requested a review from serhiy-storchaka March 21, 2019 06:32

serhiy-storchaka reviewed Mar 21, 2019

View reviewed changes

Objects/listobject.c Outdated Show resolved Hide resolved

serhiy-storchaka reviewed Mar 21, 2019

View reviewed changes

Lib/test/test_sort.py Show resolved Hide resolved

Rémi Lapeyre added 2 commits March 21, 2019 12:06

Simplify logic in pre-sort check

47a0688

Add test

467cd16

serhiy-storchaka reviewed Mar 21, 2019

View reviewed changes

Objects/listobject.c Outdated Show resolved Hide resolved

Objects/listobject.c Outdated Show resolved Hide resolved

serhiy-storchaka and others added 2 commits March 21, 2019 14:03

Fix coding style

8b2224f

Co-Authored-By: remilapeyre <remi.lapeyre@henki.fr>

Fix indentation

6286a4b

serhiy-storchaka reviewed Mar 22, 2019

View reviewed changes

rhettinger merged commit dd5417a into python:master Mar 25, 2019

bedevere-bot removed the awaiting merge label Mar 25, 2019

bedevere-bot removed the needs backport to 3.7 label Mar 25, 2019

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Mar 25, 2019

bpo-36218: Fix handling of heterogeneous values in list.sort (pythonG…

54fed03

…H-12209) (cherry picked from commit dd5417a) Co-authored-by: Rémi Lapeyre <remi.lapeyre@henki.fr>

rhettinger pushed a commit that referenced this pull request Mar 25, 2019

bpo-36218: Fix handling of heterogeneous values in list.sort (GH-12209)

9dbb09f

GH-12532) (cherry picked from commit dd5417a) Co-authored-by: Rémi Lapeyre <remi.lapeyre@henki.fr>

	if (key_type == &PyTuple_Type) {
	if ( (keys_are_all_same_type == 0) \|\| (key_type == &PyTuple_Type) ) {

Uh oh!

bpo-36218: Fix handling of heterogeneous values in list.sort #12209

bpo-36218: Fix handling of heterogeneous values in list.sort #12209

Uh oh!

Conversation

remilapeyre commented Mar 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tirkarthi commented Mar 7, 2019

Uh oh!

remilapeyre commented Mar 7, 2019

Uh oh!

embg commented Mar 8, 2019

Uh oh!

remilapeyre commented Mar 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

remilapeyre commented Mar 10, 2019

Uh oh!

embg commented Mar 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

embg Mar 11, 2019

Choose a reason for hiding this comment

Uh oh!

embg Mar 11, 2019

Choose a reason for hiding this comment

Uh oh!

embg Mar 11, 2019

Choose a reason for hiding this comment

Uh oh!

embg Mar 11, 2019

Choose a reason for hiding this comment

Uh oh!

remilapeyre Mar 11, 2019

Choose a reason for hiding this comment

Uh oh!

embg commented Mar 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

remilapeyre commented Mar 11, 2019

Uh oh!

embg commented Mar 11, 2019

Uh oh!

ned-deily commented Mar 12, 2019

Uh oh!

remilapeyre commented Mar 12, 2019

Uh oh!

tim-one commented Mar 13, 2019

Uh oh!

rhettinger commented Mar 21, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

remilapeyre commented Mar 22, 2019

Uh oh!

embg commented Mar 22, 2019

Uh oh!

remilapeyre commented Mar 22, 2019

Uh oh!

embg commented Mar 22, 2019

Uh oh!

rhettinger commented Mar 24, 2019

Uh oh!

miss-islington commented Mar 25, 2019

Uh oh!

bedevere-bot commented Mar 25, 2019

Uh oh!

Uh oh!

remilapeyre commented Mar 7, 2019 •

edited

Loading

remilapeyre commented Mar 8, 2019 •

edited

Loading

embg commented Mar 11, 2019 •

edited

Loading

embg commented Mar 11, 2019 •

edited

Loading