Skip to content

bpo-36218: Fix handling of heterogeneous values in list.sort #12209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

remilapeyre
Copy link
Contributor

@remilapeyre remilapeyre commented Mar 7, 2019

@tirkarthi
Copy link
Member

Please add a NEWS entry since this fixes a segfault in a released version.

@remilapeyre
Copy link
Contributor Author

Thanks @tirkarthi!

@embg
Copy link
Contributor

embg commented Mar 8, 2019

Author of the bug here: apologies for the error, this patch is absolutely correct.

@remilapeyre
Copy link
Contributor Author

remilapeyre commented Mar 8, 2019

To be fair, it was a clever patch and did pass the review from Tim Peters ;) (and the bug took quite a long time to get noticed so I guess lists are actually homogenous in general)

@remilapeyre
Copy link
Contributor Author

Hi @embg, thinking back about this bug, I'm not sure this is the best fix.

For inputs like [(1, False), (2.0, True)] the current pre-sort check will use ms.key_compare = safe_object_compare while it could use ms.key_compare = unsafe_tuple_compare and ms.tuple_elem_compare = safe_object_compare, don't you think?

@embg
Copy link
Contributor

embg commented Mar 11, 2019

@remilapeyre You're correct, and according to my measurements from the original PR, the performance gain would be about 40% for the case you mention.

How about like this? (See changes below)

@@ -2305,6 +2305,7 @@ list_sort_impl(PyListObject *self, PyObject *keyfunc, int reverse)
lo.keys[i]);

if (key->ob_type != key_type) {
keys_are_in_tuples = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
keys_are_in_tuples = 0;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We remove this line, which over-compensates in the case you mentioned.

if (keys_are_in_tuples) {
/* Make sure we're not dealing with tuples of tuples
* (remember: here, key_type refers list [key[0] for key in keys]) */
if (key_type == &PyTuple_Type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (key_type == &PyTuple_Type) {
if ( (keys_are_all_same_type == 0) || (key_type == &PyTuple_Type) ) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And instead modify this check, which sets tuple_elem_compare to safe_object_compare in the case of nested tuples, to also fall back when the tuple first-elements are heterogeneous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's going to work because of the early exit at https://github.com/python/cpython/pull/12209/files#diff-5037da2a492e5903f100107e0506f818R2310 you can't be sure that all keys are tuples anyway.

I think the simplest way forward would be to use two variables key_type and inner_key_type to know if it's item[0] or item that changed type.

Does this make sense?

@embg
Copy link
Contributor

embg commented Mar 11, 2019

@remilapeyre On second thought, I'm not clear how the early exit would impact the change proposed in my review, since the check is outside the loop. Could you walk me through that?

Although I agree that two variables is a good solution if we can't patch it by just modifying a single line.

@remilapeyre
Copy link
Contributor Author

Unless I missed something, you are suggesting to use this code:

    /* The pre-sort check: here's where we decide which compare function to use.
     * How much optimization is safe? We test for homogeneity with respect to
     * several properties that are expensive to check at compare-time, and
     * set ms appropriately. */
    if (saved_ob_size > 1) {
        /* Assume the first element is representative of the whole list. */
        int keys_are_in_tuples = (lo.keys[0]->ob_type == &PyTuple_Type &&
                                  Py_SIZE(lo.keys[0]) > 0);

        PyTypeObject* key_type = (keys_are_in_tuples ?
                                  PyTuple_GET_ITEM(lo.keys[0], 0)->ob_type :
                                  lo.keys[0]->ob_type);

        int keys_are_all_same_type = 1;
        int strings_are_latin = 1;
        int ints_are_bounded = 1;

        /* Prove that assumption by checking every key. */
        for (i=0; i < saved_ob_size; i++) {

            if (keys_are_in_tuples &&
                !(lo.keys[i]->ob_type == &PyTuple_Type && Py_SIZE(lo.keys[i]) != 0)) {
                keys_are_in_tuples = 0;
                keys_are_all_same_type = 0;
                break;
            }

            /* Note: for lists of tuples, key is the first element of the tuple
             * lo.keys[i], not lo.keys[i] itself! We verify type-homogeneity
             * for lists of tuples in the if-statement directly above. */
            PyObject *key = (keys_are_in_tuples ?
                             PyTuple_GET_ITEM(lo.keys[i], 0) :
                             lo.keys[i]);

            if (key->ob_type != key_type) {
                keys_are_all_same_type = 0;
                break;
            }

            if (key_type == &PyLong_Type) {
                if (ints_are_bounded && Py_ABS(Py_SIZE(key)) > 1)
                    ints_are_bounded = 0;
            }
            else if (key_type == &PyUnicode_Type){
                if (strings_are_latin &&
                    PyUnicode_KIND(key) != PyUnicode_1BYTE_KIND)
                strings_are_latin = 0;
            }
        }

        /* Choose the best compare, given what we now know about the keys. */
        if (keys_are_all_same_type) {

            if (key_type == &PyUnicode_Type && strings_are_latin) {
                ms.key_compare = unsafe_latin_compare;
            }
            else if (key_type == &PyLong_Type && ints_are_bounded) {
                ms.key_compare = unsafe_long_compare;
            }
            else if (key_type == &PyFloat_Type) {
                ms.key_compare = unsafe_float_compare;
            }
            else if ((ms.key_richcompare = key_type->tp_richcompare) != NULL) {
                ms.key_compare = unsafe_object_compare;
            }
            else {
                ms.key_compare = safe_object_compare;
            }
            
        }
        else {
            ms.key_compare = safe_object_compare;
        }

        if (keys_are_in_tuples) {
            /* Make sure we're not dealing with tuples of tuples
            * (remember: here, key_type refers list [key[0] for key in keys]) */
            if  ((keys_are_all_same_type == 0) || (key_type == &PyTuple_Type)) {
                ms.tuple_elem_compare = safe_object_compare;
            }
            else {
                ms.tuple_elem_compare = ms.key_compare;
            }

            ms.key_compare = unsafe_tuple_compare;
        }

    }
    /* End of pre-sort check: ms is now set properly! */

Then for input like [(1, True), (2.4, False), 6] the loop will break on the second element ((2.4, False)) because type(1) != type(2.4) with keys_are_in_tuples = 1 and keys_are_all_same_type = 0. This will set ms.key_compare = unsafe_tuple_compare which will segfault when comparing something with the third element (6).

If you break from the list before the third element, how can you be sure it will be a tuple?

@embg
Copy link
Contributor

embg commented Mar 11, 2019

OK, I see what you mean now. Couldn't we fix it like this?

if (key->ob_type != key_type) {
    keys_are_all_same_type = 0;
    if (!keys_are_in_tuples) break;
}

That way we break early for the non-tuple case, but for tuples we check all the way through. The full diff would then be two lines.

@ned-deily
Copy link
Member

@tim-one and @rhettinger, since you were involved in the original change, could you review this please?

@serhiy-storchaka serhiy-storchaka added type-bug An unexpected behavior, bug, or error needs backport to 3.7 labels Mar 12, 2019
@remilapeyre
Copy link
Contributor Author

@embg yes, this seems to be the correct condition to exit the loop.

@tim-one
Copy link
Member

tim-one commented Mar 13, 2019

@ned-deily I haven't chimed in here because it appears to be a shallow bug that's already getting first-class attention. The kinds of bugs we were looking for at the start were far subtler 😉 .

@tim-one tim-one removed their request for review March 13, 2019 05:18
@rhettinger
Copy link
Contributor

Serhiy, I would appreciate it if you could go through this as well.

serhiy-storchaka and others added 2 commits March 21, 2019 14:03
Co-Authored-By: remilapeyre <remi.lapeyre@henki.fr>
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you have changed too much. This change adds a pessimization. I suggested to fix an indentation, not to change a code.

@remilapeyre
Copy link
Contributor Author

Sorry, I will fix it.

@embg
Copy link
Contributor

embg commented Mar 22, 2019

@remilapeyre Since sort() could conceivably be run on untrusted input, you should consider submitting the segfault you found to a bug bounty program: https://hackerone.com/ibb-python

@remilapeyre
Copy link
Contributor Author

Thanks for invitation but it was Lyn Levenick that reported the issue on https://bugs.python.org/issue36218#msg337341

@embg
Copy link
Contributor

embg commented Mar 22, 2019

Ah, I will post on the bugtracker thread then

@rhettinger
Copy link
Contributor

Let's see if this can get wrapped up in time for the alpha release this weekend.

@rhettinger rhettinger merged commit dd5417a into python:master Mar 25, 2019
@miss-islington
Copy link
Contributor

Thanks @remilapeyre for the PR, and @rhettinger for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7.
🐍🍒⛏🤖

@bedevere-bot
Copy link

GH-12532 is a backport of this pull request to the 3.7 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Mar 25, 2019
…H-12209)

(cherry picked from commit dd5417a)

Co-authored-by: Rémi Lapeyre <remi.lapeyre@henki.fr>
rhettinger pushed a commit that referenced this pull request Mar 25, 2019
GH-12532)

(cherry picked from commit dd5417a)

Co-authored-by: Rémi Lapeyre <remi.lapeyre@henki.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants