Skip to content

MAINT fix the way to call stats.mode #23633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Aug 5, 2022

Conversation

glemaitre
Copy link
Member

Related to #23626

stats.mode cannot be unpacked as before with the latest SciPy.

@glemaitre
Copy link
Member Author

We should as well check with upstream if the change is intended. It will break older scikit-learn versions with the newest SciPy version without a deprecation warning.

@Micky774
Copy link
Contributor

This was a rather surprising change. Similar problem w/ utils.tests.test_extmath::test_uniform_weights

@ogrisel
Copy link
Member

ogrisel commented Jun 16, 2022

For reference here is a snippet of the test failure caused by this breaking change in scipy.stats.mode:

    def _most_frequent(array, extra_value, n_repeat):
        """Compute the most frequent value in a 1d array extended with
        [extra_value] * n_repeat, where extra_value is assumed to be not part
        of the array."""
        # Compute the most frequent value in array only
        if array.size > 0:
            if array.dtype == object:
                # scipy.stats.mode is slow with object dtype array.
                # Python Counter is more efficient
                counter = Counter(array)
                most_frequent_count = counter.most_common(1)[0][1]
                # tie breaking similarly to scipy.stats.mode
                most_frequent_value = min(
                    value
                    for value, count in counter.items()
                    if count == most_frequent_count
                )
            else:
                mode = stats.mode(array)
>               most_frequent_value = mode[0][0]
E               IndexError: invalid index to scalar variable.

array      = array([1., 1., 1.])
extra_value = nan
mode       = ModeResult(mode=1.0, count=3)
n_repeat   = 0

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remaining failures in pylatest_pip_scipy_dev are unrelated.

@ogrisel ogrisel added this to the 1.1.2 milestone Jun 16, 2022
@ogrisel
Copy link
Member

ogrisel commented Jun 16, 2022

In scipy/scipy#16418 it is argued that this change of behavior is considered a bugfix. So if scipy 1.9 gets released without a deprecation cycle, we will need to quickly release scikit-learn 1.1.2 with a backport of this fix and #23640.

@lesteve
Copy link
Member

lesteve commented Jun 16, 2022

I am not sure how this fix works, since depending on the scipy version, most_frequent_value and most_frequent_count will be a numpy scalar (scipy >= 1.9) or a numpy 1d array with a single value (scipy < 1.9).

Should we not do a sklearn.util.fixes.mode so that all the call of scipy.stats.mode gets fixed in a uniform manner?

@ogrisel
Copy link
Member

ogrisel commented Jun 16, 2022

Should we not do a sklearn.util.fixes.mode so that all the call of scipy.stats.mode gets fixed in a uniform manner?

This is a good suggestion. #23640 will need to be updated accordingly. Maybe let's do both in the same PR.

@glemaitre
Copy link
Member Author

Indeed, this is quite funny that the current CIs do not fail. I assume that they should kind of ravel the 1-element array.

@glemaitre
Copy link
Member Author

I need the change introduced by @Micky774 in #23640 as well to adopt the test.
One question is: do we want to modify weighted_mode to make is as well consistent with mode from SciPy? We would need to deprecate the behaviour so it could be part of another PR anyway.

@ogrisel
Copy link
Member

ogrisel commented Jun 16, 2022

One question is: do we want to modify weighted_mode to make is as well consistent with mode from SciPy? We would need to deprecate the behaviour so it could be part of another PR anyway.

I believe so. +1 for separate the PR to make backporting the basic compat fix to 1.1.2 easier if needed.

Comment on lines 74 to 88
Examples
--------
>>> import numpy as np
>>> a = np.array([[6, 8, 3, 0],
... [3, 2, 1, 7],
... [8, 1, 8, 4],
... [5, 3, 0, 5],
... [4, 7, 5, 9]])
>>> from sklearn.utils.fixes import mode
>>> mode(a)
ModeResult(mode=array([3, 1, 0, 0]), count=array([1, 1, 1, 1]))
To get mode of whole array, specify ``axis=None``:
>>> mode(a, axis=None)
ModeResult(mode=3, count=3)
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm no. The idea is that sklearn.utils.fixes.mode will always behave like scipy>=1.9. So we either import it or redefine it and reduce the dimension.

glemaitre and others added 2 commits June 20, 2022 16:35
Co-authored-by: Meekail Zain <34613774+Micky774@users.noreply.github.com>
@glemaitre
Copy link
Member Author

It should be ready for reviews.

@ogrisel
Copy link
Member

ogrisel commented Jun 21, 2022

Note that this PR will need to be adapted depending on the result of the discussion with upstream:

@glemaitre
Copy link
Member Author

I put it WIP until we know what happens upstream.

@glemaitre glemaitre marked this pull request as draft June 21, 2022 13:28
@thomasjpfan
Copy link
Member

SciPy moved forward with a solution here: scipy/scipy#16429

@glemaitre
Copy link
Member Author

Cool. I will remove the backport and just make the change in the code then.

@glemaitre glemaitre marked this pull request as ready for review July 26, 2022 14:12
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to adjust the code here?

mode, _ = stats.mode(_y[neigh_ind, k], axis=1)

mode = stats.mode(array)
most_frequent_value = mode[0][0]
most_frequent_count = mode[1][0]
most_frequent_value, most_frequent_count = stats.mode(array)
Copy link
Member

@thomasjpfan thomasjpfan Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be specific, I think we need to work around SciPy's deprecation warning by placing this into utils.fixes and adjust the code that uses mode.

def _mode(a, axis=0):
    if sp_version >= parse_version("1.9.0"):
        return stats.mode(a, keepdims=False)
    
    # unpack for SciPy version < 1.9
    results = stats.mode(a)
    return results[0][0], results[1][0]

When our min supported SciPy version is 1.9 we can call stats.mode directly with keepdims=False.

We do not use mode in too many places:

sklearn/impute/_base.py:            mode = stats.mode(array)
sklearn/neighbors/_classification.py:                mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
sklearn/utils/tests/test_extmath.py:        mode, score = stats.mode(x, axis)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened a PR to your fork with this idea: glemaitre#12

@thomasjpfan thomasjpfan added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Aug 4, 2022
@thomasjpfan thomasjpfan mentioned this pull request Aug 4, 2022
12 tasks
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre glemaitre merged commit 02a4b34 into scikit-learn:main Aug 5, 2022
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Aug 5, 2022
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Meekail Zain <34613774+Micky774@users.noreply.github.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
glemaitre added a commit that referenced this pull request Aug 5, 2022
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Meekail Zain <34613774+Micky774@users.noreply.github.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
PGijsbers added a commit to openml-labs/gama that referenced this pull request Aug 25, 2022
The scipy warning is caused by scikit-learn internal usage of scipy.
See scikit-learn/scikit-learn#23633
PGijsbers added a commit to openml-labs/gama that referenced this pull request Aug 25, 2022
* Fix code export for new scikit-learn

* Show warnings, but dont error, ignore scipy warning

The scipy warning is caused by scikit-learn internal usage of scipy.
See scikit-learn/scikit-learn#23633

* Explicitly add whiten to avoid deprecation warning

* Cast array to list to avoid ambiguous comparison

The previous statement was ambiguous as  the `not in` operation could also
interpreted to be used in element-wise fashion.

* Allow to ignore terminals in search space for Individual.from_string

This allows you to reconstruct an individual if additional hyperparameters
have been added to the search space.

* Add test for code export
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Sep 12, 2022
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Meekail Zain <34613774+Micky774@users.noreply.github.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:impute No Changelog Needed To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants