Skip to content

FIX wminkowski removed in 1.8.0.dev0 #21741

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Nov 22, 2021

This should fix the scipy-dev nightly build failure:

https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=35203&view=logs&j=dfe99b15-50db-5d7b-b1e9-4105c42527cf&t=ef785ae2-496b-5b02-9f0e-07a6c3ab3081

________________________ test_cdist[X10-X20-wminkowski] ________________________
[gw0] linux -- Python 3.10.0 /usr/share/miniconda/envs/testvenv/bin/python

[...]
        elif isinstance(metric, str):
            mstr = metric.lower()
            metric_info = _METRIC_ALIAS.get(mstr, None)
            if metric_info is not None:
                cdist_fn = metric_info.cdist_func
                return cdist_fn(XA, XB, out=out, **kwargs)
            elif mstr.startswith("test_"):
                metric_info = _TEST_METRICS.get(mstr, None)
                if metric_info is None:
                    raise ValueError(f'Unknown "Test" Distance Metric: {mstr[5:]}')
                XA, XB, typ, kwargs = _validate_cdist_input(
                    XA, XB, mA, mB, n, metric_info, **kwargs)
                return _cdist_callable(
                    XA, XB, metric=metric_info.dist_func, out=out, **kwargs)
            else:
>               raise ValueError('Unknown Distance Metric: %s' % mstr)
E               ValueError: Unknown Distance Metric: wminkowski

I tested locally and it seems to work as expected.

EDIT: to be consistent with scipy 1.8 overall, it was also necessary to make it possible to pass a w parameter when metric="minkowski" in which case the weights are to be raised to the power p to be consistent with scipy 1.8.

@ogrisel ogrisel added this to the 1.0.2 milestone Nov 22, 2021
@ogrisel ogrisel added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Nov 22, 2021
@ogrisel ogrisel removed this from the 1.0.2 milestone Nov 22, 2021
@ogrisel ogrisel removed the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Nov 22, 2021
@ogrisel
Copy link
Member Author

ogrisel commented Nov 22, 2021

There is no need for backport because it will only impact the nightly build before the release of 1.8.0 final, and only the scikit-learn main branch.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Does it mean that parse_version does not parse dev and rc numbers?

@ogrisel
Copy link
Member Author

ogrisel commented Nov 22, 2021

LGTM. Does it mean that parse_version does not parse dev and rc numbers?

parse_version works correctly, it's just than:

parse_version("1.7.2") < parse_version("1.8.0.dev0") < parse_version("1.8.0.dev0.whatever") < parse_version("1.8.0")

as expected. It's just our code that was wrong since the wminkowski has been removed in the dev version, after any 1.7.x but before 1.8.0 and it crashes the nightly build that runs against 1.8.0.dev0.something.

@glemaitre
Copy link
Member

as expected. It's just our code that was wrong since the wminkowski has been removed in the dev version, after any 1.7.x but before 1.8.0 and it crashes the nightly build that runs against 1.8.0.dev0.something.

Right. Is parse_version doing some loose checking parse_version("1.8")?

@adrinjalali
Copy link
Member

It'd be nice to do a scipy dev commit before merging

@ogrisel
Copy link
Member Author

ogrisel commented Nov 22, 2021

It'd be nice to do a scipy dev commit before merging

Forgot we could do this ;)

@ogrisel
Copy link
Member Author

ogrisel commented Nov 22, 2021

Hum I realize that there are plenty of other failures in the original log caused by a new DeprecationWarning in np.percentile:

E DeprecationWarning the `interpolation=` argument to percentile was renamed to `method=`, which has additional options.
E       Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)
`np.percentile`, 

@ogrisel
Copy link
Member Author

ogrisel commented Nov 22, 2021

There is another failure in test_neighbors_metrics related to wminkowski's removal. I will fix it as part of this PR. The other failures should best be tackled in a dedicated PR.

@ogrisel
Copy link
Member Author

ogrisel commented Nov 23, 2021

I pushed a new commit to make the code work consistently with scipy 1.8 while still being backward compatible with the scipy 1.7 and earlier parametrization. I think we need to backport this 1.0.2 to make this release compatible with 1.8 that might be released approximately at the same time and very probably before we release 1.1.

@ogrisel ogrisel added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Nov 23, 2021
@ogrisel ogrisel added this to the 1.0.2 milestone Nov 23, 2021
@ogrisel ogrisel changed the title MAINT wminkowski removed in 1.8.0.dev0 FIX wminkowski removed in 1.8.0.dev0 Nov 23, 2021
@ogrisel ogrisel requested a review from adrinjalali November 23, 2021 11:02
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

This error is making it hard to see if there's any relevant error in the logs though:

E       DeprecationWarning: the `interpolation=` argument to percentile was renamed to `method=`, which has additional options.
E       Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)

# WMinkowskiDistance in sklearn implements the weighting
# scheme of the old 'wminkowski' in scipy < 1.8 and the
# following adaptation:
return WMinkowskiDistance(p, w ** (1/p), **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue then is that the user can create a WMinkowskiDistance(p, q, **kwargs) themself, which would be different than get_metric("wminkowski", p, w), and kinda similar to get_metric("minkowski", p, w), but with a transformation on w. I think this can cause quite a bit of confusion, and I'm not sure how to solve it.

If we remove "wminkowski", then it'd be more consistent I guess

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rewrite MinkowskiDistance and WMinkowskiDistance to implement the new scheme in MinkowskiDistance directly and make WMinkowskiDistance inherit from MinkowskiDistance with the transformation of the weight in its constructor and leave DistanceMetric.get_metric to be more straightforward.

And then later we can deprecate "wminkowski" (but probably after a year or two, once scipy 1.8 is commonly installed).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be easier to do this change once we no longer support versions of scipy (pre-1.6) where wminkowski is not deprecated.

I think this PR is good enough for now. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is good enough for now. WDYT?

I agree with you and believe the steps you gave in the first comment are proper ones to take in another PR.

Copy link
Member

@adrinjalali adrinjalali Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at the metric across the board, and here's what I see:

_dist_metrics.pyx:558-594

# W-Minkowski Distance
#  d = sum(w_i^p * (x_i^p - y_i^p)) ^ (1/p)
cdef class WMinkowskiDistance(DistanceMetric):
    r"""Weighted Minkowski Distance

    .. math::
       D(x, y) = [\sum_i |w_i * (x_i - y_i)|^p] ^ (1/p)

the comment and the docstring are inconsistent, but the docstring seems to be consistent with the implementation:

        for j in range(size):
            d += pow(self.vec[j] * fabs(x1[j] - x2[j]), self.p)

Now this PR is making this change (on the minkowski part):

    "minkowski"     MinkowskiDistance     p, w      ``sum(w * |x - y|^p)^(1/p)``
    "wminkowski"    WMinkowskiDistance    p, w      ``sum(|w * (x - y)|^p)^(1/p)``

and the two are just different and not consistent with one another.

Since scipy has deprecated wminkowski already for a while, and now it's removed, I'd say it makes sense for this PR to deprecate WMinkowskiDistance, and make everything consistent with scipy? Especially since from what I see, it seems to be the case that we also have been a little bit flaky on how we incorporate w there, and now is a good time to just stick to one model.

Copy link
Member

@jjerphan jjerphan Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if one of the component of w is negative, WMinkowskiDistance's call to pow might crash.

Edit: actually, this gives nan on main:

from sklearn.neighbors import DistanceMetric
import numpy as np

rng = np.random.RandomState(1)
X = rng.rand(10, 100)
w = rng.rand(X.shape[1],)
w[0] = -1337

dist = DistanceMetric.get_metric("wminkowski", p=3, w=w)

# This produces nan
dm = dist.pairwise(X)
print(dm)
[[ 0. nan nan nan nan nan nan nan nan nan]
 [nan  0. nan nan nan nan nan nan nan nan]
 [nan nan  0. nan nan nan nan nan nan nan]
 [nan nan nan  0. nan nan nan nan nan nan]
 [nan nan nan nan  0. nan nan nan nan nan]
 [nan nan nan nan nan  0. nan nan nan nan]
 [nan nan nan nan nan nan  0. nan nan nan]
 [nan nan nan nan nan nan nan  0. nan nan]
 [nan nan nan nan nan nan nan nan  0. nan]
 [nan nan nan nan nan nan nan nan nan  0.]]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since scipy has deprecated wminkowski already for a while, and now it's removed, I'd say it makes sense for this PR to deprecate WMinkowskiDistance, and make everything consistent with scipy? Especially since from what I see, it seems to be the case that we also have been a little bit flaky on how we incorporate w there, and now is a good time to just stick to one model.

I intended this PR to be minimal so as to be backportable to 1.0.2.

I would rather not deprecate a scikit-learn component in a scikit-learn minor release, even if that component maps more or less directly to a deprecated components in a dependency.

I can do the proposed change but I think it should target 1.1 and then 1.0.2 will have 2 broken tests under scipy 1.8 but maybe we don't care. Or maybe we care because it can make the release process automation fail...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if one of the component of w is negative, WMinkowskiDistance's call to pow might crash.

I don't think we care about negative weights. We could raise a better error message but this out of the scope of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment and the docstring are inconsistent, but the docstring seems to be consistent with the implementation

The inline comment is just plain wrong. Let me delete it, it's just useless.

Now this PR is making this change (on the minkowski part):
"minkowski" MinkowskiDistance p, w sum(w * |x - y|^p)^(1/p)
"wminkowski" WMinkowskiDistance p, w sum(|w * (x - y)|^p)^(1/p)
and the two are just different and not consistent with one another.

This is correct. DistanceMetric.get_metric("minkowski", p, w) is not the same as DistanceMetric.get_metric("minkowski", p, w) as you need to take the p-root or p-power of one w to get the equivalent w of the other. This is all consistent with what scipy does.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you, @ogrisel.

Some minor suggestions.

# WMinkowskiDistance in sklearn implements the weighting
# scheme of the old 'wminkowski' in scipy < 1.8 and the
# following adaptation:
return WMinkowskiDistance(p, w ** (1/p), **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is good enough for now. WDYT?

I agree with you and believe the steps you gave in the first comment are proper ones to take in another PR.

Comment on lines +93 to +104
def test_cdist(metric_param_grid, X1, X2):
metric, param_grid = metric_param_grid
keys = param_grid.keys()
for vals in itertools.product(*param_grid.values()):
kwargs = dict(zip(keys, vals))
if metric == "mahalanobis":
# See: https://github.com/scipy/scipy/issues/13861
pytest.xfail("scipy#13861: cdist with 'mahalanobis' fails onmemmap data")
elif metric == "wminkowski":
if sp_version >= parse_version("1.8.0"):
pytest.skip("wminkowski will be removed in SciPy 1.8.0")

# wminkoski is deprecated in SciPy 1.6.0 and removed in 1.8.0
ExceptionToAssert = None
if sp_version >= parse_version("1.6.0"):
ExceptionToAssert = DeprecationWarning
with pytest.warns(ExceptionToAssert):
D_true = cdist(X1, X2, metric, **kwargs)
else:
D_true = cdist(X1, X2, metric, **kwargs)

check_cdist(metric, kwargs, D_true)
# Possibly caused by: https://github.com/joblib/joblib/issues/563
pytest.xfail(
"scipy#13861: cdist with 'mahalanobis' fails on joblib memmap data"
)
check_cdist(metric, kwargs, X1, X2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is way clearer now: thanks 👍

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is suitable for the minor release, but it'd be nice if you could do a follow up PR for 1.1 :)

@ogrisel
Copy link
Member Author

ogrisel commented Nov 23, 2021

I agree that this is suitable for the minor release, but it'd be nice if you could do a follow up PR for 1.1 :)

I opened #21765 ;)

@jjerphan jjerphan merged commit f924bc8 into scikit-learn:main Nov 23, 2021
@ogrisel ogrisel deleted the main-skip-wminkowski-in-test_cdist branch November 24, 2021 13:20
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Nov 29, 2021
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
glemaitre pushed a commit that referenced this pull request Dec 25, 2021
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Jan 18, 2022
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Jan 18, 2022
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build / CI cython module:metrics To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants