Skip to content

[MRG] Use fast pairwise distance calculations for metric='sqeuclidean' #12601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

rth
Copy link
Member

@rth rth commented Nov 15, 2018

Closes #12600

For pairwise_distance(.., metric='sqeuclidean'), this uses the fast euclidean_distance(..., squared=True) function to be consistent with metric='euclidean', instead of the slower (but more accurate) pdist from scipy (cf parent issue for benchmarks).

metric='sqeuclidean' is not used in the code base, so it's not really critical, but it may make some users applications faster.

@amueller
Copy link
Member

Not sure if this makes it better or worse ;) But I think probably better. +0?

@rth rth changed the title [MRG] Use fast pariwise distance calculations for metric='sqeuclidean' [MRG] Use fast pairwise distance calculations for metric='sqeuclidean' Nov 15, 2018
@rth
Copy link
Member Author

rth commented Nov 15, 2018

Yeah, it's not too critical, but independently of the issues we may have with the euclidean metric, I feel that squared euclidean should use the same implementation. I certainly assumed it did so before this PR. Having,

NearestNeighbors(algorithm='brute', metric='euclidean')

much faster than,

NearestNeighbors(algorithm='brute', metric='sqeuclidean')

is very counter-intuitive (while it should logically be the opposite).

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I think this is less surprising.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

May we could document that metric="sqeuclidean" is a shortcut for metric="euclidean", metric_params={"squared": True}

@@ -62,10 +62,13 @@ def test_pairwise_distances():
S = pairwise_distances(X, Y, metric="euclidean")
S2 = euclidean_distances(X, Y)
assert_array_almost_equal(S, S2)
# Test sqeuclidean distance
S = pairwise_distances(X, Y, metric="sqeuclidean")
assert_allclose(S, S2**2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use assert_array_almost_equal here too for consistency

@ogrisel
Copy link
Member

ogrisel commented Nov 16, 2018

Maybe we should have more explicit names for:

  • "euclidean_fast": use sklearn euclidean_distances with squared=False
  • "euclidean_accurate": use a numerically stable implementation, e.g. scipy.pdist / scipy.cdist
  • "sqeuclidean_fast": use sklearn euclidean_distances with squared=True
  • "sqeuclidean_accurate": use a numerically stable implementation, e.g. scipy.pdist / scipy.cdist

I am not sure what the default for "euclidean" and "sqeuclidean" should be. As far as I understand we now have: euclidean == euclidean_fast and sqeuclidean == sqeuclidean_accurate.

@ogrisel
Copy link
Member

ogrisel commented Nov 16, 2018

I don't understand: this PR is just a fix on the documentation but does not change what is currently implemented in scikit-learn, right?

@jeremiedbb
Copy link
Member

Not exactly. This line switch from scipy to sklearn implementation.

'sqeuclidean': partial(euclidean_distances, squared=True)

However it's does not change the current implementation of sklearn.

@jeremiedbb
Copy link
Member

This will change the result of the pairwise_distances. Doesn't it break backward compatibility ? (only very rarely but still).
Besides, when the results will be different, it will always be a wrong result.

I guess to avoid that do as Olivier proposes and default to the accurate one.

@rth
Copy link
Member Author

rth commented Nov 16, 2018

Maybe we should have more explicit names for:
"euclidean_accurate": use a numerically stable implementation, e.g

There is a PR to add this via a global config option #12136 (because a new metric name would not address all use cases cf #12136 (comment))

I am not sure what the default for "euclidean" and "sqeuclidean" should be.

IMO, in the ideal case, it should be a fast euclidean (with correction for inaccurate points), same as in Julia (cf #9354 (comment)) preferably somewhere in scipy, that we backport. While this still doesn't exist, I agree that manually switching between accurate and fast implementation is a start.

This will change the result of the pairwise_distances. Doesn't it break backward compatibility?

That's a valid concern. I'm fine postponing this PR until we come up with a better solution for accuracy in Euclidean distances.

- from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2',
'manhattan']
- from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'sqeuclidean',
'l1', 'l2', 'manhattan']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be ordered alphabetically, is not? Same idea for the above lists.

Suggested change
'l1', 'l2', 'manhattan']
'l1', 'l2', 'sqeuclidean', 'manhattan']

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a _check_metric utility that applies to both the sklearn.metrics and the sklearn.neighbors metrics with logic like:

def _check_metric(metric, **kw):
    if metric == 'minkowski':
        if kw.get('p') in (None, 2):
            kw.pop('p', None)
            metric = 'euclidean'
        elif kw['p'] == 1:
            metric = 'manhattan'
    if metric == 'sqeuclidean'
        metric = 'euclidean'
        kw['squared'] = True
    return metric, kw

should fast vs accurate then be a kwarg rather than a name?

@rth
Copy link
Member Author

rth commented Nov 19, 2018

Should we have a _check_metric utility that applies to both the sklearn.metrics and the sklearn.neighbors metrics with logic like:

That would be quite nice, yes.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we good with this now?

Base automatically changed from master to main January 22, 2021 10:50
@jjerphan
Copy link
Member

jjerphan commented Apr 7, 2021

@rth: are you interested in updating this PR?

@rth
Copy link
Member Author

rth commented Apr 7, 2021

@jjerphan if you are interested in continuing please feel free to do so in a separate PR. Might be easier to start from scratch and rather than trying to fix merge conflicts.

@jjerphan
Copy link
Member

#20254 extends the scope of this PR.

@lesteve
Copy link
Member

lesteve commented Feb 24, 2022

I think we can close this one since it has been superseded by @jjerphan's work

@lesteve lesteve closed this Feb 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sqeuclidean metric is much slower than euclidean in pairwise_distances
8 participants