-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Use fast pairwise distance calculations for metric='sqeuclidean' #12601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Not sure if this makes it better or worse ;) But I think probably better. +0? |
Yeah, it's not too critical, but independently of the issues we may have with the NearestNeighbors(algorithm='brute', metric='euclidean') much faster than, NearestNeighbors(algorithm='brute', metric='sqeuclidean') is very counter-intuitive (while it should logically be the opposite). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I think this is less surprising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
May we could document that metric="sqeuclidean"
is a shortcut for metric="euclidean", metric_params={"squared": True}
@@ -62,10 +62,13 @@ def test_pairwise_distances(): | |||
S = pairwise_distances(X, Y, metric="euclidean") | |||
S2 = euclidean_distances(X, Y) | |||
assert_array_almost_equal(S, S2) | |||
# Test sqeuclidean distance | |||
S = pairwise_distances(X, Y, metric="sqeuclidean") | |||
assert_allclose(S, S2**2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use assert_array_almost_equal
here too for consistency
Maybe we should have more explicit names for:
I am not sure what the default for |
I don't understand: this PR is just a fix on the documentation but does not change what is currently implemented in scikit-learn, right? |
Not exactly. This line switch from scipy to sklearn implementation. 'sqeuclidean': partial(euclidean_distances, squared=True) However it's does not change the current implementation of sklearn. |
This will change the result of the pairwise_distances. Doesn't it break backward compatibility ? (only very rarely but still). I guess to avoid that do as Olivier proposes and default to the accurate one. |
There is a PR to add this via a global config option #12136 (because a new metric name would not address all use cases cf #12136 (comment))
IMO, in the ideal case, it should be a fast euclidean (with correction for inaccurate points), same as in Julia (cf #9354 (comment)) preferably somewhere in scipy, that we backport. While this still doesn't exist, I agree that manually switching between accurate and fast implementation is a start.
That's a valid concern. I'm fine postponing this PR until we come up with a better solution for accuracy in Euclidean distances. |
sklearn/neighbors/unsupervised.py
Outdated
- from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', | ||
'manhattan'] | ||
- from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'sqeuclidean', | ||
'l1', 'l2', 'manhattan'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must be ordered alphabetically, is not? Same idea for the above lists.
'l1', 'l2', 'manhattan'] | |
'l1', 'l2', 'sqeuclidean', 'manhattan'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a _check_metric
utility that applies to both the sklearn.metrics and the sklearn.neighbors metrics with logic like:
def _check_metric(metric, **kw):
if metric == 'minkowski':
if kw.get('p') in (None, 2):
kw.pop('p', None)
metric = 'euclidean'
elif kw['p'] == 1:
metric = 'manhattan'
if metric == 'sqeuclidean'
metric = 'euclidean'
kw['squared'] = True
return metric, kw
should fast vs accurate then be a kwarg rather than a name?
That would be quite nice, yes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we good with this now?
@rth: are you interested in updating this PR? |
@jjerphan if you are interested in continuing please feel free to do so in a separate PR. Might be easier to start from scratch and rather than trying to fix merge conflicts. |
#20254 extends the scope of this PR. |
I think we can close this one since it has been superseded by @jjerphan's work |
Closes #12600
For
pairwise_distance(.., metric='sqeuclidean')
, this uses the fasteuclidean_distance(..., squared=True)
function to be consistent withmetric='euclidean'
, instead of the slower (but more accurate)pdist
from scipy (cf parent issue for benchmarks).metric='sqeuclidean'
is not used in the code base, so it's not really critical, but it may make some users applications faster.