[MRG] Use fast pairwise distance calculations for metric='sqeuclidean' #12601

rth · 2018-11-15T22:16:14Z

For pairwise_distance(.., metric='sqeuclidean'), this uses the fast euclidean_distance(..., squared=True) function to be consistent with metric='euclidean', instead of the slower (but more accurate) pdist from scipy (cf parent issue for benchmarks).

metric='sqeuclidean' is not used in the code base, so it's not really critical, but it may make some users applications faster.

amueller · 2018-11-15T22:18:17Z

Not sure if this makes it better or worse ;) But I think probably better. +0?

rth · 2018-11-15T22:30:06Z

Yeah, it's not too critical, but independently of the issues we may have with the euclidean metric, I feel that squared euclidean should use the same implementation. I certainly assumed it did so before this PR. Having,

NearestNeighbors(algorithm='brute', metric='euclidean')

much faster than,

NearestNeighbors(algorithm='brute', metric='sqeuclidean')

is very counter-intuitive (while it should logically be the opposite).

amueller

You're right, I think this is less surprising.

jeremiedbb

lgtm

May we could document that metric="sqeuclidean" is a shortcut for metric="euclidean", metric_params={"squared": True}

jeremiedbb · 2018-11-15T22:55:32Z

sklearn/metrics/tests/test_pairwise.py

@@ -62,10 +62,13 @@ def test_pairwise_distances():
    S = pairwise_distances(X, Y, metric="euclidean")
    S2 = euclidean_distances(X, Y)
    assert_array_almost_equal(S, S2)
+    # Test sqeuclidean distance
+    S = pairwise_distances(X, Y, metric="sqeuclidean")
+    assert_allclose(S, S2**2)


maybe use assert_array_almost_equal here too for consistency

ogrisel · 2018-11-16T08:47:28Z

Maybe we should have more explicit names for:

"euclidean_fast": use sklearn euclidean_distances with squared=False
"euclidean_accurate": use a numerically stable implementation, e.g. scipy.pdist / scipy.cdist
"sqeuclidean_fast": use sklearn euclidean_distances with squared=True
"sqeuclidean_accurate": use a numerically stable implementation, e.g. scipy.pdist / scipy.cdist

I am not sure what the default for "euclidean" and "sqeuclidean" should be. As far as I understand we now have: euclidean == euclidean_fast and sqeuclidean == sqeuclidean_accurate.

ogrisel · 2018-11-16T08:52:49Z

I don't understand: this PR is just a fix on the documentation but does not change what is currently implemented in scikit-learn, right?

jeremiedbb · 2018-11-16T08:56:08Z

Not exactly. This line switch from scipy to sklearn implementation.

'sqeuclidean': partial(euclidean_distances, squared=True)

However it's does not change the current implementation of sklearn.

jeremiedbb · 2018-11-16T09:03:16Z

This will change the result of the pairwise_distances. Doesn't it break backward compatibility ? (only very rarely but still).
Besides, when the results will be different, it will always be a wrong result.

I guess to avoid that do as Olivier proposes and default to the accurate one.

rth · 2018-11-16T09:59:23Z

Maybe we should have more explicit names for:
"euclidean_accurate": use a numerically stable implementation, e.g

There is a PR to add this via a global config option #12136 (because a new metric name would not address all use cases cf #12136 (comment))

I am not sure what the default for "euclidean" and "sqeuclidean" should be.

IMO, in the ideal case, it should be a fast euclidean (with correction for inaccurate points), same as in Julia (cf #9354 (comment)) preferably somewhere in scipy, that we backport. While this still doesn't exist, I agree that manually switching between accurate and fast implementation is a start.

This will change the result of the pairwise_distances. Doesn't it break backward compatibility?

That's a valid concern. I'm fine postponing this PR until we come up with a better solution for accuracy in Euclidean distances.

eamanu · 2018-11-16T10:05:31Z

sklearn/neighbors/unsupervised.py

-        - from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2',
-          'manhattan']
+        - from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'sqeuclidean',
+          'l1', 'l2', 'manhattan']


This must be ordered alphabetically, is not? Same idea for the above lists.

Suggested change

'l1', 'l2', 'manhattan']

'l1', 'l2', 'sqeuclidean', 'manhattan']

jnothman

Should we have a _check_metric utility that applies to both the sklearn.metrics and the sklearn.neighbors metrics with logic like:

def _check_metric(metric, **kw):
    if metric == 'minkowski':
        if kw.get('p') in (None, 2):
            kw.pop('p', None)
            metric = 'euclidean'
        elif kw['p'] == 1:
            metric = 'manhattan'
    if metric == 'sqeuclidean'
        metric = 'euclidean'
        kw['squared'] = True
    return metric, kw

should fast vs accurate then be a kwarg rather than a name?

rth · 2018-11-19T10:28:24Z

Should we have a _check_metric utility that applies to both the sklearn.metrics and the sklearn.neighbors metrics with logic like:

That would be quite nice, yes.

jnothman

Are we good with this now?

jjerphan · 2021-04-07T12:46:27Z

@rth: are you interested in updating this PR?

rth · 2021-04-07T12:50:51Z

@jjerphan if you are interested in continuing please feel free to do so in a separate PR. Might be easier to start from scratch and rather than trying to fix merge conflicts.

jjerphan · 2021-08-10T11:50:23Z

#20254 extends the scope of this PR.

lesteve · 2022-02-24T05:05:38Z

I think we can close this one since it has been superseded by @jjerphan's work

Use fast pairwise_distances for metric='sqeuclidean'

afc93b4

rth changed the title ~~[MRG] Use fast pariwise distance calculations for metric='sqeuclidean'~~ [MRG] Use fast pairwise distance calculations for metric='sqeuclidean' Nov 15, 2018

amueller approved these changes Nov 15, 2018

View reviewed changes

jeremiedbb approved these changes Nov 15, 2018

View reviewed changes

eamanu reviewed Nov 16, 2018

View reviewed changes

Lint

cbdcde7

jnothman reviewed Nov 18, 2018

View reviewed changes

rth mentioned this pull request Nov 19, 2018

Estimators with metric='euclidean' default but supporting metric_params and p should instead have metric='minkowski' #12437

Open

rth mentioned this pull request Nov 20, 2018

Numerical precision of euclidean_distances with float32 #9354

Closed

amueller added Performance Needs Decision Requires decision labels Aug 6, 2019

jnothman reviewed Dec 16, 2019

View reviewed changes

github-actions bot added module:cluster module:metrics module:neighbors labels Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:50

lesteve closed this Feb 24, 2022

	'l1', 'l2', 'manhattan']
	'l1', 'l2', 'sqeuclidean', 'manhattan']

Uh oh!

[MRG] Use fast pairwise distance calculations for metric='sqeuclidean' #12601

[MRG] Use fast pairwise distance calculations for metric='sqeuclidean' #12601

Uh oh!

Conversation

rth commented Nov 15, 2018

Uh oh!

amueller commented Nov 15, 2018

Uh oh!

rth commented Nov 15, 2018

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

jeremiedbb Nov 15, 2018

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Nov 16, 2018

Uh oh!

ogrisel commented Nov 16, 2018

Uh oh!

jeremiedbb commented Nov 16, 2018

Uh oh!

jeremiedbb commented Nov 16, 2018

Uh oh!

rth commented Nov 16, 2018

Uh oh!

eamanu Nov 16, 2018

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

rth commented Nov 19, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jjerphan commented Apr 7, 2021

Uh oh!

rth commented Apr 7, 2021

Uh oh!

jjerphan commented Aug 10, 2021

Uh oh!

lesteve commented Feb 24, 2022

Uh oh!

Uh oh!