[MRG] Allow exact Euclidean distance calculations #12136

rth · 2018-09-23T13:55:22Z

In an attempt to somewhat mitigate #9354 this optionally allows to compute the Euclidean distances using the exact formulation from scipy.spatial.distances.cdist, without the quadratic expansion formula. This is much slower but more numerically accurate,

This adds,

algorithm parameter to metrics.euclidean_distances

euclidean_distance_algorithm global config option to allow overwriting this setting globally (i.e. to avoid having to pass it though everywhere),

with sklearn.config_context(euclidean_distance_algorithm='exact'):
    knn = KNeighboursClassifier(algorithm='brute', metric='euclidean')
    # or
    cl = AgglomerativeClustering(linkage='ward')  # uses pairwise_distances internally

at the very least this would allow to check if the precision may affect the results for real datasets (e.g. with some clustering algorithms in 32 bit), which is currently hard to tell.

If we can get this into 0.20, it would be great. If not, I'll change the versionadded...

rth · 2018-09-23T14:11:58Z

Note that this will be useful even assuming #11271 is merged for 32 bit, because even in 64 bit there are precision issues as illustrated in #9354 (comment).

rth · 2018-09-24T10:14:38Z

Given that the precision of 32 bit euclidean distance is a blocker for 0.21, this PR could be an easy workaround for 0.20 in the meanwhile. Otherwise in the current situation, there is just no way to mitigate it, and this can have practical impact as illustrated e.g. in #9354 (comment)

cc @jnothman @TomDLT @jeremiedbb

qinhanmin2014 · 2018-09-24T15:02:59Z

I'm not sure this is the right solution but:
(1) We allow Y to be None.
(2) Why adding the parameter to _global_config?
(3) The test is not strong enough (also passes on master). Also, we're calculating the distance with cdist, so it seems strange to check the result using the same function.

rth · 2018-09-24T15:56:06Z

Thanks for the feedback!

(1) We allow Y to be None.

Yes, good catch. Will fix it.

(2) Why adding the parameter to _global_config?

Consider for instance Birch, algorithm that uses euclidean_distances in _split_node. Without a global option, we would need to add a new parameter to Birch (which is a bit irrelevant), then pass it through everywhere inside the estimator, which is also not very helpful. I agree that for estimator that have a metric_params parameter (e.g. DBSCAN) it's not necessary, but euclidean distances is just so widespread in the code base (also used in all pairwise_distance{,_chunked,_argmax,_argmin_min, etc} functions with default parameters) this seemed like a less tedious way of doing it.

Also I like the idea that a computational backend can be changed globally in a context manager (same as you can change it e.g. for joblib regarding thread/process parallelism).

The test is not strong enough (also passes on master).

It did indeed for 64 bit, but not 32 bit. The following shouldn't,

import numpy as np
from scipy.spatial.distance import cdist
from numpy.testing import assert_allclose

from sklearn.metrics.pairwise import euclidean_distances

dtype = 'float32'

if dtype == 'float64':
    offset = 10000
elif dtype == 'float32':
    offset = 0

XA = np.random.RandomState(42).rand(100, 10).astype(dtype) + offset
XB = np.random.RandomState(41).rand(200, 10).astype(dtype) + offset

dist_exact = cdist(XA, XB, 'euclidean')

assert_allclose(dist_exact, euclidean_distances(XA, XB))

will adapt the tests.

Also, we're calculating the distance with cdist, so it seems strange to check the result using the same function.

I just want to check that euclidean_distances(X, Y, algorithm='exact') produces the same results as cdist: a direct comparison between the two seem the most straightforward way to achieve it.

jnothman

I agree that a global config is not the worst way to handle this, as long as it is somehow visible, or known through warnings. But some kind of auto method should be available if possible..?

rth · 2018-09-26T21:21:17Z

But some kind of auto method should be available if possible..?

#12142 may end up with some kind of "auto" method, but that would require more work and that's why I though it would be better to make this a separate PR.

Also changed versionadded to 0.21.

qinhanmin2014 · 2018-09-29T14:44:49Z

I somehow begin to accept a global config. With it, users can then use an alternative algorithm without modifying their code. Though it seems strange to solve problem in this way.
@rth We need to consider Y_norm_squared and X_norm_squared. When they are not None, maybe we should switch back to "quadratic-expansion" (or raise an error when users use "exact")
Maybe we can hurry this into 0.20.1, since it's a temporary fix for the issue.

jnothman · 2018-10-14T21:10:39Z

To me a temporary fix would make sure we never do the fast calculation in 32 bit and would hold on a further solution

amueller · 2018-11-15T16:17:58Z

Has anyone looked at the scipy implemenation.
It's

static NPY_INLINE int
cdist_seuclidean(const double *XA, const double *XB, const double *var,
                 double *dm, const npy_intp num_rowsA, const npy_intp num_rowsB,
                 const npy_intp num_cols)
{
    npy_intp i, j;

    for (i = 0; i < num_rowsA; ++i) {
        const double *u = XA + (num_cols * i);
        for (j = 0; j < num_rowsB; ++j, ++dm) {
            const double *v = XB + (num_cols * j);
            *dm = seuclidean_distance(var, u, v, num_cols);
        }
    }
    return 0;
}

which doesn't seem very clever to me. Would it be worth doing something more clever?
How much does a compiler optimize this?

amueller · 2018-11-16T17:18:06Z

I think doing the thing julia does and controlling the threshold with a global option might be nicer. Should be faster, right?

rth · 2018-11-20T21:07:51Z

Has anyone looked at the scipy implementation.

I haven't -- might be worth investigating in any case.

I think doing the thing julia does and controlling the threshold with a global option might be nicer. Should be faster, right?

Yes, but also more complicated. And this seems to work well enough for low dimensional data according to latest @jeremiedbb analysts. Though cdist will upcast to 32bit to 64 bit, so it's not a solution there.

jnothman · 2019-02-28T13:52:02Z

So I think we've decided this is not the solution for now.

Allow euclidean distance calculations with algorithm='exact'

7855314

rth force-pushed the euclidean-dist-accuracy-v2 branch from 1ff3482 to 7855314 Compare September 23, 2018 14:01

Fix default value in docstring

65823cb

rth mentioned this pull request Sep 24, 2018

Numerical precision of euclidean_distances with float32 #9354

Closed

jnothman reviewed Sep 26, 2018

View reviewed changes

Change versionadded to 0.21

a1d72b3

Merge branch 'master' into euclidean-dist-accuracy-v2

ff53e43

Merge branch 'master' into euclidean-dist-accuracy-v2

b1be35b

qinhanmin2014 mentioned this pull request Nov 14, 2018

Relax strict floating point equality assertion #12574

Closed

Lint

c66f3b2

rth mentioned this pull request Nov 16, 2018

[MRG] Use fast pairwise distance calculations for metric='sqeuclidean' #12601

Closed

Merge branch 'master' into euclidean-dist-accuracy-v2

625fedc

jnothman closed this Feb 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Allow exact Euclidean distance calculations #12136

[MRG] Allow exact Euclidean distance calculations #12136

Uh oh!

rth commented Sep 23, 2018 •

edited

Loading

Uh oh!

rth commented Sep 23, 2018

Uh oh!

rth commented Sep 24, 2018 •

edited

Loading

Uh oh!

qinhanmin2014 commented Sep 24, 2018

Uh oh!

rth commented Sep 24, 2018

Uh oh!

jnothman left a comment

Uh oh!

rth commented Sep 26, 2018

Uh oh!

qinhanmin2014 commented Sep 29, 2018

Uh oh!

jnothman commented Oct 14, 2018 via email

Uh oh!

amueller commented Nov 15, 2018

Uh oh!

amueller commented Nov 16, 2018

Uh oh!

rth commented Nov 20, 2018

Uh oh!

jnothman commented Feb 28, 2019

Uh oh!

Uh oh!

Uh oh!

[MRG] Allow exact Euclidean distance calculations #12136

[MRG] Allow exact Euclidean distance calculations #12136

Uh oh!

Conversation

rth commented Sep 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Sep 23, 2018

Uh oh!

rth commented Sep 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qinhanmin2014 commented Sep 24, 2018

Uh oh!

rth commented Sep 24, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

rth commented Sep 26, 2018

Uh oh!

qinhanmin2014 commented Sep 29, 2018

Uh oh!

jnothman commented Oct 14, 2018 via email

Uh oh!

amueller commented Nov 15, 2018

Uh oh!

amueller commented Nov 16, 2018

Uh oh!

rth commented Nov 20, 2018

Uh oh!

jnothman commented Feb 28, 2019

Uh oh!

Uh oh!

rth commented Sep 23, 2018 •

edited

Loading

rth commented Sep 24, 2018 •

edited

Loading