Skip to content

[MRG] Allow exact Euclidean distance calculations #12136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

rth
Copy link
Member

@rth rth commented Sep 23, 2018

In an attempt to somewhat mitigate #9354 this optionally allows to compute the Euclidean distances using the exact formulation from scipy.spatial.distances.cdist, without the quadratic expansion formula. This is much slower but more numerically accurate,

This adds,

  • algorithm parameter to metrics.euclidean_distances
  • euclidean_distance_algorithm global config option to allow overwriting this setting globally (i.e. to avoid having to pass it though everywhere),
    with sklearn.config_context(euclidean_distance_algorithm='exact'):
        knn = KNeighboursClassifier(algorithm='brute', metric='euclidean')
        # or
        cl = AgglomerativeClustering(linkage='ward')  # uses pairwise_distances internally 

at the very least this would allow to check if the precision may affect the results for real datasets (e.g. with some clustering algorithms in 32 bit), which is currently hard to tell.

If we can get this into 0.20, it would be great. If not, I'll change the versionadded...

@rth rth force-pushed the euclidean-dist-accuracy-v2 branch from 1ff3482 to 7855314 Compare September 23, 2018 14:01
@rth
Copy link
Member Author

rth commented Sep 23, 2018

Note that this will be useful even assuming #11271 is merged for 32 bit, because even in 64 bit there are precision issues as illustrated in #9354 (comment).

@rth
Copy link
Member Author

rth commented Sep 24, 2018

Given that the precision of 32 bit euclidean distance is a blocker for 0.21, this PR could be an easy workaround for 0.20 in the meanwhile. Otherwise in the current situation, there is just no way to mitigate it, and this can have practical impact as illustrated e.g. in #9354 (comment)

cc @jnothman @TomDLT @jeremiedbb

@qinhanmin2014
Copy link
Member

I'm not sure this is the right solution but:
(1) We allow Y to be None.
(2) Why adding the parameter to _global_config?
(3) The test is not strong enough (also passes on master). Also, we're calculating the distance with cdist, so it seems strange to check the result using the same function.

@rth
Copy link
Member Author

rth commented Sep 24, 2018

Thanks for the feedback!

(1) We allow Y to be None.

Yes, good catch. Will fix it.

(2) Why adding the parameter to _global_config?

Consider for instance Birch, algorithm that uses euclidean_distances in _split_node. Without a global option, we would need to add a new parameter to Birch (which is a bit irrelevant), then pass it through everywhere inside the estimator, which is also not very helpful. I agree that for estimator that have a metric_params parameter (e.g. DBSCAN) it's not necessary, but euclidean distances is just so widespread in the code base (also used in all pairwise_distance{,_chunked,_argmax,_argmin_min, etc} functions with default parameters) this seemed like a less tedious way of doing it.

Also I like the idea that a computational backend can be changed globally in a context manager (same as you can change it e.g. for joblib regarding thread/process parallelism).

The test is not strong enough (also passes on master).

It did indeed for 64 bit, but not 32 bit. The following shouldn't,

import numpy as np
from scipy.spatial.distance import cdist
from numpy.testing import assert_allclose

from sklearn.metrics.pairwise import euclidean_distances

dtype = 'float32'

if dtype == 'float64':
    offset = 10000
elif dtype == 'float32':
    offset = 0

XA = np.random.RandomState(42).rand(100, 10).astype(dtype) + offset
XB = np.random.RandomState(41).rand(200, 10).astype(dtype) + offset

dist_exact = cdist(XA, XB, 'euclidean')

assert_allclose(dist_exact, euclidean_distances(XA, XB))

will adapt the tests.

Also, we're calculating the distance with cdist, so it seems strange to check the result using the same function.

I just want to check that euclidean_distances(X, Y, algorithm='exact') produces the same results as cdist: a direct comparison between the two seem the most straightforward way to achieve it.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a global config is not the worst way to handle this, as long as it is somehow visible, or known through warnings. But some kind of auto method should be available if possible..?

@rth
Copy link
Member Author

rth commented Sep 26, 2018

But some kind of auto method should be available if possible..?

#12142 may end up with some kind of "auto" method, but that would require more work and that's why I though it would be better to make this a separate PR.

Also changed versionadded to 0.21.

@qinhanmin2014
Copy link
Member

I somehow begin to accept a global config. With it, users can then use an alternative algorithm without modifying their code. Though it seems strange to solve problem in this way.
@rth We need to consider Y_norm_squared and X_norm_squared. When they are not None, maybe we should switch back to "quadratic-expansion" (or raise an error when users use "exact")
Maybe we can hurry this into 0.20.1, since it's a temporary fix for the issue.

@jnothman
Copy link
Member

jnothman commented Oct 14, 2018 via email

@amueller
Copy link
Member

Has anyone looked at the scipy implemenation.
It's

static NPY_INLINE int
cdist_seuclidean(const double *XA, const double *XB, const double *var,
                 double *dm, const npy_intp num_rowsA, const npy_intp num_rowsB,
                 const npy_intp num_cols)
{
    npy_intp i, j;

    for (i = 0; i < num_rowsA; ++i) {
        const double *u = XA + (num_cols * i);
        for (j = 0; j < num_rowsB; ++j, ++dm) {
            const double *v = XB + (num_cols * j);
            *dm = seuclidean_distance(var, u, v, num_cols);
        }
    }
    return 0;
}

which doesn't seem very clever to me. Would it be worth doing something more clever?
How much does a compiler optimize this?

@amueller
Copy link
Member

I think doing the thing julia does and controlling the threshold with a global option might be nicer. Should be faster, right?

@rth
Copy link
Member Author

rth commented Nov 20, 2018

Has anyone looked at the scipy implementation.

I haven't -- might be worth investigating in any case.

I think doing the thing julia does and controlling the threshold with a global option might be nicer. Should be faster, right?

Yes, but also more complicated. And this seems to work well enough for low dimensional data according to latest @jeremiedbb analysts. Though cdist will upcast to 32bit to 64 bit, so it's not a solution there.

@jnothman
Copy link
Member

So I think we've decided this is not the solution for now.

@jnothman jnothman closed this Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants