-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Allow exact Euclidean distance calculations #12136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1ff3482
to
7855314
Compare
Note that this will be useful even assuming #11271 is merged for 32 bit, because even in 64 bit there are precision issues as illustrated in #9354 (comment). |
Given that the precision of 32 bit euclidean distance is a blocker for 0.21, this PR could be an easy workaround for 0.20 in the meanwhile. Otherwise in the current situation, there is just no way to mitigate it, and this can have practical impact as illustrated e.g. in #9354 (comment) |
I'm not sure this is the right solution but: |
Thanks for the feedback!
Yes, good catch. Will fix it.
Consider for instance Also I like the idea that a computational backend can be changed globally in a context manager (same as you can change it e.g. for joblib regarding thread/process parallelism).
It did indeed for 64 bit, but not 32 bit. The following shouldn't, import numpy as np
from scipy.spatial.distance import cdist
from numpy.testing import assert_allclose
from sklearn.metrics.pairwise import euclidean_distances
dtype = 'float32'
if dtype == 'float64':
offset = 10000
elif dtype == 'float32':
offset = 0
XA = np.random.RandomState(42).rand(100, 10).astype(dtype) + offset
XB = np.random.RandomState(41).rand(200, 10).astype(dtype) + offset
dist_exact = cdist(XA, XB, 'euclidean')
assert_allclose(dist_exact, euclidean_distances(XA, XB)) will adapt the tests.
I just want to check that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that a global config is not the worst way to handle this, as long as it is somehow visible, or known through warnings. But some kind of auto method should be available if possible..?
#12142 may end up with some kind of "auto" method, but that would require more work and that's why I though it would be better to make this a separate PR. Also changed |
I somehow begin to accept a global config. With it, users can then use an alternative algorithm without modifying their code. Though it seems strange to solve problem in this way. |
To me a temporary fix would make sure we never do the fast calculation in
32 bit and would hold on a further solution
|
Has anyone looked at the scipy implemenation. static NPY_INLINE int
cdist_seuclidean(const double *XA, const double *XB, const double *var,
double *dm, const npy_intp num_rowsA, const npy_intp num_rowsB,
const npy_intp num_cols)
{
npy_intp i, j;
for (i = 0; i < num_rowsA; ++i) {
const double *u = XA + (num_cols * i);
for (j = 0; j < num_rowsB; ++j, ++dm) {
const double *v = XB + (num_cols * j);
*dm = seuclidean_distance(var, u, v, num_cols);
}
}
return 0;
} which doesn't seem very clever to me. Would it be worth doing something more clever? |
I think doing the thing julia does and controlling the threshold with a global option might be nicer. Should be faster, right? |
I haven't -- might be worth investigating in any case.
Yes, but also more complicated. And this seems to work well enough for low dimensional data according to latest @jeremiedbb analysts. Though cdist will upcast to 32bit to 64 bit, so it's not a solution there. |
So I think we've decided this is not the solution for now. |
In an attempt to somewhat mitigate #9354 this optionally allows to compute the Euclidean distances using the exact formulation from
scipy.spatial.distances.cdist
, without the quadratic expansion formula. This is much slower but more numerically accurate,This adds,
algorithm
parameter tometrics.euclidean_distances
euclidean_distance_algorithm
global config option to allow overwriting this setting globally (i.e. to avoid having to pass it though everywhere),at the very least this would allow to check if the precision may affect the results for real datasets (e.g. with some clustering algorithms in 32 bit), which is currently hard to tell.
If we can get this into 0.20, it would be great. If not, I'll change the
versionadded
...