-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Fix euclidean_distances numerical instabilities #13410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for this!! I think our main priority here is numerical stability, while ideally maintaining good performance where we can. Let us know when you want review. |
distances = euclidean_distances(X, Y) | ||
|
||
assert_allclose(distances, expected, rtol=1e-6) | ||
assert distances.dtype == dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test don't pass with rtol=1e-7
(default) in float32, probably because scipy upcast to float64 to compute distances which can cause differences in the last significant digit.
I think a rtol of 1e-6 is enough for float32. What do you think ?
The ci/circleci: doc-min-dependencies may be real? Could you have generated infinite distances, for instance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Please add a what's new entry.
return np.asarray(D) | ||
|
||
|
||
cdef floating _euclidean_dense_dense_exact_1d(floating *x, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this much better than using vectorized implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pairwise distances can't be simply vectorized.
The numpy way would be to do:
D = ((X[:,:,np.newaxis] - Y.T[np.newaxis,:,:])**2).sum(axis=1)
but it takes a lot more memory and is 10x slower.
Another possibility is to do:
for i in range(n_samples_X):
for j in range(n_samples_Y):
D[i, j] = ((X[i] - Y[j])**2).sum()
But calling the numpy api in a tight C loop is catastrophic. It's 1000x slower.
So I far as I know the implementation I did is the only way to keep close to cdist
performance. If you know a way to use vectorized implem I'll take it
sklearn/metrics/pairwise.py
Outdated
xe = xs + (chunk_size if i < n_chunks_X - 1 else n_samples_X_rem) | ||
|
||
X_chunk = X[xs:xe].astype(np.float64) | ||
XX_chunk = row_norms(X_chunk, squared=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you won't use the XX
and YY
that were passed in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that special case, I won't indeed. The reason is that I only take that pass when n_features > 32
and dtype=float32
(for float64 no need to upcast so no need to chunk). In that case I can't use norms computed on float32 data. This is the main reason of the loss of precision. So I need to first upcast X and then compute the norms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct to think that you ask for XY and YY to be passed on, but don't use them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I fixed that. I do use them now if they are in float64.
That's strange. the failure appears randomly. It passes for some commits and fails for some others. I can't reproduce locally. I'll investigate further. |
|
||
Returns | ||
------- | ||
distances : {array, sparse matrix}, shape (n_samples_1, n_samples_2) | ||
distances : array, shape (n_samples_1, n_samples_2) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept that change because it was wrong. The output is never sparse
My goal is not to optimize more than it was. It's just that in both cases, i.e. upcast or exact, the performance if lower than before. I'm just trying to keep it reasonable. def euclidean_distances(X,Y):
return np.sqrt(np.sum((X[:,:,np.newaxis], Y.T[np.newaxis,:,:])**2, axis=1)) :D I removed some special cases. I want to keep the 2 cases, exact and upcast, to cover as many numerical issues situations as possible. I think it's simple enough now. |
It's true that since the number of column is a constant in this case, the copy would take an
Yes but some optimizations in this PR are not related to the code improving the accuracy. The use of How about we strip down this PR to the bare minimum that fixes the accuracy issue and leave the rest for a subsequent PR? See, there are many variables that decide what to do: number of features, type of the inputs, whether the result is symmetric, the sparseness of X and Y, memory format of X and Y. That's a lot of cases, and not all of them might be common enough to be worth the effort to try to make an optimal algorithm for them. So we should probably think that through in another PR. |
As explained, I removed the use of syrk, and the forcing to C ordered. |
I would be inclined to also postpone the check on the number of feature as well as the symmetric chunks. I mean, it's also just an optimization. The chunking code is already pretty complex and would probably need some refactoring with the pre-existing chunking code in |
that's just 2 lines if X is Y and j < i:
d = distances[ys:ye, xs:xe].T I really don't think it hurts the reviewing of this PR.
No it's also for better accuracy. Since we don't upcast the float64, using the exact method for low dimension is important. Especially since it's in low dimension that it's more likely to have accuracy issues. |
Why do you claim it is more likely to have accuracy issues in low dimensionality? The example I gave in #9354 (comment) exposes problems that get more severe with higher dimensionality. In these cases, upcasting to float64 may eventually not be sufficient anymore. |
As you wish. I would just make this PR as simple as possible so that we can get it right in a finite amount of time.
I understand that. (Although it seems discussed by others.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds a lot of complexity, and I'm not convinced that all of it is justified, but I'm okay with it for now. I'm yet to check that the tests fail in master.
I think @kno10 would have us use the slow-but-sure computation in all cases because even in high dimensions the fast-but-rough calculation can give inaccurate distances (especially, if I'm not mistaken, when the norms of the operands are close).
@jeremiedbb are there aspects here you would consider changing? (@rth?)
Notes | ||
----- | ||
When ``n_features > 16``, the euclidean distance between a pair of row | ||
vector x and y is computed as:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector -> vectors
@@ -30,6 +30,8 @@ | |||
from ..utils._joblib import effective_n_jobs | |||
|
|||
from .pairwise_fast import _chi2_kernel_fast, _sparse_manhattan | |||
from .pairwise_fast import _euclidean_dense_dense_exact | |||
from ._safe_euclidean_sparse import _euclidean_sparse_dense_exact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a bit surprising that these implementations are in different modules. Why did you want them separated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want them to be separated. I had to separate them because I wanted to compile pairwise_fast
with the -ffast-math
flag. The thing is that with this flag, gcc performs unsafe optimizations for the _safe_euclidean_sparse
functions.
The reason I wanted to use this flag for pairwise_fast
is that it's much faster, and should be safe because there's no operation that gcc would optimize in an unsafe way. By the way, scipy uses this flag for its euclidean distance.
This test fails on master: @pytest.mark.parametrize("dtype, x",
[(np.float32, 10000), (np.float64, 100000000)],
ids=["float32", "float64"])
@pytest.mark.parametrize("dim", [1, 100])
def test_euclidean_distances_extreme_values(dtype, x, dim):
# check that euclidean distances is correct where 'fast' method wouldn't be
X = np.array([[x + 1] + [0] * (dim - 1)], dtype=dtype)
Y = np.array([[x] + [0] * (dim - 1)], dtype=dtype)
expected = [[1]]
distances = euclidean_distances(X, Y)
if dtype == np.float64 and dim == 100:
# This is expected to fail for float64 and dimension > 32 due to lack
# of precision of the fast method of euclidean distances.
with pytest.raises(AssertionError, match='Not equal to tolerance'):
assert_allclose(distances, expected)
else:
assert_allclose(distances, expected) Doing the upcasting of float32 only fixes the float32 part (low dim and high dim). |
If I understand correctly, @Celelibi would only keep the upcasting of float32, and @kno10 would keep the exact computation and extend it to all cases. If I wanted to keep only one, I think I'd go for the upcasting. The reason is that it's enough to fix all the float32 instabilities and I still think that the float64 instabilities should happen really rarely in a machine learning context. If we decide to only keep the upcasting, wouldn't it make more sense to review @Celelibi's PR then ? |
But that 'should happen really rarely' is just speculation, isn't it? I have shown that the issues can much more quickly arise with increasing dimensionality. Because of that, I am no longer sure that the upcasting is even enough to resolve this completely for float32. Upcasting solves this for a single variable, but with 1000 dimensions, we probably still lose around 3 digits of precision in the worst case, i.e., have 4 digits correct, whereas the "actually-not-that-slow-but-sure" approach should give all 7. |
No it's not. With 16 digits precision you'll keep more precision when squaring than if if had 8 digits precsion. This is illustrated by the example
The loss of precision comes from the squaring of the coordinates. If you have 8 digits precision, then you need 16 digits precision to store the square. This is exactly what upcasting provides. |
Define "happen a lot more". How often does the latter happen? It may matter to more people than you think. This is still just speculation. And this is just the toy example that was easy to find. There probably are worse cases.
No, that is not sufficient. Because you work with the sum-of-squares.
1 digit of precision with fp32, because it cost us 6 digits due to the high dimensionality. Now one million variables of course is an extreme case. Consider word vectors have 1000 dimensions. Do the same experiment with
oops. The "bad" equation produced a negative difference, and hence a nan. sklearn 0.19.1 and 0.20.3 with scikit-learn/sklearn/metrics/pairwise.py Line 252 in 5ff64d3
(it is reasonable to have this safeguard, but it can also hide such problems). So assume someone is working with word vectors of length 1000, are you certain this is "really rarely" causing problems? I am not... 0 values that shouldn't be 0 can cause all kinds of trouble. judging from a quick look at Google Scholar, there are people that apply k-means to word2vec vectors. |
Your examples are not complete. You only show the issue with float32 but you don't show what happens when you upcast.
Maybe 1000000 is too small
Oops, better precision than exact method... |
FWIW, I get more accurate results for |
1000.00024 is still more precise than 1000.03 |
For fairness, you should then compare to upcasting the (a-b)**2 method.
But the point is that such problems can arise in ways that you do not seem to have considered. As I am not an expert on numerics, I'd rather be safe when there already is a known problem there, and we don't know if our workaround captures all the cases. |
No, the point is not to ! We're comparing the exact method on float32 and the fast method with upcast. |
I've considered all the cases you mention. But as I said multiple times, I'm not proposing a solution which fixes all numerical precision issues. And the example I gave shows that even with the exact method you can get precision issues. As long as we don't work at arbitrary precision, there will be precision issues (but numpy/sklearn are not mathematica). I'm just proposing a reasonable effort in that direction, which is what has been agreed on. |
It had many optimizations too. And I've seen a bug in my old PR because of a symmetry optimization.
I would just kindly notice that your example use python's sum instead of numpy's sum. The former uses whatever C Just for information, the place where we loose most of the accuracy with the exact formula is in the summation. >>> import numpy as np
>>> def neumaier_sum(X):
... res = np.float32(0.)
... corr = np.float32(0.)
... for e in X:
... tmp = res + e
... if np.abs(res) > np.abs(e):
... corr += (res - tmp) + e
... else:
... corr += (e - tmp) + res
... res = tmp
... return res + corr
...
>>> a = np.array([1.0] * 1000000, np.float32)
>>> b = np.array([1.1] * 1000000, np.float32)
>>> sum((a-b)**2)
10000.004433095455
>>> neumaier_sum((a-b)**2)
10000.204
>>> np.sum((a-b)**2)
9999.993
>>> np.sum((a-b)**2, dtype=np.float64)
10000.004433095455 @kno10, what solution are you suggesting? Upcast and always use the formula BTW, I confirm the statement of @jeremiedbb. On random data, the algorithm that upcast and use the expanded formula produce the exact result. Both for few features and for many features. So that's a net improvement of the accuracy with a small performance cost and little additional code complexity. |
If possible I would avoid the upcast (because it uses a lot of memory, and isn't free). |
You're right that the upcast, rather than just aggregating in the higher
precision, is due to limitations of numpy.dot. not sure if there's a simple
way around it (einsum takes dtype but not sure it's the right solution)
|
According to https://stackoverflow.com/a/47775357/1939754
can be pretty fast, and seems to be accurate:
but I don't know if this can be easily extended to work on pairwise vectors. Plus, it will still allocate n*m vectors, so it does cause some memory overhead. A well implemented vectorized C version may be much faster (and as mentioned, the registers may have extra precision, in particular with FMA).
this would allow reusing a buffer for d easily, and should also give good precision when the values of a and b differ in magnitude, because all the computation is in fp64. But as we can see, the sum also can benefit from double precision. Maybe you can benchmark a Cythonized version of this:
|
My biggest concern with that solution is that we would loose the performance benefits of using the underlying BLAS. Especially with a large number of features and samples. Which is precisely where the performance matters the most. Additionnally, the expanded formula perform approximately 1/3 less operations overall. Which would be mostly visible with a large number of features. But that's just a performance bound, in reality I doubt we could get anywhere close to this bound in a reasonable amount of time. Are we really ready to run up to 50 times slower for a little bit of accuracy? Maybe an additional |
Fixes #9354
Should close #11271 eventually
should close #12128
This PR follows the discussion in #9354, i.e.
for
n_features <= 16
, use the usual exact method to compute distancesd(x,y)² = ||x-y||²
for
n_features > 16
, keep the fast methodd(x,y)² = ||x||² -2x.y + ||y||²
but up-casting to float64 if necessary.A couple of remarks:
When
n_features <= 32
, when one of the matrices is sparse and the other is not, the performances are pretty good, but I couldn't get good perfs when both are sparse. In that case I densified the smaller array. I think this is fine and won't cause a big increase of memory usage. The reason is as follows:Worst cast, both arrays have same n_sample
N
. Then the ration between memory usage of the densified array and the distance matrix isN * n_features / N*N = n_features / N < 32 / N
. So for reasonableN >~ 1000
, the increase is at most ~3%.To achieve good performances on the exact dense dense case I had to add the
-Ofast
compiler flag to enable better compiler optimizations (almost x2 speedup).Benchmarks:
n_samples_X = 100000
,n_samples_Y = 100
,n_features = 10
n_samples_X = 100000
,n_samples_Y = 100
,n_features = 1000