-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX euclidean_distances float32 numerical instabilities #13554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you @jeremiedbb. Some tests are failing. Could you please summarise why you are proposing this solution? |
Well, this solution fixes the initial issue, i.e. precision issues in float32 (which was the most requested), while keeping good performance. for low dimensional data, precision could be increased for float64 at no performance cost, but the cost would be severe for high dimensional data. Not mandatory for 0.21. |
maxmem = max( | ||
((x_density * n_samples_X + y_density * n_samples_Y) * n_features | ||
+ (x_density * n_samples_X * y_density * n_samples_Y)) / 10, | ||
10 * 2**17) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this choice?
In my tests for #11271 I found that the computation time slowly increase with the memory size. It's the first few plots of this comment: #11271 (comment).
I'm mostly curious about this choice. I don't think this should prevent this PR from being merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A fixed 10Mib maxmem can lead to very small chunks when the number of features is very large (extreme case I agree). In that case the drop of performance can be quite severe.
Another possibility could be to have a fixed maxmem with a min chunk_size. I think it's kind of equivalent, but in this case you can have a memory increase bigger than expected whereas in the previous case the memory increase is bounded. That's why I went for the first one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Large chunks would also decrease cache efficiency. Although I'm not a big fan of having an unbounded amount of additional memory, 10% of the memory already used is probably fair.
When I look at my old benchmarks, I think that there might be something going on under the hood. Maybe the optimal block size is actually constant. Moreover, if the number of features is large, it can also be chunked. That might deserve to be explored as well.
I think we can add to the todo-list to benchmark the influence of the memory size and chunking to get a proper understanding of the performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the input data is memmapped, calculating this as a function of n_features may not be fair.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my comment about memmapping was incorrect in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems to be more complicated than necessary, but okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine to merge with this heuristic.
In the future, I am wondering if we could expose a chunk_size
parameter which by default do such heuristic and could be fine-tuned by the user. @jeremiedbb mentioned to me that it might not be that easy since it will impact pairwise_distance
where we would need to give also this parameter, etc. Just a thought.
sklearn/metrics/pairwise.py
Outdated
n_chunks_X = n_samples_X // chunk_size + (n_samples_X % chunk_size != 0) | ||
n_chunks_Y = n_samples_Y // chunk_size + (n_samples_Y % chunk_size != 0) | ||
|
||
for i in range(n_chunks_X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not use utils.gen_even_slices
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gen_batches
since we want a specific batch size, but it's cleaner indeed.
sklearn/metrics/pairwise.py
Outdated
distances = np.empty((n_samples_X, n_samples_Y), dtype=np.float32) | ||
|
||
x_density = X.getnnz() / np.prod(X.shape) if issparse(X) else 1 | ||
y_density = Y.getnnz() / np.prod(Y.shape) if issparse(Y) else 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: X.nnz
maybe
doc/whats_new/v0.21.rst
Outdated
|
||
- |Fix| The function :func:`euclidean_distances`, and therefore the functions | ||
:func:`pairwise_distances` and :func:`pairwise_distances_chunked` with | ||
``metric=euclidean``, suffered from numerical precision issues. Precision has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"with float32 features"
doc/whats_new/v0.21.rst
Outdated
in version 0.21 and will be removed in version 0.23. :issue:`10580` by | ||
:user:`Reshama Shaikh <reshamas>` and :user:`Sandra Mitrovic <SandraMNE>`. | ||
|
||
- |Fix| The function :func:`euclidean_distances`, and therefore the functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe more relevant: "and therefore several estimators with metric='euclidean'"
doc/whats_new/v0.21.rst
Outdated
- |Fix| The function :func:`euclidean_distances`, and therefore the functions | ||
:func:`pairwise_distances` and :func:`pairwise_distances_chunked` with | ||
``metric=euclidean``, suffered from numerical precision issues. Precision has | ||
been increased for float32. :issue:`13554` by :user:` <Celelibi>` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think correct syntax is just :user:`Celelibi`
sklearn/metrics/pairwise.py
Outdated
|
||
return distances if squared else np.sqrt(distances, out=distances) | ||
|
||
|
||
def _euclidean_distances_upcast_fast(X, XX=None, Y=None, YY=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"_fast" is a bit obscure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a residue of my previous PR implementing fast and exact method. I removed it.
maxmem = max( | ||
((x_density * n_samples_X + y_density * n_samples_Y) * n_features | ||
+ (x_density * n_samples_X * y_density * n_samples_Y)) / 10, | ||
10 * 2**17) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the input data is memmapped, calculating this as a function of n_features may not be fair.
sklearn/metrics/pairwise.py
Outdated
+ (x_density * n_samples_X * y_density * n_samples_Y)) / 10, | ||
10 * 2**17) | ||
|
||
# The increase amount of memory is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "in 8-byte blocks"... in fact where do you account for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxmem
is the max allowed number of 8-byte blocks for extra memory usage.
10Mib is 10 * 2**20
bits = 10 * 2**17
8-byte blocks.
Yes that's true. But having a fixed 10Mib can be very bad when the number of features is high because in that case the chunk size is very small. Do you think 10Mib + a min chunk size would be better (Note that the 10Mib is not guaranteed in that case) ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking good enough to merge.
maxmem = max( | ||
((x_density * n_samples_X + y_density * n_samples_Y) * n_features | ||
+ (x_density * n_samples_X * y_density * n_samples_Y)) / 10, | ||
10 * 2**17) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my comment about memmapping was incorrect in any case.
@@ -231,34 +231,115 @@ def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False, | |||
elif XX.shape != (X.shape[0], 1): | |||
raise ValueError( | |||
"Incompatible dimensions for X and X_norm_squared") | |||
if XX.dtype == np.float32: | |||
XX = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we raise a warning that these cannot be used??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not but I'm afraid it could generate a lot of warnings. For instance when pairwise_distances is called in a loop.
One thing we should certainly do is adding an option to row_norms
to use a float64 accumulator. That's on my todo list :)
np.maximum(distances, 0, out=distances) | ||
|
||
# Ensure that distances between vectors and themselves are set to 0.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this out of the if? It's only applicable if the if
holds.
(Curiously we currently appear to duplicate this logic in pairwise_distances_chunked
but not in pairwise_distances
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean zeroing the diagonal ? It applies using both methods.
It's necessary to also do it in pairwise_distances_chunked, because we lose the diagonal info when passing chunks to pairwise_distances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I don't think we zero the diagonal in pairwise_distances at all for other metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at euclidean_distances, not pairwise_distances... That's strange indeed. Even for metric=euclidean it's not enforced when n_jobs > 1. Let me put that also on my todo list :)
maxmem = max( | ||
((x_density * n_samples_X + y_density * n_samples_Y) * n_features | ||
+ (x_density * n_samples_X * y_density * n_samples_Y)) / 10, | ||
10 * 2**17) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems to be more complicated than necessary, but okay.
# - x_density * batch_size * n_features (copy of chunk of X) | ||
# - y_density * batch_size * n_features (copy of chunk of Y) | ||
# - batch_size * batch_size (chunk of distance matrix) | ||
# Hence x² + (xd+yd)kx = M, where x=batch_size, k=n_features, M=maxmem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a grade school student ask, "Why are we learning the quadratic equation?", we have an answer.
if YY.dtype == np.float32: | ||
YY = None | ||
elif Y.dtype == np.float32: | ||
YY = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance, it is not obvious that None
is used by _euclidean_distances_upcast
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean ? XX
and YY
are used by _euclidean_distances_upcast
no matter their value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting XX
or YY
to None
tells _euclidean_distances_upcast
to call row_norms(*_chunk, squared=True)[:, np.newaxis]
.
Without reading _euclidean_distances_upcast
, it is difficult to tell why XX
or YY
are set to None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we would need a small comment to clarify this part.
(np.float64, 1e-8)]) | ||
@pytest.mark.parametrize("dim", [1, 1000000]) | ||
def test_euclidean_distances_extreme_values(dtype, s, dim): | ||
# check that euclidean distances is correct where 'fast' method wouldn't be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fast method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the comment clearer
2eca804
to
fe74973
Compare
Are we going to get this into 0.21? |
Are you able to look at this, @qinhanmin2014? |
Apologies I won't be available until this weekend (will try to merge OPTICS these days). |
@rth are you okay with this? |
It would good I think. Will review this in the next couple of days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Only nitpicks.
@@ -203,7 +203,7 @@ def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False, | |||
|
|||
Returns | |||
------- | |||
distances : {array, sparse matrix}, shape (n_samples_1, n_samples_2) | |||
distances : array, shape (n_samples_1, n_samples_2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put the comment here but I think that it should be written above.
I would either add a note (using sphinx syntax) or within the parameter description that X_norm_squared
and Y_norm_squared
will be ignored if they are passed as float32
since they will lead to inaccurate distances.
if YY.dtype == np.float32: | ||
YY = None | ||
elif Y.dtype == np.float32: | ||
YY = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we would need a small comment to clarify this part.
sklearn/metrics/pairwise.py
Outdated
Assumes XX and YY have float64 dtype or are None. | ||
|
||
X and Y are upcast to float64 by chunks, which size is chosen to limit | ||
memory increase by approximately 10%. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory increase by approximately 10%. | |
memory increase by approximately 10% (bounded to 10Mb). |
maxmem = max( | ||
((x_density * n_samples_X + y_density * n_samples_Y) * n_features | ||
+ (x_density * n_samples_X * y_density * n_samples_Y)) / 10, | ||
10 * 2**17) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine to merge with this heuristic.
In the future, I am wondering if we could expose a chunk_size
parameter which by default do such heuristic and could be fine-tuned by the user. @jeremiedbb mentioned to me that it might not be that easy since it will impact pairwise_distance
where we would need to give also this parameter, etc. Just a thought.
Y = y_array_constr(Y) | ||
distances = euclidean_distances(X, Y) | ||
|
||
assert_allclose(distances, expected, rtol=1e-6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a small comments regarding rtol=1e-6
assert distances.dtype == dtype | ||
|
||
|
||
@pytest.mark.parametrize("dtype, s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could have a better name for s
|
||
@pytest.mark.parametrize("dtype, s", | ||
[(np.float32, 1e-4), | ||
(np.float64, 1e-8)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(np.float64, 1e-8)]) | |
pytest.param(np.float64, 1e-8, marks=pytest.mark.xfail('failing due to lack of precision'))]) |
I messed up :/ but the diff is correct |
We should take that review as Approve, @glemaitre? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to work working on this @jeremiedbb, and for all the insightful comments/discussions by reviewers.
Overall looks good to me. Below are a few comments mostly about readability..
sklearn/metrics/pairwise.py
Outdated
d += XX_chunk | ||
d += YY_chunk | ||
|
||
distances[x_slice, y_slice] = d.astype(np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add copy=False
, in the case when d = distances[y_slice, x_slice].T
it is already float32
, and astype make a copy by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function only computes euclidean distances on float64 and downcast it to float32 at the end so actually there will always be a copy. Actually I find it more clear to not add the copy=False
parameter to emphasize that
if X is Y and j < i: | ||
# when X is Y the distance matrix is symmetric so we only need | ||
# to compute half of it. | ||
d = distances[y_slice, x_slice].T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It must be correct since the corresponding tests pass, but after thinking about it for 20s I'm still not confident that when we compute distances[x_slice, y_slice]
, distances[y_slice, x_slice]
is already computed, is not null and can be copied from.
How about just adding a continue
here, then add a second loop below,
if X is Y:
# the result is symmetric, copy distances from the lower triangular matrix
for i, x_slice in enumerate(x_batches):
for j, y_slice in enumerate(y_batches):
if j < i:
distances[x_slice, y_slice] = distances[x_slice, y_slice].T
it shouldn't matter for run time too much time I think, but it might make things a bit more explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not confident that when we compute distances[x_slice, y_slice], distances[y_slice, x_slice] is already computed
It is since we compute the chunks with i
(rows) as the outer loop. It implies that all chunks of the upper right triangular part of the distance matrix are computed before their symmetric chunk in the lower left triangular part.
sklearn/metrics/pairwise.py
Outdated
y_density = Y.nnz / np.prod(Y.shape) if issparse(Y) else 1 | ||
|
||
# Allow 10% more memory than X, Y and the distance matrix take (at least | ||
# 10Mib) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use MiB
(or MB
) as all other cache sizes in scikit-learn? Mib unit can lead to confusion when comparing to the CPU L3 cache size or to working_memory
.
# 10Mib) | ||
maxmem = max( | ||
((x_density * n_samples_X + y_density * n_samples_Y) * n_features | ||
+ (x_density * n_samples_X * y_density * n_samples_Y)) / 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the assumption that sizeof(dtype) == 8
without making it appear in the computation, I think?
Maybe we could add X.dtype.itemsize / 8
even if that is equal to 1, to make it easier to follow what is going on.
Also not that it matters too much, but
- for dense isn't that its 5% of (X, Y, distance) as those are float32. i.e.
X.nbytes == x_density * n_samples_X * 4 / 8
? - for sparse it's a roughly 10% if it is CSR and we account for
X.data, X.indptr
in 32 bit, but again it's far from obvious for a casual reader.
Actually, maybe,
def get_array_nbytes(X):
if issparse(X):
if hasattr(X, 'indices'):
# CSR or CSC
return X.data.nbytes + X.data.nbytes + X.data.indptr
else:
# some other sparse format, assume 8 bytes to index
# one non zero element (e.g. COO)
return X.data.nbytes + X.nnz*8
else:
return X.nbytes
maxmem = (get_array_size(X) + get_array_size(Y) + get_array_size(distances)) / 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the assumption that sizeof(dtype) == 8 without making it appear in the computation, I think?
This function is only used for float32 X and Y. I've explained it in its docstring.
for sparse it's a roughly 10% if it is CSR and we account for X.data, X.indptr in 32 bit, but again it's far from obvious for a casual reader.
When you change the type of the csr matrix, only a copy of data
is made. indices
and indptr
are still int, so I think this formula is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you change the type of the csr matrix, only a copy of data is made.
Are you sure?
>>> import scipy.sparse
>>> import numpy as nop
>>> import numpy as np
>>> X = scipy.sparse.csr_matrix(np.ones((10, 10)))
>>> X.dtype
dtype('float64')
>>> Y = X.astype(np.float32)
>>> np.shares_memory(Y.data, Y.data)
True
>>> np.shares_memory(Y.data, X.data)
False
>>> np.shares_memory(Y.indices, X.indices)
False
That's not critical, my point here is that it might be beter (not necessarily in this PR) to use ndarray.nbytes
than trying to re-compute that from scratch. If we get this calculation wrong we won't likely know, since no test will fail and we will only reason in wrong sizes of memory cache.
Remaining the comments of @rth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the astype(copy=False) change under the assumption that this is the only substantive change blocking merge.
Thanks, feel free to merge if this PR is blocking the release. |
I consider #13739 to be the blocker.
|
I'm ok letting astype(copy=False), but I want to remind that this function is only used on float32 data so the copy will always be made, and I find more confusing since it may give the impression that it's possible to have no copy. |
So then remove the copy=True. Thanks. Happy to merge when you think you
have addressed Roman's comments.
|
I think I have. @rth do you agree with my answers to your comments ? |
In general a copy will be made, but when
which does make 1 extra unnecessary copy since
Yes, hence my comments about splitting that part of code a bit more. But it's not critical. |
you're right. I put the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last comment apart from that LGTM.
Merging. Thanks @jeremiedbb and @Celelibi ! |
Fixes #9354
Close #11271 , close #12128
takes only the chunk upcast part of #13410: When input is float32, compute distances on chunks of X and Y upcast to float64.
memory usage of 10% of {X, Y, dist_matrix} with a min of 10Mib.