[WIP] Stable and fast float32 implementation of euclidean_distances #11271

Celelibi · 2018-06-15T11:19:29Z

Reference Issues/PRs

Fixes #9354
Superseds PR #10069

What does this implement/fix? Explain your changes.

These commits implement a block-wise casting to float64 and uses the older code to compute the euclidean distance matrix on the blocks. This is done useing only a fixed amount of additional (temporary) memory.

Any other comments?

This code implements several optimizations:

since the distance matrix is symmetric when X is Y, copy the blocks of the upper triangle to the lower triangle;
compute the optimal block size that would use most of the allowed additional memory;
cast blocks of {X,Y}_norm_squared to float64;
precompute blocks of X_norm_squared if not given so it gets reused through the iterations over Y;
swap X and Y when X_norm_squared is given, but not Y_norm_squared.

Note that all the optimizations listed here have proven useful in a benchmark. The hardcoded amount of additional memory of 10MB is also derived from a benchmark.

As a side bonus, this implementation should also support float16 out of the box, should scikit-learn support it at some point.

jnothman · 2018-06-16T10:03:51Z

Ping when you feel it's ready for review! Thanks

Celelibi · 2018-06-16T13:42:09Z

@jnothman, Now that all tests pass, I think it's ready for review.

jnothman

This is certainly very precise, but perhaps not so elegant! I think it looks pretty good, but I'll have to look again later.

Could you report some benchmarks here please? Thanks

jnothman · 2018-06-17T22:33:19Z

sklearn/metrics/pairwise.py

+        swap = False
+
+    # Maximum number of additional bytes to use to case X and Y to float64.
+    maxmem = 10*1024*1024


please write down that this is 10MB

Why do you believe this is a reasonable limit?

See the benchmark below that vary the memory size.

jnothman · 2018-06-17T22:35:38Z

sklearn/metrics/pairwise.py

+    return distances if squared else np.sqrt(distances, out=distances)
+
+
+def _cast_if_needed(arr, dtype):


We'd usually put this sort of thing in sklearn.utils.fixes, which helps us find it for removal when our minimum dependencies change.

jnothman · 2018-06-17T22:39:26Z

sklearn/metrics/pairwise.py

+    if X.dtype != np.float64:
+        XYmem += X.shape[1]
+    if Y.dtype != np.float64:
+        XYmem += Y.shape[1]


Perhaps make this X.shape[1] just to reduce reader surprise since they should be equal...

On the other hand, it's under a test for Y.dtype. So X.shape[1] would be equally surprising (at least to me). And this is to account for the memory occupied by the the float64 copy of Y. So semantically, it seems more appropriate to me to use Y.shape[1].
I might add a comment or something to make it more clear.

Celelibi · 2018-06-18T12:59:33Z

This is certainly very precise, but perhaps not so elegant! I think it looks pretty good, but I'll have to look again later.

This is indeed not my proudest code. I'm open to any suggestion to refactor the code in addition to the small fix you suggested.

BTW, shall I make a new pull request with the changes in a new branch?
May I modify my branch and force push it?
Or maybe just add new commits to the current branch?

Could you report some benchmarks here please?

Here you go.

Optimal memory size

Here are the plots I used to choose the amount of 10MB of temporary memory. It measures the computation time with some various number of samples and features. Distinct X and Y are passed, no squared norm.

For 100 features, the optimal memory size seems to be about 5MB, but the computation time is quite short. While for 1000 features, it seems to be more between 10MB and 30MB. I thought about computing the optimal amount of memory from the shape of X and Y. But I'm not sure it's worth the added complexity.

Hm. after further investigation, it looks like optimal memory size is the one that produce a block size around 2048. So... maybe I could just add bs = min(bs, 2048) so that we get both a maximum of 10MB and a fast block size for small number of features?

Norm squared precomputing

Here are some plots to see whether it's worth precomputing the norm squared of X and Y. The 3 plots on the left have a fixed number of samples (shown above the plots) and vary the number of features. The 3 plots on the right have a fixed number of features and vary the number of samples.

The reason why varying the number of features produce so much variations in the performance might be because it makes the block size vary too.

Let's zoom on the right part of the plots to see whether it's worth precomputing the squared norm.

For a small number of features and samples, it doesn't really matter. But if the number of samples or features is large, precomputing the squared norm of X does have a noticeable impact on the performance. On the other hand, precomputing the squared norm of Y doesn't change anything. It would indeed be computed anyway by the code for float64.

However, a possible improvement not implemented here could be to use some of the allowed additional memory to precompute and cache the norm squared of Y for some blocks (if not all). So that they could be reused during the next iteration over the blocks of X.

Casting the given norm squared

When both X_norm_squared and Y_norm_squared are given, is it worth casting them to float64?

It seems pretty clear that it's always worth casting the squared norms when they are given. At least when the numbrer of samples is large enough. Otherwise it doesn't matter.

However, I'm not exactly sure why casting Y_norm_squared makes such a difference. It looks like the broadcasting+casting in distances += YY is suboptimal.

As before, a possible improvement not implemented could be to cache the casted blocks of the squared norm of Y_norm_squared so that they could be reused during the next iteration over the blocks of X.

Swapping X and Y

Is it worth swapping X and Y when only X_norm_squared is given?
Let's plot the time taken when either X_norm_squared or Y_norm_squared is given and casted to float64, while the other is precomputed.

I think it's pretty clear for a large number of features or samples that X and Y should be swapped when only X_norm_squared is given.

Is there any other benchmark you would want to see?

Overall, the gain provided by these optimizations is small, but real and consistent. It's up to you to decide whether it's worth the complexity of the code. ^^

Signed-off-by: Celelibi <celelibi@gmail.com>

…oat64 Signed-off-by: Celelibi <celelibi@gmail.com>

Signed-off-by: Celelibi <celelibi@gmail.com>

Celelibi · 2018-06-25T15:00:23Z

BTW, shall I make a new pull request with the changes in a new branch?
May I modify my branch and force push it?
Or maybe just add new commits to the current branch?

I had to rebase my branch and force push it anyway for the auto-merge to succeed and the tests to pass.

@jnothman wanna have a look at the benchmarks and discuss the mergability of those commits?

jnothman · 2018-06-25T22:04:21Z

yes, this got forgotten, I'll admit. but I may not find time over the next week. ping if necessary

jnothman · 2018-06-26T00:04:20Z

@rth, you may be interested in this PR, btw. IMO, we broke euclidean distances for float32 in 0.19 (thinking that avoiding the upcast was a good idea), and should prioritise fixing it.

rth · 2018-06-26T22:47:26Z

Very nice benchmarks and PR @Celelibi !

It would be useful to also test the net effect of this PR on performance e.g. of KMeans / Birch as suggested in #10069 (comment)

I'm not yet sure how this would interact with pairwise_distances_chunked(.., metric="euclidean") -- I'm wondering if could be possible to reuse some of that work, or at least make sure we don't chunk twice in this case. In any case it might make sense to use with sklearn.config_context(working_memory=128): context manager to defined the amount of memory per block.

rth

A few more comments..

rth · 2018-06-26T22:49:40Z

sklearn/metrics/pairwise.py

+    # - a float64 copy of a block of X_norm_squared if needed;
+    # - a float64 copy of a block of Y_norm_squared if needed.
+    # This is a quadratic equation that we solve to compute the block size that
+    # would use maxmem bytes.


What happens if we use a bit more? The performance is decreased?

The performance decreases a bit, yes. But it's only noticeable for a maxmem several order of magnitude larger than required. If you look at the first set of plots, you'll see a trend for larger memory sizes.

rth · 2018-06-26T22:50:38Z

sklearn/metrics/pairwise.py

+            distances[i:ipbs, j:jpbs] = d
+
+            if X is Y and j > i:
+                distances[j:jpbs, i:ipbs] = d.T


Line not covered by tests, probably need to add a test for this

rth · 2018-06-26T22:50:56Z

sklearn/metrics/pairwise.py

+        if X_norm_squared is None and Y_norm_squared is not None:
+            X_norm_squared = Y_norm_squared.T
+        if X_norm_squared is not None and Y_norm_squared is None:
+            Y_norm_squared = X_norm_squared.T


Also not covered by tests..

rth · 2018-06-26T22:51:36Z

sklearn/metrics/tests/test_pairwise.py

+    D32_32 = euclidean_distances(X32, Y32)
+    assert_equal(D64_32.dtype, np.float64)
+    assert_equal(D32_64.dtype, np.float64)
+    assert_equal(D32_32.dtype, np.float32)


assert D32_32.dtype == np.float32 etc.

With the adoption of pytest, we are phasing out use of test helpers assert_equal, assert_true, etc. Please use bare assert statements, e.g. assert x == y, assert not x, etc.

rth · 2018-06-26T22:52:05Z

sklearn/metrics/tests/test_pairwise.py

+                               Y_norm_squared=Y32_norm_sq)
+    assert_greater(np.max(np.abs(DYN - D64)), .01)
+    assert_greater(np.max(np.abs(DXN - D64)), .01)
+    assert_greater(np.max(np.abs(DXYN - D64)), .01)


assert np.max(np.abs(DXYN - D64)) > .01

rth · 2018-06-26T22:57:52Z

sklearn/metrics/tests/test_pairwise.py

+                  [0.9221151778782808, 0.9681831844652774]])
+    D64 = euclidean_distances(X)
+    D32 = euclidean_distances(X.astype(np.float32))
+    assert_array_almost_equal(D32, D64)


On master I get,

In [5]: D64 = euclidean_distances(X) In [6]: D64 Out[6]: array([[0. , 0.00054379], [0.00054379, 0. ]]) In [7]: D32 = euclidean_distances(X.astype(np.float32)) In [8]: D32 Out[8]: array([[0. , 0.00059802], [0.00059802, 0. ]], dtype=float32)

and assert_array_almost_equal uses 6 decimals by default, so since this passes, it would increase the accuracy from 1 to 3 significant digits. Maybe use assert_allclose with rtol to make it more explicit?

jnothman

This looks great for a start (i.e. as a bug fix, rather than an enhancement). I need to review some of your optimisations and benchmarks to understand whether it's worth enhancing it further.

jnothman · 2018-07-05T00:35:43Z

sklearn/metrics/tests/test_pairwise.py

+    D32_32 = euclidean_distances(X32, Y32)
+    assert_equal(D64_32.dtype, np.float64)
+    assert_equal(D32_64.dtype, np.float64)
+    assert_equal(D32_32.dtype, np.float32)


With the adoption of pytest, we are phasing out use of test helpers assert_equal, assert_true, etc. Please use bare assert statements, e.g. assert x == y, assert not x, etc.

Celelibi · 2018-07-08T11:35:07Z

I'm not yet sure how this would interact with pairwise_distances_chunked(.., metric="euclidean")

I didn't forsee this. Well, they might chunk twice, which may have a bad impact on performance.

I'm wondering if could be possible to reuse some of that work, or at least make sure we don't chunk twice in this case.

It wouldn't be easy to have the chunking done at only one place in the code. I mean the current code always use Y as a whole. Which means it should be casted entirely. Even if we fixed it to chunk both X and Y, the chunking code of pairwise_distances_chunked isn't really meant to be intermixed with some kind of preprocessing (nor should it IMO).

The best solution I could see right now would be to have some specialized chunked implementations for some metrics. Kind of the same way pairwise_distance only rely on scipy.distance.{p,c}dist when there isn't a better implementation.
What do you think about it?

BTW pairwise_distances might currently use scipy.spatial.{c,p}dist, which (in addition to being slow) handle float32 by casting first and returning a float64 result. This might be a problem with sqeuclidean metric which then behave differently from euclidean in that regard in addition to being a problem with the general support of float32.

In any case it might make sense to use with sklearn.config_context(working_memory=128): context manager to defined the amount of memory per block.

Interesting, I didn't know about that. However, it looks like utils.get_chunk_n_rows is the only function to ever use that setting. Unfortunately I can't use that function since I have to at least take into account the casted copy of X and the result chunk. But I could still use the amount of working memory that is set instead of a magic value.

It would be useful to also test the net effect of this PR on performance e.g. of KMeans / Birch as suggested in #10069 (comment)

That comment was about deprecating {X,Y}_norm_squared and its impact on performance on the algorithms using it. But ok, why not. I haven't made a benchmark comparing the older code.

jnothman · 2018-07-08T12:44:42Z

I think given that we seem to see that this operation works well with 10MB working mem and the default working memory is 1GB, we should consider this a negligible addition, and not use the working_memory business with it. I also think it's important to focus upon this as a bug fix, rather than something that needs to be perfected in one shot.

rth · 2018-07-12T12:58:23Z

@Celelibi Thanks for the detailed response!

I think given that we seem to see that this operation works well with 10MB
working mem and the default working memory is 1GB we should consider this
a negligible addition, and not use the working_memory business with it.

@jeremiedbb, @ogrisel mentioned that you run some benchmarks demonstrating that using a smaller working memory had higher performance on your system. Would you be able to share those results (the original benchmarks are in #10280 (comment))? Maybe in a separate issue. Thanks!

amueller · 2018-07-20T18:34:38Z

from my understanding this will take some more work. untagging 0.20.

Celelibi force-pushed the stable-float32-euclidean-distance branch 4 times, most recently from 82f16a1 to ac5d1fb Compare June 15, 2018 20:14

jnothman reviewed Jun 17, 2018

View reviewed changes

Celelibi added 6 commits June 25, 2018 16:18

TST: float32 inputs to euclidean_distances

952795a

Signed-off-by: Celelibi <celelibi@gmail.com>

TST: Accuracy of euclidean_distances with float32

53d0fee

Signed-off-by: Celelibi <celelibi@gmail.com>

FIX: More stable euclidean_distances for float32

578193a

Signed-off-by: Celelibi <celelibi@gmail.com>

ENH: Refactor euclidean_distances to avoid recursive calls

c6bcd5a

Signed-off-by: Celelibi <celelibi@gmail.com>

ENH: Share squared norm if X is Y in euclidean_distances

a6b7f7c

Signed-off-by: Celelibi <celelibi@gmail.com>

ENH: Check squared norm earilier in euclidean_distances

d976473

Signed-off-by: Celelibi <celelibi@gmail.com>

Celelibi force-pushed the stable-float32-euclidean-distance branch from c22b0bc to 98e88d7 Compare June 25, 2018 14:27

Celelibi added 7 commits June 25, 2018 16:52

ENH: Adjust the block size of euclidean_distances to avoid a loop

7bd4b76

Signed-off-by: Celelibi <celelibi@gmail.com>

ENH: cast blocks of X only once in euclidean_distances

d0f9796

Signed-off-by: Celelibi <celelibi@gmail.com>

TST: euclidean_distances usage of float32 norm squared

1721349

Signed-off-by: Celelibi <celelibi@gmail.com>

FIX: Use the provided squared norm in euclidean_distances

4e07e8b

Signed-off-by: Celelibi <celelibi@gmail.com>

ENH: euclidean_distances: cast the blocks of {X,Y}_norm_squared to fl…

3b7ad89

…oat64 Signed-off-by: Celelibi <celelibi@gmail.com>

ENH: euclidean_distances: precompute blocks of X_norm_squared

8aecbde

Signed-off-by: Celelibi <celelibi@gmail.com>

ENH: euclidean_distances: swap X and Y sometimes with float32

0fd279a

Signed-off-by: Celelibi <celelibi@gmail.com>

Celelibi force-pushed the stable-float32-euclidean-distance branch from 98e88d7 to 0fd279a Compare June 25, 2018 14:53

jnothman added this to the 0.20 milestone Jun 26, 2018

rth reviewed Jun 26, 2018

View reviewed changes

jnothman reviewed Jul 5, 2018

View reviewed changes

jeremiedbb mentioned this pull request Jul 13, 2018

Discussion working memory vs. performances #11506

Closed

rth mentioned this pull request Jul 16, 2018

Numerical precision of euclidean_distances with float32 #9354

Closed

amueller modified the milestones: 0.20, 0.21 Jul 20, 2018

rth mentioned this pull request Jul 30, 2018

Unstable pairwise_distances #11711

Closed

rth mentioned this pull request Sep 23, 2018

[MRG] Allow exact Euclidean distance calculations #12136

Closed

jeremiedbb mentioned this pull request Mar 7, 2019

[MRG] Fix euclidean_distances numerical instabilities #13410

Closed

jeremiedbb mentioned this pull request Apr 1, 2019

FIX euclidean_distances float32 numerical instabilities #13554

Merged

rth closed this in #13554 Apr 29, 2019

		return distances if squared else np.sqrt(distances, out=distances)


		def _cast_if_needed(arr, dtype):

Uh oh!

[WIP] Stable and fast float32 implementation of euclidean_distances #11271

[WIP] Stable and fast float32 implementation of euclidean_distances #11271

Uh oh!

Conversation

Celelibi commented Jun 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented Jun 16, 2018 via email

Uh oh!

Celelibi commented Jun 16, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Celelibi commented Jun 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Optimal memory size

Norm squared precomputing

Casting the given norm squared

Swapping X and Y

Uh oh!

Celelibi commented Jun 25, 2018

Uh oh!

jnothman commented Jun 25, 2018 via email

Uh oh!

jnothman commented Jun 26, 2018 via email

Uh oh!

rth commented Jun 26, 2018

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Celelibi commented Jul 8, 2018

Uh oh!

jnothman commented Jul 8, 2018 via email

Uh oh!

rth commented Jul 12, 2018

Uh oh!

amueller commented Jul 20, 2018

Uh oh!

Uh oh!

Celelibi commented Jun 15, 2018 •

edited

Loading

Celelibi commented Jun 18, 2018 •

edited

Loading