Relax strict floating point equality assertion #12574

ogrisel · 2018-11-12T21:32:37Z

This is a fix for #10561: test_estimate_bandwidth_1sample fails on power / ppc architectures.

jnothman

It looks like a good change... let's find out if it also fixes the issue.

qinhanmin2014 · 2018-11-13T04:09:20Z

FYI @jnothman @ogrisel I don't think it fixes the issue, see

assert_almost_equal(2.384185791015625e-07, 0)
# AssertionError: 
# Arrays are not almost equal to 7 decimals
#  ACTUAL: 2.384185791015625e-07
#  DESIRED: 0

Though I think we can merge this one.

ogrisel · 2018-11-13T08:39:54Z

Good catch, do you think we should relax this to the following?

assert_almost_equal(bandwidth, 0, decimal=6)

ogrisel · 2018-11-13T08:48:43Z

It's weird because it means that pairwise_distances does not return a distance of exactly 0 for the distance of a single sample to itself.

ogrisel · 2018-11-13T08:50:30Z

Also the dtype is float64` here so the distance should much closer to 0.

ogrisel · 2018-11-13T08:55:35Z

The value if X in my case is: np.array([[11.0129137 , 8.74284498]], dtype=np.float64), I don't understand how pairwise_distances(X, squared=True) could return something in the order of 2e-7 in 64 bit floats.

qinhanmin2014 · 2018-11-13T11:07:16Z

~~Yes, that's strange, especially considering that we're using metric='minkowski' from scipy.~~
Our euclidean distance is known to be not precise, maybe this is the reason.
I'll vote +1 for assert_almost_equal(bandwidth, 0, decimal=6), maybe also add a comment or leave the issue open.

ogrisel · 2018-11-13T13:08:27Z

Indeed that's probably the cause. Let me have a deeper a look with different random seeds to improve the test.

…precision of our Euclidean distances

qinhanmin2014 · 2018-11-13T15:22:51Z

Why do we need to test multiple seeds?
Maybe we can use a fixed random state and set the tolerance accordingly?

amueller · 2018-11-13T16:04:18Z

So is that related to our euclidean distance issue? What's the status of that?

ogrisel · 2018-11-13T17:42:58Z

Why do we need to test multiple seeds?

Because I want to make sure that the behavior we test is not specific to a particular value of the seed. Those 2 tests are really fast so it's cheap to run them for 5 different seeds.

Maybe we can use a fixed random state and set the tolerance accordingly?

This is what was done previously: random_state=0 by default in estimate_bandwidth.

ogrisel · 2018-11-13T17:44:33Z

So is that related to our euclidean distance issue?

I suppose that this could be caused by that but I am not 100% sure as with different seeds (that is different data points for this test) I still get a bandwidth that is strictly exactly 0 on my environment.

What's the status of that?

#12142 is still open. The non-accurate but fast implementation based on matrix-matrix multiplication is not necessarily always problematic, for instance for KMeans and nearest neighbors classification it's probably safe but whenever we actually use the distance values it might be problematic. Maybe we need a switch to choose on which implementation to use based on the context.

qinhanmin2014 · 2018-11-14T01:32:39Z

So is that related to our euclidean distance issue?

Also not sure, cannot reproduce on my environment.

Maybe we need a switch to choose on which implementation to use based on the context.

We have #12136 but need to make decision there.

qinhanmin2014 · 2018-11-14T01:34:00Z

Because I want to make sure that the behavior we test is not specific to a particular value of the seed. Those 2 tests are really fast so it's cheap to run them for 5 different seeds.

Alright, but maybe we can ask contributors to try multiple seeds and use one seed in our test.

jnothman · 2018-11-14T09:44:50Z

In our euclidean_distances we explicitly set the diagonal to 0 if X is Y or Y is None. But this would not apply if it was metric='minkowski', p=2 or if effective_n_jobs(n_jobs) > 1. I can't see other cases where this should occur.

ogrisel · 2018-11-14T10:38:36Z

Because we chunk, we fit on X and find neighbors on X[chunk_slice] so the X is X[chunk_slice] check is false, so we don't set the diagonal to zero explicitly in this case.

jnothman

Perhaps, but looking at the code, this is likely to use neighbors.DistanceMetric's euclidean implementation, not pairwise_distances at all.... :|

ogrisel · 2018-11-14T10:55:44Z

Alright, but maybe we can ask contributors to try multiple seeds and use one seed in our test.

We could introduce a new random_seed fixture that provides different seeds for each run to each test that want a variable seed while still ensuring test reproducibility and test isolation based on the global run dependent _random_seeed that we setup in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/__init__.py#L93.

This way those test would be need to be run once, but each time with a different seed so as to make sure that what we test is not dependent on a particular seed value.

This could introduce random false negative though: imagine a unit test that depends on some permutation statistical test that is expected to reject a null hypothesis 999 out 1000 times for instance. Instead we could have the test pickup a random seed in the [0-9] range. The contributor would be in charge of writing a test that work for all the seeds in the [0-9] range (effectively ensuring that it's robust enough to not be dependent on a particular seed value) and our CI infrastructure would make sure that this test stays seed independent by selecting a different seed in the [0-9] range at each new run. This solution would:

ensure that those tests do not rely on a particular seed and make sure that this lack of seed-dependency is enforced over time (non-regression tests);
not introduce any computational overhead vs fixing the seed: each time we run with one seed;
not introduce any new false negative (expected random failure) because the test should be designed to work with all the seeds in the [0-10] range to begin with;
if a regression is ever introduced, it's easy to debug: set the SKLEARN_SEED environment variable reproduction case and launch a debugger to understand what's going on with that particular seed.

This would be an opt-in test fixture: if a test has a good reason to require a particular fixed seed, it's ok not to use that fixture and stick to fixed seeds but it should be discouraged.

ogrisel · 2018-11-15T10:25:36Z

Perhaps, but looking at the code, this is likely to use neighbors.DistanceMetric's euclidean implementation, not pairwise_distances at all.... :|

I checked by putting a breakpoint in pairwise_distances when running test_estimate_bandwidth_1sample and I got the following traceback:

pairwise_distances (/volatile/ogrisel/code/scikit-learn/sklearn/metrics/pairwise.py:1397)
pairwise_distances_chunked (/volatile/ogrisel/code/scikit-learn/sklearn/metrics/pairwise.py:1273)
kneighbors (/volatile/ogrisel/code/scikit-learn/sklearn/neighbors/base.py:435)
estimate_bandwidth (/volatile/ogrisel/code/scikit-learn/sklearn/cluster/mean_shift_.py:82)
test_estimate_bandwidth_1sample (/volatile/ogrisel/code/scikit-learn/sklearn/cluster/tests/test_mean_shift.py:39)

and fun is set to euclidean_distances that uses the BLAS GEMM trick for dense numpy arrays.

amueller · 2018-11-15T16:39:50Z

I think we should have the discussion re seeds in a separate issue.
You could add a note here to make the test more strict and/or check on it once we "fixed" euclidean distances.

jnothman · 2018-11-17T23:04:06Z

Now I see, @ogrisel... So this may be a regression in 0.20 because pairwise_distances_chunked is used which doesn't do the zeroing that euclidean_distances does... Perhaps it should!

jnothman · 2018-11-17T23:05:00Z

Or is the problem also that _chunked is copying when it should be passing X is Y to Euclidean?

amueller · 2018-11-18T18:51:22Z

only the former. Seems to be fine with a single chunk:

from sklearn.metrics.pairwise import pairwise_distances_chunked
import numpy as np
X = np.random.normal(size=(10000, 10), scale=1e10)

chunks = list(pairwise_distances_chunked(X, working_memory=1))
print(len(chunks))
print(np.diag(np.vstack(chunks)).max())

770
1448.1546878700494

chunks = list(pairwise_distances_chunked(X, working_memory=10000))
print(len(chunks))
print(np.diag(np.vstack(chunks)).max())

1
0.0

amueller · 2018-11-18T18:59:40Z

The easiest is to do this in pairwise_distances_chunked in post-processing if the metric is actually a metric, right? We've only done it in euclidean distances before from what I can see. Does scipy do it in pdist (and/or cdist)?

amueller · 2018-11-18T19:08:28Z

It's hard to robustly detect someone is using euclidean_distances, though, since they could be passing the function and/or any of the aliases. So doing post-processing in chunked_euclidean_distances is fragile. But otherwise we have to pass down the slicing all the way to the metric function, which would mean adding it to many places in the API :-/

The brittle fix would be

        if (X is Y or Y is None) and metric == "euclidean":
            D_chunk[np.arange(D_chunk.shape[0]), sl] = 0

~~Maybe we could do~~

if  PAIRWISE_DISTANCE_FUNCTIONS.get(metric, metric) is euclidean_distances

~~which is slightly better?~~

Whoops I forgot that passing euclidean_distances wasn't allowed, so

if  PAIRWISE_DISTANCE_FUNCTIONS.get(metric, None) is euclidean_distances

ogrisel · 2018-11-20T21:19:08Z

Closing in favor of #12612. The discussion on better handling of the seeds in test is more general and will happen elsewhere.

Relax strict floating point equality assertion

6a2c611

ogrisel mentioned this pull request Nov 12, 2018

Test - sklearn/cluster/tests/test_mean_shift.py::test_estimate_bandwidth_1sample fails on ppc64le platform #10561

Closed

jnothman approved these changes Nov 13, 2018

View reviewed changes

Better bandwidth estimation test that tolerate the lack of numerical …

023e64c

…precision of our Euclidean distances

jnothman reviewed Nov 14, 2018

View reviewed changes

amueller mentioned this pull request Nov 15, 2018

2 test failures on Debian stable (stretch) amd64 #12548

Closed

amueller added this to the 0.20.1 milestone Nov 18, 2018

amueller mentioned this pull request Nov 18, 2018

[MRG] set blockwise diagonals to zero for euclidean distance #12612

Merged

ogrisel closed this Nov 20, 2018

ogrisel deleted the fix-strict-estimate_bandwidth_1sample branch November 20, 2018 21:19

Uh oh!

Relax strict floating point equality assertion #12574

Relax strict floating point equality assertion #12574

Uh oh!

Conversation

ogrisel commented Nov 12, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Nov 13, 2018

Uh oh!

ogrisel commented Nov 13, 2018

Uh oh!

ogrisel commented Nov 13, 2018

Uh oh!

ogrisel commented Nov 13, 2018

Uh oh!

ogrisel commented Nov 13, 2018

Uh oh!

qinhanmin2014 commented Nov 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Nov 13, 2018

Uh oh!

qinhanmin2014 commented Nov 13, 2018

Uh oh!

amueller commented Nov 13, 2018

Uh oh!

ogrisel commented Nov 13, 2018

Uh oh!

ogrisel commented Nov 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qinhanmin2014 commented Nov 14, 2018

Uh oh!

qinhanmin2014 commented Nov 14, 2018

Uh oh!

jnothman commented Nov 14, 2018

Uh oh!

ogrisel commented Nov 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Nov 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Nov 15, 2018

Uh oh!

amueller commented Nov 15, 2018

Uh oh!

jnothman commented Nov 17, 2018 via email

Uh oh!

jnothman commented Nov 17, 2018 via email

Uh oh!

amueller commented Nov 18, 2018

Uh oh!

amueller commented Nov 18, 2018

Uh oh!

amueller commented Nov 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Nov 20, 2018

Uh oh!

Uh oh!

qinhanmin2014 commented Nov 13, 2018 •

edited

Loading

ogrisel commented Nov 13, 2018 •

edited

Loading

ogrisel commented Nov 14, 2018 •

edited

Loading

ogrisel commented Nov 14, 2018 •

edited

Loading

amueller commented Nov 18, 2018 •

edited

Loading