Skip to content

Relax strict floating point equality assertion #12574

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Nov 12, 2018

This is a fix for #10561: test_estimate_bandwidth_1sample fails on power / ppc architectures.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a good change... let's find out if it also fixes the issue.

@qinhanmin2014
Copy link
Member

FYI @jnothman @ogrisel I don't think it fixes the issue, see

assert_almost_equal(2.384185791015625e-07, 0)
# AssertionError: 
# Arrays are not almost equal to 7 decimals
#  ACTUAL: 2.384185791015625e-07
#  DESIRED: 0

Though I think we can merge this one.

@ogrisel
Copy link
Member Author

ogrisel commented Nov 13, 2018

Good catch, do you think we should relax this to the following?

assert_almost_equal(bandwidth, 0, decimal=6)

@ogrisel
Copy link
Member Author

ogrisel commented Nov 13, 2018

It's weird because it means that pairwise_distances does not return a distance of exactly 0 for the distance of a single sample to itself.

@ogrisel
Copy link
Member Author

ogrisel commented Nov 13, 2018

Also the dtype is float64` here so the distance should much closer to 0.

@ogrisel
Copy link
Member Author

ogrisel commented Nov 13, 2018

The value if X in my case is: np.array([[11.0129137 , 8.74284498]], dtype=np.float64), I don't understand how pairwise_distances(X, squared=True) could return something in the order of 2e-7 in 64 bit floats.

@qinhanmin2014
Copy link
Member

qinhanmin2014 commented Nov 13, 2018

Yes, that's strange, especially considering that we're using metric='minkowski' from scipy.
Our euclidean distance is known to be not precise, maybe this is the reason.
I'll vote +1 for assert_almost_equal(bandwidth, 0, decimal=6), maybe also add a comment or leave the issue open.

@ogrisel
Copy link
Member Author

ogrisel commented Nov 13, 2018

Indeed that's probably the cause. Let me have a deeper a look with different random seeds to improve the test.

@qinhanmin2014
Copy link
Member

Why do we need to test multiple seeds?
Maybe we can use a fixed random state and set the tolerance accordingly?

@amueller
Copy link
Member

So is that related to our euclidean distance issue? What's the status of that?

@ogrisel
Copy link
Member Author

ogrisel commented Nov 13, 2018

Why do we need to test multiple seeds?

Because I want to make sure that the behavior we test is not specific to a particular value of the seed. Those 2 tests are really fast so it's cheap to run them for 5 different seeds.

Maybe we can use a fixed random state and set the tolerance accordingly?

This is what was done previously: random_state=0 by default in estimate_bandwidth.

@ogrisel
Copy link
Member Author

ogrisel commented Nov 13, 2018

So is that related to our euclidean distance issue?

I suppose that this could be caused by that but I am not 100% sure as with different seeds (that is different data points for this test) I still get a bandwidth that is strictly exactly 0 on my environment.

What's the status of that?

#12142 is still open. The non-accurate but fast implementation based on matrix-matrix multiplication is not necessarily always problematic, for instance for KMeans and nearest neighbors classification it's probably safe but whenever we actually use the distance values it might be problematic. Maybe we need a switch to choose on which implementation to use based on the context.

@qinhanmin2014
Copy link
Member

So is that related to our euclidean distance issue?

Also not sure, cannot reproduce on my environment.

Maybe we need a switch to choose on which implementation to use based on the context.

We have #12136 but need to make decision there.

@qinhanmin2014
Copy link
Member

Because I want to make sure that the behavior we test is not specific to a particular value of the seed. Those 2 tests are really fast so it's cheap to run them for 5 different seeds.

Alright, but maybe we can ask contributors to try multiple seeds and use one seed in our test.

@jnothman
Copy link
Member

In our euclidean_distances we explicitly set the diagonal to 0 if X is Y or Y is None. But this would not apply if it was metric='minkowski', p=2 or if effective_n_jobs(n_jobs) > 1. I can't see other cases where this should occur.

@ogrisel
Copy link
Member Author

ogrisel commented Nov 14, 2018

Because we chunk, we fit on X and find neighbors on X[chunk_slice] so the X is X[chunk_slice] check is false, so we don't set the diagonal to zero explicitly in this case.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but looking at the code, this is likely to use neighbors.DistanceMetric's euclidean implementation, not pairwise_distances at all.... :|

@ogrisel
Copy link
Member Author

ogrisel commented Nov 14, 2018

Alright, but maybe we can ask contributors to try multiple seeds and use one seed in our test.

We could introduce a new random_seed fixture that provides different seeds for each run to each test that want a variable seed while still ensuring test reproducibility and test isolation based on the global run dependent _random_seeed that we setup in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/__init__.py#L93.

This way those test would be need to be run once, but each time with a different seed so as to make sure that what we test is not dependent on a particular seed value.

This could introduce random false negative though: imagine a unit test that depends on some permutation statistical test that is expected to reject a null hypothesis 999 out 1000 times for instance. Instead we could have the test pickup a random seed in the [0-9] range. The contributor would be in charge of writing a test that work for all the seeds in the [0-9] range (effectively ensuring that it's robust enough to not be dependent on a particular seed value) and our CI infrastructure would make sure that this test stays seed independent by selecting a different seed in the [0-9] range at each new run. This solution would:

  • ensure that those tests do not rely on a particular seed and make sure that this lack of seed-dependency is enforced over time (non-regression tests);
  • not introduce any computational overhead vs fixing the seed: each time we run with one seed;
  • not introduce any new false negative (expected random failure) because the test should be designed to work with all the seeds in the [0-10] range to begin with;
  • if a regression is ever introduced, it's easy to debug: set the SKLEARN_SEED environment variable reproduction case and launch a debugger to understand what's going on with that particular seed.

This would be an opt-in test fixture: if a test has a good reason to require a particular fixed seed, it's ok not to use that fixture and stick to fixed seeds but it should be discouraged.

@ogrisel
Copy link
Member Author

ogrisel commented Nov 15, 2018

Perhaps, but looking at the code, this is likely to use neighbors.DistanceMetric's euclidean implementation, not pairwise_distances at all.... :|

I checked by putting a breakpoint in pairwise_distances when running test_estimate_bandwidth_1sample and I got the following traceback:

pairwise_distances (/volatile/ogrisel/code/scikit-learn/sklearn/metrics/pairwise.py:1397)
pairwise_distances_chunked (/volatile/ogrisel/code/scikit-learn/sklearn/metrics/pairwise.py:1273)
kneighbors (/volatile/ogrisel/code/scikit-learn/sklearn/neighbors/base.py:435)
estimate_bandwidth (/volatile/ogrisel/code/scikit-learn/sklearn/cluster/mean_shift_.py:82)
test_estimate_bandwidth_1sample (/volatile/ogrisel/code/scikit-learn/sklearn/cluster/tests/test_mean_shift.py:39)

and fun is set to euclidean_distances that uses the BLAS GEMM trick for dense numpy arrays.

@amueller
Copy link
Member

I think we should have the discussion re seeds in a separate issue.
You could add a note here to make the test more strict and/or check on it once we "fixed" euclidean distances.

@jnothman
Copy link
Member

jnothman commented Nov 17, 2018 via email

@jnothman
Copy link
Member

jnothman commented Nov 17, 2018 via email

@amueller amueller added this to the 0.20.1 milestone Nov 18, 2018
@amueller
Copy link
Member

only the former. Seems to be fine with a single chunk:

from sklearn.metrics.pairwise import pairwise_distances_chunked
import numpy as np
X = np.random.normal(size=(10000, 10), scale=1e10)

chunks = list(pairwise_distances_chunked(X, working_memory=1))
print(len(chunks))
print(np.diag(np.vstack(chunks)).max())

770
1448.1546878700494

chunks = list(pairwise_distances_chunked(X, working_memory=10000))
print(len(chunks))
print(np.diag(np.vstack(chunks)).max())

1
0.0

@amueller
Copy link
Member

The easiest is to do this in pairwise_distances_chunked in post-processing if the metric is actually a metric, right? We've only done it in euclidean distances before from what I can see. Does scipy do it in pdist (and/or cdist)?

@amueller
Copy link
Member

amueller commented Nov 18, 2018

It's hard to robustly detect someone is using euclidean_distances, though, since they could be passing the function and/or any of the aliases. So doing post-processing in chunked_euclidean_distances is fragile. But otherwise we have to pass down the slicing all the way to the metric function, which would mean adding it to many places in the API :-/

The brittle fix would be

        if (X is Y or Y is None) and metric == "euclidean":
            D_chunk[np.arange(D_chunk.shape[0]), sl] = 0

Maybe we could do

if  PAIRWISE_DISTANCE_FUNCTIONS.get(metric, metric) is euclidean_distances

which is slightly better?

Whoops I forgot that passing euclidean_distances wasn't allowed, so

if  PAIRWISE_DISTANCE_FUNCTIONS.get(metric, None) is euclidean_distances

@ogrisel
Copy link
Member Author

ogrisel commented Nov 20, 2018

Closing in favor of #12612. The discussion on better handling of the seeds in test is more general and will happen elsewhere.

@ogrisel ogrisel closed this Nov 20, 2018
@ogrisel ogrisel deleted the fix-strict-estimate_bandwidth_1sample branch November 20, 2018 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants