-
-
Notifications
You must be signed in to change notification settings - Fork 26k
Relax strict floating point equality assertion #12574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relax strict floating point equality assertion #12574
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a good change... let's find out if it also fixes the issue.
Good catch, do you think we should relax this to the following? assert_almost_equal(bandwidth, 0, decimal=6) |
It's weird because it means that |
Also the |
The value if |
|
Indeed that's probably the cause. Let me have a deeper a look with different random seeds to improve the test. |
…precision of our Euclidean distances
Why do we need to test multiple seeds? |
So is that related to our euclidean distance issue? What's the status of that? |
Because I want to make sure that the behavior we test is not specific to a particular value of the seed. Those 2 tests are really fast so it's cheap to run them for 5 different seeds.
This is what was done previously: |
I suppose that this could be caused by that but I am not 100% sure as with different seeds (that is different data points for this test) I still get a bandwidth that is strictly exactly 0 on my environment.
#12142 is still open. The non-accurate but fast implementation based on matrix-matrix multiplication is not necessarily always problematic, for instance for KMeans and nearest neighbors classification it's probably safe but whenever we actually use the distance values it might be problematic. Maybe we need a switch to choose on which implementation to use based on the context. |
Also not sure, cannot reproduce on my environment.
We have #12136 but need to make decision there. |
Alright, but maybe we can ask contributors to try multiple seeds and use one seed in our test. |
In our euclidean_distances we explicitly set the diagonal to 0 if X is Y or Y is None. But this would not apply if it was |
Because we chunk, we fit on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, but looking at the code, this is likely to use neighbors.DistanceMetric's euclidean implementation, not pairwise_distances at all.... :|
We could introduce a new This way those test would be need to be run once, but each time with a different seed so as to make sure that what we test is not dependent on a particular seed value. This could introduce random false negative though: imagine a unit test that depends on some permutation statistical test that is expected to reject a null hypothesis 999 out 1000 times for instance. Instead we could have the test pickup a random seed in the [0-9] range. The contributor would be in charge of writing a test that work for all the seeds in the [0-9] range (effectively ensuring that it's robust enough to not be dependent on a particular seed value) and our CI infrastructure would make sure that this test stays seed independent by selecting a different seed in the [0-9] range at each new run. This solution would:
This would be an opt-in test fixture: if a test has a good reason to require a particular fixed seed, it's ok not to use that fixture and stick to fixed seeds but it should be discouraged. |
I checked by putting a breakpoint in pairwise_distances when running
and |
I think we should have the discussion re seeds in a separate issue. |
Now I see, @ogrisel... So this may be a regression in 0.20 because
pairwise_distances_chunked is used which doesn't do the zeroing that
euclidean_distances does... Perhaps it should!
|
Or is the problem also that _chunked is copying when it should be passing X
is Y to Euclidean?
|
only the former. Seems to be fine with a single chunk: from sklearn.metrics.pairwise import pairwise_distances_chunked
import numpy as np
X = np.random.normal(size=(10000, 10), scale=1e10)
chunks = list(pairwise_distances_chunked(X, working_memory=1))
print(len(chunks))
print(np.diag(np.vstack(chunks)).max())
chunks = list(pairwise_distances_chunked(X, working_memory=10000))
print(len(chunks))
print(np.diag(np.vstack(chunks)).max())
|
The easiest is to do this in |
It's hard to robustly detect someone is using The brittle fix would be if (X is Y or Y is None) and metric == "euclidean":
D_chunk[np.arange(D_chunk.shape[0]), sl] = 0
if PAIRWISE_DISTANCE_FUNCTIONS.get(metric, metric) is euclidean_distances
Whoops I forgot that passing if PAIRWISE_DISTANCE_FUNCTIONS.get(metric, None) is euclidean_distances |
Closing in favor of #12612. The discussion on better handling of the seeds in test is more general and will happen elsewhere. |
This is a fix for #10561:
test_estimate_bandwidth_1sample
fails on power / ppc architectures.