-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MAINT Adapt DistanceMetric
in prevision for fused sparse-dense support
#24223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT Adapt DistanceMetric
in prevision for fused sparse-dense support
#24223
Conversation
To use the a more efficient implementation for future callers implementations. See: #15
The failures are weird. Here is a summary of what I have seen so far:
D_sklearn = dm.pairwise(X_csr, Y)
assert D_sklearn.flags.c_contiguous
> assert_allclose(D_sklearn, D_scipy_cdist, **rtol_dict)
D_scipy_cdist = array([[1.21796151, 0.99804582, 0.64796144, 1.09320258, 1.49914561,
1.12392043, 0.47718256, 1.020455 , 0.9278..., 2.10374725, 2.20617501, 1.51727776, 1.13402 ,
1.92153794, 1.95399778, 0.78790286, 0.78723968, 1.36559244]])
D_sklearn = array([[1.21796151, 0.99804582, 0.64796144, 1.09320258, 1.49914561,
1.12392043, 0.47718256, 1.020455 , 0.9278..., 2.10374725, 2.20617501, 1.51727776, 1.13402 ,
0. , 0. , 0. , 0. , 0. ]]) E AssertionError:
E Not equal to tolerance rtol=1e-07, atol=0
E
E Mismatched elements: 100 / 500 (20%)
E Max absolute difference: 2.44808142
E Max relative difference: 1. the relative error is often, but not exactly 1.0. Meaning that there are 100 out of 500 exact zeros in the scikit-learn CSR/dense distance computation. So it looks like a pointer arithmetic bug to me. |
Good catch. But why doesn't it fail on Linux? |
Honestly, I do not see why tests were passing on configuration using Linux. Let's put it on hazard? I think this problem motivates the need to add On a side note, you mentioned in a review that |
DistanceMetric
in prevision for fused sparse-dense support
Skipped jobs on the CI are due to an unusual way of raising awareness of Ubuntu 18.04 upcoming removal. See: actions/runner-images#6002 |
It cannot be chance: it was deterministically happening for different linux builds with different base images and different numpy / scipy versions.
I am pretty sure we would have had the same problem with a different random seed. The distance values where non-zeros and the same on all the platforms. Were those tests skipped for some reason on the linux builds? Answer: not it wasn't the case. I just checked that for instance |
It probably has to do with a malloc happening after a recently freed buffer for the result array: the distances values of the previous assertion were not reinitialized and therefore left to the correctly computed values of the |
Indeed, we are using |
I tried to make the test stronger as following: diff --git a/sklearn/metrics/tests/test_dist_metrics.py b/sklearn/metrics/tests/test_dist_metrics.py
index bb95681eb..c895143fe 100644
--- a/sklearn/metrics/tests/test_dist_metrics.py
+++ b/sklearn/metrics/tests/test_dist_metrics.py
@@ -113,14 +113,23 @@ def test_cdist(metric_param_grid, X, Y):
assert D_sklearn.flags.c_contiguous
assert_allclose(D_sklearn, D_scipy_cdist, **rtol_dict)
+ # Erase values in this results buffer to avoid successive free/malloc
+ # reusing a buffer with the correct distances in the next calls,
+ # potentially silently hiding problems.
+ D_sklearn.fill(0.)
+
D_sklearn = dm.pairwise(X_csr, Y_csr)
assert D_sklearn.flags.c_contiguous
assert_allclose(D_sklearn, D_scipy_cdist, **rtol_dict)
+ D_sklearn.fill(0.)
+
D_sklearn = dm.pairwise(X_csr, Y)
assert D_sklearn.flags.c_contiguous
assert_allclose(D_sklearn, D_scipy_cdist, **rtol_dict)
+ D_sklearn.fill(0.)
+
D_sklearn = dm.pairwise(X, Y_csr)
assert D_sklearn.flags.c_contiguous
assert_allclose(D_sklearn, D_scipy_cdist, **rtol_dict) but I cannot make it fail on 1c7bd26 (last commit prior to your fix)... Not sure what's going on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much @jjerphan. LGTM.
Maybe @thomasjpfan can give a second review for this as the original author of this strategy.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Fix typo that I forgot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ort (scikit-learn#24223) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Reference Issues/PRs
Extracts of parts of jjerphan#15
Precedes #23585
What does this implement/fix? Explain your changes.
This proposes a modulo-less alternative which is less costly to support
sparse and dense fused datasets.
The
_pairwise_sparse_dense
and_pairwise_dense_sparse
methods currentlyused for testing purposes have been reworked not to duplicate the logic
of
cdist_csr
andpdist_csr
.This also fix a bug in
_pairwise_dense_sparse
use to test thesparse-dense case.
See: jjerphan#15
Any other comments?