Skip to content

MAINT Adapt DistanceMetric in prevision for fused sparse-dense support #24223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented Aug 22, 2022

Reference Issues/PRs

Extracts of parts of jjerphan#15

Precedes #23585

What does this implement/fix? Explain your changes.

This proposes a modulo-less alternative which is less costly to support
sparse and dense fused datasets.

The _pairwise_sparse_dense and _pairwise_dense_sparse methods currently
used for testing purposes have been reworked not to duplicate the logic
of cdist_csr and pdist_csr.

This also fix a bug in _pairwise_dense_sparse use to test the
sparse-dense case.

See: jjerphan#15

Any other comments?

To use the a more efficient implementation for
future callers implementations.

See: #15
@ogrisel
Copy link
Member

ogrisel commented Aug 22, 2022

The failures are weird. Here is a summary of what I have seen so far:

  • all tests pass on Linux,
  • some test_cdist parametrization (but not all) fail on Windows and Mac,
  • each time there is a failure we have something like:
            D_sklearn = dm.pairwise(X_csr, Y)
            assert D_sklearn.flags.c_contiguous
>           assert_allclose(D_sklearn, D_scipy_cdist, **rtol_dict)

D_scipy_cdist = array([[1.21796151, 0.99804582, 0.64796144, 1.09320258, 1.49914561,
        1.12392043, 0.47718256, 1.020455  , 0.9278..., 2.10374725, 2.20617501, 1.51727776, 1.13402   ,
        1.92153794, 1.95399778, 0.78790286, 0.78723968, 1.36559244]])
D_sklearn  = array([[1.21796151, 0.99804582, 0.64796144, 1.09320258, 1.49914561,
        1.12392043, 0.47718256, 1.020455  , 0.9278..., 2.10374725, 2.20617501, 1.51727776, 1.13402   ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])
E       AssertionError: 
E       Not equal to tolerance rtol=1e-07, atol=0
E       
E       Mismatched elements: 100 / 500 (20%)
E       Max absolute difference: 2.44808142
E       Max relative difference: 1.

the relative error is often, but not exactly 1.0. Meaning that there are 100 out of 500 exact zeros in the scikit-learn CSR/dense distance computation.

So it looks like a pointer arithmetic bug to me.

@ogrisel
Copy link
Member

ogrisel commented Aug 22, 2022

Good catch. But why doesn't it fail on Linux?

@jjerphan
Copy link
Member Author

Good catch. But why doesn't it fail on Linux?

Honestly, I do not see why tests were passing on configuration using Linux. Let's put it on hazard? I think this problem motivates the need to add global_random to those tests. What do you think?

On a side note, you mentioned in a review that n_features could be used instead of size for DistanceMetrics: do you think this PR is a good opportunity for this change?

@jjerphan jjerphan marked this pull request as ready for review August 22, 2022 15:36
@jjerphan jjerphan changed the title MAINT Adapt DistanceMetric for fused sparse-dense MAINT Adapt DistanceMetric in prevision for fused sparse-dense support Aug 22, 2022
@jjerphan
Copy link
Member Author

jjerphan commented Aug 22, 2022

Skipped jobs on the CI are due to an unusual way of raising awareness of Ubuntu 18.04 upcoming removal. See: actions/runner-images#6002

@ogrisel
Copy link
Member

ogrisel commented Aug 22, 2022

Honestly, I do not see why tests were passing on configuration using Linux. Let's put it on hazard?

It cannot be chance: it was deterministically happening for different linux builds with different base images and different numpy / scipy versions.

I think this problem motivates the need to add global_random to those tests. What do you think?

I am pretty sure we would have had the same problem with a different random seed. The distance values where non-zeros and the same on all the platforms.

Were those tests skipped for some reason on the linux builds?

Answer: not it wasn't the case. I just checked that for instance test_cdist[X2-Y2-minkowski0] did run and pass on the Linux CI using the filtering capabilities of the Azure Web UI:

https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=45755&view=ms.vss-test-web.build-test-results-tab

@ogrisel
Copy link
Member

ogrisel commented Aug 22, 2022

It probably has to do with a malloc happening after a recently freed buffer for the result array: the distances values of the previous assertion were not reinitialized and therefore left to the correctly computed values of the dm.pairwise(X_csr, Y_csr) case while on Windows and Mac malloc would not reuse recently freed memory in the same way.

@jjerphan
Copy link
Member Author

It probably has to do with a malloc of a recently freed buffer for the result array: the distances values of the previous assertion were not reinitialized and therefore left to the correctly computed values of the dm.pairwise(X_csr, Y_csr) case while on Windows and Mac malloc would not reuse recently freed memory in the same way.

Indeed, we are using np.empty which is returning an uninitialised buffer allocated with malloc.

@ogrisel
Copy link
Member

ogrisel commented Aug 22, 2022

Indeed, we are using np.empty which is returning an uninitialised buffer allocated with malloc.

I tried to make the test stronger as following:

diff --git a/sklearn/metrics/tests/test_dist_metrics.py b/sklearn/metrics/tests/test_dist_metrics.py
index bb95681eb..c895143fe 100644
--- a/sklearn/metrics/tests/test_dist_metrics.py
+++ b/sklearn/metrics/tests/test_dist_metrics.py
@@ -113,14 +113,23 @@ def test_cdist(metric_param_grid, X, Y):
         assert D_sklearn.flags.c_contiguous
         assert_allclose(D_sklearn, D_scipy_cdist, **rtol_dict)
 
+        # Erase values in this results buffer to avoid successive free/malloc
+        # reusing a buffer with the correct distances in the next calls,
+        # potentially silently hiding problems.
+        D_sklearn.fill(0.)
+
         D_sklearn = dm.pairwise(X_csr, Y_csr)
         assert D_sklearn.flags.c_contiguous
         assert_allclose(D_sklearn, D_scipy_cdist, **rtol_dict)
 
+        D_sklearn.fill(0.)
+
         D_sklearn = dm.pairwise(X_csr, Y)
         assert D_sklearn.flags.c_contiguous
         assert_allclose(D_sklearn, D_scipy_cdist, **rtol_dict)
 
+        D_sklearn.fill(0.)
+
         D_sklearn = dm.pairwise(X, Y_csr)
         assert D_sklearn.flags.c_contiguous
         assert_allclose(D_sklearn, D_scipy_cdist, **rtol_dict)

but I cannot make it fail on 1c7bd26 (last commit prior to your fix)... Not sure what's going on.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much @jjerphan. LGTM.

Maybe @thomasjpfan can give a second review for this as the original author of this strategy.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
jjerphan and others added 2 commits August 25, 2022 15:31
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Fix typo that I forgot
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan merged commit a34a70d into scikit-learn:main Aug 25, 2022
@jjerphan jjerphan deleted the maint/dist-metrics-sparse-support-follow-up branch August 25, 2022 15:22
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Sep 12, 2022
…ort (scikit-learn#24223)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants