MAINT Adapt `DistanceMetric` in prevision for fused sparse-dense support #24223

jjerphan · 2022-08-22T13:49:36Z

Reference Issues/PRs

Extracts of parts of jjerphan#15

Precedes #23585

What does this implement/fix? Explain your changes.

This proposes a modulo-less alternative which is less costly to support
sparse and dense fused datasets.

The _pairwise_sparse_dense and _pairwise_dense_sparse methods currently
used for testing purposes have been reworked not to duplicate the logic
of cdist_csr and pdist_csr.

This also fix a bug in _pairwise_dense_sparse use to test the
sparse-dense case.

See: jjerphan#15

Any other comments?

To use the a more efficient implementation for future callers implementations. See: #15

ogrisel · 2022-08-22T14:58:28Z

The failures are weird. Here is a summary of what I have seen so far:

all tests pass on Linux,
some test_cdist parametrization (but not all) fail on Windows and Mac,
each time there is a failure we have something like:

            D_sklearn = dm.pairwise(X_csr, Y)
            assert D_sklearn.flags.c_contiguous
>           assert_allclose(D_sklearn, D_scipy_cdist, **rtol_dict)

D_scipy_cdist = array([[1.21796151, 0.99804582, 0.64796144, 1.09320258, 1.49914561,
        1.12392043, 0.47718256, 1.020455  , 0.9278..., 2.10374725, 2.20617501, 1.51727776, 1.13402   ,
        1.92153794, 1.95399778, 0.78790286, 0.78723968, 1.36559244]])
D_sklearn  = array([[1.21796151, 0.99804582, 0.64796144, 1.09320258, 1.49914561,
        1.12392043, 0.47718256, 1.020455  , 0.9278..., 2.10374725, 2.20617501, 1.51727776, 1.13402   ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

E       AssertionError: 
E       Not equal to tolerance rtol=1e-07, atol=0
E       
E       Mismatched elements: 100 / 500 (20%)
E       Max absolute difference: 2.44808142
E       Max relative difference: 1.

the relative error is often, but not exactly 1.0. Meaning that there are 100 out of 500 exact zeros in the scikit-learn CSR/dense distance computation.

So it looks like a pointer arithmetic bug to me.

ogrisel · 2022-08-22T15:10:26Z

Good catch. But why doesn't it fail on Linux?

jjerphan · 2022-08-22T15:15:17Z

Good catch. But why doesn't it fail on Linux?

Honestly, I do not see why tests were passing on configuration using Linux. Let's put it on hazard? I think this problem motivates the need to add global_random to those tests. What do you think?

On a side note, you mentioned in a review that n_features could be used instead of size for DistanceMetrics: do you think this PR is a good opportunity for this change?

jjerphan · 2022-08-22T15:50:48Z

Skipped jobs on the CI are due to an unusual way of raising awareness of Ubuntu 18.04 upcoming removal. See: actions/runner-images#6002

ogrisel · 2022-08-22T16:30:33Z

Honestly, I do not see why tests were passing on configuration using Linux. Let's put it on hazard?

It cannot be chance: it was deterministically happening for different linux builds with different base images and different numpy / scipy versions.

I think this problem motivates the need to add global_random to those tests. What do you think?

I am pretty sure we would have had the same problem with a different random seed. The distance values where non-zeros and the same on all the platforms.

Were those tests skipped for some reason on the linux builds?

Answer: not it wasn't the case. I just checked that for instance test_cdist[X2-Y2-minkowski0] did run and pass on the Linux CI using the filtering capabilities of the Azure Web UI:

https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=45755&view=ms.vss-test-web.build-test-results-tab

ogrisel · 2022-08-22T16:35:19Z

It probably has to do with a malloc happening after a recently freed buffer for the result array: the distances values of the previous assertion were not reinitialized and therefore left to the correctly computed values of the dm.pairwise(X_csr, Y_csr) case while on Windows and Mac malloc would not reuse recently freed memory in the same way.

jjerphan · 2022-08-22T17:01:20Z

It probably has to do with a malloc of a recently freed buffer for the result array: the distances values of the previous assertion were not reinitialized and therefore left to the correctly computed values of the dm.pairwise(X_csr, Y_csr) case while on Windows and Mac malloc would not reuse recently freed memory in the same way.

Indeed, we are using np.empty which is returning an uninitialised buffer allocated with malloc.

sklearn/metrics/_dist_metrics.pyx.tp

ogrisel · 2022-08-22T17:16:41Z

Indeed, we are using np.empty which is returning an uninitialised buffer allocated with malloc.

I tried to make the test stronger as following:

diff --git a/sklearn/metrics/tests/test_dist_metrics.py b/sklearn/metrics/tests/test_dist_metrics.py
index bb95681eb..c895143fe 100644
--- a/sklearn/metrics/tests/test_dist_metrics.py
+++ b/sklearn/metrics/tests/test_dist_metrics.py
@@ -113,14 +113,23 @@ def test_cdist(metric_param_grid, X, Y):
         assert D_sklearn.flags.c_contiguous
         assert_allclose(D_sklearn, D_scipy_cdist, **rtol_dict)
 
+        # Erase values in this results buffer to avoid successive free/malloc
+        # reusing a buffer with the correct distances in the next calls,
+        # potentially silently hiding problems.
+        D_sklearn.fill(0.)
+
         D_sklearn = dm.pairwise(X_csr, Y_csr)
         assert D_sklearn.flags.c_contiguous
         assert_allclose(D_sklearn, D_scipy_cdist, **rtol_dict)
 
+        D_sklearn.fill(0.)
+
         D_sklearn = dm.pairwise(X_csr, Y)
         assert D_sklearn.flags.c_contiguous
         assert_allclose(D_sklearn, D_scipy_cdist, **rtol_dict)
 
+        D_sklearn.fill(0.)
+
         D_sklearn = dm.pairwise(X, Y_csr)
         assert D_sklearn.flags.c_contiguous
         assert_allclose(D_sklearn, D_scipy_cdist, **rtol_dict)

but I cannot make it fail on 1c7bd26 (last commit prior to your fix)... Not sure what's going on.

ogrisel

Thanks very much @jjerphan. LGTM.

Maybe @thomasjpfan can give a second review for this as the original author of this strategy.

sklearn/metrics/_dist_metrics.pyx.tp

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

sklearn/metrics/_dist_metrics.pyx.tp

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Fix typo that I forgot

thomasjpfan

LGTM

…ort (scikit-learn#24223) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

MAINT Adapt DistanceMetric for fused sparse-dense

79f6db9

To use the a more efficient implementation for future callers implementations. See: #15

github-actions bot added cython module:metrics labels Aug 22, 2022

jjerphan mentioned this pull request Aug 22, 2022

FEA Fused sparse-dense support for PairwiseDistancesReduction #23585

Merged

4 tasks

jjerphan added the No Changelog Needed label Aug 22, 2022

DOC Adapt docstring of dist_csr

1c7bd26

FIX Correctly create Y_indptr for the sparse-dense case

8e30ef1

jjerphan marked this pull request as ready for review August 22, 2022 15:36

jjerphan changed the title ~~MAINT Adapt DistanceMetric for fused sparse-dense~~ MAINT Adapt DistanceMetric in prevision for fused sparse-dense support Aug 22, 2022

jjerphan mentioned this pull request Aug 22, 2022

CI Use Ubuntu 22.04 instead of Ubuntu 18.04 #24225

Merged

ogrisel reviewed Aug 22, 2022

View reviewed changes

sklearn/metrics/_dist_metrics.pyx.tp Show resolved Hide resolved

jjerphan added 2 commits August 22, 2022 20:24

Merge branch 'main' into maint/dist-metrics-sparse-support-follow-up

e181d94

Mimic SparseDenseDatasetsPair's frugal adaptation

732b464

jjerphan added the Waiting for Reviewer label Aug 22, 2022

ogrisel approved these changes Aug 23, 2022

View reviewed changes

sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved

sklearn/metrics/_dist_metrics.pyx.tp Show resolved Hide resolved

DOC Clarify methods' purpose

0db551d

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

thomasjpfan reviewed Aug 25, 2022

View reviewed changes

sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved

sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved

jjerphan and others added 2 commits August 25, 2022 15:31

Make validation consistent

77e7f09

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

fixup! Make validation consistent

d13687a

Fix typo that I forgot

thomasjpfan approved these changes Aug 25, 2022

View reviewed changes

thomasjpfan merged commit a34a70d into scikit-learn:main Aug 25, 2022

jjerphan deleted the maint/dist-metrics-sparse-support-follow-up branch August 25, 2022 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MAINT Adapt `DistanceMetric` in prevision for fused sparse-dense support #24223

MAINT Adapt `DistanceMetric` in prevision for fused sparse-dense support #24223

Uh oh!

jjerphan commented Aug 22, 2022 •

edited

Loading

Uh oh!

ogrisel commented Aug 22, 2022 •

edited

Loading

Uh oh!

ogrisel commented Aug 22, 2022

Uh oh!

jjerphan commented Aug 22, 2022

Uh oh!

jjerphan commented Aug 22, 2022 •

edited

Loading

Uh oh!

ogrisel commented Aug 22, 2022

Uh oh!

ogrisel commented Aug 22, 2022 •

edited

Loading

Uh oh!

jjerphan commented Aug 22, 2022

Uh oh!

Uh oh!

ogrisel commented Aug 22, 2022 •

edited

Loading

Uh oh!

ogrisel left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment

Uh oh!

Uh oh!

Uh oh!

MAINT Adapt DistanceMetric in prevision for fused sparse-dense support #24223

MAINT Adapt DistanceMetric in prevision for fused sparse-dense support #24223

Uh oh!

Conversation

jjerphan commented Aug 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

ogrisel commented Aug 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Aug 22, 2022

Uh oh!

jjerphan commented Aug 22, 2022

Uh oh!

jjerphan commented Aug 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Aug 22, 2022

Uh oh!

ogrisel commented Aug 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjerphan commented Aug 22, 2022

Uh oh!

Uh oh!

ogrisel commented Aug 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MAINT Adapt `DistanceMetric` in prevision for fused sparse-dense support #24223

MAINT Adapt `DistanceMetric` in prevision for fused sparse-dense support #24223

jjerphan commented Aug 22, 2022 •

edited

Loading

ogrisel commented Aug 22, 2022 •

edited

Loading

jjerphan commented Aug 22, 2022 •

edited

Loading

ogrisel commented Aug 22, 2022 •

edited

Loading

ogrisel commented Aug 22, 2022 •

edited

Loading