MAINT Adapt `PairwiseDistancesReduction` heuristic for `strategy="auto"` #24043

jjerphan · 2022-07-29T09:02:22Z

Reference Issues/PRs

Part of #22587.

What does this implement/fix? Explain your changes.

Currently, the heuristic to choose between parallel_on_X or parallel_on_Y is sub-optimal and does not take the number of samples of Y into account.

This changes the heuristic to add a comparison between the number of samples of the two datasets.

The radius definition is from Olivier: scikit-learn#22320 (review) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jjerphan · 2022-08-05T19:38:25Z

With the current heuristic, we get using the time_PairwiseDistancesArgKmin benchmark the following results (:tada:):

       before           after         ratio
     [06e733e1]       [fec7843c]
     <maint/pdr-adapt-heuristic~1>       <maint/pdr-adapt-heuristic>
-        109±20ms         82.2±1ms     0.75  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_PairwiseDistancesArgKmin(1000, 1000, 100, 'manhattan', 'auto')                                                                          
-       7.76±0.3s          954±9ms     0.12  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_PairwiseDistancesArgKmin(10000, 100000, 100, 'manhattan', 'auto')                                                                       
-       4.08±0.1s          165±5ms     0.04  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_PairwiseDistancesArgKmin(10000, 100000, 100, 'euclidean', 'auto')                                                                       
-       967±100ms         37.1±2ms     0.04  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_PairwiseDistancesArgKmin(1000, 10000, 100, 'manhattan', 'auto')                                                                         
-        760±40ms       21.4±0.4ms     0.03  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_PairwiseDistancesArgKmin(1000, 10000, 100, 'euclidean', 'auto')                                                                         
-       7.76±0.2s         32.9±1ms     0.00  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_PairwiseDistancesArgKmin(1000, 100000, 100, 'euclidean', 'auto')                                                                        

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Full asv trace

# Having commented all benchmarks but `time_PairwiseDistancesArgKmin` 
$ asv continuous -b PairwiseDistancesReductions -e `git merge-base --fork-point upstream/main maint/pdr-adapt-heuristic` maint/pdr-adapt-heuristic
· Creating environments
· Discovering benchmarks

· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For scikit-learn commit fec7843c <maint/pdr-adapt-heuristic> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ··· pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_PairwiseDistancesArgKmin                                                                                                                                     8/54 failed
[ 50.00%] ··· ========== ======== ============ ================== =========================== =========================== ================== =========================== ===========================
              --                                                                                                 metric / strategy
              -------------------------------- -----------------------------------------------------------------------------------------------------------------------------------------------------
               n_train    n_test   n_features   euclidean / auto   euclidean / parallel_on_X   euclidean / parallel_on_Y   manhattan / auto   manhattan / parallel_on_X   manhattan / parallel_on_Y
              ========== ======== ============ ================== =========================== =========================== ================== =========================== ===========================
                 1000      1000       100          67.2±10ms               16.5±0.2ms                  67.6±0.3ms              82.2±1ms               35.1±0.3ms                  82.4±20ms
                 1000     10000       100          21.4±0.4ms              18.8±0.3ms                   794±40ms               37.1±2ms                34.6±2ms                    976±60ms
                 1000     100000      100           32.9±1ms               33.0±0.6ms                  7.88±0.2s              106±0.4ms               106±0.5ms                   9.35±0.1s
                10000      1000       100           68.3±2ms                41.8±4ms                    69.8±3ms               81.8±3ms                243±4ms                     76.0±1ms
                10000     10000       100           442±10ms                45.4±3ms                    414±20ms               697±30ms                237±1ms                     740±60ms
                10000     100000      100           165±5ms                 169±3ms                    4.05±0.2s               954±9ms                 994±50ms                   6.99±0.2s
               10000000    1000       100          1.47±0.1s               24.6±0.04s                  1.58±0.01s             7.81±0.2s                 failed                    7.79±0.1s
               10000000   10000       100          13.9±0.5s               26.7±0.03s                  14.5±0.4s              1.29±0.01m                failed                     1.29±0m
               10000000   100000      100            failed                  failed                      failed                 failed                  failed                      failed
              ========== ======== ============ ================== =========================== =========================== ================== =========================== ===========================

[ 50.00%] ···· For parameters: 10000000, 1000, 100, 'manhattan', 'parallel_on_X'


               asv: benchmark timed out (timeout 500s)

               For parameters: 10000000, 10000, 100, 'manhattan', 'parallel_on_X'


               asv: benchmark timed out (timeout 500s)

               For parameters: 10000000, 100000, 100, 'euclidean', 'auto'


               asv: benchmark timed out (timeout 500s)

               For parameters: 10000000, 100000, 100, 'euclidean', 'parallel_on_X'


               asv: benchmark timed out (timeout 500s)

               For parameters: 10000000, 100000, 100, 'euclidean', 'parallel_on_Y'


               asv: benchmark timed out (timeout 500s)

               For parameters: 10000000, 100000, 100, 'manhattan', 'auto'


               asv: benchmark timed out (timeout 500s)

               For parameters: 10000000, 100000, 100, 'manhattan', 'parallel_on_X'


               asv: benchmark timed out (timeout 500s)

               For parameters: 10000000, 100000, 100, 'manhattan', 'parallel_on_Y'


               asv: benchmark timed out (timeout 500s)

[ 50.00%] · For scikit-learn commit 06e733e1 <maint/pdr-adapt-heuristic~1> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl.....................................
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[100.00%] ··· pairwise_distances_reductions.PairwiseDistancesReductionsBenchmar                                                                                                                                     8/54 failed
[100.00%] ··· ========== ======== ============ ================== =========================== =========================== ================== =========================== ===========================
              --                                                                                                 metric / strategy
              -------------------------------- -----------------------------------------------------------------------------------------------------------------------------------------------------
               n_train    n_test   n_features   euclidean / auto   euclidean / parallel_on_X   euclidean / parallel_on_Y   manhattan / auto   manhattan / parallel_on_X   manhattan / parallel_on_Y
              ========== ======== ============ ================== =========================== =========================== ================== =========================== ===========================
                 1000      1000       100          67.4±0.3ms              16.5±0.2ms                   67.1±1ms               109±20ms               35.3±0.4ms                  82.7±20ms
                 1000     10000       100           760±40ms                18.4±2ms                   749±100ms              967±100ms                37.9±2ms                    984±20ms
                 1000     100000      100          7.76±0.2s                32.9±2ms                   7.10±0.2s              9.56±0.04s              106±0.4ms                   9.35±0.06s
                10000      1000       100           68.8±2ms                41.1±8ms                    68.1±2ms               76.1±1ms                239±3ms                     81.4±2ms
                10000     10000       100           431±20ms                48.8±4ms                    455±20ms               667±20ms                236±3ms                     701±30ms
                10000     100000      100          4.08±0.1s                165±3ms                    4.39±0.3s              7.76±0.3s                953±3ms                    6.73±0.2s
               10000000    1000       100          1.55±0.02s              24.8±0.2s                   1.37±0.1s              7.56±0.1s                 failed                    7.53±0.08s
               10000000   10000       100          12.6±0.5s               26.3±0.02s                  14.7±0.2s              1.29±0.01m                failed                    1.31±0.01m
               10000000   100000      100            failed                  failed                      failed                 failed                  failed                      failed
              ========== ======== ============ ================== =========================== =========================== ================== =========================== ===========================

[100.00%] ···· For parameters: 10000000, 1000, 100, 'manhattan', 'parallel_on_X'


               asv: benchmark timed out (timeout 500s)

               For parameters: 10000000, 10000, 100, 'manhattan', 'parallel_on_X'


               asv: benchmark timed out (timeout 500s)

               For parameters: 10000000, 100000, 100, 'euclidean', 'auto'


               asv: benchmark timed out (timeout 500s)

               For parameters: 10000000, 100000, 100, 'euclidean', 'parallel_on_X'


               asv: benchmark timed out (timeout 500s)

               For parameters: 10000000, 100000, 100, 'euclidean', 'parallel_on_Y'


               asv: benchmark timed out (timeout 500s)

               For parameters: 10000000, 100000, 100, 'manhattan', 'auto'


               asv: benchmark timed out (timeout 500s)

               For parameters: 10000000, 100000, 100, 'manhattan', 'parallel_on_X'


               asv: benchmark timed out (timeout 500s)

               For parameters: 10000000, 100000, 100, 'manhattan', 'parallel_on_Y'


               asv: benchmark timed out (timeout 500s)

       before           after         ratio
     [06e733e1]       [fec7843c]
     <maint/pdr-adapt-heuristic~1>       <maint/pdr-adapt-heuristic>
-        109±20ms         82.2±1ms     0.75  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_PairwiseDistancesArgKmin(1000, 1000, 100, 'manhattan', 'auto')
-       7.76±0.3s          954±9ms     0.12  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_PairwiseDistancesArgKmin(10000, 100000, 100, 'manhattan', 'auto')
-       4.08±0.1s          165±5ms     0.04  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_PairwiseDistancesArgKmin(10000, 100000, 100, 'euclidean', 'auto')
-       967±100ms         37.1±2ms     0.04  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_PairwiseDistancesArgKmin(1000, 10000, 100, 'manhattan', 'auto')
-        760±40ms       21.4±0.4ms     0.03  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_PairwiseDistancesArgKmin(1000, 10000, 100, 'euclidean', 'auto')
-       7.76±0.2s         32.9±1ms     0.00  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_PairwiseDistancesArgKmin(1000, 100000, 100, 'euclidean', 'auto')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

ogrisel

Very nice. Do you confirm that this fixes the scalability degradation observed in the plot of the description of #23865 when the number of CPU threads is large compared to the number of samples?

I am +1 for merging as is as this is already a net performance improvement but I as suspect that the variant I discuss below might even be a bit better in some cases. Can be done in a follow-up PR.

sklearn/metrics/_pairwise_distances_reduction/_base.pyx

thomasjpfan · 2022-08-09T16:45:35Z

asv_benchmarks/benchmarks/pairwise_distances_reductions.py

+
+    param_names = ["n_train", "n_test", "n_features", "metric", "strategy"]
+    params = [
+        [1000, 10_000, int(1e7)],


As stated in #24120 (review) and #24120 (comment), I do not think we can include this under asv_benchmarks.

I'm okay with only updating the heuristic in this PR and merge the benchmark separately.

jjerphan · 2022-08-10T07:53:40Z

Do you confirm that this fixes the scalability degradation observed in the plot of the description of #23865 when the number of CPU threads is large compared to the number of samples?

I can't confirm because I have not explored it as I thought it to be orthogonal to fixing the heuristic. Still, it has to be done and I guess a subsequent dedicated PR is appropriate.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jjerphan · 2022-08-11T07:05:40Z

I've accepted @ogrisel's suggestion which code-wise revert this branch to fec7843c, the revision for which benchmark results' (see #24043 (comment)) were reported.

I think this can be further finer-grained refined (e.g. tuning constants), but at least this fixes behaviour observed in #24076 so that @Micky774 can perform benchmark.

ogrisel · 2022-08-11T07:24:49Z

@thomasjpfan ok for merge with this updated version?

thomasjpfan

LGTM

…o"` (scikit-learn#24043) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

github-actions bot added cython module:metrics labels Jul 29, 2022

jjerphan mentioned this pull request Aug 5, 2022

PERF PairwiseDistancesReductions initial work #22587

Closed

jjerphan and others added 2 commits August 5, 2022 11:48

ASV Add benchmarks for PairwiseDistancesReductions

06e733e

The radius definition is from Olivier: scikit-learn#22320 (review) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

MAINT Adapt heuristic to take the number of samples of Y in

fec7843

jjerphan force-pushed the maint/pdr-adapt-heuristic branch from e0aad59 to fec7843 Compare August 5, 2022 10:11

jjerphan marked this pull request as ready for review August 5, 2022 19:39

jjerphan added the Quick Review For PRs that are quick to review label Aug 5, 2022

Remove old sub-optimal condition

09b7060

Micky774 mentioned this pull request Aug 5, 2022

ASV Add benchmarks for PairwiseDistancesReductions #24120

Closed

Well, Julien

9c890df

jjerphan removed the Quick Review For PRs that are quick to review label Aug 6, 2022

This was referenced Aug 9, 2022

MAINT Division-less alternative for fused sparse-dense support jjerphan/scikit-learn#15

Merged

FEA Fused sparse-dense support for PairwiseDistancesReduction #23585

Merged

ogrisel approved these changes Aug 9, 2022

View reviewed changes

sklearn/metrics/_pairwise_distances_reduction/_base.pyx Outdated Show resolved Hide resolved

sklearn/metrics/_pairwise_distances_reduction/_base.pyx Outdated Show resolved Hide resolved

thomasjpfan reviewed Aug 9, 2022

View reviewed changes

jjerphan and others added 3 commits August 11, 2022 08:56

MAINT Revert heuristic to use 2 conditions and document it

5d0a337

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

MAINT Remove ASV benchmark

7056ce2

Merge branch 'main' into maint/pdr-adapt-heuristic

9e5155e

Micky774 mentioned this pull request Aug 11, 2022

PERF Implement PairwiseDistancesReduction backend for KNeighbors.predict_proba #24076

Merged

4 tasks

thomasjpfan approved these changes Aug 11, 2022

View reviewed changes

thomasjpfan merged commit 5e1ed8b into scikit-learn:main Aug 11, 2022

jjerphan deleted the maint/pdr-adapt-heuristic branch August 11, 2022 20:40

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Sep 12, 2022

MAINT Adapt PairwiseDistancesReduction heuristic for `strategy="aut…

53f4a88

…o"` (scikit-learn#24043) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MAINT Adapt `PairwiseDistancesReduction` heuristic for `strategy="auto"` #24043

MAINT Adapt `PairwiseDistancesReduction` heuristic for `strategy="auto"` #24043

Uh oh!

jjerphan commented Jul 29, 2022 •

edited

Loading

Uh oh!

jjerphan commented Aug 5, 2022 •

edited

Loading

Uh oh!

ogrisel left a comment

Uh oh!

Uh oh!

Uh oh!

thomasjpfan Aug 9, 2022 •

edited

Loading

Uh oh!

jjerphan commented Aug 10, 2022 •

edited

Loading

Uh oh!

jjerphan commented Aug 11, 2022

Uh oh!

ogrisel commented Aug 11, 2022

Uh oh!

thomasjpfan left a comment

Uh oh!

Uh oh!

Uh oh!

MAINT Adapt PairwiseDistancesReduction heuristic for strategy="auto" #24043

MAINT Adapt PairwiseDistancesReduction heuristic for strategy="auto" #24043

Uh oh!

Conversation

jjerphan commented Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

jjerphan commented Aug 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thomasjpfan Aug 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjerphan commented Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjerphan commented Aug 11, 2022

Uh oh!

ogrisel commented Aug 11, 2022

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MAINT Adapt `PairwiseDistancesReduction` heuristic for `strategy="auto"` #24043

MAINT Adapt `PairwiseDistancesReduction` heuristic for `strategy="auto"` #24043

jjerphan commented Jul 29, 2022 •

edited

Loading

jjerphan commented Aug 5, 2022 •

edited

Loading

thomasjpfan Aug 9, 2022 •

edited

Loading

jjerphan commented Aug 10, 2022 •

edited

Loading