Skip to content

ENH Pairwise Distances ArgKmin #21462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 322 commits into from

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented Oct 26, 2021

Reference Issues/PRs

Supersedes #20254.

Context

Reductions over pairwise distances (argmin, argkmin, threshold filtering, cumulative sum, etc.) are one of the foundations used by many algorithms within scikit-learn.

Various strategies exist to compute them differently, including some using chunks (see sklearn/metrics/pairwise.py).

However, those are still mainly implemented at the level of Python leaving some rooms for manœuvre, especially for potential optimizations which can be made at a lower level.

Proposals of this PR

This PR proposes:

  • introducing PairwiseDistancesReduction, an abstraction allowing computing reductions efficiently in parallel
  • introducing PairwiseDistancesArgkmin, a first implementation of PairwiseDistancesReduction for k-Nearest Neighbors search
  • introduces FastEuclideanPairwiseDistancesArgKmin a specialization of PairwiseDistancesArgkmin for the euclidean and squared euclidean distances which uses the GEMM trick
  • introducing DatasetsPair, an abstraction to wrap a pair of two datasets and compute their vectors pairwise distances
  • DenseDenseDatasetsPair, a first implementation of DatasetsPair for pair of two dense datasets
  • adapting existing interfaces implementations to use PairwiseDistancesArgkmin when and where possible
  • adapting existing tests and introducing new ones for existing and for the new implementations

Results

Scalability results (I beg readers' pardon for my plotting skills)

1.0 implementation do not scale well (not even sublinearly).

This PR implementation does scales linearly up to 64 threads.

Results are obtained via this script.

This PR @ 19dd7ca

  • (n_train, n_test, k) = (1 000 000, 10 000, 10)

speed_up_21462_1000000_10000_log

  • (n_train, n_test, k) = (100 000, 100 000, 10)

speed_up_21462_100000_100000_log

Raw results
    n_threads  n_train  n_test  n_features  mean_runtime  stderr_runtime
0           1   100000  100000          50     53.319188               0
1           2   100000  100000          50     27.118464               0
2           4   100000  100000          50     13.899768               0
3           8   100000  100000          50      6.993012               0
4          16   100000  100000          50      3.801222               0
5          32   100000  100000          50      2.243082               0
6          64   100000  100000          50      1.370284               0
7         128   100000  100000          50      2.007565               0
8           1   100000  100000         100     77.488982               0
9           2   100000  100000         100     39.340334               0
10          4   100000  100000         100     20.142855               0
11          8   100000  100000         100     10.352673               0
12         16   100000  100000         100      5.556708               0
13         32   100000  100000         100      3.059276               0
14         64   100000  100000         100      2.183617               0
15        128   100000  100000         100      3.138607               0
16          1   100000  100000         500    265.572455               0
17          2   100000  100000         500    133.960014               0
18          4   100000  100000         500     68.229143               0
19          8   100000  100000         500     35.094786               0
20         16   100000  100000         500     18.780773               0
21         32   100000  100000         500     10.413006               0
22         64   100000  100000         500      6.948780               0
23        128   100000  100000         500     11.336498               0
    n_threads  n_train  n_test  n_features  mean_runtime  stderr_runtime
0           1  1000000   10000          50     54.181024               0
1           2  1000000   10000          50     27.885327               0
2           4  1000000   10000          50     14.291922               0
3           8  1000000   10000          50      7.112637               0
4          16  1000000   10000          50      3.886164               0
5          32  1000000   10000          50      2.156768               0
6          64  1000000   10000          50      1.375637               0
7         128  1000000   10000          50      1.563360               0
8           1  1000000   10000         100     77.440228               0
9           2  1000000   10000         100     39.710930               0
10          4  1000000   10000         100     20.238788               0
11          8  1000000   10000         100     10.565105               0
12         16  1000000   10000         100      5.549846               0
13         32  1000000   10000         100      3.070918               0
14         64  1000000   10000         100      1.955702               0
15        128  1000000   10000         100      2.145065               0
16          1  1000000   10000         500    266.191423               0
17          2  1000000   10000         500    136.932209               0
18          4  1000000   10000         500     70.186863               0
19          8  1000000   10000         500     35.348363               0
20         16  1000000   10000         500     18.673352               0
21         32  1000000   10000         500     10.232006               0
22         64  1000000   10000         500      7.026281               0
23        128  1000000   10000         500      7.118796               0

1.0 (9b03375)

  • (n_train, n_test, k) = (1 000 000, 10 000, 10)

speed_up_1 0_1000000_10000_log

  • (n_train, n_test, k) = (100 000, 100 000, 10)

speed_up_1 0_100000_100000_log

Raw results
    n_threads  n_train  n_test  n_features  mean_runtime  stderr_runtime
0           1   100000  100000          50    155.998673               0
1           2   100000  100000          50    178.260460               0
2           4   100000  100000          50    162.936344               0
3           8   100000  100000          50    157.636250               0
4          16   100000  100000          50    159.273397               0
5          32   100000  100000          50    160.395220               0
6          64   100000  100000          50    163.525406               0
7         128   100000  100000          50    197.601314               0
8           1   100000  100000         100    176.557523               0
9           2   100000  100000         100    186.601761               0
10          4   100000  100000         100    168.457660               0
11          8   100000  100000         100    158.684159               0
12         16   100000  100000         100    160.464695               0
13         32   100000  100000         100    162.835237               0
14         64   100000  100000         100    167.569254               0
15        128   100000  100000         100    181.630751               0
16          1   100000  100000         500    348.425143               0
17          2   100000  100000         500    277.428042               0
18          4   100000  100000         500    215.084920               0
19          8   100000  100000         500    187.254540               0
20         16   100000  100000         500    179.741570               0
21         32   100000  100000         500    174.659939               0
22         64   100000  100000         500    190.489453               0
23        128   100000  100000         500    227.535006               0
    n_threads  n_train  n_test  n_features  mean_runtime  stderr_runtime
0           1  1000000   10000          50    155.084480               0
1           2  1000000   10000          50    189.289008               0
2           4  1000000   10000          50    174.326572               0
3           8  1000000   10000          50    169.130710               0
4          16  1000000   10000          50    169.489740               0
5          32  1000000   10000          50    169.501073               0
6          64  1000000   10000          50    176.410910               0
7         128  1000000   10000          50    188.036263               0
8           1  1000000   10000         100    198.885173               0
9           2  1000000   10000         100    217.164905               0
10          4  1000000   10000         100    194.695082               0
11          8  1000000   10000         100    184.083839               0
12         16  1000000   10000         100    184.801623               0
13         32  1000000   10000         100    178.922246               0
14         64  1000000   10000         100    189.154418               0
15        128  1000000   10000         100    201.639427               0
16          1  1000000   10000         500    425.073250               0
17          2  1000000   10000         500    331.314108               0
18          4  1000000   10000         500    254.123434               0
19          8  1000000   10000         500    203.032605               0
20         16  1000000   10000         500    194.956694               0
21         32  1000000   10000         500    194.276188               0
22         64  1000000   10000         500    190.638631               0
23        128  1000000   10000         500    191.943951               0
Benchmarks information
Machine specification
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7742 64-Core Processor
Stepping:            0
CPU MHz:             3388.360
BogoMIPS:            4491.59
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-63,128-191
NUMA node1 CPU(s):   64-127,192-255
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

Benchmarks results between 1.0 (e7fb5b8) and this PR @ 19dd7ca (via 5f3f389)

Between ×1.2 and ×20 speed-ups in normal configurations.
Small regression when using 1 core on small datasets.

1 core
       before           after         ratio
     [e7fb5b8c]       [19dd7cab]
     <main>           <benchmarks/pairwise-distances-argkmin~6>
+         286±5ms          374±5ms     1.31  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_locally_linear_embedding(1000, 1000, 100)                                                                                                     
-         1.22±0s       1.10±0.01s     0.90  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 10000, 100)                                                                                                                      
-      9.59±0.03s       8.28±0.06s     0.86  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_locally_linear_embedding(1000, 100000, 100)                                                                                                   
-      20.4±0.04s       16.6±0.05s     0.82  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(10000, 100000, 100)                                                                                                                     
-      6.40±0.02s       5.21±0.02s     0.81  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 100000, 100)                                                                                                                     
-         241±2ms          195±3ms     0.81  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(1000, 10000, 100)                                                                                                                       
-       124±0.9ms         90.9±1ms     0.73  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)                                                                                                   
-      1.22±0.01s         874±10ms     0.72  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)                                                                                              
-      12.1±0.03s       8.64±0.01s     0.71  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 100000, 100)                                                                                             
-      12.2±0.02s       8.68±0.01s     0.71  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 100000, 100)                                                                                                 
-      1.44±0.01s       1.02±0.01s     0.71  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(1000, 100000, 100)                                                                                                                      
-      1.23±0.01s          868±8ms     0.71  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)                                                                                                  
-         126±1ms         87.7±2ms     0.70  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)                                                                                               
-         133±1ms         90.6±1ms     0.68  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)                                                                                               
-         133±1ms         90.5±1ms     0.68  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)                                                                                                   
-      1.34±0.01s          905±7ms     0.68  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)                                                                                              
-      1.34±0.01s          894±7ms     0.67  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)                                                                                                  
-      22.9±0.4ms       12.6±0.4ms     0.55  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)                                                                                                            
-         2.35±0s       1.11±0.01s     0.47  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)                                                                                                          
-         232±2ms          109±2ms     0.47  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)                                                                                                           
-         230±2ms         92.2±2ms     0.40  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)                                                                                                           
-      22.5±0.02s       8.96±0.03s     0.40  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 100000, 100)                                                                                                         
-      2.25±0.01s          896±8ms     0.40  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
2 cores
       before           after         ratio
     [e7fb5b8c]       [19dd7cab]
     <main>           <pairwise-distances-argkmin>
-      1.51±0.01s       1.36±0.01s     0.90  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 10000, 100)
-      10.5±0.01s       9.35±0.02s     0.89  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(10000, 10000, 100)
-         17.0±0s          14.9±0s     0.88  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_locally_linear_embedding(1000, 100000, 100)
-      9.21±0.01s       7.65±0.04s     0.83  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 100000, 100)
-         317±1ms        241±0.7ms     0.76  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(1000, 10000, 100)
-     12.6±0.04ms      9.33±0.08ms     0.74  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
-     12.7±0.05ms      9.32±0.08ms     0.74  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
-      21.9±0.01s       15.5±0.01s     0.71  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(10000, 100000, 100)
-         1.52±0s          880±1ms     0.58  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(1000, 100000, 100)
-         1.27±0s          704±7ms     0.55  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-       133±0.4ms       72.9±0.8ms     0.55  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-         1.27±0s          699±8ms     0.55  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-       133±0.1ms       72.7±0.8ms     0.55  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-      12.7±0.01s          6.92±0s     0.54  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 100000, 100)
-         12.7±0s       6.85±0.01s     0.54  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 100000, 100)
-       138±0.2ms       74.1±0.4ms     0.54  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
-       139±0.2ms       74.0±0.1ms     0.53  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
-         1.35±0s          704±2ms     0.52  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-         1.35±0s        702±0.8ms     0.52  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)
-     21.6±0.04ms      11.2±0.06ms     0.52  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-         2.24±0s          805±3ms     0.36  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-       229±0.9ms      81.7±0.08ms     0.36  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-       234±0.1ms       76.3±0.4ms     0.33  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         2.22±0s          717±8ms     0.32  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-         22.1±0s          7.00±0s     0.32  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
4 cores

To come soon.

8 cores
       before           after         ratio
     [e7fb5b8c]       [19dd7cab]
     <main>           <pairwise-distances-argkmin>
-         1.47±0s       1.31±0.01s     0.89  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 10000, 100)
-         16.6±0s          14.6±0s     0.88  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_locally_linear_embedding(1000, 100000, 100)
-         9.62±0s       8.28±0.01s     0.86  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(10000, 10000, 100)
-      8.63±0.01s          7.07±0s     0.82  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 100000, 100)
-       283±0.8ms        202±0.3ms     0.71  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(1000, 10000, 100)
-    10.00±0.01ms      6.50±0.01ms     0.65  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
-    10.00±0.01ms      6.40±0.01ms     0.64  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
-      17.4±0.01s          9.97±0s     0.57  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(10000, 100000, 100)
-      19.9±0.2ms      8.07±0.03ms     0.41  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-         1.12±0s          371±1ms     0.33  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(1000, 100000, 100)
-      97.9±0.2ms      22.8±0.02ms     0.23  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-      97.4±0.2ms      22.6±0.02ms     0.23  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-       870±0.6ms       197±0.05ms     0.23  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-       869±0.7ms          195±1ms     0.22  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-       106±0.1ms      22.7±0.04ms     0.21  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
-       106±0.1ms      22.5±0.03ms     0.21  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
-       963±0.9ms        203±0.5ms     0.21  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-       963±0.6ms        203±0.6ms     0.21  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)
-       203±0.2ms      25.4±0.03ms     0.13  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-         1.85±0s        232±0.1ms     0.13  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-       206±0.2ms      25.3±0.06ms     0.12  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.81±0s       201±0.05ms     0.11  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
16 cores
       before           after         ratio
     [e7fb5b8c]       [19dd7cab]
     <main>           <pairwise-distances-argkmin>
-         16.4±0s       14.5±0.01s     0.88  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_locally_linear_embedding(1000, 100000, 100)
-      9.57±0.02s       8.13±0.01s     0.85  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(10000, 10000, 100)
-       2.84±0.3s       2.39±0.01s     0.84  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_mean_shift(1000, 10000, 100)
-      8.63±0.02s       7.05±0.03s     0.82  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 100000, 100)
-      2.73±0.03s        2.18±0.1s     0.80  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_tsne(1000, 1000, 100)
-         285±4ms        224±0.6ms     0.79  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(1000, 10000, 100)
-       111±0.7ms        85.0±20ms     0.77  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
-      2.72±0.03s       2.07±0.03s     0.76  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_tsne(1000, 10000, 100)
-       2.78±0.3s       2.10±0.02s     0.75  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_tsne(1000, 100000, 100)
-     10.3±0.01ms      6.84±0.02ms     0.66  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
-       17.2±0.1s       9.04±0.02s     0.53  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(10000, 100000, 100)
-         105±3ms       45.5±0.2ms     0.43  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
-     20.4±0.07ms      8.38±0.03ms     0.41  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-         206±3ms       58.8±0.2ms     0.29  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-       101±0.5ms       27.0±0.1ms     0.27  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-      1.09±0.01s        278±0.4ms     0.25  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(1000, 100000, 100)
-         101±1ms      15.4±0.05ms     0.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-         849±7ms        123±0.1ms     0.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-        919±40ms        124±0.4ms     0.13  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-         924±4ms        112±0.6ms     0.12  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-        931±10ms        111±0.4ms     0.12  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)
-         209±1ms      17.7±0.05ms     0.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-      1.79±0.02s        142±0.3ms     0.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-      1.85±0.05s        129±0.1ms     0.07  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
20 cores
       before           after         ratio
     [e7fb5b8c]       [19dd7cab]
     <main>           <pairwise-distances-argkmin>
-      16.3±0.02s          14.4±0s     0.88  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_locally_linear_embedding(1000, 100000, 100)
-      9.13±0.02s       8.05±0.03s     0.88  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(10000, 1000, 100)
-         8.59±0s       6.96±0.02s     0.81  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_isomap(1000, 100000, 100)
-       279±0.7ms        225±0.2ms     0.81  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(1000, 10000, 100)
-      10.0±0.05s       8.05±0.01s     0.80  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(10000, 10000, 100)
-     9.38±0.02ms      6.92±0.02ms     0.74  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 1000, 100)
-     9.42±0.02ms       6.65±0.2ms     0.71  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 1000, 100)
-       17.2±0.2s       8.75±0.01s     0.51  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(10000, 100000, 100)
-       102±0.3ms       46.1±0.3ms     0.45  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 10000, 100)
-       104±0.6ms       46.3±0.1ms     0.45  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 10000, 100)
-      19.7±0.1ms      8.47±0.04ms     0.43  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-       200±0.8ms       59.3±0.2ms     0.30  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-         1.09±0s        264±0.5ms     0.24  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_birch(1000, 100000, 100)
-      93.5±0.7ms      11.5±0.04ms     0.12  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 1000, 100)
-      94.2±0.4ms      11.2±0.05ms     0.12  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 1000, 100)
-        844±10ms       88.1±0.4ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(10000, 10000, 100)
-        864±20ms       88.4±0.2ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(10000, 10000, 100)
-        963±20ms       94.9±0.7ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin_min(1000, 100000, 100)
-         932±2ms       90.7±0.2ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_pairwise_distances_argmin(1000, 100000, 100)
-       202±0.3ms      13.8±0.06ms     0.07  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-      1.87±0.02s        108±0.3ms     0.06  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-      1.84±0.04s        105±0.5ms     0.06  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
Benchmarks information
Machine specification
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  12
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
Stepping:            4
CPU MHz:             1000.223
BogoMIPS:            4600.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            16896K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsave

Any other comments?

#20254 originally proposed more changes.

Those changes have been removed from the branch (in fa424a4 and 5678666) but will be introduced in follow-up PRs.

jjerphan and others added 30 commits July 9, 2021 17:23
Co-authored-by: Jérémie du Boisberranger
<jeremiedbb@users.noreply.github.com>
Introduce PairwiseDistancesReduction.is_usable_for encapsulating
the test for branching logic.

Also removed unused code due to new implementation.
StdVectorSentinel makes a proper life-cycle
management for std::vectors' buffers possible.

Duplication seems needed as fused types
can't be used as attributes.

It's possible to obtain a missing symbol
(`_ZSt28__throw_bad_array_new_lengthv`) at
runtime.

This is unrelated to the implementation here,
and there are issues reporting the problem, e.g.:
cython/cython#4218.

A temporary workaround:
stan-dev/pystan#294 (comment)
Also removes _finalise_compute.
Add RadiusNeighborhood as a PairwiseDistancesReduction
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Probably this should get its own test.
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021
…cikit-learn#21501)

This modify the test configuration so that it
makes sense for when a sole sample is provided
for MeanShift.

This test was passing previously for this
configuration but was not supposed to.

The new implementation strategy for kneighbors
which uses PairwiseDistancesArgKmin (see scikit-learn#21462)
is numerically stabler for this case,
motivating this modication.

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
glemaitre added a commit that referenced this pull request Dec 25, 2021
…21501)

This modify the test configuration so that it
makes sense for when a sole sample is provided
for MeanShift.

This test was passing previously for this
configuration but was not supposed to.

The new implementation strategy for kneighbors
which uses PairwiseDistancesArgKmin (see #21462)
is numerically stabler for this case,
motivating this modication.

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
@ogrisel
Copy link
Member

ogrisel commented Feb 11, 2022

I did some quick benchmark with a k-NN classifier on my Apple M1 laptop:

main

  • single thread (OMP_NUM_THREADS=1):
n_samples_train: 1000000, n_samples_test: 1000, acc: 0.994, duration: 18.120 s
  • all available cores (default)
n_samples_train: 1000000, n_samples_test: 1000, acc: 0.994, duration: 14.243 s

#21462

  • single thread (OMP_NUM_THREADS=1):
n_samples_train: 1000000, n_samples_test: 1000, acc: 0.994, duration: 5.670 s
  • all available cores (default):
n_samples_train: 1000000, n_samples_test: 1000, acc: 0.994, duration: 1.196 s

This machine just has 4 performance cores and 4 efficiency cores, so a speed up of ~12x is a nice surprise :) As expected this PR both improves both the hardware scalability (more efficient use of the cores) and the CPU cache (to avoid io bottlenecks).

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the merged conflicts but I realized I made a mistake in the ordering of the sections of the changelog. Let me fix it quickly.

@lorentzenchr
Copy link
Member

#22134 is merged

@ogrisel
Copy link
Member

ogrisel commented Feb 17, 2022

Ah sorry for commenting here. I mixed it up with #22134.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants