Skip to content

ENH Introduce PairwiseDistancesRadiusNeighborhood #22320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented Jan 28, 2022

Reference Issues/PRs

Comes after #22134.

What does this implement/fix? Explain your changes.

This introduces two new concrete PairwiseDistancesReductions for radius-neighborhood querying.

Any other comments?

It uses resizable buffers as provided via std::vectors, with some adaptation to return them as numpy arrays safely. Though it suffers from concurrent reallocation in threads (namely when vectors’ buffers are being reallocated).

This concurrent reallocation causes some drops in performance as calls to malloc(3) (used under the hood for the buffers’ reallocations) lock by default in the compilers’ standard libraries’ implementations. Concretely, for the new PairwiseDistancesRadiusNeighborhood, the speed-up only can get up to 4 contrarily to 20 for PairwiseDistancesArgKmin which does not know such problems.

A possible solution to alleviate this problem would be to use another implementation of malloc(3) such as mimalloc‘s.

Finally, and contrarily to k-nn querying which is static, radius neighbours querying is dynamic, hence we might want to adapt the scheduling of prange loops with dynamic or guided. IMO, this should be done in another PR.

Hardware Scalability on 1.0 (9b03375) using this script

The current implementation on 1.0 does not scale because:

  • is tied to CPython GIL due to many small PyObjects (this is especially the case when n_features is small)
  • reallocates a lot of temporary numpy arrays is various threads (which is costly and comes with concurrency)
  • performs a lot of back and forth movement of data between the RAM and the caches

speed_up_100000_100000_log_1.0

Hardware Scalability for this PR @ aeb79ca using this script

With mimalloc, this PR implementation does scales linearly up to 8 threads and sublinearly up to 64 threads.

Without mimalloc, performances are worse and variable.

Without mimalloc:

speed_up_100000_100000_log_without_mimalloc

Raw results
    n_threads  n_train  n_test  n_features  mean_runtime  stderr_runtime
0           1    10000   10000          50      1.858132               0
1           2    10000   10000          50      0.943718               0
2           4    10000   10000          50      0.595347               0
3           8    10000   10000          50      0.350731               0
4          16    10000   10000          50      0.522424               0
5          32    10000   10000          50      0.444270               0
6          64    10000   10000          50      0.559822               0
7         128    10000   10000          50      0.396564               0
8           1    10000   10000         100      2.063410               0
9           2    10000   10000         100      1.191579               0
10          4    10000   10000         100      0.623694               0
11          8    10000   10000         100      0.359428               0
12         16    10000   10000         100      0.353618               0
13         32    10000   10000         100      0.283564               0
14         64    10000   10000         100      0.616601               0
15        128    10000   10000         100      0.439544               0
16          1    10000   10000         500      4.059608               0
17          2    10000   10000         500      2.084081               0
18          4    10000   10000         500      1.088404               0
19          8    10000   10000         500      0.607854               0
20         16    10000   10000         500      0.585447               0
21         32    10000   10000         500      0.539778               0
22         64    10000   10000         500      0.728474               0
23        128    10000   10000         500      0.497490               0

With mimalloc:

speed_up_100000_100000_log_with_mimalloc

Raw results
    n_threads  n_train  n_test  n_features  mean_runtime  stderr_runtime
0           1    10000   10000          50      1.843863               0
1           2    10000   10000          50      0.878143               0
2           4    10000   10000          50      0.477720               0
3           8    10000   10000          50      0.248385               0
4          16    10000   10000          50      0.202744               0
5          32    10000   10000          50      0.155233               0
6          64    10000   10000          50      0.123474               0
7         128    10000   10000          50      0.243421               0
8           1    10000   10000         100      2.043319               0
9           2    10000   10000         100      1.030469               0
10          4    10000   10000         100      0.528292               0
11          8    10000   10000         100      0.308843               0
12         16    10000   10000         100      0.219315               0
13         32    10000   10000         100      0.165821               0
14         64    10000   10000         100      0.144672               0
15        128    10000   10000         100      0.268027               0
16          1    10000   10000         500      3.942212               0
17          2    10000   10000         500      2.009313               0
18          4    10000   10000         500      1.010291               0
19          8    10000   10000         500      0.537854               0
20         16    10000   10000         500      0.387427               0
21         32    10000   10000         500      0.314703               0
22         64    10000   10000         500      0.254417               0
23        128    10000   10000         500      0.378934               0

Speed-ups between 1.0 (998e8f2) and this PR @ c4d8c4a (via 5f3f389a67a2fe914959ee3a0abc7ad3c6e220bd)

From ×2 up to ×20 speed-ups in normal configurations.
Small regression in some cases.

(I need to check if mimalloc can be LD_PRELOAD-ed when using asv to report results).

Without mimalloc:

1 core
       before           after         ratio
     [998e8f20]       [c4d8c4a4]
     <main>           <pairwise-distances-radius-neighborhood>
-      86.9±0.6ms       17.8±0.2ms     0.20  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 1000, 100)
-         835±2ms        161±0.2ms     0.19  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 10000, 100)
-         8.37±0s          1.58±0s     0.19  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 100000, 100)
-      13.0±0.1ms      2.41±0.04ms     0.18  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 1000, 100)
-       148±0.6ms       21.4±0.1ms     0.14  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 10000, 100)
-         1.47±0s        203±0.3ms     0.14  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
2 cores
       before           after         ratio
     [998e8f20]       [c4d8c4a4]
     <main>           <pairwise-distances-radius-neighborhood>
-       595±0.7ms        318±0.8ms     0.53  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 10000, 100)
-         5.98±0s       3.11±0.01s     0.52  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 100000, 100)
-      63.4±0.3ms       32.4±0.2ms     0.51  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 1000, 100)
-      11.0±0.2ms      3.98±0.06ms     0.36  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 1000, 100)
-         1.24±0s          373±1ms     0.30  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 100000, 100)
-       125±0.6ms       37.4±0.2ms     0.30  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 10000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
4 cores
       before           after         ratio
     [998e8f20]       [c4d8c4a4]
     <main>           <pairwise-distances-radius-neighborhood>
-      52.2±0.4ms       18.4±0.3ms     0.35  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 1000, 100)
-         481±1ms          166±1ms     0.35  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 10000, 100)
-      4.83±0.01s       1.62±0.02s     0.34  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 100000, 100)
-     10.0±0.09ms      2.79±0.08ms     0.28  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 1000, 100)
-       113±0.6ms       23.2±0.3ms     0.20  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 10000, 100)
-         1.11±0s          221±2ms     0.20  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
8 cores
       before           after         ratio
     [998e8f20]       [c4d8c4a4]
     <main>           <pairwise-distances-radius-neighborhood>
-      10.8±0.2ms       3.20±0.1ms     0.30  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 1000, 100)
-      58.2±0.6ms       11.8±0.2ms     0.20  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 1000, 100)
-         535±2ms       91.9±0.5ms     0.17  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 10000, 100)
-         5.35±0s          886±6ms     0.17  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 100000, 100)
-       116±0.5ms       17.1±0.2ms     0.15  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 10000, 100)
-         1.11±0s          151±1ms     0.14  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
16 cores
       before           after         ratio
     [998e8f20]       [c4d8c4a4]
     <main>           <pairwise-distances-radius-neighborhood>
-      10.9±0.2ms       3.92±0.2ms     0.36  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 1000, 100)
-       123±0.9ms       28.7±0.3ms     0.23  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 10000, 100)
-      55.5±0.6ms       8.37±0.1ms     0.15  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 1000, 100)
-         495±8ms       71.0±0.8ms     0.14  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 10000, 100)
-      4.93±0.04s          508±3ms     0.10  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 100000, 100)
-      1.15±0.02s        117±0.9ms     0.10  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
32 cores
       before           after         ratio
     [998e8f20]       [c4d8c4a4]
     <main>           <pairwise-distances-radius-neighborhood>
-     10.9±0.05ms       5.25±0.3ms     0.48  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 1000, 100)
-         127±1ms       30.1±0.3ms     0.24  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 10000, 100)
-      54.3±0.6ms       9.03±0.8ms     0.17  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 1000, 100)
-         480±4ms       63.2±0.2ms     0.13  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 10000, 100)
-         1.17±0s        105±0.6ms     0.09  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 100000, 100)
-      4.80±0.01s          326±1ms     0.07  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
64 cores
       before           after         ratio
     [998e8f20]       [c4d8c4a4]
     <main>           <pairwise-distances-radius-neighborhood>
-     11.9±0.05ms         8.54±5ms     0.72  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 1000, 100)
-         152±2ms       34.1±0.2ms     0.22  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 10000, 100)
-        70.5±2ms         12.4±2ms     0.18  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 1000, 100)
-         581±5ms       56.5±0.4ms     0.10  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 10000, 100)
-      1.30±0.01s        102±0.7ms     0.08  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 100000, 100)
-      5.76±0.01s          263±1ms     0.05  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
128 cores
       before           after         ratio
     [998e8f20]       [c4d8c4a4]
     <main>           <pairwise-distances-radius-neighborhood>
+      12.3±0.2ms          146±7ms    11.88  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 1000, 100)
+        83.8±4ms          135±4ms     1.61  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 1000, 100)
-      1.39±0.02s         434±10ms     0.31  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(1000, 100000, 100)
-         643±4ms          188±5ms     0.29  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 10000, 100)
-      6.31±0.03s         624±20ms     0.10  pdr.PairwiseDistancesRadiusNeighborhoodBenchmark.time_radius_neighbors(10000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Benchmarks information

Machine specification
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7742 64-Core Processor
Stepping:            0
CPU MHz:             3392.969
BogoMIPS:            4491.83
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-63,128-191
NUMA node1 CPU(s):   64-127,192-255
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca 

jjerphan and others added 7 commits January 5, 2022 15:19
…it-learn#22064)

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Also move the error message upfront if results have to
be sorted without the distances being returned.
@jjerphan
Copy link
Member Author

No memory leaks.

Tested with this script
from sklearn.metrics._pairwise_distances_reduction import (
    PairwiseDistancesRadiusNeighborhood,
)
import numpy as np
import psutil
import pandas as pd
import gc
import matplotlib.pyplot as plt

if __name__ == "__main__":
    memory_info = []
    mem_info = psutil.Process().memory_full_info()
    memory_info.append(mem_info)
    rng = np.random.RandomState(0)
    X = rng.rand(10000, 100)
    Y = rng.rand(10, 100)
    mem_info = psutil.Process().memory_full_info()
    memory_info.append(mem_info)

    n_calls = 100
    radius = 10000
    return_distance = True

    for i in range(n_calls):
        PairwiseDistancesRadiusNeighborhood.compute(
                X, Y,
                radius=radius,
                return_distance=return_distance,
                sort_results=True,
                strategy="parallel_on_Y",
        )
        mem_info = psutil.Process().memory_full_info()
        print(f"Iteration {i} / {n_calls}", mem_info)
        memory_info.append(mem_info)

    gc.collect()
    mem_info = psutil.Process().memory_full_info()
    memory_info.append(mem_info)

    title = "Memory usage on calls to PairwiseDistancesRadiusNeighborhood.compute"
    title += f"\n X.shape={X.shape} -- Y.shape={Y.shape} " \
             f"radius={radius} -- return_distance={return_distance}"

    df = pd.DataFrame(data=memory_info)
    df[["data", "rss", "uss"]].plot(title=title,
                                    xlabel="Number of calls",
                                    ylabel="Size of segment (in bytes)",
                                    figsize=(16, 9), fontsize=10)
    plt.savefig(f"/tmp/n_calls={n_calls}_radius={radius}_return_distance={return_distance}.png")
    plt.show()

@jjerphan jjerphan marked this pull request as ready for review February 23, 2022 10:54
@ogrisel
Copy link
Member

ogrisel commented Feb 23, 2022

I think we can consider reviewing this PR for inclusion and merge without requiring integration of mimalloc. Integrating mimalloc (in a vendored dynamic lib in the PyPI wheels or as an explicit dependency for the conda-forge package) might be doable with manageable maintenance overhead but can be explored in a follow-up PR (with [cd build] runs to check that it works).

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me (pending that the full [cd build] is green)!

I did some tests on my local Apple M1 laptop (without mimalloc) and I typically observe 3x to 5x speed-ups over the main branch and a slightly improved memory usage.

In particular I tried with the following script:

from time import perf_counter
import numpy as np
from joblib import Memory
from scipy.spatial.distance import cdist
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.neighbors import RadiusNeighborsClassifier


n_samples_train = 1_000_000
n_samples_test = 1000
n_features = 100

m = Memory(location="/tmp/joblib")
make_blobs = m.cache(make_blobs)

X, y = make_blobs(
    n_samples=n_samples_train + n_samples_test,
    centers=30,
    n_features=n_features,
    random_state=0,
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=n_samples_test, random_state=0
)
radius = np.quantile(a=cdist(X_train[:10000], X_test[:10]).ravel(), q=0.001)
print(f"{radius=:.3f}")

rnc = RadiusNeighborsClassifier(radius=radius, outlier_label="most_frequent").fit(
    X_train, y_train
)
for i in range(10):
    t0 = perf_counter()
    accuracy = rnc.score(X_test, y_test)
    t1 = perf_counter()
    print(
        f"{n_samples_train=}, {n_samples_test=}, {n_features=}, "
        f"{accuracy=}, duration={t1 - t0:.3f}s"
    )

and launched it with mprof run bench_radius_neighbors.py / mprof plot to the following results:

  • on main:
radius=12.098
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=5.216s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=4.714s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=4.996s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=4.937s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=5.077s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=4.868s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=4.766s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=4.959s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=4.835s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=4.938s

mem_bench_radius_main

  • on this branch:
radius=12.098
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.193s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.227s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.221s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.253s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.264s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.364s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.271s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.321s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.323s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.291s

mem_bench_radius_pr

So both faster and more memory efficient. And more importantly, the memory management seems to be correct because memory remains stable after several iterations (no memory leak)! I also tried with moving the constructor under the for loop and the result is the same and to run for a larger number of iterations and memory usage remained stable.

I also tried viztracer --open bench_radius_neighbors.py and I did not see anything suspicious.

Also all scikit-learn tests pass on macos/arm64 which is currently not tested on the CI.

Some minor comments and suggestions below:

@jjerphan
Copy link
Member Author

Thank you for this thoughtful review, @ogrisel: I highly appreciate it. 🤝

I will address your comments tomorrow.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened jjerphan#13 and jjerphan#12 on your forks on ideas about using vectors and reduce derefs in the StdVectorSentinel

At some point, I feel like StdVectorSentinel can be useful in other parts of the codebase. I am okay to keep in pairwise for now.

Otherwise, the implementation looks solid and I see the same performance benefits when running the benchmarks from #22320 (review)

@jjerphan
Copy link
Member Author

jjerphan commented Mar 4, 2022

At some point, I feel like StdVectorSentinel can be useful in other parts of the codebase. I am okay to keep in pairwise for now.

Yes, I do agree. Probably this can be re-factored with other fixtures using C++ to have proper common Cython interfaces.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran more benchmarks using #22320 (review) with other metrics such as metric="chebyshev" to go down the PairwiseDistancesRadiusNeighborhood code path and see memory and runtime improvements compare to main.

I pushed a [cd build] commit and the wheels build successfully with the adoption of vector for dist_middle_terms_chunks in FastEuclideanPairwiseDistancesRadiusNeighborhood.

LGTM

@thomasjpfan thomasjpfan merged commit fb71eef into scikit-learn:main Mar 4, 2022
@jjerphan jjerphan deleted the pairwise-distances-radius-neighborhood branch March 4, 2022 17:49
@jjerphan
Copy link
Member Author

jjerphan commented Mar 4, 2022

Thank you once again, @thomasjpfan and @ogrisel for the thorough reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants