-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Introduce PairwiseDistancesRadiusNeighborhood
#22320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Introduce PairwiseDistancesRadiusNeighborhood
#22320
Conversation
…it-learn#22064) Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Remove duplicated Bunch
Also move the error message upfront if results have to be sorted without the distances being returned.
No memory leaks. Tested with this scriptfrom sklearn.metrics._pairwise_distances_reduction import (
PairwiseDistancesRadiusNeighborhood,
)
import numpy as np
import psutil
import pandas as pd
import gc
import matplotlib.pyplot as plt
if __name__ == "__main__":
memory_info = []
mem_info = psutil.Process().memory_full_info()
memory_info.append(mem_info)
rng = np.random.RandomState(0)
X = rng.rand(10000, 100)
Y = rng.rand(10, 100)
mem_info = psutil.Process().memory_full_info()
memory_info.append(mem_info)
n_calls = 100
radius = 10000
return_distance = True
for i in range(n_calls):
PairwiseDistancesRadiusNeighborhood.compute(
X, Y,
radius=radius,
return_distance=return_distance,
sort_results=True,
strategy="parallel_on_Y",
)
mem_info = psutil.Process().memory_full_info()
print(f"Iteration {i} / {n_calls}", mem_info)
memory_info.append(mem_info)
gc.collect()
mem_info = psutil.Process().memory_full_info()
memory_info.append(mem_info)
title = "Memory usage on calls to PairwiseDistancesRadiusNeighborhood.compute"
title += f"\n X.shape={X.shape} -- Y.shape={Y.shape} " \
f"radius={radius} -- return_distance={return_distance}"
df = pd.DataFrame(data=memory_info)
df[["data", "rss", "uss"]].plot(title=title,
xlabel="Number of calls",
ylabel="Size of segment (in bytes)",
figsize=(16, 9), fontsize=10)
plt.savefig(f"/tmp/n_calls={n_calls}_radius={radius}_return_distance={return_distance}.png")
plt.show() |
I think we can consider reviewing this PR for inclusion and merge without requiring integration of mimalloc. Integrating mimalloc (in a vendored dynamic lib in the PyPI wheels or as an explicit dependency for the conda-forge package) might be doable with manageable maintenance overhead but can be explored in a follow-up PR (with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me (pending that the full [cd build]
is green)!
I did some tests on my local Apple M1 laptop (without mimalloc) and I typically observe 3x to 5x speed-ups over the main branch and a slightly improved memory usage.
In particular I tried with the following script:
from time import perf_counter
import numpy as np
from joblib import Memory
from scipy.spatial.distance import cdist
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.neighbors import RadiusNeighborsClassifier
n_samples_train = 1_000_000
n_samples_test = 1000
n_features = 100
m = Memory(location="/tmp/joblib")
make_blobs = m.cache(make_blobs)
X, y = make_blobs(
n_samples=n_samples_train + n_samples_test,
centers=30,
n_features=n_features,
random_state=0,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=n_samples_test, random_state=0
)
radius = np.quantile(a=cdist(X_train[:10000], X_test[:10]).ravel(), q=0.001)
print(f"{radius=:.3f}")
rnc = RadiusNeighborsClassifier(radius=radius, outlier_label="most_frequent").fit(
X_train, y_train
)
for i in range(10):
t0 = perf_counter()
accuracy = rnc.score(X_test, y_test)
t1 = perf_counter()
print(
f"{n_samples_train=}, {n_samples_test=}, {n_features=}, "
f"{accuracy=}, duration={t1 - t0:.3f}s"
)
and launched it with mprof run bench_radius_neighbors.py
/ mprof plot
to the following results:
- on
main
:
radius=12.098
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=5.216s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=4.714s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=4.996s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=4.937s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=5.077s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=4.868s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=4.766s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=4.959s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=4.835s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=4.938s
- on this branch:
radius=12.098
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.193s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.227s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.221s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.253s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.264s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.364s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.271s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.321s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.323s
n_samples_train=1000000, n_samples_test=1000, n_features=100, accuracy=0.997, duration=1.291s
So both faster and more memory efficient. And more importantly, the memory management seems to be correct because memory remains stable after several iterations (no memory leak)! I also tried with moving the constructor under the for loop and the result is the same and to run for a larger number of iterations and memory usage remained stable.
I also tried viztracer --open bench_radius_neighbors.py
and I did not see anything suspicious.
Also all scikit-learn tests pass on macos/arm64 which is currently not tested on the CI.
Some minor comments and suggestions below:
Thank you for this thoughtful review, @ogrisel: I highly appreciate it. 🤝 I will address your comments tomorrow. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
BLD Add compiler args for C++11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened jjerphan#13 and jjerphan#12 on your forks on ideas about using vectors and reduce derefs in the StdVectorSentinel
At some point, I feel like StdVectorSentinel
can be useful in other parts of the codebase. I am okay to keep in pairwise for now.
Otherwise, the implementation looks solid and I see the same performance benefits when running the benchmarks from #22320 (review)
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Yes, I do agree. Probably this can be re-factored with other fixtures using C++ to have proper common Cython interfaces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran more benchmarks using #22320 (review) with other metrics such as metric="chebyshev"
to go down the PairwiseDistancesRadiusNeighborhood
code path and see memory and runtime improvements compare to main
.
I pushed a [cd build]
commit and the wheels build successfully with the adoption of vector
for dist_middle_terms_chunks
in FastEuclideanPairwiseDistancesRadiusNeighborhood
.
LGTM
Thank you once again, @thomasjpfan and @ogrisel for the thorough reviews. |
The radius definition is from Olivier: scikit-learn#22320 (review) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Reference Issues/PRs
Comes after #22134.
What does this implement/fix? Explain your changes.
This introduces two new concrete
PairwiseDistancesReductions
for radius-neighborhood querying.Any other comments?
It uses resizable buffers as provided via
std::vectors
, with some adaptation to return them as numpy arrays safely. Though it suffers from concurrent reallocation in threads (namely when vectors’ buffers are being reallocated).This concurrent reallocation causes some drops in performance as calls to
malloc(3)
(used under the hood for the buffers’ reallocations) lock by default in the compilers’ standard libraries’ implementations. Concretely, for the newPairwiseDistancesRadiusNeighborhood
, the speed-up only can get up to 4 contrarily to 20 forPairwiseDistancesArgKmin
which does not know such problems.A possible solution to alleviate this problem would be to use another implementation of
malloc(3)
such asmimalloc
‘s.Finally, and contrarily to k-nn querying which is static, radius neighbours querying is dynamic, hence we might want to adapt the scheduling of prange loops with
dynamic
orguided
. IMO, this should be done in another PR.Hardware Scalability on 1.0 (9b03375) using this script
The current implementation on 1.0 does not scale because:
PyObjects
(this is especially the case whenn_features
is small)Hardware Scalability for this PR @ aeb79ca using this script
With mimalloc, this PR implementation does scales linearly up to 8 threads and sublinearly up to 64 threads.
Without mimalloc, performances are worse and variable.
Without mimalloc:
Raw results
With mimalloc:
Raw results
Speed-ups between 1.0 (998e8f2) and this PR @ c4d8c4a (via 5f3f389a67a2fe914959ee3a0abc7ad3c6e220bd)
From ×2 up to ×20 speed-ups in normal configurations.
Small regression in some cases.
(I need to check if
mimalloc
can beLD_PRELOAD
-ed when using asv to report results).Without mimalloc:
1 core
2 cores
4 cores
8 cores
16 cores
32 cores
64 cores
128 cores
Benchmarks information
Machine specification