-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Pairwise Distances ArgKmin #21462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
Introduce PairwiseDistancesReduction.is_usable_for encapsulating the test for branching logic. Also removed unused code due to new implementation.
StdVectorSentinel makes a proper life-cycle management for std::vectors' buffers possible. Duplication seems needed as fused types can't be used as attributes. It's possible to obtain a missing symbol (`_ZSt28__throw_bad_array_new_lengthv`) at runtime. This is unrelated to the implementation here, and there are issues reporting the problem, e.g.: cython/cython#4218. A temporary workaround: stan-dev/pystan#294 (comment)
Also removes _finalise_compute.
Add RadiusNeighborhood as a PairwiseDistancesReduction
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Probably this should get its own test.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
This was mentionned by one of Oliviers' review but I forgot it. Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
…cikit-learn#21501) This modify the test configuration so that it makes sense for when a sole sample is provided for MeanShift. This test was passing previously for this configuration but was not supposed to. The new implementation strategy for kneighbors which uses PairwiseDistancesArgKmin (see scikit-learn#21462) is numerically stabler for this case, motivating this modication. Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
…21501) This modify the test configuration so that it makes sense for when a sole sample is provided for MeanShift. This test was passing previously for this configuration but was not supposed to. The new implementation strategy for kneighbors which uses PairwiseDistancesArgKmin (see #21462) is numerically stabler for this case, motivating this modication. Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
I did some quick benchmark with a k-NN classifier on my Apple M1 laptop:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the merged conflicts but I realized I made a mistake in the ordering of the sections of the changelog. Let me fix it quickly.
#22134 is merged |
Ah sorry for commenting here. I mixed it up with #22134. |
Reference Issues/PRs
Supersedes #20254.
Context
Reductions over pairwise distances (argmin, argkmin, threshold filtering, cumulative sum, etc.) are one of the foundations used by many algorithms within scikit-learn.
Various strategies exist to compute them differently, including some using chunks (see
sklearn/metrics/pairwise.py
).However, those are still mainly implemented at the level of Python leaving some rooms for manœuvre, especially for potential optimizations which can be made at a lower level.
Proposals of this PR
This PR proposes:
PairwiseDistancesReduction
, an abstraction allowing computing reductions efficiently in parallelPairwiseDistancesArgkmin
, a first implementation ofPairwiseDistancesReduction
for k-Nearest Neighbors searchFastEuclideanPairwiseDistancesArgKmin
a specialization ofPairwiseDistancesArgkmin
for the euclidean and squared euclidean distances which uses the GEMM trickDatasetsPair
, an abstraction to wrap a pair of two datasets and compute their vectors pairwise distancesDenseDenseDatasetsPair
, a first implementation ofDatasetsPair
for pair of two dense datasetsPairwiseDistancesArgkmin
when and where possibleResults
Scalability results (I beg readers' pardon for my plotting skills)
1.0 implementation do not scale well (not even sublinearly).
This PR implementation does scales linearly up to 64 threads.
Results are obtained via this script.
This PR @ 19dd7ca
(n_train, n_test, k) = (1 000 000, 10 000, 10)
(n_train, n_test, k) = (100 000, 100 000, 10)
Raw results
1.0 (9b03375)
(n_train, n_test, k) = (1 000 000, 10 000, 10)
(n_train, n_test, k) = (100 000, 100 000, 10)
Raw results
Benchmarks information
Machine specification
Benchmarks results between 1.0 (e7fb5b8) and this PR @ 19dd7ca (via 5f3f389)
Between ×1.2 and ×20 speed-ups in normal configurations.
Small regression when using 1 core on small datasets.
1 core
2 cores
4 cores
To come soon.
8 cores
16 cores
20 cores
Benchmarks information
Machine specification
Any other comments?
#20254 originally proposed more changes.
Those changes have been removed from the branch (in fa424a4 and 5678666) but will be introduced in follow-up PRs.