ENH Improve performance of `KNeighborsClassifier.predict` #23721

Micky774 · 2022-06-21T22:53:13Z

Reference Issues/PRs

Fixes #13783
Resolves #14543 (stalled/draft)

What does this implement/fix? Explain your changes.

Leverages csr_matrix to compute fast mode in KNeighborsClassifier.predict (replaces scipy.stats.mode) for uniform weights.

Any other comments?

Theoretically, this is a faster operation even in the weighted case; however, csr_matrix.argmax sums duplicates (which is what we aim to exploit), but this actually changes the underlying data array which is very problematic since it leads to incorrect results in the multi-output loop. We could "fix" this by passing in copies of the weights to create the csr_matrix, but that defeats the whole point.

Hence, currently, this is only a meaningful speedup for weights in {None, "uniform"} since we can easily compute an ndarray of ones each loop iteration to feed to the csr_matrix without worrying about it being mutated.

To Do

Memory benchmarks

…lassifier.predict

Micky774 · 2022-06-21T23:11:48Z

Initial benchmarks generated w/ this script. Note that the implementation labeled "PR" uses the sparse matrix argmax for all methods, while "hybrid" uses it only for uniform weights. This PR currently implements to so-called "hybrid" option.

Plot

Micky774 · 2022-06-21T23:42:51Z

No significant differences in memory profile either. Tested in Jupyter notebook with

from sklearn.neighbors import KNeighborsClassifier
import numpy as np

rng = np.random.RandomState(0)
X = rng.random(size=(10_000, 200))
y = rng.randint(1, 6, size=(10_000, 3))
neigh = KNeighborsClassifier(n_neighbors=10, weights="uniform")
neigh.fit(X, y)

%load_ext memory_profiler
%memit neigh.predict(X)

Both implementations were ~23-25MiB for weights in {"uniform", "distances"}

jjerphan · 2022-06-22T13:04:51Z

Thank you for the follow-up, @Micky774.

Note that we might then introduce dedicated PairwiseDistancesReductions as back-ends for KNeighbors{Classifier,Regressor}.{predict,predict_proba}. For more details, see: #22587.

If you are interested and want to implement some of them, feel free to! 🙂

ogrisel · 2022-06-24T14:13:18Z

@Micky774 it would be great if you could review #23604 in particular which is the first step to accelerate the k-NN queries on sparse data using the new Cython infrastructure.

jjerphan · 2022-06-24T15:06:11Z

Everybody's welcome on the PairwiseDistancesReductions-:boat:.

Micky774 · 2022-06-28T23:12:38Z

@ogrisel @jjerphan Even though it'll be replaced w/ the new Cython back-end, since it is simple enough and demonstrably better than what we have right now, may we move forward w/ a review just to provide a performance game in the meantime?

jjerphan · 2022-06-29T06:15:07Z

Yes, sure. I do not see those two tasks as mutually exclusive.

ogrisel · 2022-06-29T08:06:12Z

Indeed, I had not realized that this was happening after the neighbors computation. This code can probably be further Cythonized to use OpenMP / prange parallelism but in the mean time this looks like an quick way to improve the code. Let me do a proper review now.

sklearn/neighbors/_classification.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jjerphan

LGTM. Thank you @Micky774 and @ogrisel.

doc/whats_new/v1.2.rst

sklearn/neighbors/_classification.py

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

thomasjpfan

There is a significant memory overhead when n_samples gets bigger:

from sklearn.neighbors import KNeighborsClassifier
import numpy as np

rng = np.random.RandomState(0)
n_samples = 100_000
X = rng.random(size=(n_samples, 200))
y = rng.randint(1, 6, size=(n_samples, 3))
neigh = KNeighborsClassifier(n_neighbors=10, weights="uniform")
neigh.fit(X, y)

On main, I get:

%memit neigh.predict(X)
# peak memory: 319.08 MiB, increment: 33.17 MiB

and with this PR:

%memit neigh.predict(X)
# peak memory: 546.52 MiB, increment: 37.00 MiB

I think most of the memory overhead is from constructing the sparse matrix itself.

Micky774 · 2022-07-21T02:39:28Z

There is a significant memory overhead when n_samples get bigger:
...
I think most of the memory overhead is from constructing the sparse matrix itself.

Yeah that's actually very significant. In that case, it may be better just to go for the PairwiseDistancesReductions back-end instead of using this as an intermediate speed-up. We could potentially just repurpose some of the machinery that scipy.sparse, but I don't think it's worth the effort honestly.

thomasjpfan · 2022-07-21T13:40:29Z

With the memory overhead, I am -1 overall on this PR with CSR matrices. We likely need another approach all together to resolve #13783, either through Cython (as suggested in #23721 (comment)) or something more efficient in Python.

Disapproval due to performance regressions

rth and others added 9 commits August 1, 2019 17:18

Initial version of the faster implementation of scipy.stats.mode

62695fd

Also support weights

6bd2a08

Fixes to RadiusNeigboursClassifier

c7d3286

Merge remote-tracking branch 'upstream/master' into fast-KNeighboursC…

1787f3e

…lassifier.predict

Merge branch 'main' into knn_predict_performance

7c20fa7

Merge branch 'main' into knn_predict_performance

7ee90ad

Simplified implementation and fixed critical bugs

c391bee

Merge branch 'main' into knn_predict_performance

716a3c4

Cleaned up implementation

77d6346

github-actions bot added the module:neighbors label Jun 21, 2022

Micky774 added 2 commits June 21, 2022 18:53

Formatting

9c7abb4

Formatting

f4ec893

Micky774 added 2 commits June 21, 2022 19:47

Updated dtype

4106805

Added changelog entry

ce3ad2c

Updated dtype to intp

46c2623

ogrisel reviewed Jun 29, 2022

View reviewed changes

sklearn/neighbors/_classification.py Outdated Show resolved Hide resolved

sklearn/neighbors/_classification.py Outdated Show resolved Hide resolved

Micky774 and others added 5 commits June 30, 2022 12:23

Merge branch 'main' into knn_predict_performance

2138880

Apply suggestions from code review

e791362

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Moved method-->function

12abe8f

Merge branch 'main' into knn_predict_performance

bfa683e

Merge branch 'main' into knn_predict_performance

2625a26

Micky774 added the Performance label Jul 6, 2022

Micky774 added the Quick Review For PRs that are quick to review label Jul 6, 2022

jjerphan previously approved these changes Jul 7, 2022

View reviewed changes

doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved

sklearn/neighbors/_classification.py Outdated Show resolved Hide resolved

Micky774 and others added 3 commits July 7, 2022 12:07

Update doc/whats_new/v1.2.rst

e506477

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Merge branch 'main' into knn_predict_performance

99e0e59

Moved import statement

ba80904

thomasjpfan reviewed Jul 20, 2022

View reviewed changes

Micky774 closed this Jul 21, 2022

Micky774 deleted the knn_predict_performance branch July 25, 2022 21:31

jjerphan mentioned this pull request Aug 1, 2022

EHN: RadiusNeighborRegressor speedup #24053

Open

Micky774 mentioned this pull request Aug 1, 2022

PERF Implement PairwiseDistancesReduction backend for KNeighbors.predict_proba #24076

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH Improve performance of `KNeighborsClassifier.predict` #23721

ENH Improve performance of `KNeighborsClassifier.predict` #23721

Uh oh!

Micky774 commented Jun 21, 2022 •

edited

Loading

Uh oh!

Micky774 commented Jun 21, 2022

Uh oh!

Micky774 commented Jun 21, 2022

Uh oh!

jjerphan commented Jun 22, 2022

Uh oh!

ogrisel commented Jun 24, 2022

Uh oh!

jjerphan commented Jun 24, 2022

Uh oh!

Micky774 commented Jun 28, 2022

Uh oh!

jjerphan commented Jun 29, 2022

Uh oh!

ogrisel commented Jun 29, 2022

Uh oh!

Uh oh!

Uh oh!

jjerphan left a comment

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment •

edited

Loading

Uh oh!

Micky774 commented Jul 21, 2022

Uh oh!

thomasjpfan commented Jul 21, 2022

Uh oh!

Uh oh!

Uh oh!

ENH Improve performance of KNeighborsClassifier.predict #23721

ENH Improve performance of KNeighborsClassifier.predict #23721

Uh oh!

Conversation

Micky774 commented Jun 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

To Do

Uh oh!

Micky774 commented Jun 21, 2022

Plot

Uh oh!

Micky774 commented Jun 21, 2022

Uh oh!

jjerphan commented Jun 22, 2022

Uh oh!

ogrisel commented Jun 24, 2022

Uh oh!

jjerphan commented Jun 24, 2022

Uh oh!

Micky774 commented Jun 28, 2022

Uh oh!

jjerphan commented Jun 29, 2022

Uh oh!

ogrisel commented Jun 29, 2022

Uh oh!

Uh oh!

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Micky774 commented Jul 21, 2022

Uh oh!

thomasjpfan commented Jul 21, 2022

Uh oh!

Uh oh!

ENH Improve performance of `KNeighborsClassifier.predict` #23721

ENH Improve performance of `KNeighborsClassifier.predict` #23721

Micky774 commented Jun 21, 2022 •

edited

Loading

thomasjpfan left a comment •

edited

Loading