-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Improve performance of KNeighborsClassifier.predict
#23721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…lassifier.predict
Initial benchmarks generated w/ this script. Note that the implementation labeled "PR" uses the sparse matrix argmax for all methods, while "hybrid" uses it only for uniform weights. This PR currently implements to so-called "hybrid" option. Plot |
No significant differences in memory profile either. Tested in Jupyter notebook with from sklearn.neighbors import KNeighborsClassifier
import numpy as np
rng = np.random.RandomState(0)
X = rng.random(size=(10_000, 200))
y = rng.randint(1, 6, size=(10_000, 3))
neigh = KNeighborsClassifier(n_neighbors=10, weights="uniform")
neigh.fit(X, y)
%load_ext memory_profiler
%memit neigh.predict(X) Both implementations were ~ |
Everybody's welcome on the |
Yes, sure. I do not see those two tasks as mutually exclusive. |
Indeed, I had not realized that this was happening after the neighbors computation. This code can probably be further Cythonized to use OpenMP / prange parallelism but in the mean time this looks like an quick way to improve the code. Let me do a proper review now. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a significant memory overhead when n_samples
gets bigger:
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
rng = np.random.RandomState(0)
n_samples = 100_000
X = rng.random(size=(n_samples, 200))
y = rng.randint(1, 6, size=(n_samples, 3))
neigh = KNeighborsClassifier(n_neighbors=10, weights="uniform")
neigh.fit(X, y)
On main
, I get:
%memit neigh.predict(X)
# peak memory: 319.08 MiB, increment: 33.17 MiB
and with this PR:
%memit neigh.predict(X)
# peak memory: 546.52 MiB, increment: 37.00 MiB
I think most of the memory overhead is from constructing the sparse matrix itself.
Yeah that's actually very significant. In that case, it may be better just to go for the |
With the memory overhead, I am -1 overall on this PR with CSR matrices. We likely need another approach all together to resolve #13783, either through Cython (as suggested in #23721 (comment)) or something more efficient in Python. |
Disapproval due to performance regressions
Reference Issues/PRs
Fixes #13783
Resolves #14543 (stalled/draft)
What does this implement/fix? Explain your changes.
Leverages
csr_matrix
to compute fast mode inKNeighborsClassifier.predict
(replacesscipy.stats.mode
) for uniform weights.Any other comments?
Theoretically, this is a faster operation even in the weighted case; however,
csr_matrix.argmax
sums duplicates (which is what we aim to exploit), but this actually changes the underlying data array which is very problematic since it leads to incorrect results in the multi-output loop. We could "fix" this by passing in copies of the weights to create thecsr_matrix
, but that defeats the whole point.Hence, currently, this is only a meaningful speedup for
weights in {None, "uniform"}
since we can easily compute anndarray
of ones each loop iteration to feed to thecsr_matrix
without worrying about it being mutated.To Do