Skip to content

WIP PERF Faster KNeighborsClassifier.predict #14543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

rth
Copy link
Member

@rth rth commented Aug 1, 2019

Closes #13783

This makes KNeighborsClassifier.predict faster by re-writing scipy.stats.mode as an argmax of a sparse array as discussed in the parent issue.

Using the example provided in #13783 (comment),

On master

%timeit knn.predict(X_grid)                                                 
2.47 s ± 37.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With this PR

%timeit knn.predict(X_grid)                                                 
793 ms ± 39.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

so in this particular case, the KNeighborsClassifier.predict is 3.1x faster.

It works in a straightforward way both on weighted an unweighted data making sklearn.utils.weighted_mode no longer necessary.

The downside is that it makes RadiusNeighborsClassifier.predict slower by about 30% on the following example,

In [1]: import numpy as np 
   ...: from sklearn.datasets import make_blobs 
   ...: from sklearn.neighbors import RadiusNeighborsClassifier 
   ...:  
   ...: X, y = make_blobs(centers=2, random_state=4, n_samples=30) 
   ...: knn = RadiusNeighborsClassifier(algorithm='kd_tree', radius=2).fit(X, y) 
   ...:  
   ...: x_min, x_max = X[:, 0].min(), X[:, 0].max() 
   ...: y_min, y_max = X[:, 1].min(), X[:, 1].max() 
   ...:  
   ...: xx = np.linspace(x_min, x_max, 100) 
   ...: # change 100 to 1000 below and wait a long time                             
   ...:               
   ...: yy = np.linspace(y_min, y_max, 100)                                         
   ...:   
   ...:  
   ...: X1, X2 = np.meshgrid(xx, yy)                                                
   ...:    
   ...: X_grid = np.c_[X1.ravel(), X2.ravel()]                                      

In [2]: %timeit knn.predict(X_grid)                                                 
1.27 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The difference is mostly that in `RadiusNeighborsClassifier` we call this function repeatedly on 1D arrays, as opposed to a single 2D array. In worst case, I could revert back to `scipy.stats.mode` and `sklearn.utils.weighted_mode` for `RadiusNeighborsClassifier` but it's annoying to use two different implementations.

TODO

  • fix the remaining 2 test failures on the comparison between the multi-output and single-output case.
  • investigate RadiusNeighborsClassifier.predict performance regression.

@jnothman
Copy link
Member

The radius neighbours input is perfect for transforming into a 2d CSR matrix. Just set indices to the concatenated neighbour indices, and indptr to array([len(x) for x in neighbor_indices]).

I suspect you can efficiently use sparse matrices to compute the result for both.

@rth rth marked this pull request as draft September 1, 2020 11:16
@rth
Copy link
Member Author

rth commented Sep 1, 2020

Thanks for the pointer. I'll try to investigate, but first I need to fix the test failure which seems to point to a genuine bug.

Base automatically changed from master to main January 22, 2021 10:51
@rth
Copy link
Member Author

rth commented Apr 22, 2021

@jjerphan In case you are interested to continue it, this should also give an easy performance win for KNN classification :) As a first step maybe just making this happen for KNeighborsClassifier would already be a start.
It might need to re-run benchmarks, merge with main and fix the two remaining test failures.

@jjerphan
Copy link
Member

@rth: thanks for the comment!

I might have a look after #19950 which might help with improving performance of neighbours.KNeighborsClassifier.predict :)

@rth
Copy link
Member Author

rth commented Apr 22, 2021

Thanks!

@jjerphan
Copy link
Member

@rth: #24076 might be preferred over this PR. I let you close if you also think it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

knn predict unreasonably slow b/c of use of scipy.stats.mode
3 participants