-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Sparse output KNN #3350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Sparse output KNN #3350
Conversation
Today I wrote some more code to replace the existing predict function in KNeighborsClassifier. The code is in a very rough state but it is clear where the next improvements will come. 1) return the sparse format by convention (sparse target in means sparse target out). 2) Replace the dense |
How can I import my cython function into a module? I am writing a sparse row wise mode function |
Don't forget to rerun scikit-learn's 'python setup.py build_ext -i'. |
Thinking about it. There is an implementation of mode with sparse matrix in the imputation module. |
@arjoly the implementation for the mode looks like it loops over columns from # Most frequent
elif strategy == "most_frequent":
most_frequent = np.empty(len(columns))
for i, column in enumerate(columns):
most_frequent[i] = _most_frequent(column,
0,
n_zeros_axis[i]) |
Thank you Gael that did it |
The Cython CSR rowise mode looks to be working correctly, although I think there is room for improvement in efficiency I am going to to move on to a Cython implementation of weighted mode. |
for i in range(n_samples): | ||
nnz = indptr[i + 1] - indptr[i] | ||
if nnz > 0: | ||
nonz_mode, count = stats.mode(data[indptr[i]:indptr[i + 1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I bench marked this Cython function vs python code of the same implementation and it preformed the same. The next step will be to do this calculation with Cython code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before working on optimized version, does it work with the simple python version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, a vectorized implementation of mode for a 1d array is:
def mode(a):
u, inv = np.unique(a, return_inverse=True)
c = np.bincount(inv, minlength=len(u))
m = np.argmax(c)
return u[m], c[m]
(and unlike scipy.stats.mode
, this doesn't change the dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note also that weighted bincount should be sufficient to support the weighted case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function as it is right now works as a mode function, it is slow however because of what you pointed out. I will focus on getting a functional PR first before coming back to optimize this using Joels comments.
|
Ping @jnothman, @arjoly, @vene Is there a way to get advanced indexing support for sparse matrices in the earlier versions of numpy? In this PR there is a need for indexing a matrix with a 2D array. My first thought was to backport the |
Does it work with numpy 1.8? |
The function in test neigbors has been renamed test_neighbors_classifier_multioutput_sparse
Reword comments concerning self.sparse_target_input to reflect that fit may not give sparse target data in all cases
y = y.tocsc() | ||
y.eliminate_zeros() | ||
nnz = np.diff(y.indptr) | ||
data = np.array([]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using a list?
Since we go for the simple road, it would be nice to make it pass in similar way for predict_proba. |
y_pred_mo = knn_mo.predict(X_test) | ||
|
||
assert_equal(y_pred_mo.shape, y_test.shape) | ||
assert_true(sp.issparse(y_pred_mo)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those two assertion are already made in the next line.
I missed the relevant lines in the diff. |
Looks good, last comment is cosmetic. |
if not self.outputs_2d_: | ||
self.classes_ = self.classes_[0] | ||
self._y = self._y.ravel() | ||
if not sp.issparse(y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic should probably be moved to LabelEncoding
at some point; it currently does not handle multioutput, nor sparse (but the latter had only been used for binary targets until now). The sparse implementation is not explicitly tested, and some of its conditions are only being tested because of the random number generation happening to produce entirely dense and non-dense columns.
…e up with enoding second case classes_ is not a 0 to n sequence
thanks |
Sparse Multioutput LabelEncoder for use in this PR is started here #3592 |
Marked WIP until some other PR important for this one are finalized |
happened upon this. Are you going to finish it, @hamsal? |
Superseded by #9059. |
Implement Cython CSR rowise mode function in sparsefuncs_fastImplement Cython CSR rowise weighted_mode function in sparsefuncs_fastBackport scipy sparse advanced indexingThis PR originally started out with Cython based improvements, but a middle ground has been chosen for simplicity of implementation.
This change is