Skip to content

[WIP] Sparse output KNN #3350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 35 commits into from
Closed

[WIP] Sparse output KNN #3350

wants to merge 35 commits into from

Conversation

hamsal
Copy link
Contributor

@hamsal hamsal commented Jul 8, 2014

  • Implement Cython CSR rowise mode function in sparsefuncs_fast
  • Implement Cython CSR rowise weighted_mode function in sparsefuncs_fast
  • Backport scipy sparse advanced indexing

This PR originally started out with Cython based improvements, but a middle ground has been chosen for simplicity of implementation.


This change is Reviewable

@hamsal hamsal changed the title [WIP] Sparse output support for KNN Classifiers [WIP] Sparse output KNN Jul 8, 2014
@hamsal
Copy link
Contributor Author

hamsal commented Jul 9, 2014

Today I wrote some more code to replace the existing predict function in KNeighborsClassifier. The code is in a very rough state but it is clear where the next improvements will come. 1) return the sparse format by convention (sparse target in means sparse target out). 2) Replace the dense mode code with a mode function written for a sparse matrix and do the same for weighted_mode.

@hamsal
Copy link
Contributor Author

hamsal commented Jul 10, 2014

How can I import my cython function into a module? I am writing a sparse row wise mode function csr_row_mode for CSC CSR matrices in the cython file utils/sparsefuncs_fast.pyx and I compile the cython file to get a new sparsefuncs_fast.c. However when I attempt to import the function in classification.py using from ..utils.parsefuncs_fast import csr_row_mode it fails to find it. I suspect there is an additional step to get it ready for import.

@GaelVaroquaux
Copy link
Member

Don't forget to rerun scikit-learn's 'python setup.py build_ext -i'.

@arjoly
Copy link
Member

arjoly commented Jul 10, 2014

Thinking about it. There is an implementation of mode with sparse matrix in the imputation module.
(pure numpy & scipy)

@hamsal
Copy link
Contributor Author

hamsal commented Jul 10, 2014

@arjoly the implementation for the mode looks like it loops over columns

from Imputer._sparse_fit

            # Most frequent
            elif strategy == "most_frequent":
                most_frequent = np.empty(len(columns))

                for i, column in enumerate(columns):
                    most_frequent[i] = _most_frequent(column,
                                                      0,
                                                      n_zeros_axis[i])

@hamsal
Copy link
Contributor Author

hamsal commented Jul 10, 2014

Thank you Gael that did it

@hamsal
Copy link
Contributor Author

hamsal commented Jul 11, 2014

The Cython CSR rowise mode looks to be working correctly, although I think there is room for improvement in efficiency I am going to to move on to a Cython implementation of weighted mode.

for i in range(n_samples):
nnz = indptr[i + 1] - indptr[i]
if nnz > 0:
nonz_mode, count = stats.mode(data[indptr[i]:indptr[i + 1]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I bench marked this Cython function vs python code of the same implementation and it preformed the same. The next step will be to do this calculation with Cython code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before working on optimized version, does it work with the simple python version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, a vectorized implementation of mode for a 1d array is:

def mode(a):
    u, inv = np.unique(a, return_inverse=True)
    c = np.bincount(inv, minlength=len(u))
    m = np.argmax(c)
    return u[m], c[m]

(and unlike scipy.stats.mode, this doesn't change the dtype)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note also that weighted bincount should be sufficient to support the weighted case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function as it is right now works as a mode function, it is slow however because of what you pointed out. I will focus on getting a functional PR first before coming back to optimize this using Joels comments.

@hamsal
Copy link
Contributor Author

hamsal commented Jul 17, 2014

For some reason I have been unable to recreate any of the Travis errors. I have replicated the environments using the correct versions of numpy and scipy and my local copy of this branch builds fine. The python 3 Travis makes it look like there is something wrong with building the sparsefuncs_fast.c file. I have recreated the errors

@hamsal
Copy link
Contributor Author

hamsal commented Jul 22, 2014

Ping @jnothman, @arjoly, @vene Is there a way to get advanced indexing support for sparse matrices in the earlier versions of numpy? In this PR there is a need for indexing a matrix with a 2D array. My first thought was to backport the get_item function from scipys sparse matrices. Unfortunately this code is pretty intertwined with a number of other functions ultimately goes to some c level code. So this would be a very big port from scipy. The remaining solution I have come up with will be to write a cython function to do this for me but first I would like some input.

@arjoly
Copy link
Member

arjoly commented Jul 23, 2014

Does it work with numpy 1.8?

y = y.tocsc()
y.eliminate_zeros()
nnz = np.diff(y.indptr)
data = np.array([])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using a list?

@arjoly
Copy link
Member

arjoly commented Aug 18, 2014

Since we go for the simple road, it would be nice to make it pass in similar way for predict_proba.

y_pred_mo = knn_mo.predict(X_test)

assert_equal(y_pred_mo.shape, y_test.shape)
assert_true(sp.issparse(y_pred_mo))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those two assertion are already made in the next line.

@arjoly
Copy link
Member

arjoly commented Aug 18, 2014

Since we go for the simple road, it would be nice to make it pass in similar way for predict_proba.

I missed the relevant lines in the diff.

@arjoly
Copy link
Member

arjoly commented Aug 19, 2014

Looks good, last comment is cosmetic.

@arjoly arjoly changed the title [MRG] Sparse output KNN [MRG+1] Sparse output KNN Aug 19, 2014
@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling ccae2ae on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

if not self.outputs_2d_:
self.classes_ = self.classes_[0]
self._y = self._y.ravel()
if not sp.issparse(y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic should probably be moved to LabelEncoding at some point; it currently does not handle multioutput, nor sparse (but the latter had only been used for binary targets until now). The sparse implementation is not explicitly tested, and some of its conditions are only being tested because of the random number generation happening to produce entirely dense and non-dense columns.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) when pulling 2ba9f3f on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 2ba9f3f on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) when pulling e640807 on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

@jnothman
Copy link
Member

thanks

@hamsal
Copy link
Contributor Author

hamsal commented Aug 26, 2014

Sparse Multioutput LabelEncoder for use in this PR is started here #3592

@hamsal hamsal changed the title [MRG+1] Sparse output KNN [WIP+1] Sparse output KNN Aug 28, 2014
@hamsal hamsal changed the title [WIP+1] Sparse output KNN [WIP] Sparse output KNN Aug 28, 2014
@hamsal
Copy link
Contributor Author

hamsal commented Aug 28, 2014

Marked WIP until some other PR important for this one are finalized

@jnothman
Copy link
Member

happened upon this. Are you going to finish it, @hamsal?

@jnothman
Copy link
Member

Superseded by #9059.

@jnothman jnothman closed this Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants