[WIP] Sparse output KNN #3350

hamsal · 2014-07-08T03:01:10Z

~~Implement Cython CSR rowise mode function in sparsefuncs_fast~~
~~Implement Cython CSR rowise weighted_mode function in sparsefuncs_fast~~
~~Backport scipy sparse advanced indexing~~

This PR originally started out with Cython based improvements, but a middle ground has been chosen for simplicity of implementation.

This change is

hamsal · 2014-07-09T01:23:27Z

Today I wrote some more code to replace the existing predict function in KNeighborsClassifier. The code is in a very rough state but it is clear where the next improvements will come. 1) return the sparse format by convention (sparse target in means sparse target out). 2) Replace the dense mode code with a mode function written for a sparse matrix and do the same for weighted_mode.

hamsal · 2014-07-10T02:23:08Z

How can I import my cython function into a module? I am writing a sparse row wise mode function csr_row_mode for ~~CSC~~ CSR matrices in the cython file utils/sparsefuncs_fast.pyx and I compile the cython file to get a new sparsefuncs_fast.c. However when I attempt to import the function in classification.py using from ..utils.parsefuncs_fast import csr_row_mode it fails to find it. I suspect there is an additional step to get it ready for import.

GaelVaroquaux · 2014-07-10T05:09:17Z

Don't forget to rerun scikit-learn's 'python setup.py build_ext -i'.

arjoly · 2014-07-10T10:33:48Z

Thinking about it. There is an implementation of mode with sparse matrix in the imputation module.
(pure numpy & scipy)

hamsal · 2014-07-10T20:22:42Z

@arjoly the implementation for the mode looks like it loops over columns

from Imputer._sparse_fit

            # Most frequent
            elif strategy == "most_frequent":
                most_frequent = np.empty(len(columns))

                for i, column in enumerate(columns):
                    most_frequent[i] = _most_frequent(column,
                                                      0,
                                                      n_zeros_axis[i])

hamsal · 2014-07-10T21:21:08Z

Thank you Gael that did it

hamsal · 2014-07-11T02:39:02Z

The Cython CSR rowise mode looks to be working correctly, although I think there is room for improvement in efficiency I am going to to move on to a Cython implementation of weighted mode.

arjoly · 2014-07-11T11:19:19Z

sklearn/utils/sparsefuncs_fast.pyx

+    for i in range(n_samples):
+        nnz = indptr[i + 1] - indptr[i]
+        if nnz > 0:
+            nonz_mode, count = stats.mode(data[indptr[i]:indptr[i + 1]])


The stats.mode is mostly python code.

I see, I bench marked this Cython function vs python code of the same implementation and it preformed the same. The next step will be to do this calculation with Cython code.

before working on optimized version, does it work with the simple python version?

FWIW, a vectorized implementation of mode for a 1d array is:

def mode(a): u, inv = np.unique(a, return_inverse=True) c = np.bincount(inv, minlength=len(u)) m = np.argmax(c) return u[m], c[m]

(and unlike scipy.stats.mode, this doesn't change the dtype)

Note also that weighted bincount should be sufficient to support the weighted case.

This function as it is right now works as a mode function, it is slow however because of what you pointed out. I will focus on getting a functional PR first before coming back to optimize this using Joels comments.

hamsal · 2014-07-17T19:24:21Z

For some reason I have been unable to recreate any of the Travis errors. I have replicated the environments using the correct versions of numpy and scipy and my local copy of this branch builds fine. The python 3 Travis makes it look like there is something wrong with building the sparsefuncs_fast.c file. I have recreated the errors

hamsal · 2014-07-22T23:24:49Z

Ping @jnothman, @arjoly, @vene Is there a way to get advanced indexing support for sparse matrices in the earlier versions of numpy? In this PR there is a need for indexing a matrix with a 2D array. My first thought was to backport the get_item function from scipys sparse matrices. Unfortunately this code is pretty intertwined with a number of other functions ultimately goes to some c level code. So this would be a very big port from scipy. The remaining solution I have come up with will be to write a cython function to do this for me but first I would like some input.

arjoly · 2014-07-23T08:13:35Z

Does it work with numpy 1.8?

…sparse

The function in test neigbors has been renamed test_neighbors_classifier_multioutput_sparse

Reword comments concerning self.sparse_target_input to reflect that fit may not give sparse target data in all cases

…tation

arjoly · 2014-08-18T07:16:42Z

sklearn/neighbors/base.py

+            y = y.tocsc()
+            y.eliminate_zeros()
+            nnz = np.diff(y.indptr)
+            data = np.array([])


Why not using a list?

arjoly · 2014-08-18T07:48:19Z

Since we go for the simple road, it would be nice to make it pass in similar way for predict_proba.

arjoly · 2014-08-18T07:49:40Z

sklearn/neighbors/tests/test_neighbors.py

+        y_pred_mo = knn_mo.predict(X_test)
+
+        assert_equal(y_pred_mo.shape, y_test.shape)
+        assert_true(sp.issparse(y_pred_mo))


Those two assertion are already made in the next line.

arjoly · 2014-08-18T07:50:29Z

Since we go for the simple road, it would be nice to make it pass in similar way for predict_proba.

I missed the relevant lines in the diff.

arjoly · 2014-08-19T14:06:00Z

Looks good, last comment is cosmetic.

coveralls · 2014-08-19T15:04:10Z

Coverage increased (+0.0%) when pulling ccae2ae on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

jnothman · 2014-08-20T03:05:56Z

sklearn/neighbors/base.py

-        if not self.outputs_2d_:
-            self.classes_ = self.classes_[0]
-            self._y = self._y.ravel()
+        if not sp.issparse(y):


This logic should probably be moved to LabelEncoding at some point; it currently does not handle multioutput, nor sparse (but the latter had only been used for binary targets until now). The sparse implementation is not explicitly tested, and some of its conditions are only being tested because of the random number generation happening to produce entirely dense and non-dense columns.

…e up with enoding second case classes_ is not a 0 to n sequence

coveralls · 2014-08-20T23:42:11Z

Coverage decreased (-0.0%) when pulling 2ba9f3f on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

coveralls · 2014-08-20T23:49:24Z

Coverage increased (+0.0%) when pulling 2ba9f3f on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

coveralls · 2014-08-21T16:23:42Z

Coverage increased (+0.02%) when pulling e640807 on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

jnothman · 2014-08-23T09:55:20Z

thanks

hamsal · 2014-08-26T13:55:40Z

Sparse Multioutput LabelEncoder for use in this PR is started here #3592

hamsal · 2014-08-28T17:08:01Z

Marked WIP until some other PR important for this one are finalized

jnothman · 2016-09-19T14:50:59Z

happened upon this. Are you going to finish it, @hamsal?

jnothman · 2018-02-14T10:49:06Z

Superseded by #9059.

hamsal changed the title ~~[WIP] Sparse output support for KNN Classifiers~~ [WIP] Sparse output KNN Jul 8, 2014

arjoly reviewed Jul 11, 2014
View reviewed changes

hamsal added 14 commits August 6, 2014 16:01

Included some perliminary scaffolding and testing for sparse output knn

40cb849

Implmented some basic outline for predict with sparse support

7ffd392

Saved _y as sparse matrix out of base fit if the inpy target data is …

a41a463

…sparse

Implemented a row wise mode cython function for CSC matricies

e3afae1

completed implementation of csr_row_mode

60c04b2

Cleaned lines no longer used

7057858

Include full path for imports, remove print in predict

2ed6bda

Use .tocsc instead of constructor, Eliminate zeros, format pep8

3b1ac31

Format pep8

768a613

Rename testing fun lower case, format pep8, correct comment csr_row_mode

c14c771

The function in test neigbors has been renamed test_neighbors_classifier_multioutput_sparse

Implement csr_row_mode (Reintrouduction to .c)

b07035e

Test kneighbors with sparse column

6870b8d

Update comments in kneigbors predict

19f8434

Reword comments concerning self.sparse_target_input to reflect that fit may not give sparse target data in all cases

Make dense columns durring prediction as a naive correctness implemen…

7c71da2

…tation

arjoly reviewed Aug 18, 2014
View reviewed changes

sklearn/neighbors/base.py

y = y.tocsc()

y.eliminate_zeros()

nnz = np.diff(y.indptr)

data = np.array([])

Copy link

Member

arjoly Aug 18, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using a list?

arjoly reviewed Aug 18, 2014
View reviewed changes

hamsal added 2 commits August 19, 2014 09:51

Revise Imports, use np.unique to accomplish sparse case in base fit

4d5d100

Use array instead of np.arry for data in base fit

d8deae5

arjoly changed the title ~~[MRG] Sparse output KNN~~ [MRG+1] Sparse output KNN Aug 19, 2014

Remove redundant asserts

ccae2ae

jnothman reviewed Aug 20, 2014
View reviewed changes

hamsal added 2 commits August 20, 2014 19:28

Sample predict data from classes, Maintain dtype during prediction

6b16ee7

Cover both target encoding cases in the test, First case classes_ lin…

2ba9f3f

…e up with enoding second case classes_ is not a 0 to n sequence

Comment on target data construction for test, ensure integer index

e640807

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

jnothman mentioned this pull request Aug 26, 2014

[WIP] Sparse and Multioutput LabelEncoder #3592

Closed

7 tasks

hamsal changed the title ~~[MRG+1] Sparse output KNN~~ [WIP+1] Sparse output KNN Aug 28, 2014

hamsal changed the title ~~[WIP+1] Sparse output KNN~~ [WIP] Sparse output KNN Aug 28, 2014

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

jnothman closed this Feb 14, 2018

Uh oh!

[WIP] Sparse output KNN #3350

[WIP] Sparse output KNN #3350

Uh oh!

Conversation

hamsal commented Jul 8, 2014 • edited by lesteve Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hamsal commented Jul 9, 2014

Uh oh!

hamsal commented Jul 10, 2014

Uh oh!

GaelVaroquaux commented Jul 10, 2014

Uh oh!

arjoly commented Jul 10, 2014

Uh oh!

hamsal commented Jul 10, 2014

Uh oh!

hamsal commented Jul 10, 2014

Uh oh!

hamsal commented Jul 11, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hamsal commented Jul 17, 2014

Uh oh!

hamsal commented Jul 22, 2014

Uh oh!

arjoly commented Jul 23, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjoly commented Aug 18, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjoly commented Aug 18, 2014

Uh oh!

arjoly commented Aug 19, 2014

Uh oh!

coveralls commented Aug 19, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Aug 20, 2014

Uh oh!

coveralls commented Aug 20, 2014

Uh oh!

coveralls commented Aug 21, 2014

Uh oh!

jnothman commented Aug 23, 2014

Uh oh!

hamsal commented Aug 26, 2014

Uh oh!

hamsal commented Aug 28, 2014

Uh oh!

jnothman commented Sep 19, 2016

Uh oh!

jnothman commented Feb 14, 2018

Uh oh!

Uh oh!

hamsal commented Jul 8, 2014 •

edited by lesteve

Loading