[MRG] Fast PolynomialFeatures on CSR matrices #12197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

ogrisel merged 44 commits into scikit-learn:master from AWNystrom:csr_expansion

Oct 19, 2018

Contributor

AWNystrom commented Sep 29, 2018 •

edited

Loading

This PR allows PolynomialFeatures to operate on compressed sparse row (CSR) matrices. It does so in a way that only takes products of nonzero features, so it scales with the density of the matrix to the power of the degree of the expansion.

Here's a plot that compares the runtime of this method to the dense and crc methods:

See https://arxiv.org/pdf/1803.06418.pdf for details of the algorithm.

AWNystrom added 10 commits

September 22, 2018 23:28


          csr polynomial expansions are now possible

e626fb2


          Adding a prefix to csr_expansion.pyx

25b9972


          Adding a description to the docstring.

f279fc1


          Shameless plug in docstring

08004c2


          so close. Something's off

14e22ef


          need to clean up

9dd669a


          ready to go

9ca216e


          cleaned up imports

b7f6c02


          pep8

afe6fe4


          more pep8

da8aa9f

ogrisel reviewed

View reviewed changes

Member

ogrisel left a comment •

edited

Loading

Thanks for your contribution, here is a first batch of comments.

Also the build is broken under both linux and windows but I have not investigated. The cython module cannot be imported.

Could also please include the results of the benchmarks in a comment with your analysis to help other reviewer without having them to locally build and run the benchmark of this PR?

sklearn/preprocessing/tests/test_data.py Show resolved Hide resolved

sklearn/preprocessing/csr_expansion.pyx Outdated Show resolved Hide resolved

sklearn/preprocessing/csr_expansion.pyx Outdated Show resolved Hide resolved

sklearn/preprocessing/_csr_expansion.pyx Outdated

+                                      expanded_column = (3*D**2*i - 3*D*i**2 + i**3 + 11*i \
+                                                         - 3*j**2 - 9*j) / 6 + i**2        \
+                                                         - 2*D*i + D*j - D + k             \
+                                                         if interaction_only else          \

Member

ogrisel Sep 29, 2018

I would find it more readable to use if / else blocks instead of a ternary expression here. Ternary expressions are great for one liner assignment but as soon as the r.h.s. expression is too large to fit on a single line it's more readable to use traditional if / else blocks.

sklearn/preprocessing/_csr_expansion.pyx Outdated

+                                      k = indices[col_ptr_3]
+                                      expanded_column = (3*D**2*i - 3*D*i**2 + i**3 + 11*i \
+                                                         - 3*j**2 - 9*j) / 6 + i**2        \
+                                                         - 2*D*i + D*j - D + k             \

Member

ogrisel Sep 29, 2018

style nitpick: do not add \ at the end of line. They are useless when there are parenthesis around a multi-line expression.

sklearn/preprocessing/_csr_expansion.pyx Outdated

+              # A single dense row
+              # An empty matrix
+              # Multiple rows, but one is all zeros.
+              # The first row is all zeros. Tests we're making indptr correctly.

Member

ogrisel Sep 29, 2018

Please remove those comments and write the unit tests instead of they are missing.

sklearn/preprocessing/_csr_expansion.pyx Outdated

		np.int64_t


		def csr_expansion(ndarray[DATA_T, ndim=1] data,

Member

ogrisel Sep 29, 2018

Please make it an private function with a leading _.

sklearn/preprocessing/_csr_expansion.pyx Outdated

+                  D : The dimensionality of the input CSR matrix.
+                  interaction_only : 0 for a polynomial expansion, 1 for an interaction
+                                     expansion.
+                  degree : The degree of the expansion. This must be either 2 or 3.

Member

ogrisel Sep 29, 2018

Please follow the numpydoc conventions to document the parameters. See other docstring in the projects for more details.

sklearn/preprocessing/_csr_expansion.pyx Outdated

+                  on the order of d^k where k is the degree of the expansion, assuming all
+                  rows are of similar density. This method was laid out in the work
+                  "Leveraging Sparsity to Speed Up Polynomial Feature Expansions of CSR
+                  Matrices Using K-Simplex Numbers" by Andrew Nystrom and John Hughes.

Member

ogrisel Sep 29, 2018

Please put this reference in a dedicated section of the docstring:

References
----------

See in the docstrings of sklearn/utils/extmath.py for an example.

sklearn/preprocessing/data.py Show resolved Hide resolved

ogrisel reviewed

View reviewed changes

benchmarks/bench_feature_expansions.py Show resolved Hide resolved

AWNystrom added 5 commits

September 29, 2018 21:23


          round 1 of ogrisel review

0ef2e4a


          I was too hasty with the tests. Fixed.

de5f676


          5 trials in benchmark, not 1

13a83ff


          updating dimensionalities in benchmark

8eb5f28


          pep8 and trying to get scipy.sparse.random to work on Travis

1308b74

ogrisel reviewed

View reviewed changes

benchmarks/bench_feature_expansions.py Outdated

+              trials = 5
+              num_rows = 100
+              dimensionalities = [50, 100, 200, 400, 800]
+              densities = [0.1, 0.4, 0.7, 1.0]

Member

ogrisel Oct 1, 2018

I think it would be more interesting to explore densities such as [0.01, 0.1, 1]. I think sparse representations only make sense below 0.1 density.

ogrisel reviewed

View reviewed changes

sklearn/preprocessing/data.py Outdated

-                          XP = sparse.hstack(columns, dtype=X.dtype).tocsc()
+                      if sparse.isspmatrix_csr(X):
+                          if self.degree > 3:
+                              return (self.transform(X.tocsc())).tocsr()

Member

ogrisel Oct 1, 2018

I would not call .tocsr() back at the end in this case. Let's the next step in the estimator pipeline do another conversion itself only if necessary.

Contributor Author

AWNystrom Oct 2, 2018

It would seem odd to me to get a csr matrix returned for some degrees, but get a crc matrix returned for others. It seems this could mess up some people's pipelines. If you feel I shouldn't convert to the same format the user input, I'm happy to change this, but it seems a bit odd to me.

Member

ogrisel Oct 2, 2018

scikit-learn estimators automatically convert sparse matrix format when then need so I think it's fine not to convert a priori and spare a potentially useless memory copy, although I agree this is not a big deal.

Contributor Author

AWNystrom Oct 4, 2018

The csc PolynomialFeatures test checked that the output was also a csc matrix, so I figured I'd follow the same pattern. Removing this conversion would make that part of the test fail. If it's really OK for the output to not be in the same format as the input, I'll go ahead and remove it. I thought that some algorithms ran much slower on certain sparse formats, so it seems users might expect this consistency though. Let me know.

ogrisel reviewed

View reviewed changes

sklearn/preprocessing/data.py Outdated

    
            @@ -34,8 +34,9 @@
          
              from ..utils.validation import (check_is_fitted, check_random_state,

                                              FLOAT_DTYPES)

              from ._encoders import OneHotEncoder

              from ._csr_expansion import _csr_expansion

Member

ogrisel Oct 1, 2018

Please rename both the module and the function to _csr_polynomial_expansion to make it more explicit.

AWNystrom added 2 commits

October 1, 2018 20:31


          Changing benchmark plot, changing _csr_expansion to _csr_polynomial_e…

d1ce661

…xpansion


          Now works when not enough dims for an expansion up to the full degree…

8e450a5

…, but enough for a lower degree.

Contributor Author

AWNystrom commented Oct 2, 2018

I seem to have made tests fail with the last push. Sorry about that. I'll fix it when I get a chance.

sklearn-lgtm commented Oct 2, 2018

This pull request introduces 1 alert when merging 8e450a5 into acf3bab - view on LGTM.com

new alerts:

1 for Unused local variable

Comment posted by LGTM.com


          correcting min_required_dim

7f5ed42

Member

ogrisel commented Oct 2, 2018

Please delete the sklearn/preprocessing/csr_expansion.pyx file that is no longer used.

Member

ogrisel commented Oct 2, 2018 •

edited

Loading

There is a travis job configuration that fails with:

from scipy.sparse import random as sparse_random
ImportError: cannot import name random

However for scikit-learn 0.21 we will bump up the scipy dependency to 0.17 (#12184). We still need this old scipy version running as long as we have not released the first 0.20.1 bugfix release though (to simplify backporting efforts for the first batch of bugfixes in master for the month to come). So in the meantime, you can do:

try:
    from scipy.sparse import random as sparse_random
except ImportError:
    # Avoid import error when running the tests with scipy < 0.17
    sparse_random = None
...

# TODO: remove skipif once scipy < 0.17 is no longer supported
@pytest.mark.skipif(sparse_random is None)
def test_that_needs_sparse_random():
   ...

Member

ogrisel commented Oct 2, 2018

The fact that the CSR expansion is still significantly faster than the dense version even when density == 1.0 in the benchmark makes me think that the dense case could also receive some love but this should probably be done in a separate PR once this one is merged.

ogrisel reviewed

View reviewed changes

benchmarks/bench_feature_expansions.py Outdated

+              degree = 2
+              trials = 5
+              num_rows = 100
+              dimensionalities = np.array([50, 100, 150, 200])

Member

ogrisel Oct 2, 2018

Maybe we could instead explore:

dimensionalities = np.array([25, 50, 100, 200])

and then use a loglog plot so that straight lines would reflect linear time complexities and would make it easier to interpret the plots.


          Implementing sparse_random so tests pass on old configs

600e07b

Contributor Author

AWNystrom commented Oct 16, 2018

Maybe it's not so similar to CSR-CSR. Either way it was musing rather than a request for changes...

No worries. I was just afraid I'd missed an important connection.


          Update v0.21.rst

a5e34fa

Member

jnothman commented Oct 16, 2018 via email

(and I don't know how Ben fits in the picture...!)

Contributor Author

AWNystrom commented Oct 16, 2018

(and I don't know how Ben fits in the picture...!)

(I told him I'd try to weave that in, and that was my attempt.)

Contributor Author

AWNystrom commented Oct 16, 2018

Thoughts on logging a warning when PolynomialFeatures is called on a CSC matrix with degree 2 or 3 that recommends using a CSR?

Member

jnothman commented Oct 16, 2018

Ambivalent? We don't currently warn on CSR-CSC conversion e.g. in random forests, but perhaps this is different. Scipy has SparseEfficiencyWarnings, but they usually indicate something different to this.

Contributor Author

AWNystrom commented Oct 17, 2018

Ambivalent? We don't currently warn on CSR-CSC conversion e.g. in random forests, but perhaps this is different. Scipy has SparseEfficiencyWarnings, but they usually indicate something different to this.

Whatever you think is best. Alternatively, we could do a return self.transform(X.tocsr()).tocsc() in this case. The CSC is just so slow that the user would benefit from warning or automatic conversion to CSR. The latter would double the memory used for holding the data, so that's an argument against it.

Member

jnothman commented Oct 17, 2018 via email

A lot of scipy.sparse methods do that kind of conversion internally, so maybe it's not so bad to similarly duplicate memory. In any case I think it is something we can comfortably change if users complain.


          Update data.py

876ac2f

sklearn-lgtm commented Oct 17, 2018

This pull request introduces 1 alert and fixes 1 when merging 876ac2f into bea1eb5 - view on LGTM.com

new alerts:

1 for Syntax error

fixed alerts:

1 for Unused local variable

Comment posted by LGTM.com

AWNystrom added 2 commits

October 17, 2018 00:25


          Update data.py

cde8680


          Update test_data.py

55a3722

Since we call the CSR method when able given a CSC input, adding cases where the CSR method has to be used because the degree's too high.

Contributor Author

AWNystrom commented Oct 17, 2018

OK, I can't think of anything else that should be done. Let me know if there's anything else you'd like changed.

Contributor Author

AWNystrom commented Oct 17, 2018

Actually, doing this conversion to CSR on CSC input invalidates the benchmark. We could add a parameter to the PolynomialFeatures constructor that prevents the conversion.

Member

jnothman commented Oct 18, 2018 via email

It invalidates the benchmark meaning that it invalidates the benchmark script, or it means that this is no longer the most efficient?

Contributor Author

AWNystrom commented Oct 18, 2018 via email •

edited

Loading

The method in the benchmark labeled “crc” is actually the CSR method because CSCs get converted to CSRs and use the CSR method when the degree’s 2 or 3. This makes the CSC method look fast in the benchmark because it’s not actually the CSC method. How about we just remove the benchmark?

On Wed, Oct 17, 2018 at 7:19 PM Joel Nothman ***@***.***> wrote: It invalidates the benchmark meaning that it invalidates the benchmark script, or it means that this is no longer the most efficient? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12197 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABaNs_AoS60935CZ005RuGhCTIJfkmJlks5ul-U1gaJpZM4XAW5G> .

Member

jnothman commented Oct 18, 2018 via email

Yes, remove the CSC benchmark

Member

ogrisel commented Oct 18, 2018

And because the new benchmark will only run without CSC we can change it to run for larger number of features.

Member

ogrisel commented Oct 18, 2018

Or alternatively without the log scale to make the speed-up effect look more dramatic ;)

AWNystrom added 3 commits

October 18, 2018 21:03


          Removing csr from the benchmark, no longer logscale

39f94a4


          Removing csc from the benchmark, no longer logscale

9e87233


          Merge branch 'csr_expansion' of https://github.com/AWNystrom/scikit-l…

15be3bd

…earn into csr_expansion

Contributor Author

AWNystrom commented Oct 19, 2018

It's been done. Plot updated at the top of this page.

ogrisel merged commit a5fa7d3 into scikit-learn:master

Member

ogrisel commented Oct 19, 2018

Merged! Thank you very much @AWNystrom!

Contributor Author

AWNystrom commented Oct 19, 2018

Merged! Thank you very much @AWNystrom!

Thank you! It's been a pleasure contributing. Thanks for the code reviews, everyone!

AWNystrom deleted the csr_expansion branch

April 20, 2019 21:16

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request


          [MRG] Fast PolynomialFeatures on CSR matrices (scikit-learn#12197)

49aab3d

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request


          Revert "[MRG] Fast PolynomialFeatures on CSR matrices (scikit-learn#1…

56039df

…2197)"

This reverts commit 49aab3d.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request


          Revert "[MRG] Fast PolynomialFeatures on CSR matrices (scikit-learn#1…

c84752f

…2197)"

This reverts commit 49aab3d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet