Skip to content

[MRG] Fast PolynomialFeatures on CSR matrices #12197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Oct 19, 2018
Merged

[MRG] Fast PolynomialFeatures on CSR matrices #12197

merged 44 commits into from
Oct 19, 2018

Conversation

AWNystrom
Copy link
Contributor

@AWNystrom AWNystrom commented Sep 29, 2018

This PR allows PolynomialFeatures to operate on compressed sparse row (CSR) matrices. It does so in a way that only takes products of nonzero features, so it scales with the density of the matrix to the power of the degree of the expansion.

Here's a plot that compares the runtime of this method to the dense and crc methods:
image

See https://arxiv.org/pdf/1803.06418.pdf for details of the algorithm.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, here is a first batch of comments.

Also the build is broken under both linux and windows but I have not investigated. The cython module cannot be imported.

Could also please include the results of the benchmarks in a comment with your analysis to help other reviewer without having them to locally build and run the benchmark of this PR?

expanded_column = (3*D**2*i - 3*D*i**2 + i**3 + 11*i \
- 3*j**2 - 9*j) / 6 + i**2 \
- 2*D*i + D*j - D + k \
if interaction_only else \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would find it more readable to use if / else blocks instead of a ternary expression here. Ternary expressions are great for one liner assignment but as soon as the r.h.s. expression is too large to fit on a single line it's more readable to use traditional if / else blocks.

k = indices[col_ptr_3]
expanded_column = (3*D**2*i - 3*D*i**2 + i**3 + 11*i \
- 3*j**2 - 9*j) / 6 + i**2 \
- 2*D*i + D*j - D + k \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nitpick: do not add \ at the end of line. They are useless when there are parenthesis around a multi-line expression.

# A single dense row
# An empty matrix
# Multiple rows, but one is all zeros.
# The first row is all zeros. Tests we're making indptr correctly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove those comments and write the unit tests instead of they are missing.

np.int64_t


def csr_expansion(ndarray[DATA_T, ndim=1] data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make it an private function with a leading _.

D : The dimensionality of the input CSR matrix.
interaction_only : 0 for a polynomial expansion, 1 for an interaction
expansion.
degree : The degree of the expansion. This must be either 2 or 3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the numpydoc conventions to document the parameters. See other docstring in the projects for more details.

on the order of d^k where k is the degree of the expansion, assuming all
rows are of similar density. This method was laid out in the work
"Leveraging Sparsity to Speed Up Polynomial Feature Expansions of CSR
Matrices Using K-Simplex Numbers" by Andrew Nystrom and John Hughes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put this reference in a dedicated section of the docstring:

References
----------

See in the docstrings of sklearn/utils/extmath.py for an example.

trials = 5
num_rows = 100
dimensionalities = [50, 100, 200, 400, 800]
densities = [0.1, 0.4, 0.7, 1.0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more interesting to explore densities such as [0.01, 0.1, 1]. I think sparse representations only make sense below 0.1 density.

XP = sparse.hstack(columns, dtype=X.dtype).tocsc()
if sparse.isspmatrix_csr(X):
if self.degree > 3:
return (self.transform(X.tocsc())).tocsr()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not call .tocsr() back at the end in this case. Let's the next step in the estimator pipeline do another conversion itself only if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would seem odd to me to get a csr matrix returned for some degrees, but get a crc matrix returned for others. It seems this could mess up some people's pipelines. If you feel I shouldn't convert to the same format the user input, I'm happy to change this, but it seems a bit odd to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scikit-learn estimators automatically convert sparse matrix format when then need so I think it's fine not to convert a priori and spare a potentially useless memory copy, although I agree this is not a big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The csc PolynomialFeatures test checked that the output was also a csc matrix, so I figured I'd follow the same pattern. Removing this conversion would make that part of the test fail. If it's really OK for the output to not be in the same format as the input, I'll go ahead and remove it. I thought that some algorithms ran much slower on certain sparse formats, so it seems users might expect this consistency though. Let me know.

@@ -34,8 +34,9 @@
from ..utils.validation import (check_is_fitted, check_random_state,
FLOAT_DTYPES)

from ._encoders import OneHotEncoder
from ._csr_expansion import _csr_expansion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename both the module and the function to _csr_polynomial_expansion to make it more explicit.

@AWNystrom
Copy link
Contributor Author

I seem to have made tests fail with the last push. Sorry about that. I'll fix it when I get a chance.

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 8e450a5 into acf3bab - view on LGTM.com

new alerts:

  • 1 for Unused local variable

Comment posted by LGTM.com

@ogrisel
Copy link
Member

ogrisel commented Oct 2, 2018

Please delete the sklearn/preprocessing/csr_expansion.pyx file that is no longer used.

@ogrisel
Copy link
Member

ogrisel commented Oct 2, 2018

There is a travis job configuration that fails with:

from scipy.sparse import random as sparse_random
ImportError: cannot import name random

However for scikit-learn 0.21 we will bump up the scipy dependency to 0.17 (#12184). We still need this old scipy version running as long as we have not released the first 0.20.1 bugfix release though (to simplify backporting efforts for the first batch of bugfixes in master for the month to come). So in the meantime, you can do:

try:
    from scipy.sparse import random as sparse_random
except ImportError:
    # Avoid import error when running the tests with scipy < 0.17
    sparse_random = None
...

# TODO: remove skipif once scipy < 0.17 is no longer supported
@pytest.mark.skipif(sparse_random is None)
def test_that_needs_sparse_random():
   ...

@ogrisel
Copy link
Member

ogrisel commented Oct 2, 2018

The fact that the CSR expansion is still significantly faster than the dense version even when density == 1.0 in the benchmark makes me think that the dense case could also receive some love but this should probably be done in a separate PR once this one is merged.

degree = 2
trials = 5
num_rows = 100
dimensionalities = np.array([50, 100, 150, 200])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could instead explore:

dimensionalities = np.array([25, 50, 100, 200])

and then use a loglog plot so that straight lines would reflect linear time complexities and would make it easier to interpret the plots.

@AWNystrom
Copy link
Contributor Author

Maybe it's not so similar to CSR-CSR. Either way it was musing rather than a request for changes...

No worries. I was just afraid I'd missed an important connection.

@jnothman
Copy link
Member

jnothman commented Oct 16, 2018 via email

@AWNystrom
Copy link
Contributor Author

(and I don't know how Ben fits in the picture...!)

(I told him I'd try to weave that in, and that was my attempt.)

@AWNystrom
Copy link
Contributor Author

Thoughts on logging a warning when PolynomialFeatures is called on a CSC matrix with degree 2 or 3 that recommends using a CSR?

@jnothman
Copy link
Member

Ambivalent? We don't currently warn on CSR-CSC conversion e.g. in random forests, but perhaps this is different. Scipy has SparseEfficiencyWarnings, but they usually indicate something different to this.

@AWNystrom
Copy link
Contributor Author

Ambivalent? We don't currently warn on CSR-CSC conversion e.g. in random forests, but perhaps this is different. Scipy has SparseEfficiencyWarnings, but they usually indicate something different to this.

Whatever you think is best. Alternatively, we could do a return self.transform(X.tocsr()).tocsc() in this case. The CSC is just so slow that the user would benefit from warning or automatic conversion to CSR. The latter would double the memory used for holding the data, so that's an argument against it.

@jnothman
Copy link
Member

jnothman commented Oct 17, 2018 via email

@sklearn-lgtm
Copy link

This pull request introduces 1 alert and fixes 1 when merging 876ac2f into bea1eb5 - view on LGTM.com

new alerts:

  • 1 for Syntax error

fixed alerts:

  • 1 for Unused local variable

Comment posted by LGTM.com

Since we call the CSR method when able given a CSC input, adding cases where the CSR method has to be used because the degree's too high.
@AWNystrom
Copy link
Contributor Author

OK, I can't think of anything else that should be done. Let me know if there's anything else you'd like changed.

@AWNystrom
Copy link
Contributor Author

Actually, doing this conversion to CSR on CSC input invalidates the benchmark. We could add a parameter to the PolynomialFeatures constructor that prevents the conversion.

@jnothman
Copy link
Member

jnothman commented Oct 18, 2018 via email

@AWNystrom
Copy link
Contributor Author

AWNystrom commented Oct 18, 2018 via email

@jnothman
Copy link
Member

jnothman commented Oct 18, 2018 via email

@ogrisel
Copy link
Member

ogrisel commented Oct 18, 2018

And because the new benchmark will only run without CSC we can change it to run for larger number of features.

@ogrisel
Copy link
Member

ogrisel commented Oct 18, 2018

Or alternatively without the log scale to make the speed-up effect look more dramatic ;)

@AWNystrom
Copy link
Contributor Author

It's been done. Plot updated at the top of this page.

@ogrisel ogrisel merged commit a5fa7d3 into scikit-learn:master Oct 19, 2018
@ogrisel
Copy link
Member

ogrisel commented Oct 19, 2018

Merged! Thank you very much @AWNystrom!

@AWNystrom
Copy link
Contributor Author

Merged! Thank you very much @AWNystrom!

Thank you! It's been a pleasure contributing. Thanks for the code reviews, everyone!

@AWNystrom AWNystrom deleted the csr_expansion branch April 20, 2019 21:16
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants