-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Fast PolynomialFeatures on CSR matrices #12197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution, here is a first batch of comments.
Also the build is broken under both linux and windows but I have not investigated. The cython module cannot be imported.
Could also please include the results of the benchmarks in a comment with your analysis to help other reviewer without having them to locally build and run the benchmark of this PR?
expanded_column = (3*D**2*i - 3*D*i**2 + i**3 + 11*i \ | ||
- 3*j**2 - 9*j) / 6 + i**2 \ | ||
- 2*D*i + D*j - D + k \ | ||
if interaction_only else \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would find it more readable to use if / else blocks instead of a ternary expression here. Ternary expressions are great for one liner assignment but as soon as the r.h.s. expression is too large to fit on a single line it's more readable to use traditional if / else blocks.
k = indices[col_ptr_3] | ||
expanded_column = (3*D**2*i - 3*D*i**2 + i**3 + 11*i \ | ||
- 3*j**2 - 9*j) / 6 + i**2 \ | ||
- 2*D*i + D*j - D + k \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nitpick: do not add \
at the end of line. They are useless when there are parenthesis around a multi-line expression.
# A single dense row | ||
# An empty matrix | ||
# Multiple rows, but one is all zeros. | ||
# The first row is all zeros. Tests we're making indptr correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove those comments and write the unit tests instead of they are missing.
np.int64_t | ||
|
||
|
||
def csr_expansion(ndarray[DATA_T, ndim=1] data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make it an private function with a leading _
.
D : The dimensionality of the input CSR matrix. | ||
interaction_only : 0 for a polynomial expansion, 1 for an interaction | ||
expansion. | ||
degree : The degree of the expansion. This must be either 2 or 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the numpydoc conventions to document the parameters. See other docstring in the projects for more details.
on the order of d^k where k is the degree of the expansion, assuming all | ||
rows are of similar density. This method was laid out in the work | ||
"Leveraging Sparsity to Speed Up Polynomial Feature Expansions of CSR | ||
Matrices Using K-Simplex Numbers" by Andrew Nystrom and John Hughes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put this reference in a dedicated section of the docstring:
References
----------
See in the docstrings of sklearn/utils/extmath.py
for an example.
trials = 5 | ||
num_rows = 100 | ||
dimensionalities = [50, 100, 200, 400, 800] | ||
densities = [0.1, 0.4, 0.7, 1.0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more interesting to explore densities such as [0.01, 0.1, 1]
. I think sparse representations only make sense below 0.1 density.
sklearn/preprocessing/data.py
Outdated
XP = sparse.hstack(columns, dtype=X.dtype).tocsc() | ||
if sparse.isspmatrix_csr(X): | ||
if self.degree > 3: | ||
return (self.transform(X.tocsc())).tocsr() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not call .tocsr()
back at the end in this case. Let's the next step in the estimator pipeline do another conversion itself only if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would seem odd to me to get a csr matrix returned for some degrees, but get a crc matrix returned for others. It seems this could mess up some people's pipelines. If you feel I shouldn't convert to the same format the user input, I'm happy to change this, but it seems a bit odd to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scikit-learn estimators automatically convert sparse matrix format when then need so I think it's fine not to convert a priori and spare a potentially useless memory copy, although I agree this is not a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The csc PolynomialFeatures test checked that the output was also a csc matrix, so I figured I'd follow the same pattern. Removing this conversion would make that part of the test fail. If it's really OK for the output to not be in the same format as the input, I'll go ahead and remove it. I thought that some algorithms ran much slower on certain sparse formats, so it seems users might expect this consistency though. Let me know.
sklearn/preprocessing/data.py
Outdated
@@ -34,8 +34,9 @@ | |||
from ..utils.validation import (check_is_fitted, check_random_state, | |||
FLOAT_DTYPES) | |||
|
|||
from ._encoders import OneHotEncoder | |||
from ._csr_expansion import _csr_expansion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename both the module and the function to _csr_polynomial_expansion
to make it more explicit.
…, but enough for a lower degree.
I seem to have made tests fail with the last push. Sorry about that. I'll fix it when I get a chance. |
This pull request introduces 1 alert when merging 8e450a5 into acf3bab - view on LGTM.com new alerts:
Comment posted by LGTM.com |
Please delete the |
There is a travis job configuration that fails with:
However for scikit-learn 0.21 we will bump up the scipy dependency to 0.17 (#12184). We still need this old scipy version running as long as we have not released the first 0.20.1 bugfix release though (to simplify backporting efforts for the first batch of bugfixes in master for the month to come). So in the meantime, you can do: try:
from scipy.sparse import random as sparse_random
except ImportError:
# Avoid import error when running the tests with scipy < 0.17
sparse_random = None
...
# TODO: remove skipif once scipy < 0.17 is no longer supported
@pytest.mark.skipif(sparse_random is None)
def test_that_needs_sparse_random():
... |
The fact that the CSR expansion is still significantly faster than the dense version even when density == 1.0 in the benchmark makes me think that the dense case could also receive some love but this should probably be done in a separate PR once this one is merged. |
degree = 2 | ||
trials = 5 | ||
num_rows = 100 | ||
dimensionalities = np.array([50, 100, 150, 200]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could instead explore:
dimensionalities = np.array([25, 50, 100, 200])
and then use a loglog plot so that straight lines would reflect linear time complexities and would make it easier to interpret the plots.
No worries. I was just afraid I'd missed an important connection. |
(and I don't know how Ben fits in the picture...!)
|
(I told him I'd try to weave that in, and that was my attempt.) |
Thoughts on logging a warning when PolynomialFeatures is called on a CSC matrix with degree 2 or 3 that recommends using a CSR? |
Ambivalent? We don't currently warn on CSR-CSC conversion e.g. in random forests, but perhaps this is different. Scipy has SparseEfficiencyWarnings, but they usually indicate something different to this. |
Whatever you think is best. Alternatively, we could do a return self.transform(X.tocsr()).tocsc() in this case. The CSC is just so slow that the user would benefit from warning or automatic conversion to CSR. The latter would double the memory used for holding the data, so that's an argument against it. |
A lot of scipy.sparse methods do that kind of conversion internally, so
maybe it's not so bad to similarly duplicate memory. In any case I think it
is something we can comfortably change if users complain.
|
This pull request introduces 1 alert and fixes 1 when merging 876ac2f into bea1eb5 - view on LGTM.com new alerts:
fixed alerts:
Comment posted by LGTM.com |
Since we call the CSR method when able given a CSC input, adding cases where the CSR method has to be used because the degree's too high.
OK, I can't think of anything else that should be done. Let me know if there's anything else you'd like changed. |
Actually, doing this conversion to CSR on CSC input invalidates the benchmark. We could add a parameter to the PolynomialFeatures constructor that prevents the conversion. |
It invalidates the benchmark meaning that it invalidates the benchmark
script, or it means that this is no longer the most efficient?
|
The method in the benchmark labeled “crc” is actually the CSR method
because CSCs get converted to CSRs and use the CSR method when the degree’s
2 or 3. This makes the CSC method look fast in the benchmark because it’s
not actually the CSC method. How about we just remove the benchmark?
…On Wed, Oct 17, 2018 at 7:19 PM Joel Nothman ***@***.***> wrote:
It invalidates the benchmark meaning that it invalidates the benchmark
script, or it means that this is no longer the most efficient?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12197 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABaNs_AoS60935CZ005RuGhCTIJfkmJlks5ul-U1gaJpZM4XAW5G>
.
|
Yes, remove the CSC benchmark
|
And because the new benchmark will only run without CSC we can change it to run for larger number of features. |
Or alternatively without the log scale to make the speed-up effect look more dramatic ;) |
It's been done. Plot updated at the top of this page. |
Merged! Thank you very much @AWNystrom! |
Thank you! It's been a pleasure contributing. Thanks for the code reviews, everyone! |
This PR allows PolynomialFeatures to operate on compressed sparse row (CSR) matrices. It does so in a way that only takes products of nonzero features, so it scales with the density of the matrix to the power of the degree of the expansion.
Here's a plot that compares the runtime of this method to the dense and crc methods:

See https://arxiv.org/pdf/1803.06418.pdf for details of the algorithm.