-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH more efficient _num_combinations calculation in PolynomialFeatures #19734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/preprocessing/_polynomial.py
Outdated
@@ -113,6 +113,34 @@ def _combinations(n_features, degree, interaction_only, include_bias): | |||
return chain.from_iterable(comb(range(n_features), i) | |||
for i in range(start, degree + 1)) | |||
|
|||
@staticmethod | |||
def _num_combinations(n_features, degree, interaction_only, include_bias): | |||
def ncr(n, r): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use scipy.special.comb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't know about this lib, thanks.
try: | ||
est.fit_transform(x) | ||
except ValueError: | ||
pytest.fail("possible overflow") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm uncomfortable classifying an overflow as a test failure, rather than an error in executing the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
for b in [True, False] | ||
], | ||
) | ||
def test_num_combinations(n_features, degree, interaction_only, include_bias): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it not be sufficient to just assert that the width of the transformed matrix is equal to n_features_out_ rather than testing the private API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to test n_output_features_
instead of _num_combinations
. Good call.
I'm still comparing it to sum([1 for _ in self._combinations])
though. Comparing it to the width of the transformed matrix would couple this test to the implementation of transform
we happen to get and not all of the implementations are so straightforward. I'd prefer to avoid that if possible.
If your objection is to using the private method I would propose just copy / pasting that definition into this test.
Of course, we can just do it your way too 🤷 let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any real issue about this, although I've not checked if the size of _combinations
is tested elsewhere in the file.
Thanks for the prompt review. I've addressed your comments. Please take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise looks good
@@ -0,0 +1,51 @@ | |||
#!/usr/bin/env python | |||
|
|||
import matplotlib.pyplot as plt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't have thought we needed a new benchmark for this. The speed gains are obvious to reason about, and there would be little benefit in trying to fine tune it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I was mostly trying to substantiate my claim that the time to fit_transform
is dominated by fit
on certain data. But, I got the pictures and you seem to agree so... Removed.
@@ -552,6 +552,39 @@ def test_polynomial_features_csr_X(deg, include_bias, interaction_only, dtype): | |||
assert_array_almost_equal(Xt_csr.A, Xt_dense) | |||
|
|||
|
|||
@pytest.mark.parametrize("columns", [1, 2, 3, 1000]) | |||
def test_polynomial_features_csr_wide(columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we trying to test here? That we don't get a crash for wide data? Do we currently get a crash in more cases than after this pull request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically a test for #16803 and as such, maybe shouldn't be in this PR.
It also serves as a kind of performance regression test since it will run very slowly on the old version of fit
. Are there actual performance tests anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have the ASV benchmarks but no other regular performance tests.
est.fit_transform(x) | ||
|
||
|
||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usual way to do this would be with multiple parametrize decorators, one for each parameter.
However, I'm not sure that we need to test quite so many (1000) combinations. How long do they take to run altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest test_polynomial.py -k test_num_combinations
====================================================================================== test session starts =======================================================================================
platform linux -- Python 3.9.2, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /home/frederick/Projects/scikit-learn, configfile: setup.cfg
plugins: cov-2.11.1
collected 203 items / 139 deselected / 64 selected
test_polynomial.py ................................................................ [100%]
=============================================================================== 64 passed, 139 deselected in 0.37s ===============================================================================
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my machine it takes .37s to run all 64 combinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, misread those ranges. It's still a fairly large runtime for a minor test, but acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded to all your comments. PTAL
est.fit_transform(x) | ||
|
||
|
||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my machine it takes .37s to run all 64 combinations.
Hey, just wanted to let you know I'll be leaving for vacation in ~48h, will be back on the 31st. If there are comments I don't respond to between those times it's not because I've abandoned this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good. Just a few comments.
I'm back! Thanks for your review @lorentzenchr . I have addressed your comments. PTAL when you have time. |
Removed the weird test. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @frrad Thank you very much for spotting this opportunity for improvement and addressing it with this PR.
What does this implement/fix? Explain your changes.
Currently, fitting a
PolynomialFeatures
transformer requires iterating over all possible output features in order to calculaten_output_features_
. This can be quite slow. When the input is a sparse CSR matrix with and degree <= 3 the transform itself can be faster than this calculation.This change calculates the dimension directly instead.
The provided
columns = 1000
instance oftest_polynomial_features_csr_wide
takes almost 10 seconds on my machine before this change.Reference Issues/PRs
Related to #16803 - after that issue is fixed, it will be possible to calculate the polynomial features for very large very sparse matrices quickly and this issue will start to dominate time to
fit_transform
for such input.Also, if you put
columns = 10000
the test will trigger #16803 and fail.Benchmarks
Before:

After:
