Skip to content

ENH more efficient _num_combinations calculation in PolynomialFeatures #19734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Apr 2, 2021
Merged

Conversation

frrad
Copy link
Contributor

@frrad frrad commented Mar 20, 2021

What does this implement/fix? Explain your changes.

Currently, fitting a PolynomialFeatures transformer requires iterating over all possible output features in order to calculate n_output_features_. This can be quite slow. When the input is a sparse CSR matrix with and degree <= 3 the transform itself can be faster than this calculation.

This change calculates the dimension directly instead.

The provided columns = 1000 instance of test_polynomial_features_csr_wide takes almost 10 seconds on my machine before this change.

Reference Issues/PRs

Related to #16803 - after that issue is fixed, it will be possible to calculate the polynomial features for very large very sparse matrices quickly and this issue will start to dominate time to fit_transform for such input.

Also, if you put columns = 10000 the test will trigger #16803 and fail.

Benchmarks

Before:
before

After:
after

@@ -113,6 +113,34 @@ def _combinations(n_features, degree, interaction_only, include_bias):
return chain.from_iterable(comb(range(n_features), i)
for i in range(start, degree + 1))

@staticmethod
def _num_combinations(n_features, degree, interaction_only, include_bias):
def ncr(n, r):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use scipy.special.comb?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about this lib, thanks.

try:
est.fit_transform(x)
except ValueError:
pytest.fail("possible overflow")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uncomfortable classifying an overflow as a test failure, rather than an error in executing the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

for b in [True, False]
],
)
def test_num_combinations(n_features, degree, interaction_only, include_bias):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it not be sufficient to just assert that the width of the transformed matrix is equal to n_features_out_ rather than testing the private API?

Copy link
Contributor Author

@frrad frrad Mar 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to test n_output_features_ instead of _num_combinations. Good call.

I'm still comparing it to sum([1 for _ in self._combinations]) though. Comparing it to the width of the transformed matrix would couple this test to the implementation of transform we happen to get and not all of the implementations are so straightforward. I'd prefer to avoid that if possible.

If your objection is to using the private method I would propose just copy / pasting that definition into this test.

Of course, we can just do it your way too 🤷 let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any real issue about this, although I've not checked if the size of _combinations is tested elsewhere in the file.

@frrad
Copy link
Contributor Author

frrad commented Mar 21, 2021

Thanks for the prompt review. I've addressed your comments. Please take another look.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise looks good

@@ -0,0 +1,51 @@
#!/usr/bin/env python

import matplotlib.pyplot as plt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't have thought we needed a new benchmark for this. The speed gains are obvious to reason about, and there would be little benefit in trying to fine tune it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I was mostly trying to substantiate my claim that the time to fit_transform is dominated by fit on certain data. But, I got the pictures and you seem to agree so... Removed.

@@ -552,6 +552,39 @@ def test_polynomial_features_csr_X(deg, include_bias, interaction_only, dtype):
assert_array_almost_equal(Xt_csr.A, Xt_dense)


@pytest.mark.parametrize("columns", [1, 2, 3, 1000])
def test_polynomial_features_csr_wide(columns):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we trying to test here? That we don't get a crash for wide data? Do we currently get a crash in more cases than after this pull request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically a test for #16803 and as such, maybe shouldn't be in this PR.

It also serves as a kind of performance regression test since it will run very slowly on the old version of fit. Are there actual performance tests anywhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have the ASV benchmarks but no other regular performance tests.

est.fit_transform(x)


@pytest.mark.parametrize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usual way to do this would be with multiple parametrize decorators, one for each parameter.
However, I'm not sure that we need to test quite so many (1000) combinations. How long do they take to run altogether?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest test_polynomial.py -k test_num_combinations
====================================================================================== test session starts =======================================================================================
platform linux -- Python 3.9.2, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /home/frederick/Projects/scikit-learn, configfile: setup.cfg
plugins: cov-2.11.1
collected 203 items / 139 deselected / 64 selected                                                                                                                                               

test_polynomial.py ................................................................                                                                                                        [100%]

=============================================================================== 64 passed, 139 deselected in 0.37s ===============================================================================

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my machine it takes .37s to run all 64 combinations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, misread those ranges. It's still a fairly large runtime for a minor test, but acceptable.

Copy link
Contributor Author

@frrad frrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded to all your comments. PTAL

est.fit_transform(x)


@pytest.mark.parametrize(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my machine it takes .37s to run all 64 combinations.

@frrad
Copy link
Contributor Author

frrad commented Mar 22, 2021

Hey, just wanted to let you know I'll be leaving for vacation in ~48h, will be back on the 31st. If there are comments I don't respond to between those times it's not because I've abandoned this PR.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. Just a few comments.

@frrad
Copy link
Contributor Author

frrad commented Mar 31, 2021

I'm back!

Thanks for your review @lorentzenchr . I have addressed your comments. PTAL when you have time.

@frrad
Copy link
Contributor Author

frrad commented Apr 2, 2021

Removed the weird test. PTAL.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @frrad Thank you very much for spotting this opportunity for improvement and addressing it with this PR.

@lorentzenchr lorentzenchr changed the title [MRG] more efficient _num_combinations calculation ENH more efficient _num_combinations calculation in PolynomialFeatures Apr 2, 2021
@lorentzenchr lorentzenchr merged commit bc7cd31 into scikit-learn:main Apr 2, 2021
@frrad frrad deleted the overflow branch April 2, 2021 15:40
@glemaitre glemaitre mentioned this pull request Apr 22, 2021
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants