ENH more efficient _num_combinations calculation in PolynomialFeatures #19734

frrad · 2021-03-20T20:51:40Z

What does this implement/fix? Explain your changes.

Currently, fitting a PolynomialFeatures transformer requires iterating over all possible output features in order to calculate n_output_features_. This can be quite slow. When the input is a sparse CSR matrix with and degree <= 3 the transform itself can be faster than this calculation.

This change calculates the dimension directly instead.

The provided columns = 1000 instance of test_polynomial_features_csr_wide takes almost 10 seconds on my machine before this change.

Reference Issues/PRs

Related to #16803 - after that issue is fixed, it will be possible to calculate the polynomial features for very large very sparse matrices quickly and this issue will start to dominate time to fit_transform for such input.

Also, if you put columns = 10000 the test will trigger #16803 and fail.

Benchmarks

Before:

After:

rearrange tests

jnothman · 2021-03-21T03:36:48Z

sklearn/preprocessing/_polynomial.py

@@ -113,6 +113,34 @@ def _combinations(n_features, degree, interaction_only, include_bias):
        return chain.from_iterable(comb(range(n_features), i)
                                   for i in range(start, degree + 1))

+    @staticmethod
+    def _num_combinations(n_features, degree, interaction_only, include_bias):
+        def ncr(n, r):


Why not use scipy.special.comb?

Didn't know about this lib, thanks.

jnothman · 2021-03-21T03:39:08Z

sklearn/preprocessing/tests/test_polynomial.py

+    try:
+        est.fit_transform(x)
+    except ValueError:
+        pytest.fail("possible overflow")


I'm uncomfortable classifying an overflow as a test failure, rather than an error in executing the test.

jnothman · 2021-03-21T03:40:07Z

sklearn/preprocessing/tests/test_polynomial.py

+        for b in [True, False]
+    ],
+)
+def test_num_combinations(n_features, degree, interaction_only, include_bias):


Would it not be sufficient to just assert that the width of the transformed matrix is equal to n_features_out_ rather than testing the private API?

I changed this to test n_output_features_ instead of _num_combinations. Good call.

I'm still comparing it to sum([1 for _ in self._combinations]) though. Comparing it to the width of the transformed matrix would couple this test to the implementation of transform we happen to get and not all of the implementations are so straightforward. I'd prefer to avoid that if possible.

If your objection is to using the private method I would propose just copy / pasting that definition into this test.

Of course, we can just do it your way too 🤷 let me know.

I don't have any real issue about this, although I've not checked if the size of _combinations is tested elsewhere in the file.

frrad · 2021-03-21T04:42:51Z

Thanks for the prompt review. I've addressed your comments. Please take another look.

jnothman

Otherwise looks good

jnothman · 2021-03-21T08:39:14Z

benchmarks/bench_sparse_feat_expansion.py

@@ -0,0 +1,51 @@
+#!/usr/bin/env python
+
+import matplotlib.pyplot as plt


I wouldn't have thought we needed a new benchmark for this. The speed gains are obvious to reason about, and there would be little benefit in trying to fine tune it

👍 I was mostly trying to substantiate my claim that the time to fit_transform is dominated by fit on certain data. But, I got the pictures and you seem to agree so... Removed.

jnothman · 2021-03-21T08:41:23Z

sklearn/preprocessing/tests/test_polynomial.py

@@ -552,6 +552,39 @@ def test_polynomial_features_csr_X(deg, include_bias, interaction_only, dtype):
    assert_array_almost_equal(Xt_csr.A, Xt_dense)


+@pytest.mark.parametrize("columns", [1, 2, 3, 1000])
+def test_polynomial_features_csr_wide(columns):


What are we trying to test here? That we don't get a crash for wide data? Do we currently get a crash in more cases than after this pull request?

This is basically a test for #16803 and as such, maybe shouldn't be in this PR.

It also serves as a kind of performance regression test since it will run very slowly on the old version of fit. Are there actual performance tests anywhere?

we have the ASV benchmarks but no other regular performance tests.

jnothman · 2021-03-21T08:42:27Z

sklearn/preprocessing/tests/test_polynomial.py

+    est.fit_transform(x)
+
+
+@pytest.mark.parametrize(


The usual way to do this would be with multiple parametrize decorators, one for each parameter.
However, I'm not sure that we need to test quite so many (1000) combinations. How long do they take to run altogether?

pytest test_polynomial.py -k test_num_combinations ====================================================================================== test session starts ======================================================================================= platform linux -- Python 3.9.2, pytest-6.2.2, py-1.10.0, pluggy-0.13.1 rootdir: /home/frederick/Projects/scikit-learn, configfile: setup.cfg plugins: cov-2.11.1 collected 203 items / 139 deselected / 64 selected test_polynomial.py ................................................................ [100%] =============================================================================== 64 passed, 139 deselected in 0.37s ===============================================================================

On my machine it takes .37s to run all 64 combinations.

Sorry, misread those ranges. It's still a fairly large runtime for a minor test, but acceptable.

frrad

Responded to all your comments. PTAL

frrad · 2021-03-21T16:32:31Z

sklearn/preprocessing/tests/test_polynomial.py

+    est.fit_transform(x)
+
+
+@pytest.mark.parametrize(


On my machine it takes .37s to run all 64 combinations.

frrad · 2021-03-22T19:16:38Z

Hey, just wanted to let you know I'll be leaving for vacation in ~48h, will be back on the 31st. If there are comments I don't respond to between those times it's not because I've abandoned this PR.

lorentzenchr

This is looking good. Just a few comments.

sklearn/preprocessing/_polynomial.py

sklearn/preprocessing/tests/test_polynomial.py

frrad · 2021-03-31T17:33:34Z

I'm back!

Thanks for your review @lorentzenchr . I have addressed your comments. PTAL when you have time.

frrad · 2021-04-02T01:10:04Z

Removed the weird test. PTAL.

lorentzenchr

LGTM. @frrad Thank you very much for spotting this opportunity for improvement and addressing it with this PR.

frrad added 2 commits March 19, 2021 23:34

more efficient num_combinations calculation, passing test

c96e0fa

rearrange tests

add big number

cff4755

github-actions bot added the module:preprocessing label Mar 20, 2021

frrad added 6 commits March 20, 2021 14:12

add to changelog

835aa69

add benchmark

cd6290a

fixup! add benchmark

44a5b39

fix lint

094534e

line length

a90f888

fixup! line length

558c4a8

jnothman reviewed Mar 21, 2021

View reviewed changes

frrad added 3 commits March 20, 2021 21:20

use scipy.special.comb

1d857f2

take fit_transform outside of try except block

1d4054d

test n_output_features instead of _num_combinations

c1b78ad

break, not continue

ed7abd6

jnothman reviewed Mar 21, 2021

View reviewed changes

frrad added 2 commits March 21, 2021 09:21

rm benchmrk

f8bbfb1

fix parametrize style

4219612

frrad commented Mar 21, 2021

View reviewed changes

jnothman approved these changes Mar 22, 2021

View reviewed changes

wdevazelhes mentioned this pull request Mar 22, 2021

[WIP] FIX index overflow error in sparse matrix polynomial expansion (bis) #19676

Closed

lorentzenchr reviewed Mar 24, 2021

View reviewed changes

frrad added 3 commits March 31, 2021 09:44

add docstring, refactor _num_combinations

d8613c8

docstrings in tests

48b01c0

change param range

04cfdfc

remove test_polynomial_features_csr_wide

510f6e4

lorentzenchr approved these changes Apr 2, 2021

View reviewed changes

lorentzenchr changed the title ~~[MRG] more efficient _num_combinations calculation~~ ENH more efficient _num_combinations calculation in PolynomialFeatures Apr 2, 2021

lorentzenchr merged commit bc7cd31 into scikit-learn:main Apr 2, 2021

frrad deleted the overflow branch April 2, 2021 15:40

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

		@@ -0,0 +1,51 @@
		#!/usr/bin/env python

		import matplotlib.pyplot as plt

		est.fit_transform(x)


		@pytest.mark.parametrize(

		est.fit_transform(x)


		@pytest.mark.parametrize(

Uh oh!

ENH more efficient _num_combinations calculation in PolynomialFeatures #19734

ENH more efficient _num_combinations calculation in PolynomialFeatures #19734

Uh oh!

Conversation

frrad commented Mar 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this implement/fix? Explain your changes.

Reference Issues/PRs

Benchmarks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frrad Mar 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frrad commented Mar 21, 2021

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frrad left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frrad commented Mar 22, 2021

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

frrad commented Mar 31, 2021

Uh oh!

frrad commented Apr 2, 2021

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

frrad commented Mar 20, 2021 •

edited

Loading

frrad Mar 21, 2021 •

edited

Loading