Skip to content

ENH more efficient _num_combinations calculation in PolynomialFeatures #19734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Apr 2, 2021
Merged
4 changes: 4 additions & 0 deletions doc/whats_new/v1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,10 @@ Changelog
:pr:`19426` by :user:`Alexandre Gramfort <agramfort>` and
:user:`Maria Telenczuk <maikia>`.

- |Efficiency| The implementation of `fit` for `PolynomialFeatures` transformer
is now faster. This is especially noticeable on large sparse input.
:pr:`19734` by :user:`Fred Robinson <frrad>`.

:mod:`sklearn.manifold`
.......................

Expand Down
35 changes: 29 additions & 6 deletions sklearn/preprocessing/_polynomial.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import numpy as np
from scipy import sparse
from scipy.interpolate import BSpline
from scipy.special import comb

from ..base import BaseEstimator, TransformerMixin
from ..utils import check_array
Expand Down Expand Up @@ -113,6 +114,29 @@ def _combinations(n_features, degree, interaction_only, include_bias):
return chain.from_iterable(comb(range(n_features), i)
for i in range(start, degree + 1))

@staticmethod
def _num_combinations(n_features, degree, interaction_only, include_bias):
"""Calculate number of terms in polynomial expansion

This should be equivalent to counting the number of terms returned by
_combinations(...) but much faster.
"""

if interaction_only:
combinations = sum(
[
comb(n_features, i, exact=True)
for i in range(1, min(degree + 1, n_features + 1))
]
)
else:
combinations = comb(n_features + degree, degree, exact=True) - 1

if include_bias:
combinations += 1

return combinations

@property
def powers_(self):
check_is_fitted(self)
Expand Down Expand Up @@ -170,13 +194,12 @@ def fit(self, X, y=None):
self : object
Fitted transformer.
"""
n_samples, n_features = self._validate_data(
X, accept_sparse=True).shape
combinations = self._combinations(n_features, self.degree,
self.interaction_only,
self.include_bias)
_, n_features = self._validate_data(X, accept_sparse=True).shape
self.n_input_features_ = n_features
self.n_output_features_ = sum(1 for _ in combinations)
self.n_output_features_ = self._num_combinations(
n_features, self.degree, self.interaction_only, self.include_bias
)

return self

def transform(self, X):
Expand Down
21 changes: 21 additions & 0 deletions sklearn/preprocessing/tests/test_polynomial.py
Original file line number Diff line number Diff line change
Expand Up @@ -552,6 +552,27 @@ def test_polynomial_features_csr_X(deg, include_bias, interaction_only, dtype):
assert_array_almost_equal(Xt_csr.A, Xt_dense)


@pytest.mark.parametrize("n_features", [1, 4, 5])
@pytest.mark.parametrize("degree", range(1, 5))
@pytest.mark.parametrize("interaction_only", [True, False])
@pytest.mark.parametrize("include_bias", [True, False])
def test_num_combinations(n_features, degree, interaction_only, include_bias):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it not be sufficient to just assert that the width of the transformed matrix is equal to n_features_out_ rather than testing the private API?

Copy link
Contributor Author

@frrad frrad Mar 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to test n_output_features_ instead of _num_combinations. Good call.

I'm still comparing it to sum([1 for _ in self._combinations]) though. Comparing it to the width of the transformed matrix would couple this test to the implementation of transform we happen to get and not all of the implementations are so straightforward. I'd prefer to avoid that if possible.

If your objection is to using the private method I would propose just copy / pasting that definition into this test.

Of course, we can just do it your way too 🤷 let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any real issue about this, although I've not checked if the size of _combinations is tested elsewhere in the file.

"""
Test that n_output_features_ is calculated correctly.
"""
x = sparse.csr_matrix(([1], ([0], [n_features - 1])))
est = PolynomialFeatures(
degree, interaction_only=interaction_only, include_bias=include_bias
)
est.fit(x)
num_combos = est.n_output_features_

combos = PolynomialFeatures._combinations(
n_features, degree, interaction_only, include_bias
)
assert num_combos == sum([1 for _ in combos])


@pytest.mark.parametrize(['deg', 'include_bias', 'interaction_only', 'dtype'],
[(2, True, False, np.float32),
(2, True, False, np.float64),
Expand Down