Fixes #13173, implements faster polynomial features for dense matrices #13290

sdpython · 2019-02-26T17:10:47Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR implements a faster version of polynomial features for dense matrices. Instead of computing each features independently by going through all combinations, it broadcasts as much as possible the multiple by one vector. The speed up is significant (3 times faster) for matrices with less than 10.000 rows.

Any other comments?

More details about speed up can be found here: https://github.com/sdpython/mlinsights/blob/master/_doc/notebooks/sklearn/faster_polynomial_features.ipynb.

GaelVaroquaux · 2019-02-27T10:54:22Z

Can you add the benchmark script that you used in the benchmarks folder of the repo and paste here the graph that it outputs.

…into poly

sdpython · 2019-02-27T15:56:09Z

X-axis is number of observations
Y-axis is time in seconds

degree  interaction_only      n  nfeat order  time_0_20_2  time_current

1 2 True 1 10 C 0.000299 0.000100
25 2 True 10 10 C 0.000399 0.000100
49 2 True 100 10 C 0.000399 0.000100
73 2 True 1000 10 C 0.000898 0.000299
97 2 True 10000 10 C 0.007676 0.006383
degree interaction_only n nfeat order time_0_20_2 time_current
13 2 True 1 20 C 0.001596 0.000100
37 2 True 10 20 C 0.001396 0.000199
61 2 True 100 20 C 0.001496 0.000100
85 2 True 1000 20 C 0.004189 0.002495
109 2 True 10000 20 C 0.031014 0.031117

ogrisel · 2019-02-27T17:21:29Z

The benchmark script needs a couple of quick fixes:

benchmarks/bench_plot_polynomial_features.py:10:1: F401 'sys' imported but unused
import sys
^
benchmarks/bench_plot_polynomial_features.py:11:1: F401 'warnings' imported but unused
import warnings
^
benchmarks/bench_plot_polynomial_features.py:12:1: F401 'numbers' imported but unused
import numbers
^
benchmarks/bench_plot_polynomial_features.py:24:1: F401 'sklearn.utils.testing.ignore_warnings' imported but unused
from sklearn.utils.testing import ignore_warnings
^
benchmarks/bench_plot_polynomial_features.py:87:80: E501 line too long (94 > 79 characters)
                                                              degree, interaction_only, order)
                                                                               ^
benchmarks/bench_plot_polynomial_features.py:89:80: E501 line too long (87 > 79 characters)
                                                       degree, interaction_only, order)
                                                                               ^
benchmarks/bench_plot_polynomial_features.py:105:38: E261 at least two spaces before inline comment
                                break # stops if longer than a second

ogrisel · 2019-02-27T17:32:18Z

Can you please do a benchmark for a larger number of features? The new code is significantly more complex than the old one so I would like to see case where the new code is significantly faster than the old code on a case that lasts more than 10s.

ogrisel · 2019-02-27T17:38:56Z

Arguably it might still be interesting to be significantly faster on a small number of samples at prediction time to reduce the single sample prediction latency but I am not sure that a change from 0.3ms to 0.1ms is important enough to justify the increase in code complexity.

sdpython · 2019-02-28T00:51:27Z

Full results: https://github.com/sdpython/_benchmarks/tree/master/scikit-learn/results

You are probably right about the speed up. However, the new version is a better inspiration for anybody who needs to implement a runtime for fast predictions. That was also one of my objectives by proposing this change. I don't know if many people search for numerical recipes inside scikit-learn code. I do that sometimes.

ogrisel · 2019-02-28T07:45:04Z

Thank you for the additional benchmark results. So indeed for a large number of features the difference starts to be significant. Maybe this code a good candidate for cythonization but I am not saying we should wait for cythonization to merge this PR.

I would like to have other people opinions in the complexity vs performance tradeoff of this PR.

…into poly

sdpython · 2019-02-28T09:35:09Z

I just made a fix and remove one data copy (no copy of np.multiply result) and the improvment is significant even for 100.000 rows.

Basically, the implementation is almost twice faster than the current one for matrices with 100.000 rows.

jeremiedbb · 2019-02-28T10:22:06Z

sklearn/preprocessing/data.py

+                # the matrix first to multiply contiguous memory segments.
+                transpose = X.size <= 1e7 and self.order == 'C'
+                n = X.shape[1]
+


When the size is larger, is transposing slower or equivalent ?
And, when the matrix is small does the transposing bring a significant part of the speed up ?

My point is that the transpose switch makes the code twice as long, and I'm wondering if we could always do one or the other.

I remove the transpose and it seemed better. So i'll keep it. I added it because it was better before I removed one unnecessary copy this morning. transpose is inefficient when order=='F' so the only option to simplify is to remove it. Updating the PR.

GaelVaroquaux · 2019-02-28T12:23:22Z

We are trying to evaluate whether the speedups in this PR are important or not for application (because they are a constant offset, and not a proportional benefit. We came up with a potentially useful situation: in an online learning setting, where the polynomial features are applied on minibatches (typically of size 100), which are then fed into an SGDClassifier with "partial_fit". The question is then: what is the relative cost of on call to the partial_fit of SGDClassifier versus the cost of the transform of the polynomial features. Could you run a benchmark on such an example? Thanks!

sdpython · 2019-02-28T16:25:04Z

Hi,

I created a benchmark which tests the four following functions:

def fcts_model(X, y):
    
    model1 = SGDClassifier()
    model2 = make_pipeline(PolynomialFeatures(), SGDClassifier())
    model3 = make_pipeline(ExtendedFeatures(kind='poly'), SGDClassifier())
    model4 = make_pipeline(ExtendedFeatures(kind='poly-slow'), SGDClassifier())

    model1.fit(PolynomialFeatures().fit_transform(X), y)
    model2.fit(X, y)
    model3.fit(X, y)
    model4.fit(X, y)
    
    def partial_fit_model1(X, y, model=model1):
        return model.partial_fit(X, y)
    
    def partial_fit_model2(X, y, model=model2):
        X2 = model.steps[0][1].transform(X)
        return model.steps[1][1].partial_fit(X2, y)
    
    def partial_fit_model3(X, y, model=model3):
        X2 = model.steps[0][1].transform(X)
        return model.steps[1][1].partial_fit(X2, y)
    
    def partial_fit_model4(X, y, model=model4):
        X2 = model.steps[0][1].transform(X)
        return model.steps[1][1].partial_fit(X2, y)

    return partial_fit_model1, partial_fit_model2, partial_fit_model3, partial_fit_model4

It tests the combination of PolynomialFeatures + SGDClassifier. On the following graph: SGD = SGDClassifier only, SGD-SKL = SGD + POLY 0.20.2 , SGD-FAST is the new implementation, SGD-SLOW is 0.20.2 but implemented in the bench to make sure I'm not sure my branch to make the test. Is that what you expected?

ogrisel · 2019-02-28T21:49:25Z

It tests the combination of PolynomialFeatures + SGDClassifier. On the following graph: SGD = SGDClassifier only, SGD-SKL = SGD + POLY 0.20.2 , SGD-FAST is the new implementation, SGD-SLOW is 0.20.2 but implemented in the bench to make sure I'm not sure my branch to make the test.

This is really confusing. Can you please rephrase?

ogrisel

Some more comments:

sklearn/preprocessing/data.py

…into poly

sdpython · 2019-03-01T15:32:26Z

I added the second benchmark to the sources:

SGD=SGDClasifier only
SGD-SKL=PolynomialFeatures from scikit-learn (no matter what it is)
SGD-FAST=new implementation copy-pasted in the benchmark source file
SGD-SLOW=implementation of 0.20.2 copy-pasted in the benchmark source file

GaelVaroquaux · 2019-03-01T15:40:53Z

I hate to say, but I find the benchmark code very hard to read. It's very factored and generic. I would personally write much simpler code, even if it repeats a bit more.

sdpython · 2019-03-02T12:47:37Z

Both benchmarks I assume?

sdpython · 2019-03-03T15:49:51Z

I updated the second benchmark and the code. Below what I measure written in a different way.

def measure_sgd_only(X_train, y_train, Xs, ys):
    model = SGDClassifier()
    model.fit(PolynomialFeatures().fit_transform(X_train), y_train)
    Xs = [PolynomialFeatures().fit_transform(x) for x in Xs]
    st = time()
    for X, y in zip(Xs, ys):
        model.partial_fit(X, y)
    end = time()
    return end - st


def measure_sgd_poly_skl(X_train, y_train, Xs, ys):
    model = make_pipeline(PolynomialFeatures(), SGDClassifier())
    model.fit(X_train, y_train)
    st = time()
    for X, y in zip(Xs, ys):
        X2 = model.steps[0][1].transform(X)
        model.steps[1][1].partial_fit(X2, y)
    end = time()
    return end - st


def measure_sgd_poly_fast(X_train, y_train, Xs, ys):
    model = make_pipeline(CustomPolynomialFeatures(kind='poly-fast'),
                          SGDClassifier())
    model.fit(X_train, y_train)
    st = time()
    for X, y in zip(Xs, ys):
        X2 = model.steps[0][1].transform(X)
        model.steps[1][1].partial_fit(X2, y)
    end = time()
    return end - st


def measure_sgd_poly_slow(X_train, y_train, Xs, ys):
    model = make_pipeline(CustomPolynomialFeatures(kind='poly-slow'),
                          SGDClassifier())
    model.fit(X_train, y_train)
    st = time()
    for X, y in zip(Xs, ys):
        X2 = model.steps[0][1].transform(X)
        model.steps[1][1].partial_fit(X2, y)
    end = time()
    return end - st

sdpython · 2019-03-03T18:11:01Z

I was looking for a way to show how the improvment works. The following graph plots the ratio (new execution time / execution time of 0.20.2) accross all configurations. This shows how it moves against the number of features and observations.

ogrisel · 2019-04-19T10:13:05Z

In line of the benchmark results (and the relative simplicity of the code) I am +1 for merging this change but I don't like the complexity of the benchmark scripts either.

So I would suggest to just remove the benchmark script for this PR.

NicolasHug · 2019-04-24T17:10:37Z

The last benchmark #13290 (comment) seems to show that the fast version is only really much faster when the current one is already pretty fast. The improvement isn't as clear on the right side of the x axis.

@sdpython , do you have a specific use-case for this? I agree with the others that we need to make sure we are addressing an actual problem here, to justify the complexity.

…into poly

ogrisel · 2019-05-10T09:41:23Z

One use case is to decrease the latency of individual predictions in a production deployment (the red dots on the right hand side figure).

…into poly

jeremiedbb

I just made a few comment, mostly cosmetics.
Besides that, LGTM

jeremiedbb · 2019-05-17T12:53:41Z

sklearn/preprocessing/data.py

+                    else:
+                        new_index = []
+                        end = index[-1]
+                        for feature_idx in range(0, n_features):


range(n_features) is enough.

jeremiedbb · 2019-05-17T12:57:33Z

sklearn/preprocessing/data.py

+                        for feature_idx in range(0, n_features):
+                            a = index[feature_idx]
+                            new_index.append(current_col)
+                            start = a


it seems that you don't need a. Simply write start = index[feature_idx].

jeremiedbb · 2019-05-17T13:01:34Z

sklearn/preprocessing/data.py

+                            np.multiply(XP[:, start:end],
+                                        X[:, feature_idx:feature_idx + 1],
+                                        out=XP[:, current_col:next_col],
+                                        where=True, casting='no')


where=True is the default so is not necessary

jeremiedbb · 2019-05-17T13:38:34Z

sklearn/preprocessing/data.py

+                            next_col = current_col + end - start
+                            if next_col <= current_col:
+                                break
+                            np.multiply(XP[:, start:end],


I'm not a fan of breaking loops but ok.

jeremiedbb · 2019-05-17T13:54:26Z

sklearn/preprocessing/data.py

@@ -1,10 +1,12 @@
+# coding: utf-8


unrelated to this PR. Just looked at other files in sklearn and it's sometimes there and sometimes not. It seems there's no strong convention about that.

jeremiedbb · 2019-05-17T13:55:13Z

sklearn/preprocessing/data.py

 # Authors: Alexandre Gramfort <alexandre.gramfort@inria.fr>
 #          Mathieu Blondel <mathieu@mblondel.org>
 #          Olivier Grisel <olivier.grisel@ensta.org>
 #          Andreas Mueller <amueller@ais.uni-bonn.de>
 #          Eric Martin <eric@ericmart.in>
 #          Giorgio Patrini <giorgio.patrini@anu.edu.au>
 #          Eric Chang <ericchang2017@u.northwestern.edu>
+#          Xavier Dupré <xadupre@microsoft.com>


I think these lists are not maintained any more :)

sklearn/preprocessing/data.py

…into poly

sdpython · 2019-05-20T13:01:50Z

I modified the code based on the comment. The build is failing due to unrelated changes (as others PR).

…into poly

sdpython · 2019-06-07T16:32:08Z

Do I need to do new changes to the PR?

NicolasHug · 2019-06-07T16:35:02Z

sklearn/preprocessing/data.py

+
+                current_col = 1 if self.include_bias else 0
+                for d in range(0, self.degree):
+                    if d == 0:


put this before the loop?

…into poly

jnothman

I think this deserves some small readability improvements. Otherwise LGTM ;)

jnothman · 2019-06-18T03:00:16Z

sklearn/preprocessing/data.py

+                index.append(current_col)
+
+                # d >= 1
+                for d in range(1, self.degree):


Suggested change

for d in range(1, self.degree):

for _ in range(1, self.degree):

jnothman · 2019-06-18T03:00:57Z

sklearn/preprocessing/data.py

+                index = list(range(current_col,
+                                   current_col + n_features))
+                current_col += n_features
+                index.append(current_col)


I think somewhere you should describe what the index variable is for.

jnothman · 2019-06-18T03:08:02Z

sklearn/preprocessing/data.py

-                for i, comb in enumerate(combinations):
-                    XP[:, i] = X[:, comb].prod(1)
+
+                # What follows is a faster implementation of:


At the moment I don't find the algorithm easy to read. I think if you explicitly say "dynamic programming" it might be clearer that your algorithm is building subsequent columns from precomputed partial solutions.

jnothman · 2019-06-18T03:09:45Z

sklearn/preprocessing/data.py

+                    new_index = []
+                    end = index[-1]
+                    for feature_idx in range(n_features):
+                        start = index[feature_idx]


Add a comment like: # XP[start:end] are terms of degree d - 1 that exclude feature #feature_idx

…into poly

jnothman · 2019-06-27T11:32:25Z

sklearn/preprocessing/data.py

@@ -1548,6 +1548,15 @@ def transform(self, X):
                # What follows is a faster implementation of:
                # for i, comb in enumerate(combinations):
                #     XP[:, i] = X[:, comb].prod(1)
+                # This new implementation uses two optimisations.


I don't think "new" belongs in merged code :)

Suggested change

# This new implementation uses two optimisations.

# This implementation uses two optimisations.

rth · 2019-07-25T21:59:18Z

Looks like there is a +2..+3 for merging on this PR, and most of the comments were addressed (except for the last minor wording improvement)?

Is someone among the reviewers willing to merge it then?

jnothman · 2019-07-29T13:38:35Z

Thanks @sdpython!

sdpython added 5 commits February 26, 2019 18:02

Fixes #13173, implements faster polynomial features for dense matrices

670b0f1

fix a bug occuring when interation_only is True

ae847ce

fixes interactive_only

3d81e4a

pep8

e19ccb0

pep8

607dd9a

sdpython added 2 commits February 27, 2019 14:48

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

833f0d9

…into poly

Add script to benchmark implementations

ed8585d

sdpython added 2 commits February 27, 2019 22:14

pep8

12369a8

Update bench_plot_polynomial_features.py

91936cf

sdpython added 2 commits February 28, 2019 09:46

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

a36c581

…into poly

reduce the number of data copies, improves memory footprints

71ae492

jeremiedbb reviewed Feb 28, 2019

View reviewed changes

remove transpose

ee646a5

ogrisel reviewed Feb 28, 2019

View reviewed changes

sklearn/preprocessing/data.py Outdated Show resolved Hide resolved

sklearn/preprocessing/data.py Outdated Show resolved Hide resolved

sklearn/preprocessing/data.py Outdated Show resolved Hide resolved

sdpython added 2 commits March 1, 2019 15:16

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

1a970de

…into poly

Fix comment from PR + add second benchmark

d9a94d6

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

9e0b09c

…into poly

sdpython added 2 commits May 17, 2019 14:27

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

997fe74

…into poly

remove benchmarks

7805a85

jeremiedbb approved these changes May 17, 2019

View reviewed changes

NicolasHug reviewed May 17, 2019

View reviewed changes

sklearn/preprocessing/data.py Show resolved Hide resolved

sdpython added 2 commits May 20, 2019 14:28

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

9798710

…into poly

fix PR comments

91b3067

sdpython added 2 commits May 22, 2019 15:36

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

2c36900

…into poly

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

9fe1056

…into poly

NicolasHug reviewed Jun 7, 2019

View reviewed changes

sdpython added 3 commits June 11, 2019 10:26

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

e25f2db

…into poly

optimizes the code to remove one test in a loop

fa044e5

lint

60e861a

jnothman reviewed Jun 18, 2019

View reviewed changes

sdpython added 2 commits June 26, 2019 16:51

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

ec34b3a

…into poly

More comments.

c21e04e

jnothman approved these changes Jun 27, 2019

View reviewed changes

Update data.py

a7af1a1

jnothman merged commit ce869d8 into scikit-learn:master Jul 29, 2019

jnothman mentioned this pull request Oct 28, 2019

Missing from what's new v0.22 #15384

Closed

8 tasks

lorentzenchr added the module:preprocessing label Jun 9, 2021

	for d in range(1, self.degree):
	for _ in range(1, self.degree):

	# This new implementation uses two optimisations.
	# This implementation uses two optimisations.

Uh oh!

Fixes #13173, implements faster polynomial features for dense matrices #13290

Fixes #13173, implements faster polynomial features for dense matrices #13290

Uh oh!

Conversation

sdpython commented Feb 26, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

GaelVaroquaux commented Feb 27, 2019

Uh oh!

sdpython commented Feb 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Feb 27, 2019

Uh oh!

ogrisel commented Feb 27, 2019

Uh oh!

ogrisel commented Feb 27, 2019

Uh oh!

sdpython commented Feb 28, 2019

Uh oh!

ogrisel commented Feb 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sdpython commented Feb 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Feb 28, 2019 via email

Uh oh!

sdpython commented Feb 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Feb 28, 2019

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sdpython commented Mar 1, 2019

Uh oh!

GaelVaroquaux commented Mar 1, 2019

Uh oh!

sdpython commented Mar 2, 2019

Uh oh!

sdpython commented Mar 3, 2019

Uh oh!

sdpython commented Mar 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Apr 19, 2019

Uh oh!

NicolasHug commented Apr 24, 2019

Uh oh!

ogrisel commented May 10, 2019

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sdpython commented May 20, 2019

sdpython commented Feb 27, 2019 •

edited

Loading

ogrisel commented Feb 28, 2019 •

edited

Loading

sdpython commented Feb 28, 2019 •

edited

Loading

sdpython commented Feb 28, 2019 •

edited

Loading

sdpython commented Mar 3, 2019 •

edited

Loading