Skip to content

GridSearchCV with Pipeline without Predictor #14693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
agamemnonc opened this issue Aug 20, 2019 · 5 comments
Closed

GridSearchCV with Pipeline without Predictor #14693

agamemnonc opened this issue Aug 20, 2019 · 5 comments

Comments

@agamemnonc
Copy link
Contributor

agamemnonc commented Aug 20, 2019

Description

The Pipeline documentation states that:

All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method). The last estimator may be any type (transformer, classifier, etc.)

However, trying to use GridSearchCV with a Pipeline that only includes transformers will fail (in the hypothetical scenario that you are trying to select hyper-parameters for a transformer only; this can be useful if the transformer in question is a post-processing step that takes some hyper-parameters).

Steps/Code to Reproduce

For instance, the following (dumb example) will raise an error:

from sklearn.preprocessing import MinMaxScaler


Y = np.random.randn(1000,2)
Y = np.clip(X, 0., 1.)
X = 3. * Y

pipe = Pipeline(steps=[
    ('sc', MinMaxScaler())
])

param_grid = {'sc__feature_range': [(0., 1.), (0., 5.), (3., 10.)]}

gs = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring='r2',
    cv=3
)
gs.fit(X, Y)
print(gs.best_params_)

Output:

AttributeError: 'MinMaxScaler' object has no attribute 'predict'

However, if you include a dumb estimator that simply passes its input to the output at the end of the pipeline, it will run with no issues:

from sklearn.base import BaseEstimator
from sklearn.preprocessing import MinMaxScaler


class PassThroughEstimator(BaseEstimator):
    def fit(self, X, y=None):
        return self

    def predict(self, X, y=None):
        return X


Y = np.random.randn(1000,2)
Y = np.clip(X, 0., 1.)
X = 3. * Y

pipe = Pipeline(steps=[
    ('sc', MinMaxScaler()),
    ('pse', PassThroughEstimator())
])

param_grid = {'sc__feature_range': [(0., 1.), (0., 5.), (3., 10.)]}

gs = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring='r2',
    cv=3
)
gs.fit(X, Y)
print(gs.best_params_)

Output:

{'sc__feature_range': (0.0, 1.0)}

I realise that this is not a common case (needing to only fit a transformer with hyper-parameters), but is this intended behaviour?

On a related note, is it possible to include a post-processing transformer (e.g. smoothing for time-series regression or even a scaler in case the target has been pre-processed) at the end of a Pipeline and still be able to use GridSearchCV? According to the documentation it shouldn't be, since in that case not all layers preceding the last one are transformers. See also #4143.

Versions

System:
python: 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\nak142\Miniconda3\envs\myo\python.exe
machine: Windows-10-10.0.18362-SP0

BLAS:
macros:
lib_dirs:
cblas_libs: cblas

Python deps:
pip: 19.1.1
setuptools: 41.0.1
sklearn: 0.21.2
numpy: 1.16.4
scipy: 1.2.1
Cython: 0.29.12
pandas: 0.24.2

@amueller
Copy link
Member

amueller commented Aug 20, 2019

The issue is not the transformer, but using r2 scoring, which requires calling predict. If you leave out the scoring and the last step implements a score method, it will work.

It seems strange to have the output of a transformer be compared to the target with r2 so I'm not sure we want to support that.
So one way is to tell r2 to use the output of the transformer as prediction, as you did, the other is to implement a scorer that uses the output of transform for scoring, something like

def trans_r2(est, X, y):
	return r2_score(y, est.transform(X))
GridSearchCV(..., scoring=trans_r2)

ps: posting the full traceback would have revealed that ;)

@amueller amueller changed the title GridSearchCV with Pipeline without Estimators GridSearchCV with Pipeline without Predictor Aug 20, 2019
@amueller
Copy link
Member

(renamed issue as "estimator" includes transformers)

@agamemnonc
Copy link
Contributor Author

Oh, I see. Thanks @amueller .

It seems strange to have the output of a transformer be compared to the target with r2 so I'm not sure we want to support that.

Well, I guess this would make sense if the transformer is a prediction post-processing step (e.g. smoothing for time-series regression as per my example above). But I understand this cannot be supported (or there are no plans of doing so) given that you can't have a predictor step, unless it is the last one, right?

@jnothman
Copy link
Member

jnothman commented Aug 20, 2019 via email

@amueller
Copy link
Member

Target post-processing is a bit tricky unfortunately. I think you can use TransformedTargetRegressor with an identity transform and the post-processing as the inverse_transform

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants