Skip to content

[MRG+1] FIX Pipelined fitting of Clustering algorithms, scoring of K-Means in pipelines #4064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 6, 2015

Conversation

amueller
Copy link
Member

@amueller amueller commented Jan 8, 2015

Fixes #4063.

Also fixes the embarrassing bug where clustering algorithms can't be used in pipelines.

@jnothman
Copy link
Member

jnothman commented Jan 8, 2015

Okay, I guess that's one option; the alternative is to not pass y if it's None, as is done in GridSearchCV for fit.

@amueller
Copy link
Member Author

amueller commented Jan 8, 2015

I am just now a bit confused. Pipeline always passes y=None to fit. Does that mean many clustering algorithms don't work with pipeline?!
Currently everything but the fit in some clustering algorithms and the score in KMeans has an optional y=None.

@amueller
Copy link
Member Author

amueller commented Jan 8, 2015

we should do it consistently, but I guess it is to late for that now?

@amueller
Copy link
Member Author

amueller commented Jan 8, 2015

I actually prefer the GridSearchCV approach, but it looks like we went the other way until now.

@amueller
Copy link
Member Author

amueller commented Jan 8, 2015

import numpy as np,sklearn
from sklearn.pipeline import Pipeline
from sklearn.cluster import MeanShift
X=np.random.rand(100,2)
p=Pipeline([('clf',MeanShift())])
p.fit(X)

breaks and makes me sad.

@amueller
Copy link
Member Author

amueller commented Jan 8, 2015

For the pipeline, it could be that previous steps in the pipeline use y so that it is not None, though. So you could do LDA, and then do clustering.

@jnothman
Copy link
Member

jnothman commented Jan 8, 2015

Don't be sad. As Gaël likes to say: it's not version 1 yet, stuff's allowed
to break and change :)

I can't say I have the answer here, yet, but I think you're thinking about
it the right way. Do we really require all estimators to take 2 positional
args to fit, then?

On 8 January 2015 at 13:35, Andreas Mueller notifications@github.com
wrote:

For the pipeline, it could be that previous steps in the pipeline use y
so that it is not None, though. So you could do LDA, and then do
clustering.


Reply to this email directly or view it on GitHub
#4064 (comment)
.

@amueller
Copy link
Member Author

amueller commented Jan 8, 2015

Yeah, I think this is the only way to make pipeline work as expected. Testing y is None is not good as in my example. The other option would be to use introspection or try but that is not the approach we have taken so far.

@amueller amueller changed the title [MRG] Pipeline scoring y none [MRG] FIX Pipelined fitting of Clustering algorithms, scoring of K-Means in pipelines Jan 9, 2015
@amueller amueller force-pushed the pipeline_scoring_y_none branch 2 times, most recently from 9f2ec74 to f89c72f Compare January 15, 2015 16:02
@amueller amueller added the Bug label Jan 16, 2015
@amueller amueller added this to the 0.16 milestone Jan 16, 2015
@amueller amueller force-pushed the pipeline_scoring_y_none branch from f89c72f to d55d272 Compare January 20, 2015 20:43
@amueller
Copy link
Member Author

reviews on this one would be nice as it makes testing somewhat simpler.

y = (X[:, 0] * 4).astype(np.int)
y = multioutput_estimator_convert_y_2d(name, y)
with warnings.catch_warnings(record=True):
estimator = Estimator()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add a warnings.simplefilter('always') here so that the registry doesn't catch some warnings and create heisenbugs... Better, decorate the whole test with @ignore_warnings

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is not particular to this test, though, right? and I'm not sure the whole tests should be decorated. These are more for the deprecation warnings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be reasonable to ignore all sklearn warnings, at least, in common tests. Perhaps we should change ignore_warnings to only apply to sklearn modules.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly enough, if I put @ignore_warnings on the test (not the check function) not test is run :-/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about adding a test for partial_fit, but I'm not sure about it. The pipelines need to look quite a bit different to support partial_fit, and there is no pressing reason currently to enforce taking a y in the API....

@amueller amueller force-pushed the pipeline_scoring_y_none branch 3 times, most recently from 27ddff9 to f02335a Compare January 22, 2015 21:17
@amueller
Copy link
Member Author

This one will simplify future testing such as #4162 and #4136 by the way....

@jnothman
Copy link
Member

Firstly, doc/developers/index.rst needs fixing. Secondly, I think this change should encompass partial_fit: RBM, LSHForest, and IPCA are current offenders.

@amueller
Copy link
Member Author

What is the benefit in partial_fit apart from making the tests easier to write?

@jnothman
Copy link
Member

What is the benefit in partial_fit apart from making the tests easier to write?

Consistency with fit, for one; the polymorphic design principle that Gaël expressed elsewhere; the ability to remove an if-else clause in sklearn.learning_curve; and anticipation of meta-estimators employing partial_fit in the future or in external projects where the developer may otherwise forget the unsupervised case...

@amueller
Copy link
Member Author

Fair enough.
Where would you put this in the developers doc? The current y argument of Transformers is not documented anywhere as far as I can see :-/

@amueller
Copy link
Member Author

Actually, there was a comment about fit in there, something that was not tested / enforced. I added comments about transform and score but I'm not sure that is the best place.

@jnothman
Copy link
Member

Oh, I didn't look at it closely enough. It actually gives an invocation
spec, not a def signature. But maybe it should:
http://scikit-learn.org/dev/developers/index.html#different-objects

On 29 January 2015 at 10:32, Andreas Mueller notifications@github.com
wrote:

Where?


Reply to this email directly or view it on GitHub
#4064 (comment)
.

@jnothman
Copy link
Member

Obscurely, it shows 1-ary score as an example.

On 29 January 2015 at 14:05, Joel Nothman joel.nothman@sydney.edu.au
wrote:

Oh, I didn't look at it closely enough. It actually gives an invocation
spec, not a def signature. But maybe it should:
http://scikit-learn.org/dev/developers/index.html#different-objects

On 29 January 2015 at 10:32, Andreas Mueller notifications@github.com
wrote:

Where?


Reply to this email directly or view it on GitHub
#4064 (comment)
.

@@ -716,8 +716,11 @@ is not met, an exception of type ``ValueError`` should be raised.
``y`` might be ignored in the case of unsupervised learning. However, to
make it possible to use the estimator as part of a pipeline that can
mix both supervised and unsupervised transformers, even unsupervised
estimators are kindly asked to accept a ``y=None`` keyword argument in
estimators need to accept a ``y=None`` keyword argument in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this doc was already here. I guess I only looked at a glance at what I thought was an issue, but not really.

@jnothman
Copy link
Member

Great. This LGTM apart from your test failure, which I take it will be fixed before the next reviewer gives a +1 ;)

@jnothman jnothman changed the title [MRG] FIX Pipelined fitting of Clustering algorithms, scoring of K-Means in pipelines [MRG+1] FIX Pipelined fitting of Clustering algorithms, scoring of K-Means in pipelines Jan 29, 2015
@amueller
Copy link
Member Author

Thanks for the review. Damn, I thought I just fixed travis. Will look at it again.

@amueller amueller force-pushed the pipeline_scoring_y_none branch from e37f56a to fa184fe Compare January 29, 2015 03:43
@amueller
Copy link
Member Author

Fixed.

@ghost ghost mentioned this pull request Jan 30, 2015
@amueller
Copy link
Member Author

amueller commented Feb 3, 2015

Any other reviews? @ogrisel maybe?

@ogrisel
Copy link
Member

ogrisel commented Feb 5, 2015

As the motivation of this change was to make it possible to call the clustering models in pipelines I would like to integrate a non-regression "integration" test in the spirit of #4064 (comment) in test_common. Note: clf is a bad name for such as step, model or estimator would be a better name. Or even, just use make_pipeline for that test.

@ogrisel
Copy link
Member

ogrisel commented Feb 5, 2015

Other than that +1 on my side.

@amueller
Copy link
Member Author

amueller commented Feb 5, 2015

Hum, I totally wanted to do that integration test with pipelines, seems I forgot about it. Not sure what should be tested, probably "everything". This PR has tests for the "API definition' of Pipeline, not doing the integration.
I'll try that tomorrow.

@ogrisel
Copy link
Member

ogrisel commented Feb 6, 2015

I was thinking of a bunch of sanity checks such as:

for est in all_classifiers():
   p = make_pipeline(est()).fit(X, y)
   e = est().fit(X, y)
   assert_equal(e.score(X, y), p.score(X, y))
   assert_array_equal(e.predict(X), p.predict(X))

and the similar clustering and regressors and maybe others as long as they implement fit and score.

@amueller
Copy link
Member Author

amueller commented Feb 6, 2015

I thought about which functions we want to test, I guess that this issue only exists for fit and score so your tests should be fine (plus a if hasattr(est, "score"). I thought about whether we want to test decision_function, predict_proba and predict, too.

@amueller
Copy link
Member Author

amueller commented Feb 6, 2015

at least transform should also be checked not enough coffee apparently, transform never accepts y

@amueller amueller force-pushed the pipeline_scoring_y_none branch from cfee0a1 to a105327 Compare February 6, 2015 12:17
@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 95.01% when pulling a105327 on amueller:pipeline_scoring_y_none into a168a6c on scikit-learn:master.

@amueller
Copy link
Member Author

amueller commented Feb 6, 2015

added integration test, rebased. And now I'm not convinced that this is the right fix any more :-/ Maybe not passing y if it is None would be the better fix? And removing the y=None everywhere? Edit: this is the right fix, just not enough coffee.

I am a bit confused.
I just did
`> git grep "transform(self, X, y=None)" | grep -v "fit_transform"``
which yields

sklearn/cluster/birch.py:    def transform(self, X, y=None):
sklearn/cluster/k_means_.py:    def transform(self, X, y=None):
sklearn/decomposition/base.py:    def transform(self, X, y=None):
sklearn/decomposition/base.py:    def inverse_transform(self, X, y=None):
sklearn/decomposition/dict_learning.py:    def transform(self, X, y=None):
sklearn/decomposition/pca.py:    def transform(self, X, y=None):
sklearn/decomposition/pca.py:    def inverse_transform(self, X, y=None):
sklearn/feature_extraction/dict_vectorizer.py:    def transform(self, X, y=None):
sklearn/feature_extraction/text.py:    def transform(self, X, y=None):
sklearn/kernel_approximation.py:    def transform(self, X, y=None):
sklearn/kernel_approximation.py:    def transform(self, X, y=None):
sklearn/kernel_approximation.py:    def transform(self, X, y=None):
sklearn/neighbors/approximate.py:    def transform(self, X, y=None):
sklearn/preprocessing/data.py:    def transform(self, X, y=None):
sklearn/random_projection.py:    def transform(self, X, y=None):
sklearn/tests/test_pipeline.py:    def transform(self, X, y=None):

That y is never used anywhere and not required by any API, is it?

@amueller amueller force-pushed the pipeline_scoring_y_none branch from a105327 to cd63acc Compare February 6, 2015 12:26
@amueller
Copy link
Member Author

amueller commented Feb 6, 2015

Anyhow, this PR is good to go, I think. I removed the requirement on transform (which is not a requirement at all) from the developer docs.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 95.01% when pulling cd63acc on amueller:pipeline_scoring_y_none into a168a6c on scikit-learn:master.

@ogrisel
Copy link
Member

ogrisel commented Feb 6, 2015

Great, merging this PR. Thanks again @amueller !

ogrisel added a commit that referenced this pull request Feb 6, 2015
[MRG+1] FIX Pipelined fitting of Clustering algorithms, scoring of K-Means in pipelines
@ogrisel ogrisel merged commit 620301b into scikit-learn:master Feb 6, 2015
@amueller
Copy link
Member Author

amueller commented Feb 7, 2015

Thanks for the reviews @ogrisel and @jnothman :)

@jnothman
Copy link
Member

jnothman commented Feb 7, 2015

Edit: this is the right fix, just not enough coffee.

:)

Rest assured this fix is in accordance with what's been in the docs for some time...

@amueller amueller deleted the pipeline_scoring_y_none branch May 19, 2017 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pipeline score broken for unsupervised algorithms
4 participants