[MRG] run check_estimator on meta-estimators #9741

amueller · 2017-09-12T17:55:59Z

After #9716 I think it's time to add these checks.

amueller · 2017-09-12T17:57:24Z

some rebase :-/ I'll fix the history once #9716 is merged.

amueller · 2017-09-12T19:00:52Z

I guess we need a tag to skip the "modify init parameter" check as long as pipeline and feature union violate the API?

remove outdated comment fix also for FeatureUnion [MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) (scikit-learn#8742) [MRG+1] Remove hard dependency on nose (scikit-learn#9670) MAINT Stop vendoring sphinx-gallery (scikit-learn#9403) CI upgrade travis to run on new numpy release (scikit-learn#9096) CI Make it possible to run doctests in .rst files with pytest (scikit-learn#9697) * doc/datasets/conftest.py to implement the equivalent of nose fixtures * add conftest.py in root folder to ensure that sklearn local folder is used rather than the package in site-packages * test doc with pytest in Travis * move custom_data_home definition from nose fixture to .rst file [MRG+1] avoid integer overflow by using floats for matthews_corrcoef (scikit-learn#9693) * Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef * matthews_corrcoef: cosmetic change requested by jnothman * Add test_matthews_corrcoef_overflow for Bug#9622 * test_matthews_corrcoef_overflow: clean-up and make deterministic * matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype * test_matthews_corrcoef_overflow: add simple deterministic tests TST Platform independent hash collision tests in FeatureHasher (scikit-learn#9710) TST More informative error message in test_preserve_trustworthiness_approximately (scikit-learn#9738) add some rudimentary tests for meta-estimators fix extra whitespace in error message add missing if_delegate_has_method in pipeline don't test tuple pipeline for now only copy list if not list already? doesn't seem to help?

amueller · 2017-09-19T18:18:30Z

@jnothman do you have an opinion on what to do about the "modifying init param in fit" failure for pipelines? Should we create a work-around or start the worst deprecation cycle yet and introduce steps_?

jnothman · 2017-09-19T23:48:18Z

I don't think we can avoid the modifying init param in fit failure for a while. The deprecation is a 1.0 sort of an issue. As I've said before, it's best enabled by a future parameter, rather than a straightforward deprecation. But first we have to be sure we can handle all common use-cases for pipelines in an intuitive way. Pipeline(other_pipeline.steps[:-1]) is currently straightforward whether or not other_pipeline is fitted, for example.

…

On 20 September 2017 at 04:18, Andreas Mueller ***@***.***> wrote: @jnothman <https://github.com/jnothman> do you have an opinion on what to do about the "modifying init param in fit" failure for pipelines? Should we create a work-around or start the worst deprecation cycle yet and introduce steps_? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9741 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xd38TWn3t5gk3K8hCBWQewVbdlCks5skAV6gaJpZM4PVBwz> .

amueller · 2017-09-20T16:57:49Z

Ok, so skip that test? I'll go back to finishing up the tags then ;)

rth · 2019-07-31T14:36:51Z

do you have an opinion on what to do about the "modifying init param in fit" failure for pipelines?

Depends if you see that as a bug of a feature. If one passes an estimator to a pipeline, naively I would expect that estimator to be fitted when the pipeline is fit, as it happening now. Not have a some separate cloned instance created in the pipeline. At least that's what happens in DL libraries I think and it makes more sense to me, aside the fact that it doesn't respect the scikit-learn API contract. Was that contract designed with pipelines in view though?

amueller · 2019-07-31T16:39:07Z

I think there's a separate issue to track that, #8157. Maybe it makes sense to figure out a short-term solution here for the tests?

# Conflicts: # sklearn/pipeline.py # sklearn/utils/estimator_checks.py

amueller · 2019-09-09T16:25:50Z

Ok so I have no idea how to make make_pipeline(StandardScaler()) pass the tests...
We would need to define the nan handling of pipelines which our current tags can't. Do we want to add a new tag for "ensure_no_missing_output" or something for imputers?
Basically a pipeline can handle nan if either all steps can handle nan or if there's an imputer and all steps before it can handle NaN.

amueller · 2019-09-09T17:03:25Z

I'm hacking around the pipeline cloning issue here.
I think we should merge this to get something in. This would make #14241 fail, I think, which already is a win.

The missing value tags are a bit of an issue and generally defining tags for pipelines will be tricky, but this is a start.

I didn't add ColumnTransformer yet, because there are too many issues.
There's also issues with None and passthrough steps.

amueller · 2019-09-09T18:14:19Z

ping @NicolasHug

thomasjpfan · 2019-09-09T18:25:16Z

sklearn/tests/test_metaestimators.py

+    # grid-search
+    GridSearchCV(LogisticRegression(), {'C': [0.1, 1]}, cv=2),
+    # will fail tragically
+    # make_pipeline(StandardScaler(), None)


Is this failing because of how None or 'passthrough' does not support _get_tags?

there were multiple issues

amueller · 2019-09-09T19:19:38Z

the validation and ducktyping in pipeline is a mess.
Also, it's not clear to me whether having the last n steps be passthrough will work? (edit: it works as expected I think)

But I think adding at least some tests is good...

thomasjpfan · 2019-09-09T19:23:35Z

the validation and ducktyping in pipeline is a mess.

Agreed. if_delegate_has_method does not work properly when _final_estimator is passthrough or None. I think fixing this should be in another PR.

adrinjalali · 2021-07-27T08:26:49Z

I was just reminded that Pipeline doesn't pass check_is_fitted, which is what we need in fairlearn/fairlearn#665 . I'd be very happy to have this in soon :)

glemaitre · 2021-07-27T09:16:50Z

I would expect the following to work as expected:

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=0)
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])

The unfitted pipeline will error.

from sklearn.utils.validation import check_is_fitted

check_is_fitted(pipe, "n_features_in_")

NotFittedError: This Pipeline instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

And the following will not error.

pipe.fit(X, y)
check_is_fitted(pipe, "n_features_in_")

adrinjalali · 2021-07-27T09:33:25Z

Our new check_is_fitted convention is not to call it with a second parameter though, and since n_features_in_ is a property, it's not listed in the list that check_is_fitted checks. Also, as a third party estimator, I'd like to just check if another estimator is fitted or not, not myself, which means I don't know which attribute I should check. Therefore check_is_fitted(pipeline, n_features_in_) is not a solution. I'm in a position where I'm a meta-estimator, and I'd like to check if the given estimator is fitted or not (think warm_start).

glemaitre · 2021-07-27T09:57:53Z

I am confused regarding this topic. It is coming from time to time and I don't now recall what are we supposed to do (probably we don't know :)).

From what I recall, we basically don't know what to validate in a Pipeline (e.g 1 estimator or all estimators, etc.). Passing a callable would delegate this issue to the user but, if I understand correctly, you would like to not make any check.

I was under the impression that n_features_in_ being a property is indeed allowing for such a "callable" check. So it would almost be OK to use it. Apart from if you get a Vectorizer that does not implement n_features_in_ and you are back to the start.

@adrinjalali do you recall the latest news regarding this topic?

adrinjalali · 2021-07-27T10:02:41Z

To me, a Pipeline should pass check_is_fitted if the user has called fit on it, independent of whether all the steps are already fit or not. If we want to save CPU cycles and pass pre-fitted estimators to a pipeline, we still should call fit on it, but with warm_start and not calling every estimator's fit.

I'd just set a is_fitted_ attribute in Pipeline's fit and let it pass check_is_fitted like every other estimator.

glemaitre · 2021-07-27T10:08:05Z

I'd just set a is_fitted_ attribute in Pipeline's fit and let it pass check_is_fitted like every other estimator.

maybe is_maybe_fitted_if_you_did_it_right_ is better :). Joke apart, I think that this is equivalent to the remark in PDP code: we would expect to see a step_ attribute that would be the result of calling fit.

adrinjalali · 2021-07-27T10:49:40Z

So should I just submit a PR setting some random attribute in pipeline for check_is_fitted to work?

glemaitre · 2021-07-27T11:19:26Z

So should I just submit a PR setting some random attribute in pipeline for check_is_fitted to work?

I would be happy to review. It could be nice to have a review of @jnothman that always think about some side effects that I am not aware of.

thomasjpfan · 2021-07-28T19:20:43Z

So should I just submit a PR setting some random attribute in pipeline for check_is_fitted to work?

Let's go for it. I run into the pipeline + check_is_fitted issue quite frequently.

check_is_fitted is really implicit. If we want the estimator to decide, I can see a protocol: __is_fitted__, which returns True if the estimator is fitted. check_is_fitted would call __is_fitted__ if it exists.

ogrisel · 2021-08-20T10:40:19Z

Closing in favor of #20657 that implements a general solution. If there are remaining sub-cases to be tackled they can be addressed in dedicated PR. #20657 adds a new common test so most cases should already be covered.

adrinjalali · 2021-08-20T12:08:30Z

The other PR only checks for check_is_fitted though. This pr adds a bit more than that. Doesn't it @ogrisel ?

ogrisel · 2021-08-20T13:33:06Z

Indeed sorry. It needs an update though. Let's reopen not to forget.

thomasjpfan · 2024-03-06T16:01:24Z

@adrinjalali We already run the common tests on many meta estimators:

scikit-learn/sklearn/tests/test_common.py

Line 150 in 006ccdb

    
           @parametrize_with_checks(list(chain(_tested_estimators(), _generate_pipeline())))

scikit-learn/sklearn/tests/test_common.py

Line 358 in 006ccdb

@parametrize_with_checks(list(_generate_search_cv_instances()))

I think this PR can be closed.

amueller mentioned this pull request Sep 12, 2017

GridSearchCV (or rather check_cv) doesn't handle 2d y #9742

Closed

amueller mentioned this pull request Sep 19, 2017

check_estimator does not play well with Pipelines #9768

Closed

amueller force-pushed the meta_check_estimator branch from 5cb5323 to 307c360 Compare September 19, 2017 18:13

add check with last step None in pipeline.

8845760

amueller mentioned this pull request Feb 26, 2019

check_estimator is not sufficiently general #6715

Closed

amueller mentioned this pull request Jul 29, 2019

[WIP] API specify test parameters via classmethod #11324

Open

5 tasks

amueller added 2 commits September 9, 2019 12:08

Merge branch 'master' into meta_check_estimator

44bb61c

# Conflicts: # sklearn/pipeline.py # sklearn/utils/estimator_checks.py

use parametrize_with_checks instead of check_estimator

40612a4

amueller added 3 commits September 9, 2019 12:33

minor fixes, try to hack the tags

8a20c88

cheat a bit and skip the tests the pipeline fails b/c of clone

83c0a8f

make it a bit easier

07260c6

amueller changed the title ~~[WIP] run check_estimator on meta-estimators~~ [MRG] run check_estimator on meta-estimators Sep 9, 2019

amueller added 2 commits September 9, 2019 13:04

fix comment

206630c

tags for feature union are actually correct

3848f67

thomasjpfan reviewed Sep 9, 2019

View reviewed changes

do proper delegation in pipeline fit_transform

06e5bbe

pep8

5e6ee2b

glemaitre self-requested a review June 16, 2020 12:25

glemaitre mentioned this pull request Jul 3, 2020

Stacking does not adhere to scikit-learn conventions #17821

Closed

glemaitre mentioned this pull request Jul 31, 2020

EHN make some meta-estimators lenient towards missing values #17987

Merged

glemaitre added 2 commits July 31, 2020 10:48

Merge remote-tracking branch 'origin/master' into pr/amueller/9741

525a540

changes during merge

646e0d8

Base automatically changed from master to main January 22, 2021 10:49

cmarmo added help wanted Stalled labels Apr 21, 2021

adrinjalali mentioned this pull request Jul 27, 2021

Replace get_dummies in examples with Pipeline fairlearn/fairlearn#665

Closed

adrinjalali mentioned this pull request Aug 2, 2021

ENH check_is_fitted calls __is_fitted__ if available #20657

Merged

ogrisel closed this Aug 20, 2021

ogrisel reopened this Aug 20, 2021

amueller closed this Jul 17, 2022

amueller reopened this Jul 17, 2022

glemaitre removed their request for review December 1, 2022 14:51

Merge remote-tracking branch 'upstream/main' into meta_check_estimator

b1722ea

adrinjalali closed this Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] run check_estimator on meta-estimators #9741

[MRG] run check_estimator on meta-estimators #9741

amueller commented Sep 12, 2017

amueller commented Sep 12, 2017

amueller commented Sep 12, 2017

amueller commented Sep 19, 2017

jnothman commented Sep 19, 2017 via email

amueller commented Sep 20, 2017

rth commented Jul 31, 2019

amueller commented Jul 31, 2019

amueller commented Sep 9, 2019

amueller commented Sep 9, 2019

amueller commented Sep 9, 2019

thomasjpfan Sep 9, 2019

amueller Sep 9, 2019

amueller commented Sep 9, 2019 •

edited

Loading

thomasjpfan commented Sep 9, 2019 •

edited

Loading

adrinjalali commented Jul 27, 2021

glemaitre commented Jul 27, 2021

adrinjalali commented Jul 27, 2021

glemaitre commented Jul 27, 2021

adrinjalali commented Jul 27, 2021

glemaitre commented Jul 27, 2021

adrinjalali commented Jul 27, 2021

glemaitre commented Jul 27, 2021

thomasjpfan commented Jul 28, 2021

ogrisel commented Aug 20, 2021

adrinjalali commented Aug 20, 2021

ogrisel commented Aug 20, 2021

thomasjpfan commented Mar 6, 2024

[MRG] run check_estimator on meta-estimators #9741

[MRG] run check_estimator on meta-estimators #9741

Conversation

amueller commented Sep 12, 2017

amueller commented Sep 12, 2017

amueller commented Sep 12, 2017

amueller commented Sep 19, 2017

jnothman commented Sep 19, 2017 via email

amueller commented Sep 20, 2017

rth commented Jul 31, 2019

amueller commented Jul 31, 2019

amueller commented Sep 9, 2019

amueller commented Sep 9, 2019

amueller commented Sep 9, 2019

thomasjpfan Sep 9, 2019

Choose a reason for hiding this comment

amueller Sep 9, 2019

Choose a reason for hiding this comment

amueller commented Sep 9, 2019 • edited Loading

thomasjpfan commented Sep 9, 2019 • edited Loading

adrinjalali commented Jul 27, 2021

glemaitre commented Jul 27, 2021

adrinjalali commented Jul 27, 2021

glemaitre commented Jul 27, 2021

adrinjalali commented Jul 27, 2021

glemaitre commented Jul 27, 2021

adrinjalali commented Jul 27, 2021

glemaitre commented Jul 27, 2021

thomasjpfan commented Jul 28, 2021

ogrisel commented Aug 20, 2021

adrinjalali commented Aug 20, 2021

ogrisel commented Aug 20, 2021

thomasjpfan commented Mar 6, 2024

amueller commented Sep 9, 2019 •

edited

Loading

thomasjpfan commented Sep 9, 2019 •

edited

Loading