Skip to content

[MRG] run check_estimator on meta-estimators #9741

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 19 commits into from

Conversation

amueller
Copy link
Member

Fixes #9443.

After #9716 I think it's time to add these checks.

@amueller
Copy link
Member Author

some rebase :-/ I'll fix the history once #9716 is merged.

@amueller
Copy link
Member Author

I guess we need a tag to skip the "modify init parameter" check as long as pipeline and feature union violate the API?

remove outdated comment

fix also for FeatureUnion

[MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) (scikit-learn#8742)

[MRG+1] Remove hard dependency on nose (scikit-learn#9670)

MAINT Stop vendoring sphinx-gallery (scikit-learn#9403)

CI upgrade travis to run on new numpy release (scikit-learn#9096)

CI Make it possible to run doctests in .rst files with pytest (scikit-learn#9697)

* doc/datasets/conftest.py to implement the equivalent of nose fixtures
* add conftest.py in root folder to ensure that sklearn local folder
  is used rather than the package in site-packages
* test doc with pytest in Travis
* move custom_data_home definition from nose fixture to .rst file

[MRG+1] avoid integer overflow by using floats for matthews_corrcoef (scikit-learn#9693)

* Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef

* matthews_corrcoef: cosmetic change requested by jnothman

* Add test_matthews_corrcoef_overflow for Bug#9622

* test_matthews_corrcoef_overflow: clean-up and make deterministic

* matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype

* test_matthews_corrcoef_overflow: add simple deterministic tests

TST Platform independent hash collision tests in FeatureHasher (scikit-learn#9710)

TST More informative error message in test_preserve_trustworthiness_approximately (scikit-learn#9738)

add some rudimentary tests for meta-estimators

fix extra whitespace in error message

add missing if_delegate_has_method in pipeline

don't test tuple pipeline for now

only copy list if not list already? doesn't seem to help?
@amueller amueller force-pushed the meta_check_estimator branch from 5cb5323 to 307c360 Compare September 19, 2017 18:13
@amueller
Copy link
Member Author

@jnothman do you have an opinion on what to do about the "modifying init param in fit" failure for pipelines? Should we create a work-around or start the worst deprecation cycle yet and introduce steps_?

@jnothman
Copy link
Member

jnothman commented Sep 19, 2017 via email

@amueller
Copy link
Member Author

Ok, so skip that test? I'll go back to finishing up the tags then ;)

@rth
Copy link
Member

rth commented Jul 31, 2019

do you have an opinion on what to do about the "modifying init param in fit" failure for pipelines?

Depends if you see that as a bug of a feature. If one passes an estimator to a pipeline, naively I would expect that estimator to be fitted when the pipeline is fit, as it happening now. Not have a some separate cloned instance created in the pipeline. At least that's what happens in DL libraries I think and it makes more sense to me, aside the fact that it doesn't respect the scikit-learn API contract. Was that contract designed with pipelines in view though?

@amueller
Copy link
Member Author

I think there's a separate issue to track that, #8157. Maybe it makes sense to figure out a short-term solution here for the tests?

@amueller
Copy link
Member Author

amueller commented Sep 9, 2019

Ok so I have no idea how to make make_pipeline(StandardScaler()) pass the tests...
We would need to define the nan handling of pipelines which our current tags can't. Do we want to add a new tag for "ensure_no_missing_output" or something for imputers?
Basically a pipeline can handle nan if either all steps can handle nan or if there's an imputer and all steps before it can handle NaN.

@amueller amueller changed the title [WIP] run check_estimator on meta-estimators [MRG] run check_estimator on meta-estimators Sep 9, 2019
@amueller
Copy link
Member Author

amueller commented Sep 9, 2019

I'm hacking around the pipeline cloning issue here.
I think we should merge this to get something in. This would make #14241 fail, I think, which already is a win.

The missing value tags are a bit of an issue and generally defining tags for pipelines will be tricky, but this is a start.

I didn't add ColumnTransformer yet, because there are too many issues.
There's also issues with None and passthrough steps.

@amueller
Copy link
Member Author

amueller commented Sep 9, 2019

ping @NicolasHug

# grid-search
GridSearchCV(LogisticRegression(), {'C': [0.1, 1]}, cv=2),
# will fail tragically
# make_pipeline(StandardScaler(), None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this failing because of how None or 'passthrough' does not support _get_tags?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there were multiple issues

@amueller
Copy link
Member Author

amueller commented Sep 9, 2019

the validation and ducktyping in pipeline is a mess.
Also, it's not clear to me whether having the last n steps be passthrough will work? (edit: it works as expected I think)

But I think adding at least some tests is good...

@thomasjpfan
Copy link
Member

thomasjpfan commented Sep 9, 2019

the validation and ducktyping in pipeline is a mess.

Agreed. if_delegate_has_method does not work properly when _final_estimator is passthrough or None. I think fixing this should be in another PR.

Base automatically changed from master to main January 22, 2021 10:49
@adrinjalali
Copy link
Member

I was just reminded that Pipeline doesn't pass check_is_fitted, which is what we need in fairlearn/fairlearn#665 . I'd be very happy to have this in soon :)

@glemaitre
Copy link
Member

I would expect the following to work as expected:

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=0)
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])

The unfitted pipeline will error.

from sklearn.utils.validation import check_is_fitted

check_is_fitted(pipe, "n_features_in_")
NotFittedError: This Pipeline instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

And the following will not error.

pipe.fit(X, y)
check_is_fitted(pipe, "n_features_in_")

@adrinjalali
Copy link
Member

Our new check_is_fitted convention is not to call it with a second parameter though, and since n_features_in_ is a property, it's not listed in the list that check_is_fitted checks. Also, as a third party estimator, I'd like to just check if another estimator is fitted or not, not myself, which means I don't know which attribute I should check. Therefore check_is_fitted(pipeline, n_features_in_) is not a solution. I'm in a position where I'm a meta-estimator, and I'd like to check if the given estimator is fitted or not (think warm_start).

@glemaitre
Copy link
Member

I am confused regarding this topic. It is coming from time to time and I don't now recall what are we supposed to do (probably we don't know :)).

From what I recall, we basically don't know what to validate in a Pipeline (e.g 1 estimator or all estimators, etc.). Passing a callable would delegate this issue to the user but, if I understand correctly, you would like to not make any check.

I was under the impression that n_features_in_ being a property is indeed allowing for such a "callable" check. So it would almost be OK to use it. Apart from if you get a Vectorizer that does not implement n_features_in_ and you are back to the start.

@adrinjalali do you recall the latest news regarding this topic?

@adrinjalali
Copy link
Member

To me, a Pipeline should pass check_is_fitted if the user has called fit on it, independent of whether all the steps are already fit or not. If we want to save CPU cycles and pass pre-fitted estimators to a pipeline, we still should call fit on it, but with warm_start and not calling every estimator's fit.

I'd just set a is_fitted_ attribute in Pipeline's fit and let it pass check_is_fitted like every other estimator.

@glemaitre
Copy link
Member

I'd just set a is_fitted_ attribute in Pipeline's fit and let it pass check_is_fitted like every other estimator.

maybe is_maybe_fitted_if_you_did_it_right_ is better :). Joke apart, I think that this is equivalent to the remark in PDP code: we would expect to see a step_ attribute that would be the result of calling fit.

@adrinjalali
Copy link
Member

So should I just submit a PR setting some random attribute in pipeline for check_is_fitted to work?

@glemaitre
Copy link
Member

So should I just submit a PR setting some random attribute in pipeline for check_is_fitted to work?

I would be happy to review. It could be nice to have a review of @jnothman that always think about some side effects that I am not aware of.

@thomasjpfan
Copy link
Member

So should I just submit a PR setting some random attribute in pipeline for check_is_fitted to work?

Let's go for it. I run into the pipeline + check_is_fitted issue quite frequently.

check_is_fitted is really implicit. If we want the estimator to decide, I can see a protocol: __is_fitted__, which returns True if the estimator is fitted. check_is_fitted would call __is_fitted__ if it exists.

@ogrisel
Copy link
Member

ogrisel commented Aug 20, 2021

Closing in favor of #20657 that implements a general solution. If there are remaining sub-cases to be tackled they can be addressed in dedicated PR. #20657 adds a new common test so most cases should already be covered.

@ogrisel ogrisel closed this Aug 20, 2021
@adrinjalali
Copy link
Member

The other PR only checks for check_is_fitted though. This pr adds a bit more than that. Doesn't it @ogrisel ?

@ogrisel
Copy link
Member

ogrisel commented Aug 20, 2021

Indeed sorry. It needs an update though. Let's reopen not to forget.

@ogrisel ogrisel reopened this Aug 20, 2021
@amueller amueller closed this Jul 17, 2022
@amueller amueller reopened this Jul 17, 2022
@glemaitre glemaitre removed their request for review December 1, 2022 14:51
@thomasjpfan
Copy link
Member

@adrinjalali We already run the common tests on many meta estimators:

@parametrize_with_checks(list(chain(_tested_estimators(), _generate_pipeline())))

@parametrize_with_checks(list(_generate_search_cv_instances()))

I think this PR can be closed.

@adrinjalali adrinjalali closed this Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add instance-level calls to estimator_checks for meta-estimators
9 participants