Skip to content

EHN make some meta-estimators lenient towards missing values #17987

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Aug 26, 2020

Conversation

venkyyuvy
Copy link
Contributor

Reference Issues/PRs: Fixes: #15319

What does this implement/fix? Explain your changes.

Any other comments?

@ogrisel ogrisel changed the title Meta estimators leniency towards missing values [WIP] Meta estimators leniency towards missing values Jul 25, 2020
@ogrisel
Copy link
Member

ogrisel commented Jul 25, 2020

I marked this PR as [WIP] to let other know that it's still a work in progress as MultiOutputClassifier needs to be changed to make it lenient to missing values. Feel free to update the title again when the test pass to implicitly ask for reviews.

@venkyyuvy
Copy link
Contributor Author

@ogrisel

Do I have to add one more test for MultiOutputRegressor as well?

One more question, Is there any other meta estimator, which is not allowing NaN values to passthrough?

@venkyyuvy venkyyuvy changed the title [WIP] Meta estimators leniency towards missing values Meta estimators leniency towards missing values Jul 25, 2020
@venkyyuvy venkyyuvy requested a review from ogrisel July 25, 2020 16:44
@jnothman
Copy link
Member

jnothman commented Jul 26, 2020 via email

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! As @jnothman said, it would be great to do the same for MultiOutputRegressor as part of the same PR while you are at it.

@venkyyuvy
Copy link
Contributor Author

venkyyuvy commented Jul 30, 2020

Status before this PR.

estimator-type model support_missing_value test
ensemble BaggingRegressor yes test_bagging_regressor_with_missing_inputs
ensemble BaggingClassifier yes test_bagging_classifier_with_missing_inputs
ensemble VotingClassifier yes test_classifier_for_missing_data
ensemble VotingRegressor yes test_regressor_for_missing_data
ensemble StackingClassifier yes test_clf_support_missing_values
ensemble StackingRegressor yes test_reg_support_missing_values
wrapper MultiOutputRegressor no test_leniency_for_missing_data_regression
wrapper MultiOutputClassifier no test_leniency_for_missing_data_classification
wrapper OneVsRestClassifier yes test_support_missing_values
wrapper OneVsOneClassifier no test_support_missing_values
pipeline Pipeline yes test_missing_values_leniency
search GridseachCV yes test_grid_search_allows_nans
feature_selector RFE yes test_rfe_allow_nan_inf_in_x
feature_selector RFECV yes test_rfe_allow_nan_inf_in_x
feature_selector SelectFromModel yes test_fit_accepts_nan_inf

@jnothman
Copy link
Member

Excellent. Very informative!

@jnothman
Copy link
Member

Tests here are failing, however.

@jnothman
Copy link
Member

There are other meta-estimators, however, such as StackingClassifier, SelectFromModel, RFE, Pipeline, GridSearchCV, etc...

@venkyyuvy
Copy link
Contributor Author

venkyyuvy commented Jul 30, 2020

I have not fixed for OneVsOneClassifier yet. Just, I have added test for it .

StackingClassifier, SelectFromModel, RFE, Pipeline, GridSearchCV, etc

Thank you for the pointers, I will check on these as well.

@glemaitre
Copy link
Member

Out of scope of this PR to not create a monster but I think that we need to introduce common tests for checking nan behaviour.
For transformers it would be quite simple, I did not think yet about the meta-estimator case.

@venkyyuvy
Copy link
Contributor Author

venkyyuvy commented Jul 31, 2020

@glemaitre ,
what's your suggestion on going about building a common test for nan behaviour?

Currently I am manually checking through each one of them. I am checking for meta-estimators and wrappers and not covering the transformer

@glemaitre
Copy link
Member

what's your suggestion on going about building a common test for nan behaviour?

So reviewing the code, we are repeating the same tests for all meta-estimators. So we could isolate this test and put it in estimator_checks.py.

Then, we need to call this test with the proper estimator. So with meta-estimator, we need to make sure that we pass a base estimator that has a pipeline with a SimpleImputer (this should be possible when building estimator instances).

@glemaitre
Copy link
Member

And basically we should introduce this #9741 at the same time :)

@glemaitre
Copy link
Member

We can also change these multiple tests to one, without making an automatic
common test.

Yes we could move those sklearn/ensemble/test_common.py, this is a good idea.

@venkyyuvy
Copy link
Contributor Author

Thanks, I will move the tests of voting and stacking meta estimators to sklearn/ensemble/test_common.py.
Can we keep the MultiOutput and MultiClass wrapper tests there itself?

@glemaitre
Copy link
Member

Test are failing. Could you also add a comment on the test to indicate that we should remove them when we create a common test

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@venkyyuvy
Copy link
Contributor Author

Test are failing. Could you also add a comment on the test to indicate that we should remove them when we create a common test

I'm not sure why the test are failing with Attribute error (rng.random). The tests are running fine in my end.
Could you please let me know what is going wrong?

@glemaitre
Copy link
Member

rng.random does not exist on older version of numpy. Maybe use rng.randn to generate randomly normal generated data

@cmarmo cmarmo removed the Needs Info label Aug 15, 2020
@@ -302,6 +302,22 @@ Changelog
validity of the input is now delegated to the base estimator.
:pr:`17233` by :user:`Zolisa Bleki <zoj613>`.

- |Enhancement| :class:`multiclass.OneVsOneClassifier` now accpets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: accpets -> accepts

..........................

- |Enhancement| :class:`multioutput.MultiOutputClassifier` and
:class:`multioutput.MultiOutputRegressor` now accpets the inputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: accpets -> accepts

@jnothman
Copy link
Member

This is much neater. @venkyyuvy could you please resolve those nitpicks from @justmarkham, and resolve conflicts with master, so that we can merge. Thanks again!!

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only add 2 additional comments. LGTM otherwise (fixing what was pointed out earlier).

@glemaitre glemaitre changed the title Meta estimators leniency towards missing values EHN make some meta-estimators lenient towards missing values Aug 21, 2020
@glemaitre glemaitre merged commit 8ba49f6 into scikit-learn:master Aug 26, 2020
@glemaitre
Copy link
Member

Thanks @venkyyuvy Let's merge this. It is super useful.

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
…learn#17987)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make sure meta-estimators are lenient towards missing data
6 participants