-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Parameter for stacking missing indicator into imputer #12583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Parameter for stacking missing indicator into imputer #12583
Conversation
Sweet. Can you add an entry to the 0.21 whatsnew? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice feature ! Thanks. A few comments below.
Hum I found a case when it will fail. When one column is full of missing value, the SimpleImputer will drop that column (except for "constant" strategy) and the MissingIndicator will have a feature mismatch at transform time. First, you can add it to the test, i.e. add one column full of missing values, to make the test cover more cases. Then, to fix this, one possibility is to not fit the MissingIndicator in fit but fit_transform it instead in transform. The MissingIndicator does not hold any information that the SimpleImputer already have, so it does not even need to be an attribute of the SimpleImputer. @amuller @jnothman what do you think ? |
@jeremiedbb, good catch with fully missing column! I adjusted code and added tests. What do you think about it now? |
I think I don't like the current solution since it adds a constant 1 column which is not useful imho. |
I agree. In the case of a column full of missing values, we should also drop the column full of 1 from the output of missingIndicator. |
@amueller and @jeremiedbb, I see your point and agree. But the current solution does the same if you do it like this:
In theory, if we drop the column full of 1, people who now use |
That's a good point. We need to decide to either:
I'm somewhat leaning towards 3, though it requires a deprecation cycle. |
Seems, auto drop for a constant column not always good choice. Here's a small sample:
For I would vote for the # 2, because, to be honest, it's hard to imagine, that users will try to work with a dataset, that contains whole column of empty values. It would be unuseful even with "constant" strategy for SimpleImputer. |
@DanilBaibak I think Joel meant in another PR, once this one is done |
Ok! Just added it in hot pursuit 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good following merge of #13491
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Thanks.
Thanks @banilo!! |
Glad to help 😊 |
…ikit-learn#12583)" This reverts commit 5f94df8.
…ikit-learn#12583)" This reverts commit 5f94df8.
Fixes #11886
A new parameter
add_indicator
was added to SimpleImputer that allow simply stacking a MissingIndicator transform into the output of the imputer's transform.