Skip to content

[MRG] Parameter for stacking missing indicator into imputer #12583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

DanilBaibak
Copy link
Contributor

Fixes #11886

A new parameter add_indicator was added to SimpleImputer that allow simply stacking a MissingIndicator transform into the output of the imputer's transform.

@amueller
Copy link
Member

Sweet. Can you add an entry to the 0.21 whatsnew? Thanks!

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice feature ! Thanks. A few comments below.

@jeremiedbb
Copy link
Member

Hum I found a case when it will fail. When one column is full of missing value, the SimpleImputer will drop that column (except for "constant" strategy) and the MissingIndicator will have a feature mismatch at transform time.

First, you can add it to the test, i.e. add one column full of missing values, to make the test cover more cases.

Then, to fix this, one possibility is to not fit the MissingIndicator in fit but fit_transform it instead in transform. The MissingIndicator does not hold any information that the SimpleImputer already have, so it does not even need to be an attribute of the SimpleImputer. @amuller @jnothman what do you think ?

@DanilBaibak
Copy link
Contributor Author

@jeremiedbb, good catch with fully missing column! I adjusted code and added tests. What do you think about it now?

@amueller
Copy link
Member

I think transform shouldn't modify the estimator. Why can't we fit in fit and call transform after dropping columns?

I don't like the current solution since it adds a constant 1 column which is not useful imho.

@jeremiedbb
Copy link
Member

jeremiedbb commented Nov 20, 2018

I don't like the current solution since it adds a constant 1 column which is not useful imho.

I agree. In the case of a column full of missing values, we should also drop the column full of 1 from the output of missingIndicator.

@DanilBaibak
Copy link
Contributor Author

@amueller and @jeremiedbb, I see your point and agree. But the current solution does the same if you do it like this:

pipeline = make_pipeline(
    make_union(
        SimpleImputer(missing_values=marker),
        MissingIndicator(missing_values=marker)
    )
)

pipeline.fit_transform(X_test)

In theory, if we drop the column full of 1, people who now use make_union can affected. What do you think?

@amueller
Copy link
Member

That's a good point. We need to decide to either:

  1. make it inconsistent and drop the 1s here
  2. keep the redundant 1s in both places
  3. change the behavior in MissingIndicator to always drop constant columns.

I'm somewhat leaning towards 3, though it requires a deprecation cycle.

@DanilBaibak
Copy link
Contributor Author

Seems, auto drop for a constant column not always good choice. Here's a small sample:

X_train = np.array([
    [1, 1, np.nan],
    [np.nan, 2, 6],
    [2, np.nan, 3],
    [3, 3, 9]
])

X_test = np.array([
    [np.nan, 1, 5],
    [np.nan, 2, np.nan],
    [np.nan, np.nan, 3]
])

pipeline = make_pipeline(
    make_union(
        SimpleImputer(),
        MissingIndicator()
    )
)
pipeline.fit(X_train)

pipeline.transform(X_train)
pipeline.transform(X_test)

For X_test we have a whole column of 1, but not for X_train.

I would vote for the # 2, because, to be honest, it's hard to imagine, that users will try to work with a dataset, that contains whole column of empty values. It would be unuseful even with "constant" strategy for SimpleImputer.

@jeremiedbb
Copy link
Member

@DanilBaibak I think Joel meant in another PR, once this one is done

@DanilBaibak
Copy link
Contributor Author

DanilBaibak commented Apr 1, 2019

Ok! Just added it in hot pursuit 😄

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good following merge of #13491

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Thanks.

@jnothman jnothman merged commit 2252e1f into scikit-learn:master Apr 9, 2019
@jnothman
Copy link
Member

jnothman commented Apr 9, 2019

Thanks @banilo!!

@DanilBaibak
Copy link
Contributor Author

Glad to help 😊

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Apr 25, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add_indicator switch in imputers
6 participants