Skip to content

[MRG] ENH add an option to drop full missing features in MissingIndicator #13491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jeremiedbb
Copy link
Member

This PR should help #12583 to move forward.

Following discussions in #12583, adding a option to SimpleImputer to stack a MissingIndicator which is consistent with the behavior of the SimpleImputer without breaking backward compatibility requires adding an option to the MissingIndicator to drop the columns full of missing values.

To do that I added another possible value for the features parameter.

I wonder if we want to keep the 3 options, or raise a future warning saying that 'only-missing' will take the new option behavior in 2 releases and deprecate the new option. After all keeping features with full missing values does not really make sense. What' your opinion about that ?

@@ -1057,13 +1057,15 @@ class MissingIndicator(BaseEstimator, TransformerMixin):
`missing_values` will be indicated (True in the output array), the
other values will be marked as False.

features : str, optional
features : {"missing-only", "all", "not-constant"}, optional
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If find not-constant terrible... Please help me find a better name :) !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly for internal use, so don't worry! But "some-missing" might be better.

@jeremiedbb
Copy link
Member Author

It also fixes a bug: when X is sparse the mask would contain explicit zeros (all non missing values become explicit zeros).
The mask is not incorrect, but there are more values stored than necessary.

Here's an example to illustrate this:

from sklearn.impute import MissingIndicator                                                                                                                      
from scipy.sparse import csr_matrix                                                                                                                              

X = csr_matrix([[0, 1, 2],
                [1, 2, 0],
                [2, 0, 1]])                                                                                                                  

mi = MissingIndicator(features='all', missing_values=1)                                                                                                          
print(mi.fit_transform(X))                                                                                                                                       
  (1, 0)	True
  (2, 0)	False
  (0, 1)	True
  (1, 1)	False
  (0, 2)	False
  (2, 2)	True

All the '2' became explicit zeros.

@jnothman
Copy link
Member

jnothman commented Mar 23, 2019 via email

features_diff_fit_trans = np.setdiff1d(features, self.features_)
if (self.error_on_new and features_diff_fit_trans.size > 0):
raise ValueError("The features {} have missing values "
"in transform but have no missing values "
"in fit.".format(features_diff_fit_trans))

if (self.features_.size > 0 and
self.features_.size < self._n_features):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the first condition. If we want only features with missing, and there's not any, then the mask should be empty. Before, it would return the mask of all features.

@jeremiedbb
Copy link
Member Author

Not-constant => varying?

I'm not a fan either :/

Discussing irl with Joris, he finds 'not-constant' explicit enough. Maybe it's not too bad after all.

I put the PR on MRG and maybe we'll reach a consensus with other reviewers :)

@jeremiedbb jeremiedbb changed the title [WIP] ENH add an option to drop full missing features in MissingIndicator [MRG] ENH add an option to drop full missing features in MissingIndicator Mar 25, 2019
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit silly to maintain these three options. We could consider phasing out missing-only? Not important.

:user:`Jérémie du Boisberranger <jeremiedbb>`.

- |Fix| Fixed a bug in :class:`MissingIndicator` when ``X`` is sparse. All the
non-zero missing values used to become explicit False is the transformed data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is -> in

@@ -1057,13 +1057,15 @@ class MissingIndicator(BaseEstimator, TransformerMixin):
`missing_values` will be indicated (True in the output array), the
other values will be marked as False.

features : str, optional
features : {"missing-only", "all", "not-constant"}, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly for internal use, so don't worry! But "some-missing" might be better.

if missing_values_mask.format == 'csc'
else np.unique(missing_values_mask.indices))
if self.features in ('missing-only', 'not-constant'):
if imputer_mask.format == 'csc':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be achieved with imputer_mask.getnnz(axis=0)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you make it simple when you can make it complicated :D ?

return imputer_mask, features_with_missing
if self.features == 'all':
features_indices = np.arange(X.shape[1])
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be clearer as:

elif self.features == 'missing-only':
    features_indices = np.flatnonzero(n_missing)
else:
    features_indices = np.flatnonzero(np.logical_and(n_missing < X.shape[0], n_missing > 0))

@@ -919,7 +919,7 @@ def test_iterative_imputer_early_stopping():
'have missing values in transform but have no missing values in fit'),
(np.array([[-1, 1], [1, 2]]), np.array([[-1, 1], [1, 2]]),
{'features': 'random', 'sparse': 'auto'},
"'features' has to be either 'missing-only' or 'all'"),
"'features' has to be either 'missing-only', 'all' or 'not-constant'"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either -> one of

Xt = mi.fit_transform(X)

nnz = Xt.getnnz()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just add an assertion elsewhere that Xt.nnz == Xt.sum(). You shouldn't need a new test, nor a specified expected value.

@jnothman
Copy link
Member

I should add: otherwise LGTM

@jeremiedbb
Copy link
Member Author

It's mostly for internal use, so don't worry! But "some-missing" might be better.

Aaah that's better indeed !

DanilBaibak added a commit to DanilBaibak/scikit-learn that referenced this pull request Mar 28, 2019
Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will you define error_on_new if you add the new option?
We can still implement #12583 without this PR? (e.g., use features='all' and pick the columns we need)
I doubt whether it's important to consider features containing only missing values.

@@ -1246,15 +1254,14 @@ def transform(self, X):

imputer_mask, features = self._get_missing_features_info(X)

if self.features == "missing-only":
if self.features in ("missing-only", "some-missing"):
features_diff_fit_trans = np.setdiff1d(features, self.features_)
if (self.error_on_new and features_diff_fit_trans.size > 0):
raise ValueError("The features {} have missing values "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to update the error message?



def test_missing_indicator_sparse_no_explicit_zeros():
# Check that non missing values don't become explicit zeeros in the mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zeeros -> zeros

@jeremiedbb
Copy link
Member Author

How will you define error_on_new if you add the new option?

I wouldn't modify it: raise an error if a feature has missing values at transform but not at fit. Not keeping the full missing at fit does not interfere with that. I wouldn't raise an error if a feature has full missing at fit and only some missing at transform, to keep the same behavior as the SimpleImputer (although for this one full missing can't be fitted at all).

We can still implement #12583 without this PR? (e.g., use features='all' and pick the columns we need) I doubt whether it's important to consider features containing only missing values.

Yes we can do that. The idea was to have imputers share similar behaviors, and this behavior is justified by the fact that constant features bring absolutely no information to learn.

@qinhanmin2014
Copy link
Member

I wouldn't modify it: raise an error if a feature has missing values at transform but not at fit. Not keeping the full missing at fit does not interfere with that. I wouldn't raise an error if a feature has full missing at fit and only some missing at transform, to keep the same behavior as the SimpleImputer (although for this one full missing can't be fitted at all).

But now, when features="missing-only", users will get an error when a feature only has missing values in fit but has some missing values in transform? The doc and the error message seems incorrect.

Yes we can do that. The idea was to have imputers share similar behaviors, and this behavior is justified by the fact that constant features bring absolutely no information to learn.

You're right that constant features are not informative, but we have things like feature_selection.VarianceThreshold and I don't think it's the duty of a transformer to do feature selection (See e.g., the Notes part of https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html).

@qinhanmin2014
Copy link
Member

You're right that constant features are not informative, but we have things like feature_selection.VarianceThreshold and I don't think it's the duty of a transformer to do feature selection (See e.g., the Notes part of https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html).

As long as you can define error_on_new, I'm OK with the new option.

@jeremiedbb
Copy link
Member Author

But now, when features="missing-only", users will get an error when a feature only has missing values in fit but has some missing values in transform? The doc and the error message seems incorrect.

right. I did not catch that.

@jeremiedbb
Copy link
Member Author

Finally I updated error_on_new to raise an error in both cases:

  • feature with no missing at fit and some missing at transform (applicable only if 'missing-only' or 'some-missing')
  • feature with only missing at fit and some non missing at transform (applicable only if 'some-missing')

not raising in the second case would have added a lot more complexity to the code (keeping track of which feature has been dropped for which reason).

Let me know what you think of this behavior.

@qinhanmin2014
Copy link
Member

Hmm, I think the new definition is difficult to understand (and the name error_on_new seems strange under the new definition). I doubt whether we should close this PR. We can still implement #12583 without this PR? (e.g., use features='all' and pick the columns we need). Maybe add a note instead.
ping @jnothman

@jeremiedbb
Copy link
Member Author

I think the idea in the long term is to only keep one option with the behavior of some-missing, i.e. deprecate the current behavior in favor of the new one. At least that's what I understood from the discussions.

@qinhanmin2014
Copy link
Member

I think the idea in the long term is to only keep one option with the behavior of some-missing, i.e. deprecate the current behavior in favor of the new one. At least that's what I understood from the discussions.

Let's see what @jnothman thinks.

@jnothman
Copy link
Member

jnothman commented Apr 2, 2019 via email

@qinhanmin2014
Copy link
Member

I'd be happy with that, but we should only raise a deprecation warning when
there would be a difference in output...

So you think current definition of error_on_new (changed after your approval) is acceptable? @jnothman

error_on_new : boolean, optional
        If True (default), transform will raise an error when there are either
        features with missing values in transform that have no missing values
        in fit (only applicable if
        ``features in ("missing-only", "some-missing")``), or features with non
        missing values in transform that have only missing values in fit
        (only applicable if ``features="some-missing"``).

@jnothman
Copy link
Member

jnothman commented Apr 2, 2019 via email

@qinhanmin2014
Copy link
Member

So there's -2 on current definition of error_on_new.
I think it's difficult to define error_on_new if we support features="some_missing" and I guess it's not so important to consider features containing only missing values here.
@jnothman How about closing this PR? We can still implement #12583 without this PR (e.g., use features='all' and pick the columns we need). Maybe we can add a note instead.

@jnothman
Copy link
Member

jnothman commented Apr 2, 2019 via email

@jeremiedbb
Copy link
Member Author

Ok then. I won't close it however because it also fixes a bug. I'll rename it instead.

@scikit-learn scikit-learn deleted a comment from jeremiedbb Apr 2, 2019
@qinhanmin2014
Copy link
Member

Ok then. I won't close it however because it also fixes a bug. I'll rename it instead.

Apart from the bug, you can also add some notes about the all-missing behaviour.

@jeremiedbb
Copy link
Member Author

Apart from the bug, you can also add some notes about the all-missing behaviour.

well there's no special behavior in that case. I don't think it's worth adding a note on a rare edge case which we don't treat differently.

Actually I'm closing this one and opening a new one to ease the review because the thread will be unrelated.

@qinhanmin2014
Copy link
Member

well there's no special behavior in that case. I don't think it's worth adding a note on a rare edge case which we don't treat differently.

That's fine. Apologies again for the wrong comments above.

@jeremiedbb
Copy link
Member Author

opened #13562 for the fixes only

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants