[MRG] ENH add an option to drop full missing features in MissingIndicator #13491

jeremiedbb · 2019-03-21T21:39:53Z

This PR should help #12583 to move forward.

Following discussions in #12583, adding a option to SimpleImputer to stack a MissingIndicator which is consistent with the behavior of the SimpleImputer without breaking backward compatibility requires adding an option to the MissingIndicator to drop the columns full of missing values.

To do that I added another possible value for the features parameter.

I wonder if we want to keep the 3 options, or raise a future warning saying that 'only-missing' will take the new option behavior in 2 releases and deprecate the new option. After all keeping features with full missing values does not really make sense. What' your opinion about that ?

jeremiedbb · 2019-03-21T21:41:42Z

sklearn/impute.py

@@ -1057,13 +1057,15 @@ class MissingIndicator(BaseEstimator, TransformerMixin):
        `missing_values` will be indicated (True in the output array), the
        other values will be marked as False.

-    features : str, optional
+    features : {"missing-only", "all", "not-constant"}, optional


If find not-constant terrible... Please help me find a better name :) !

It's mostly for internal use, so don't worry! But "some-missing" might be better.

jeremiedbb · 2019-03-22T09:18:25Z

It also fixes a bug: when X is sparse the mask would contain explicit zeros (all non missing values become explicit zeros).
The mask is not incorrect, but there are more values stored than necessary.

Here's an example to illustrate this:

from sklearn.impute import MissingIndicator                                                                                                                      
from scipy.sparse import csr_matrix                                                                                                                              

X = csr_matrix([[0, 1, 2],
                [1, 2, 0],
                [2, 0, 1]])                                                                                                                  

mi = MissingIndicator(features='all', missing_values=1)                                                                                                          
print(mi.fit_transform(X))                                                                                                                                       
  (1, 0)	True
  (2, 0)	False
  (0, 1)	True
  (1, 1)	False
  (0, 2)	False
  (2, 2)	True

All the '2' became explicit zeros.

jnothman · 2019-03-23T10:19:44Z

Not-constant => varying?

jeremiedbb · 2019-03-25T14:19:56Z

sklearn/impute.py

            features_diff_fit_trans = np.setdiff1d(features, self.features_)
            if (self.error_on_new and features_diff_fit_trans.size > 0):
                raise ValueError("The features {} have missing values "
                                 "in transform but have no missing values "
                                 "in fit.".format(features_diff_fit_trans))

-            if (self.features_.size > 0 and
-                    self.features_.size < self._n_features):


I removed the first condition. If we want only features with missing, and there's not any, then the mask should be empty. Before, it would return the mask of all features.

jeremiedbb · 2019-03-25T14:22:35Z

Not-constant => varying?

I'm not a fan either :/

Discussing irl with Joris, he finds 'not-constant' explicit enough. Maybe it's not too bad after all.

I put the PR on MRG and maybe we'll reach a consensus with other reviewers :)

jnothman

It seems a bit silly to maintain these three options. We could consider phasing out missing-only? Not important.

jnothman · 2019-03-26T23:09:01Z

doc/whats_new/v0.21.rst

+  :user:`Jérémie du Boisberranger <jeremiedbb>`.
+
+- |Fix| Fixed a bug in :class:`MissingIndicator` when ``X`` is sparse. All the
+  non-zero missing values used to become explicit False is the transformed data.


jnothman · 2019-03-26T23:09:13Z

sklearn/impute.py

@@ -1057,13 +1057,15 @@ class MissingIndicator(BaseEstimator, TransformerMixin):
        `missing_values` will be indicated (True in the output array), the
        other values will be marked as False.

-    features : str, optional
+    features : {"missing-only", "all", "not-constant"}, optional


It's mostly for internal use, so don't worry! But "some-missing" might be better.

jnothman · 2019-03-26T23:14:08Z

sklearn/impute.py

-                if missing_values_mask.format == 'csc'
-                else np.unique(missing_values_mask.indices))
+            if self.features in ('missing-only', 'not-constant'):
+                if imputer_mask.format == 'csc':


this can be achieved with imputer_mask.getnnz(axis=0)

Why would you make it simple when you can make it complicated :D ?

jnothman · 2019-03-26T23:16:34Z

sklearn/impute.py

-        return imputer_mask, features_with_missing
+        if self.features == 'all':
+            features_indices = np.arange(X.shape[1])
+        else:


this would be clearer as:

elif self.features == 'missing-only': features_indices = np.flatnonzero(n_missing) else: features_indices = np.flatnonzero(np.logical_and(n_missing < X.shape[0], n_missing > 0))

jnothman · 2019-03-26T23:17:21Z

sklearn/tests/test_impute.py

@@ -919,7 +919,7 @@ def test_iterative_imputer_early_stopping():
      'have missing values in transform but have no missing values in fit'),
     (np.array([[-1, 1], [1, 2]]), np.array([[-1, 1], [1, 2]]),
      {'features': 'random', 'sparse': 'auto'},
-      "'features' has to be either 'missing-only' or 'all'"),
+      "'features' has to be either 'missing-only', 'all' or 'not-constant'"),


either -> one of

jnothman · 2019-03-26T23:18:57Z

sklearn/tests/test_impute.py

+    Xt = mi.fit_transform(X)
+
+    nnz = Xt.getnnz()
+


You could just add an assertion elsewhere that Xt.nnz == Xt.sum(). You shouldn't need a new test, nor a specified expected value.

jnothman · 2019-03-27T07:41:17Z

I should add: otherwise LGTM

jeremiedbb · 2019-03-27T09:13:45Z

It's mostly for internal use, so don't worry! But "some-missing" might be better.

Aaah that's better indeed !

qinhanmin2014

How will you define error_on_new if you add the new option?
We can still implement #12583 without this PR? (e.g., use features='all' and pick the columns we need)
I doubt whether it's important to consider features containing only missing values.

qinhanmin2014 · 2019-03-30T14:16:38Z

sklearn/impute.py

@@ -1246,15 +1254,14 @@ def transform(self, X):

        imputer_mask, features = self._get_missing_features_info(X)

-        if self.features == "missing-only":
+        if self.features in ("missing-only", "some-missing"):
            features_diff_fit_trans = np.setdiff1d(features, self.features_)
            if (self.error_on_new and features_diff_fit_trans.size > 0):
                raise ValueError("The features {} have missing values "


we need to update the error message?

qinhanmin2014 · 2019-03-30T14:16:53Z

sklearn/tests/test_impute.py

+
+
+def test_missing_indicator_sparse_no_explicit_zeros():
+    # Check that non missing values don't become explicit zeeros in the mask


zeeros -> zeros

…-drop-full-missing

jeremiedbb · 2019-04-01T09:23:22Z

How will you define error_on_new if you add the new option?

I wouldn't modify it: raise an error if a feature has missing values at transform but not at fit. Not keeping the full missing at fit does not interfere with that. I wouldn't raise an error if a feature has full missing at fit and only some missing at transform, to keep the same behavior as the SimpleImputer (although for this one full missing can't be fitted at all).

We can still implement #12583 without this PR? (e.g., use features='all' and pick the columns we need) I doubt whether it's important to consider features containing only missing values.

Yes we can do that. The idea was to have imputers share similar behaviors, and this behavior is justified by the fact that constant features bring absolutely no information to learn.

qinhanmin2014 · 2019-04-01T10:05:24Z

I wouldn't modify it: raise an error if a feature has missing values at transform but not at fit. Not keeping the full missing at fit does not interfere with that. I wouldn't raise an error if a feature has full missing at fit and only some missing at transform, to keep the same behavior as the SimpleImputer (although for this one full missing can't be fitted at all).

But now, when features="missing-only", users will get an error when a feature only has missing values in fit but has some missing values in transform? The doc and the error message seems incorrect.

Yes we can do that. The idea was to have imputers share similar behaviors, and this behavior is justified by the fact that constant features bring absolutely no information to learn.

You're right that constant features are not informative, but we have things like feature_selection.VarianceThreshold and I don't think it's the duty of a transformer to do feature selection (See e.g., the Notes part of https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html).

qinhanmin2014 · 2019-04-01T10:06:54Z

You're right that constant features are not informative, but we have things like feature_selection.VarianceThreshold and I don't think it's the duty of a transformer to do feature selection (See e.g., the Notes part of https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html).

As long as you can define error_on_new, I'm OK with the new option.

jeremiedbb · 2019-04-01T10:14:47Z

But now, when features="missing-only", users will get an error when a feature only has missing values in fit but has some missing values in transform? The doc and the error message seems incorrect.

right. I did not catch that.

jeremiedbb · 2019-04-01T13:32:34Z

Finally I updated error_on_new to raise an error in both cases:

feature with no missing at fit and some missing at transform (applicable only if 'missing-only' or 'some-missing')
feature with only missing at fit and some non missing at transform (applicable only if 'some-missing')

not raising in the second case would have added a lot more complexity to the code (keeping track of which feature has been dropped for which reason).

Let me know what you think of this behavior.

qinhanmin2014 · 2019-04-01T13:55:43Z

Hmm, I think the new definition is difficult to understand (and the name error_on_new seems strange under the new definition). I doubt whether we should close this PR. We can still implement #12583 without this PR? (e.g., use features='all' and pick the columns we need). Maybe add a note instead.
ping @jnothman

jeremiedbb · 2019-04-01T14:46:43Z

I think the idea in the long term is to only keep one option with the behavior of some-missing, i.e. deprecate the current behavior in favor of the new one. At least that's what I understood from the discussions.

qinhanmin2014 · 2019-04-01T15:14:36Z

I think the idea in the long term is to only keep one option with the behavior of some-missing, i.e. deprecate the current behavior in favor of the new one. At least that's what I understood from the discussions.

Let's see what @jnothman thinks.

jnothman · 2019-04-02T08:32:04Z

I'd be happy with that, but we should only raise a deprecation warning when there would be a difference in output...

qinhanmin2014 · 2019-04-02T08:35:02Z

I'd be happy with that, but we should only raise a deprecation warning when
there would be a difference in output...

So you think current definition of error_on_new (changed after your approval) is acceptable? @jnothman

error_on_new : boolean, optional
        If True (default), transform will raise an error when there are either
        features with missing values in transform that have no missing values
        in fit (only applicable if
        ``features in ("missing-only", "some-missing")``), or features with non
        missing values in transform that have only missing values in fit
        (only applicable if ``features="some-missing"``).

jnothman · 2019-04-02T09:30:26Z

I think that change is unhelpful and possibly destructive, thanks for asking. I think features newly with present values are still not of any interest to the user.

qinhanmin2014 · 2019-04-02T10:13:06Z

So there's -2 on current definition of error_on_new.
I think it's difficult to define error_on_new if we support features="some_missing" and I guess it's not so important to consider features containing only missing values here.
@jnothman How about closing this PR? We can still implement #12583 without this PR (e.g., use features='all' and pick the columns we need). Maybe we can add a note instead.

jnothman · 2019-04-02T10:36:36Z

You're probably right. I'm quite ambivalent about the all-missing behaviour as long as it is tested.

jeremiedbb · 2019-04-02T13:18:27Z

Ok then. I won't close it however because it also fixes a bug. I'll rename it instead.

qinhanmin2014 · 2019-04-02T13:44:38Z

Ok then. I won't close it however because it also fixes a bug. I'll rename it instead.

Apart from the bug, you can also add some notes about the all-missing behaviour.

jeremiedbb · 2019-04-02T13:50:16Z

Apart from the bug, you can also add some notes about the all-missing behaviour.

well there's no special behavior in that case. I don't think it's worth adding a note on a rare edge case which we don't treat differently.

Actually I'm closing this one and opening a new one to ease the review because the thread will be unrelated.

qinhanmin2014 · 2019-04-02T14:02:21Z

well there's no special behavior in that case. I don't think it's worth adding a note on a rare edge case which we don't treat differently.

That's fine. Apologies again for the wrong comments above.

jeremiedbb · 2019-04-02T14:06:28Z

opened #13562 for the fixes only

jeremiedbb added 3 commits March 21, 2019 16:20

option to drop full missing columns (dense)

fce7ee1

deal with sparse

cb35f6c

what' new

90490f0

jeremiedbb commented Mar 21, 2019

View reviewed changes

jeremiedbb mentioned this pull request Mar 22, 2019

[MRG] Parameter for stacking missing indicator into imputer #12583

Merged

jeremiedbb added 3 commits March 25, 2019 11:53

tst

5090cce

add test for explicit zeros

a8d95e5

what's new

b0f4fdd

jeremiedbb commented Mar 25, 2019

View reviewed changes

jeremiedbb changed the title ~~[WIP] ENH add an option to drop full missing features in MissingIndicator~~ [MRG] ENH add an option to drop full missing features in MissingIndicator Mar 25, 2019

jeremiedbb mentioned this pull request Mar 25, 2019

[WIP] ENH Allow precomputed mask as input for MissingIndicator #13514

Closed

jnothman reviewed Mar 26, 2019

View reviewed changes

simpler code. fix typos. wording.

fba8ff4

jnothman approved these changes Mar 27, 2019

View reviewed changes

DanilBaibak added a commit to DanilBaibak/scikit-learn that referenced this pull request Mar 28, 2019

Resolved conflicts after merge with scikit-learn#13491

455c952

qinhanmin2014 reviewed Mar 30, 2019

View reviewed changes

jeremiedbb added 2 commits April 1, 2019 10:51

update docstring + typo

d57db9d

Merge remote-tracking branch 'upstream/master' into missing-indicator…

37c1cf8

…-drop-full-missing

jeremiedbb added 2 commits April 1, 2019 15:25

update error_on_new

f65055e

add test

db10a82

scikit-learn deleted a comment from jeremiedbb Apr 2, 2019

qinhanmin2014 closed this Apr 2, 2019



		def test_missing_indicator_sparse_no_explicit_zeros():
		# Check that non missing values don't become explicit zeeros in the mask

[MRG] ENH add an option to drop full missing features in MissingIndicator #13491

[MRG] ENH add an option to drop full missing features in MissingIndicator #13491

Conversation

jeremiedbb commented Mar 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremiedbb commented Mar 22, 2019

jnothman commented Mar 23, 2019 via email

Choose a reason for hiding this comment

jeremiedbb commented Mar 25, 2019

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Mar 27, 2019

jeremiedbb commented Mar 27, 2019

qinhanmin2014 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremiedbb commented Apr 1, 2019

qinhanmin2014 commented Apr 1, 2019

qinhanmin2014 commented Apr 1, 2019

jeremiedbb commented Apr 1, 2019

jeremiedbb commented Apr 1, 2019

qinhanmin2014 commented Apr 1, 2019

jeremiedbb commented Apr 1, 2019

qinhanmin2014 commented Apr 1, 2019

jnothman commented Apr 2, 2019 via email

qinhanmin2014 commented Apr 2, 2019

jnothman commented Apr 2, 2019 via email

qinhanmin2014 commented Apr 2, 2019

jnothman commented Apr 2, 2019 via email

jeremiedbb commented Apr 2, 2019

qinhanmin2014 commented Apr 2, 2019

jeremiedbb commented Apr 2, 2019

qinhanmin2014 commented Apr 2, 2019

jeremiedbb commented Apr 2, 2019