-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] ENH add an option to drop full missing features in MissingIndicator #13491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] ENH add an option to drop full missing features in MissingIndicator #13491
Conversation
sklearn/impute.py
Outdated
@@ -1057,13 +1057,15 @@ class MissingIndicator(BaseEstimator, TransformerMixin): | |||
`missing_values` will be indicated (True in the output array), the | |||
other values will be marked as False. | |||
|
|||
features : str, optional | |||
features : {"missing-only", "all", "not-constant"}, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If find not-constant
terrible... Please help me find a better name :) !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mostly for internal use, so don't worry! But "some-missing" might be better.
It also fixes a bug: when X is sparse the mask would contain explicit zeros (all non missing values become explicit zeros). Here's an example to illustrate this: from sklearn.impute import MissingIndicator
from scipy.sparse import csr_matrix
X = csr_matrix([[0, 1, 2],
[1, 2, 0],
[2, 0, 1]])
mi = MissingIndicator(features='all', missing_values=1)
print(mi.fit_transform(X))
(1, 0) True
(2, 0) False
(0, 1) True
(1, 1) False
(0, 2) False
(2, 2) True All the '2' became explicit zeros. |
Not-constant => varying?
|
features_diff_fit_trans = np.setdiff1d(features, self.features_) | ||
if (self.error_on_new and features_diff_fit_trans.size > 0): | ||
raise ValueError("The features {} have missing values " | ||
"in transform but have no missing values " | ||
"in fit.".format(features_diff_fit_trans)) | ||
|
||
if (self.features_.size > 0 and | ||
self.features_.size < self._n_features): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the first condition. If we want only features with missing, and there's not any, then the mask should be empty. Before, it would return the mask of all features.
I'm not a fan either :/ Discussing irl with Joris, he finds 'not-constant' explicit enough. Maybe it's not too bad after all. I put the PR on MRG and maybe we'll reach a consensus with other reviewers :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit silly to maintain these three options. We could consider phasing out missing-only? Not important.
doc/whats_new/v0.21.rst
Outdated
:user:`Jérémie du Boisberranger <jeremiedbb>`. | ||
|
||
- |Fix| Fixed a bug in :class:`MissingIndicator` when ``X`` is sparse. All the | ||
non-zero missing values used to become explicit False is the transformed data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is -> in
sklearn/impute.py
Outdated
@@ -1057,13 +1057,15 @@ class MissingIndicator(BaseEstimator, TransformerMixin): | |||
`missing_values` will be indicated (True in the output array), the | |||
other values will be marked as False. | |||
|
|||
features : str, optional | |||
features : {"missing-only", "all", "not-constant"}, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mostly for internal use, so don't worry! But "some-missing" might be better.
sklearn/impute.py
Outdated
if missing_values_mask.format == 'csc' | ||
else np.unique(missing_values_mask.indices)) | ||
if self.features in ('missing-only', 'not-constant'): | ||
if imputer_mask.format == 'csc': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be achieved with imputer_mask.getnnz(axis=0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you make it simple when you can make it complicated :D ?
sklearn/impute.py
Outdated
return imputer_mask, features_with_missing | ||
if self.features == 'all': | ||
features_indices = np.arange(X.shape[1]) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be clearer as:
elif self.features == 'missing-only':
features_indices = np.flatnonzero(n_missing)
else:
features_indices = np.flatnonzero(np.logical_and(n_missing < X.shape[0], n_missing > 0))
sklearn/tests/test_impute.py
Outdated
@@ -919,7 +919,7 @@ def test_iterative_imputer_early_stopping(): | |||
'have missing values in transform but have no missing values in fit'), | |||
(np.array([[-1, 1], [1, 2]]), np.array([[-1, 1], [1, 2]]), | |||
{'features': 'random', 'sparse': 'auto'}, | |||
"'features' has to be either 'missing-only' or 'all'"), | |||
"'features' has to be either 'missing-only', 'all' or 'not-constant'"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either -> one of
sklearn/tests/test_impute.py
Outdated
Xt = mi.fit_transform(X) | ||
|
||
nnz = Xt.getnnz() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just add an assertion elsewhere that Xt.nnz == Xt.sum()
. You shouldn't need a new test, nor a specified expected value.
I should add: otherwise LGTM |
Aaah that's better indeed ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will you define error_on_new
if you add the new option?
We can still implement #12583 without this PR? (e.g., use features='all' and pick the columns we need)
I doubt whether it's important to consider features containing only missing values.
sklearn/impute.py
Outdated
@@ -1246,15 +1254,14 @@ def transform(self, X): | |||
|
|||
imputer_mask, features = self._get_missing_features_info(X) | |||
|
|||
if self.features == "missing-only": | |||
if self.features in ("missing-only", "some-missing"): | |||
features_diff_fit_trans = np.setdiff1d(features, self.features_) | |||
if (self.error_on_new and features_diff_fit_trans.size > 0): | |||
raise ValueError("The features {} have missing values " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to update the error message?
sklearn/tests/test_impute.py
Outdated
|
||
|
||
def test_missing_indicator_sparse_no_explicit_zeros(): | ||
# Check that non missing values don't become explicit zeeros in the mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zeeros -> zeros
I wouldn't modify it: raise an error if a feature has missing values at transform but not at fit. Not keeping the full missing at fit does not interfere with that. I wouldn't raise an error if a feature has full missing at fit and only some missing at transform, to keep the same behavior as the SimpleImputer (although for this one full missing can't be fitted at all).
Yes we can do that. The idea was to have imputers share similar behaviors, and this behavior is justified by the fact that constant features bring absolutely no information to learn. |
But now, when
You're right that constant features are not informative, but we have things like |
As long as you can define |
right. I did not catch that. |
Finally I updated
not raising in the second case would have added a lot more complexity to the code (keeping track of which feature has been dropped for which reason). Let me know what you think of this behavior. |
Hmm, I think the new definition is difficult to understand (and the name |
I think the idea in the long term is to only keep one option with the behavior of |
Let's see what @jnothman thinks. |
I'd be happy with that, but we should only raise a deprecation warning when
there would be a difference in output...
|
So you think current definition of
|
I think that change is unhelpful and possibly destructive, thanks for
asking. I think features newly with present values are still not of any
interest to the user.
|
So there's -2 on current definition of error_on_new. |
You're probably right. I'm quite ambivalent about the all-missing behaviour
as long as it is tested.
|
Ok then. I won't close it however because it also fixes a bug. I'll rename it instead. |
Apart from the bug, you can also add some notes about the all-missing behaviour. |
well there's no special behavior in that case. I don't think it's worth adding a note on a rare edge case which we don't treat differently. Actually I'm closing this one and opening a new one to ease the review because the thread will be unrelated. |
That's fine. Apologies again for the wrong comments above. |
opened #13562 for the fixes only |
This PR should help #12583 to move forward.
Following discussions in #12583, adding a option to SimpleImputer to stack a MissingIndicator which is consistent with the behavior of the SimpleImputer without breaking backward compatibility requires adding an option to the MissingIndicator to drop the columns full of missing values.
To do that I added another possible value for the
features
parameter.I wonder if we want to keep the 3 options, or raise a future warning saying that 'only-missing' will take the new option behavior in 2 releases and deprecate the new option. After all keeping features with full missing values does not really make sense. What' your opinion about that ?