-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] MissingIndicator transformer #8075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] MissingIndicator transformer #8075
Conversation
7fcdbf4
to
678f2c3
Compare
Hi, I am not entirely aware of the |
Ok thx @maniteja123. Sorry for the stupid question. |
No problem @tguillemot . No question is stupid. Thanks for asking. :) Probably I should add a small example to make the use case clear to understand. |
Indeed it's always usefull. |
#7084 makes missing values. This helps solve missing value problems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some tests please. I'll look at the implementation soon.
sklearn/preprocessing/imputation.py
Outdated
`missing_values` will be imputed. For missing values encoded as np.nan, | ||
use the string value "NaN". | ||
|
||
axis : integer, optional (default=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the missing indicator is independent of axis. So the meaning of missing_indicators =train
would be only indicate the features with the missing values at fit time ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
sklearn/preprocessing/imputation.py
Outdated
- If `axis=0`, then impute along columns. | ||
- If `axis=1`, then impute along rows. | ||
|
||
missing_indicators : [None, "all", "train", indices/mask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think features
would be an adequate name, or which_features
or something.
sklearn/preprocessing/imputation.py
Outdated
If "train" | ||
If array | ||
|
||
copy : boolean, optional (default=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not applicable?
sklearn/preprocessing/imputation.py
Outdated
Attributes | ||
---------- | ||
feat_with_missing_ : array of shape(n_missing_features, ) | ||
The features with missing values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"in fit"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduce indentation.
Note that this is only stored if features == 'train'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This is heading the right direction.
Once it's looking good, we'll talk about integrating it into Imputer
, and adding a summary feature which indicates the presence of any missing values in a row.
sklearn/preprocessing/imputation.py
Outdated
|
||
feat_with_missing = mask_matrix.sum(axis=0).nonzero()[1] | ||
# ravel since nonzero returns 2d matrices for sparse in scipy 0.11 | ||
feat_with_missing = np.asarray(feat_with_missing).ravel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure np.ravel(feat_with_missing)
will suffice.
sklearn/preprocessing/imputation.py
Outdated
|
||
if self.features == "train": | ||
features = np.setdiff1d(self.feat_with_missing_, | ||
feat_with_missing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent to match previous line's self
sklearn/preprocessing/imputation.py
Outdated
features = np.setdiff1d(self.feat_with_missing_, | ||
feat_with_missing) | ||
if features: | ||
warnings.warn("The features %s have missing " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a case we should warn about. The opposite (missing values in transform but none in fit), perhaps.
sklearn/preprocessing/imputation.py
Outdated
elif self.features == "all": | ||
imputer_mask = imputer_mask | ||
|
||
elif isinstance(self.features, (np.ndarray, list, tuple)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tuples and lists are treated differently by numpy and I'm not sure we should support tuples.
X2_tr = MI.transform(X2) | ||
mask = X2[:, features] == -1 | ||
assert_array_equal(X2_tr, mask) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove blanks
sklearn/preprocessing/imputation.py
Outdated
self : object | ||
Returns self. | ||
""" | ||
if (isinstance(self.features, six.string_types) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably also validate that:
features
, if an array-like, is an integer arraysparse
has a valid value
sklearn/preprocessing/imputation.py
Outdated
`missing_values` will be imputed. For missing values encoded as np.nan, | ||
use the string value "NaN". | ||
|
||
features : [None, "all", "train", array(indices/mask)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{'train' (default), 'all', array-like of int}
MI = clone(MI).set_params(features = "train") | ||
MI.fit(X1) | ||
X2_tr = MI.transform(X2) | ||
features = MI.feat_with_missing_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check that this is correctly computed.
[0, -1, 5, -1], | ||
[11, -1, 1, 1] | ||
]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mask = X2 == -1 # can be done before loop
for X1, X2, missing_values in [(X1_orig, X2_orig, -1), (X1_orig + 1, X2_orig + 1, 0)]:
for retype in [np.array, lambda x: x.tolist(), sparse.csr_matrix, sparse.csc_matrix, sparse.lil_matrix]:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain what retype
means. I guess it it return type but does it mean it should call MI.fit(retype(X1))
and MI.transform(retype(X2))
?
MI.fit(X1) | ||
X2_tr = MI.transform(X2) | ||
mask = X2[:, features] == -1 | ||
assert_array_equal(X2_tr, mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to assert the warning case, and assert that validation error messages are produced correctly.
Also, it might be a good idea to add something to the narrative documentation at this point, explaining the motivation for such features, and briefly describing the operation. It would be good to add an example here in the docstring too. Perhaps you should add a task list to the PR description |
retype was a bad name for "the type you want to change it to" |
Whichever sparse type is most appropriate here. COO? CSC? |
X2_tr = MI.transform(X2) | ||
features = MI.feat_with_missing_ | ||
assert_array_equal(expect_feat_missing, features) | ||
assert_array_equal(np.asarray(X2_tr), mask[:, features]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry Joel for many questions but if I am confused here. Suppose I take the case where X1 and X2 are sparse or lists, how do I check for the equality between X2_tr and mask[:, features]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X2_tr
should be an array or sparse matrix regardless of X2
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah yeah sorry it is array or sparse matrix. I was asking about the equality between the sparse matrices because X2_tr will be sparse while mask is array. Is it okay to densify and use assert_array_equals
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a test, yes.
I don't know much about sparse arrays but I was thinking COO might be a good choice since there might be less missing values in general. Please do correct me if I am thinking it the wrong way. |
I don't get your purported motivation for COO. If you construct the matrix in a column-wise fashion, CSC might be the best choice. |
Oh I see, but the matrix which is being constructed by this using Also if we return as CSC, then the |
For auto, return it in the format it came in. Otherwise, return CSC. You don't need to calculate indices and indptr manually. Calling |
Thanks @jnothman , just one more clarification. In case of sparse matrix and |
Yes, I guess so.
…On 28 December 2016 at 15:51, Maniteja Nandana ***@***.***> wrote:
Thanks @jnothman <https://github.com/jnothman> , just one more
clarification. In case of sparse matrix and missing values = 0 currently
a dense matrix is returned. Should it be the same even when sparse='auto'
?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8075 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xkVk34wY8miIBGIVcwTNxsC40gcks5rMermgaJpZM4LQF9P>
.
|
Thanks but lil format wouldn't contain indices right ? So this line fails for it. also should the return type be numpy array when a list is passed for transform ? Sorry for so many questions but the tests are failing for these scenarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, return type should be a numpy array, except if it should be a sparse matrix, in which case the returned format should be specified in the docstring, but need not be the same as the input type. check_array
will convert to acceptable types.
sklearn/preprocessing/imputation.py
Outdated
if sparse.issparse(X) and self.missing_values != 0: | ||
# sparse matrix and missing values is not zero | ||
imputer_mask = _get_mask(X.data, self.missing_values) | ||
imputer_mask = X.__class__((imputer_mask, X.indices.copy(), | ||
X.indptr.copy()), shape=X.shape, | ||
dtype=X.dtype) | ||
|
||
print 'here' + str(type(X)) + str(type(imputer_mask)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug print to be removed...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look over the transformation and tests once you're happy you know how to determine handling of transform
output types etc.
doc/modules/preprocessing.rst
Outdated
Transformer indicating missing values | ||
===================================== | ||
|
||
MissingIndicator transformer is useful to transform a dataset into corresponding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use
:class:`MissingIndicator`
doc/modules/preprocessing.rst
Outdated
MissingIndicator(features='train', missing_values=-1, sparse='auto') | ||
>>> X2_tr = MI.transform(X2) | ||
>>> X2_tr | ||
array([[False, False, True], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably be returning ints, not bools.
sklearn/preprocessing/imputation.py
Outdated
X : {array-like, sparse matrix}, shape (n_samples, n_features) | ||
Input data, where ``n_samples`` is the number of samples and | ||
``n_features`` is the number of features. | ||
Returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line before this, please
sklearn/preprocessing/imputation.py
Outdated
|
||
def transform(self, X): | ||
"""Impute all missing values in X. | ||
Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line
sklearn/preprocessing/imputation.py
Outdated
---------- | ||
X : {array-like, sparse matrix}, shape = [n_samples, n_features] | ||
The input data to complete. | ||
Returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line
sklearn/preprocessing/imputation.py
Outdated
The input data to complete. | ||
Returns | ||
------- | ||
X : {array-like, sparse matrix}, shape = [n_samples, n_features] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
traditionally Xt
sklearn/preprocessing/imputation.py
Outdated
Returns | ||
------- | ||
X : {array-like, sparse matrix}, shape = [n_samples, n_features] | ||
The transformerwith missing indicator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh?
Non zero missing values
Zero missing values*
Hi @jnothman , sorry for the delay. Could you look at the above return types and let me know if it works ? Thanks. |
I think that's correct. But to be sure, we could write a test that checks that auto corresponds to the same type/format as output by imputer in that case. |
5357a1b
to
41c2596
Compare
@maniteja123 any updates? :) |
Does it need a review? Can you rename to MRG if so... I can look into it next week... |
Hi @raghavrv thanks for reminding. I believe it can be reviewed once though tests can be comprehensive regarding the return type for the sparse and dense matrices. Will look at fixing the failing tests and then ping you. |
Codecov Report
@@ Coverage Diff @@
## master #8075 +/- ##
==========================================
+ Coverage 95.48% 95.48% +<.01%
==========================================
Files 342 342
Lines 60987 61096 +109
==========================================
+ Hits 58233 58339 +106
- Misses 2754 2757 +3
Continue to review full report at Codecov.
|
…puter_missing_values
can you maybe add this to plot_missing_values.py or something? It would be good to have an example that uses this. Ideally we'd package that into |
|
||
Parameters | ||
---------- | ||
missing_values : number, string, np.nan (default) or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does number mean and why is np.nan not a number? Maybe just move the np.nan to the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
number means real number. It's just to fit this in one line.
I think by definition nan is not a number :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the dtype is also important, isn't it? I find "float or int" more natural than "number or np.nan" but I don't have a strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that "float or int" is better than number, but I think it's important to keep np.nan visible since it should be a common value for missing_values
. Maybe something like
int, float, string or None (default=np.nan) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now this is consistent with SimpleImputer and ChainedImputer in fact.
|
||
Parameters | ||
---------- | ||
missing_values : number, string, np.nan (default) or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now this is consistent with SimpleImputer and ChainedImputer in fact.
@amueller I updated the example by doing a feature union of the output of the imputer and the missing indicator. This is probably the use case that each imputer should handle in the future... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this LGTM. Thanks for completing it @glemaitre. Given recent issues, though, I wonder if we should make sure missing_values=pickle.loads(pickle.dumps(np.nan))
works also.
sklearn/impute.py
Outdated
|
||
error_on_new : boolean, optional | ||
If True (default), transform will raise an error when there are | ||
features with missing values in transform but have no missing values in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but -> that
assert isinstance(X_fit_mask, np.ndarray) | ||
assert isinstance(X_trans_mask, np.ndarray) | ||
else: | ||
if sparse.issparse(X_fit): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not another elif
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it is true for all other param_sparse and missing_values combination
…t-learn into maniteja123-imputer_missing_values
A similar test is done in the common test now. |
merge on green? |
We need to merge #11391 first |
@maniteja123 can you rebase now that #11391 is merged ? |
Done @agramfort
…On 16 July 2018 at 10:05, Alexandre Gramfort ***@***.***> wrote:
@maniteja123 <https://github.com/maniteja123> can you rebase now that
#11391 <#11391> is
merged ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8075 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHG9PxeR-xiPD2GIhwNdqTANSoV7SjK0ks5uHElVgaJpZM4LQF9P>
.
--
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
|
And will we be opening issues to add a |
yes. Do we want to have it in 0.20 thought? |
tests passed in travis |
Did someone open these issues? I don't see them linked. |
I don't think it's been opened. Atm we are only offering one other imputer. (Apparently knn imputer is still blocking on your review @amueller) |
MissingIndicator transformer for the missing values indicator mask.
see #6556
What does this implement/fix? Explain your changes.
The current implementation returns a indicator mask for the missing values.
Any other comments?
It is a very initial attempt and currently no tests are present. Please do have a look and give suggestions on the design. Thanks !