Skip to content

[MRG+1] MissingIndicator transformer #8075

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 57 commits into from
Jul 16, 2018

Conversation

maniteja123
Copy link
Contributor

@maniteja123 maniteja123 commented Dec 18, 2016

MissingIndicator transformer for the missing values indicator mask.
see #6556

What does this implement/fix? Explain your changes.

The current implementation returns a indicator mask for the missing values.

Any other comments?

It is a very initial attempt and currently no tests are present. Please do have a look and give suggestions on the design. Thanks !

  • Implementation
  • Documentation
  • Tests

@maniteja123 maniteja123 force-pushed the imputer_missing_values branch from 7fcdbf4 to 678f2c3 Compare December 18, 2016 10:12
@tguillemot
Copy link
Contributor

@jnothman @raghavrv Sorry for my stupid question but what is the difference with #7084 .

@maniteja123
Copy link
Contributor Author

Hi, I am not entirely aware of the ValueDropper but that seems to be used to drop specific percentage of values per feature of the corresponding classes. OTOH, this MissingIndicator (name not yet finalized) will be just an indicator of the presence of missing values in the input and transformed output is just a binary matrix. Hope I got it right !

@tguillemot
Copy link
Contributor

Ok thx @maniteja123. Sorry for the stupid question.

@maniteja123
Copy link
Contributor Author

No problem @tguillemot . No question is stupid. Thanks for asking. :) Probably I should add a small example to make the use case clear to understand.

@tguillemot
Copy link
Contributor

Indeed it's always usefull.

@jnothman
Copy link
Member

#7084 makes missing values. This helps solve missing value problems.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some tests please. I'll look at the implementation soon.

`missing_values` will be imputed. For missing values encoded as np.nan,
use the string value "NaN".

axis : integer, optional (default=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the missing indicator is independent of axis. So the meaning of missing_indicators =train would be only indicate the features with the missing values at fit time ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

- If `axis=0`, then impute along columns.
- If `axis=1`, then impute along rows.

missing_indicators : [None, "all", "train", indices/mask]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think features would be an adequate name, or which_features or something.

If "train"
If array

copy : boolean, optional (default=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not applicable?

Attributes
----------
feat_with_missing_ : array of shape(n_missing_features, )
The features with missing values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in fit"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduce indentation.

Note that this is only stored if features == 'train'

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This is heading the right direction.

Once it's looking good, we'll talk about integrating it into Imputer, and adding a summary feature which indicates the presence of any missing values in a row.


feat_with_missing = mask_matrix.sum(axis=0).nonzero()[1]
# ravel since nonzero returns 2d matrices for sparse in scipy 0.11
feat_with_missing = np.asarray(feat_with_missing).ravel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure np.ravel(feat_with_missing) will suffice.


if self.features == "train":
features = np.setdiff1d(self.feat_with_missing_,
feat_with_missing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent to match previous line's self

features = np.setdiff1d(self.feat_with_missing_,
feat_with_missing)
if features:
warnings.warn("The features %s have missing "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a case we should warn about. The opposite (missing values in transform but none in fit), perhaps.

elif self.features == "all":
imputer_mask = imputer_mask

elif isinstance(self.features, (np.ndarray, list, tuple)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tuples and lists are treated differently by numpy and I'm not sure we should support tuples.

X2_tr = MI.transform(X2)
mask = X2[:, features] == -1
assert_array_equal(X2_tr, mask)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove blanks

self : object
Returns self.
"""
if (isinstance(self.features, six.string_types) and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably also validate that:

  • features, if an array-like, is an integer array
  • sparse has a valid value

`missing_values` will be imputed. For missing values encoded as np.nan,
use the string value "NaN".

features : [None, "all", "train", array(indices/mask)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{'train' (default), 'all', array-like of int}

MI = clone(MI).set_params(features = "train")
MI.fit(X1)
X2_tr = MI.transform(X2)
features = MI.feat_with_missing_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check that this is correctly computed.

[0, -1, 5, -1],
[11, -1, 1, 1]
])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mask = X2 == -1  # can be done before loop
for X1, X2, missing_values in [(X1_orig, X2_orig, -1), (X1_orig + 1, X2_orig + 1, 0)]:
    for retype in [np.array, lambda x: x.tolist(), sparse.csr_matrix, sparse.csc_matrix, sparse.lil_matrix]:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what retype means. I guess it it return type but does it mean it should call MI.fit(retype(X1)) and MI.transform(retype(X2)) ?

MI.fit(X1)
X2_tr = MI.transform(X2)
mask = X2[:, features] == -1
assert_array_equal(X2_tr, mask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to assert the warning case, and assert that validation error messages are produced correctly.

@jnothman
Copy link
Member

Also, it might be a good idea to add something to the narrative documentation at this point, explaining the motivation for such features, and briefly describing the operation.

It would be good to add an example here in the docstring too.

Perhaps you should add a task list to the PR description

@maniteja123
Copy link
Contributor Author

Hi @jnothman , thanks a lot for the detailed review. I have tried to address most of them. It would be great if you could explain the meaning of retype in your comment here.

Also if sparse = True should the transform always return sparse irrespective of input and if so, which sparse type ?

Thanks.

@jnothman
Copy link
Member

retype was a bad name for "the type you want to change it to"

@jnothman
Copy link
Member

Whichever sparse type is most appropriate here. COO? CSC?

X2_tr = MI.transform(X2)
features = MI.feat_with_missing_
assert_array_equal(expect_feat_missing, features)
assert_array_equal(np.asarray(X2_tr), mask[:, features])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry Joel for many questions but if I am confused here. Suppose I take the case where X1 and X2 are sparse or lists, how do I check for the equality between X2_tr and mask[:, features].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X2_tr should be an array or sparse matrix regardless of X2, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah yeah sorry it is array or sparse matrix. I was asking about the equality between the sparse matrices because X2_tr will be sparse while mask is array. Is it okay to densify and use assert_array_equals ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a test, yes.

@maniteja123
Copy link
Contributor Author

I don't know much about sparse arrays but I was thinking COO might be a good choice since there might be less missing values in general. Please do correct me if I am thinking it the wrong way.

@jnothman
Copy link
Member

I don't get your purported motivation for COO. If you construct the matrix in a column-wise fashion, CSC might be the best choice.

@maniteja123
Copy link
Contributor Author

maniteja123 commented Dec 23, 2016

Oh I see, but the matrix which is being constructed by this using get_mask. It is not column-wise right ?

Also if we return as CSC, then the indices and indptr needs to calculated accordingly ?

@jnothman
Copy link
Member

For auto, return it in the format it came in. Otherwise, return CSC. You don't need to calculate indices and indptr manually. Calling csc_matrix (or .tocsc() if already sparse) on it will suffice.

@maniteja123
Copy link
Contributor Author

Thanks @jnothman , just one more clarification. In case of sparse matrix and missing values = 0 currently a dense matrix is returned. Should it be the same even when sparse='auto' ?

@jnothman
Copy link
Member

jnothman commented Dec 28, 2016 via email

@maniteja123
Copy link
Contributor Author

maniteja123 commented Dec 28, 2016

Thanks but lil format wouldn't contain indices right ? So this line fails for it.

also should the return type be numpy array when a list is passed for transform ? Sorry for so many questions but the tests are failing for these scenarios.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, return type should be a numpy array, except if it should be a sparse matrix, in which case the returned format should be specified in the docstring, but need not be the same as the input type. check_array will convert to acceptable types.

if sparse.issparse(X) and self.missing_values != 0:
# sparse matrix and missing values is not zero
imputer_mask = _get_mask(X.data, self.missing_values)
imputer_mask = X.__class__((imputer_mask, X.indices.copy(),
X.indptr.copy()), shape=X.shape,
dtype=X.dtype)

print 'here' + str(type(X)) + str(type(imputer_mask))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug print to be removed...?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look over the transformation and tests once you're happy you know how to determine handling of transform output types etc.

Transformer indicating missing values
=====================================

MissingIndicator transformer is useful to transform a dataset into corresponding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use

:class:`MissingIndicator`

MissingIndicator(features='train', missing_values=-1, sparse='auto')
>>> X2_tr = MI.transform(X2)
>>> X2_tr
array([[False, False, True],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably be returning ints, not bools.

X : {array-like, sparse matrix}, shape (n_samples, n_features)
Input data, where ``n_samples`` is the number of samples and
``n_features`` is the number of features.
Returns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line before this, please


def transform(self, X):
"""Impute all missing values in X.
Parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line

----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
The input data to complete.
Returns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line

The input data to complete.
Returns
-------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

traditionally Xt

Returns
-------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
The transformerwith missing indicator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh?

@maniteja123
Copy link
Contributor Author

Non zero missing values

sparse ndarray csr_matrix csc_matrix lil_matrix
True csc_matrix csc_matrix csc_matrix csc_matrix
False ndarray ndarray ndarray ndarray
auto ndarray csc_matrix csc_matrix csc_matrix

Zero missing values*

sparse ndarray csr_matrix csc_matrix lil_matrix
True csc_matrix csc_matrix csc_matrix csc_matrix
False ndarray ndarray ndarray ndarray
auto ndarray ndarray ndarray ndarray

Hi @jnothman , sorry for the delay. Could you look at the above return types and let me know if it works ? Thanks.

@jnothman
Copy link
Member

I think that's correct. But to be sure, we could write a test that checks that auto corresponds to the same type/format as output by imputer in that case.

@maniteja123 maniteja123 force-pushed the imputer_missing_values branch from 5357a1b to 41c2596 Compare February 18, 2017 19:26
@raghavrv
Copy link
Member

raghavrv commented Mar 4, 2017

@maniteja123 any updates? :)

@raghavrv
Copy link
Member

raghavrv commented Mar 4, 2017

Does it need a review? Can you rename to MRG if so... I can look into it next week...

@maniteja123
Copy link
Contributor Author

maniteja123 commented Mar 4, 2017

Hi @raghavrv thanks for reminding. I believe it can be reviewed once though tests can be comprehensive regarding the return type for the sparse and dense matrices. Will look at fixing the failing tests and then ping you.

@codecov
Copy link

codecov bot commented Mar 4, 2017

Codecov Report

Merging #8075 into master will increase coverage by <.01%.
The diff coverage is 97.24%.

@@            Coverage Diff             @@
##           master    #8075      +/-   ##
==========================================
+ Coverage   95.48%   95.48%   +<.01%     
==========================================
  Files         342      342              
  Lines       60987    61096     +109     
==========================================
+ Hits        58233    58339     +106     
- Misses       2754     2757       +3
Impacted Files Coverage Δ
sklearn/preprocessing/init.py 100% <100%> (ø)
sklearn/preprocessing/tests/test_imputation.py 100% <100%> (ø)
sklearn/preprocessing/imputation.py 94.23% <94.44%> (+0.07%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6852a03...fa0c023. Read the comment docs.

@glemaitre
Copy link
Member

@jnothman @amueller any review on this feature.

@amueller
Copy link
Member

can you maybe add this to plot_missing_values.py or something? It would be good to have an example that uses this. Ideally we'd package that into SimpleImputer but for now we can make it explicit and it should still be good for ChainedImputer.


Parameters
----------
missing_values : number, string, np.nan (default) or None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does number mean and why is np.nan not a number? Maybe just move the np.nan to the end?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number means real number. It's just to fit this in one line.
I think by definition nan is not a number :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the dtype is also important, isn't it? I find "float or int" more natural than "number or np.nan" but I don't have a strong opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that "float or int" is better than number, but I think it's important to keep np.nan visible since it should be a common value for missing_values. Maybe something like
int, float, string or None (default=np.nan) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now this is consistent with SimpleImputer and ChainedImputer in fact.


Parameters
----------
missing_values : number, string, np.nan (default) or None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now this is consistent with SimpleImputer and ChainedImputer in fact.

@glemaitre
Copy link
Member

@amueller I updated the example by doing a feature union of the output of the imputer and the missing indicator. This is probably the use case that each imputer should handle in the future...

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this LGTM. Thanks for completing it @glemaitre. Given recent issues, though, I wonder if we should make sure missing_values=pickle.loads(pickle.dumps(np.nan)) works also.


error_on_new : boolean, optional
If True (default), transform will raise an error when there are
features with missing values in transform but have no missing values in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but -> that

assert isinstance(X_fit_mask, np.ndarray)
assert isinstance(X_trans_mask, np.ndarray)
else:
if sparse.issparse(X_fit):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not another elif?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it is true for all other param_sparse and missing_values combination

@glemaitre
Copy link
Member

I think this LGTM. Thanks for completing it @glemaitre. Given recent issues, though, I wonder if we should make sure missing_values=pickle.loads(pickle.dumps(np.nan)) works also.

A similar test is done in the common test now.

@amueller
Copy link
Member

merge on green?

@glemaitre
Copy link
Member

We need to merge #11391 first

@agramfort
Copy link
Member

@maniteja123 can you rebase now that #11391 is merged ?

@glemaitre
Copy link
Member

glemaitre commented Jul 16, 2018 via email

@jnothman
Copy link
Member

And will we be opening issues to add a add_indicator param to other imputers?

@glemaitre
Copy link
Member

And will we be opening issues to add a add_indicator param to other imputers?

yes. Do we want to have it in 0.20 thought?

@glemaitre
Copy link
Member

tests passed in travis

@amueller amueller merged commit 726fa36 into scikit-learn:master Jul 16, 2018
@amueller
Copy link
Member

Did someone open these issues? I don't see them linked.

@jnothman
Copy link
Member

I don't think it's been opened. Atm we are only offering one other imputer. (Apparently knn imputer is still blocking on your review @amueller)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.