[MRG+1] MissingIndicator transformer #8075

maniteja123 · 2016-12-18T10:10:43Z

MissingIndicator transformer for the missing values indicator mask.
see #6556

What does this implement/fix? Explain your changes.

The current implementation returns a indicator mask for the missing values.

Any other comments?

It is a very initial attempt and currently no tests are present. Please do have a look and give suggestions on the design. Thanks !

Implementation
Documentation
Tests

tguillemot · 2016-12-19T09:40:30Z

@jnothman @raghavrv Sorry for my stupid question but what is the difference with #7084 .

maniteja123 · 2016-12-19T11:40:38Z

Hi, I am not entirely aware of the ValueDropper but that seems to be used to drop specific percentage of values per feature of the corresponding classes. OTOH, this MissingIndicator (name not yet finalized) will be just an indicator of the presence of missing values in the input and transformed output is just a binary matrix. Hope I got it right !

tguillemot · 2016-12-19T13:07:40Z

Ok thx @maniteja123. Sorry for the stupid question.

maniteja123 · 2016-12-19T15:40:16Z

No problem @tguillemot . No question is stupid. Thanks for asking. :) Probably I should add a small example to make the use case clear to understand.

tguillemot · 2016-12-19T15:50:13Z

Indeed it's always usefull.

jnothman · 2016-12-20T04:17:18Z

#7084 makes missing values. This helps solve missing value problems.

jnothman

Add some tests please. I'll look at the implementation soon.

jnothman · 2016-12-20T04:17:30Z

sklearn/preprocessing/imputation.py

+        `missing_values` will be imputed. For missing values encoded as np.nan,
+        use the string value "NaN".
+
+    axis : integer, optional (default=0)


I don't think we need this.

I understand the missing indicator is independent of axis. So the meaning of missing_indicators =train would be only indicate the features with the missing values at fit time ?

jnothman · 2016-12-20T04:18:06Z

sklearn/preprocessing/imputation.py

+        - If `axis=0`, then impute along columns.
+        - If `axis=1`, then impute along rows.
+
+    missing_indicators : [None, "all", "train", indices/mask]


I think features would be an adequate name, or which_features or something.

jnothman · 2016-12-20T04:18:11Z

sklearn/preprocessing/imputation.py

+        If "train"
+        If array
+
+    copy : boolean, optional (default=True)


not applicable?

jnothman · 2016-12-20T04:18:31Z

sklearn/preprocessing/imputation.py

+    Attributes
+    ----------
+    feat_with_missing_  : array of shape(n_missing_features, )
+            The features with missing values.


Reduce indentation.

Note that this is only stored if features == 'train'

jnothman

Thanks. This is heading the right direction.

Once it's looking good, we'll talk about integrating it into Imputer, and adding a summary feature which indicates the presence of any missing values in a row.

jnothman · 2016-12-20T21:35:51Z

sklearn/preprocessing/imputation.py

+
+            feat_with_missing = mask_matrix.sum(axis=0).nonzero()[1]
+            # ravel since nonzero returns 2d matrices for sparse in scipy 0.11
+            feat_with_missing = np.asarray(feat_with_missing).ravel()


I'm pretty sure np.ravel(feat_with_missing) will suffice.

jnothman · 2016-12-20T21:36:22Z

sklearn/preprocessing/imputation.py

+
+        if self.features == "train":
+            features = np.setdiff1d(self.feat_with_missing_,
+                                        feat_with_missing)


indent to match previous line's self

jnothman · 2016-12-20T21:38:46Z

sklearn/preprocessing/imputation.py

+            features = np.setdiff1d(self.feat_with_missing_,
+                                        feat_with_missing)
+            if features:
+                warnings.warn("The features %s have missing "


I don't think this is a case we should warn about. The opposite (missing values in transform but none in fit), perhaps.

jnothman · 2016-12-20T21:40:29Z

sklearn/preprocessing/imputation.py

+        elif self.features == "all":
+            imputer_mask = imputer_mask
+
+        elif isinstance(self.features, (np.ndarray, list, tuple)):


tuples and lists are treated differently by numpy and I'm not sure we should support tuples.

jnothman · 2016-12-20T21:42:05Z

sklearn/preprocessing/tests/test_imputation.py

+    X2_tr = MI.transform(X2)
+    mask = X2[:, features] == -1
+    assert_array_equal(X2_tr, mask)
+


remove blanks

jnothman · 2016-12-21T02:35:59Z

sklearn/preprocessing/imputation.py

+        self : object
+            Returns self.
+        """
+        if (isinstance(self.features, six.string_types) and


Should probably also validate that:

features, if an array-like, is an integer array

sparse has a valid value

jnothman · 2016-12-21T02:36:16Z

sklearn/preprocessing/imputation.py

+        `missing_values` will be imputed. For missing values encoded as np.nan,
+        use the string value "NaN".
+
+    features : [None, "all", "train", array(indices/mask)]


{'train' (default), 'all', array-like of int}

jnothman · 2016-12-21T02:39:22Z

sklearn/preprocessing/tests/test_imputation.py

+    MI = clone(MI).set_params(features = "train")
+    MI.fit(X1)
+    X2_tr = MI.transform(X2)
+    features = MI.feat_with_missing_


Need to check that this is correctly computed.

jnothman · 2016-12-21T02:39:43Z

sklearn/preprocessing/tests/test_imputation.py

+         [0,  -1,   5,  -1],
+         [11,  -1,   1,  1]
+    ])
+


mask = X2 == -1 # can be done before loop for X1, X2, missing_values in [(X1_orig, X2_orig, -1), (X1_orig + 1, X2_orig + 1, 0)]: for retype in [np.array, lambda x: x.tolist(), sparse.csr_matrix, sparse.csc_matrix, sparse.lil_matrix]:

Could you explain what retype means. I guess it it return type but does it mean it should call MI.fit(retype(X1)) and MI.transform(retype(X2)) ?

jnothman · 2016-12-21T02:40:16Z

sklearn/preprocessing/tests/test_imputation.py

+    MI.fit(X1)
+    X2_tr = MI.transform(X2)
+    mask = X2[:, features] == -1
+    assert_array_equal(X2_tr, mask)


Also need to assert the warning case, and assert that validation error messages are produced correctly.

jnothman · 2016-12-21T02:42:58Z

Also, it might be a good idea to add something to the narrative documentation at this point, explaining the motivation for such features, and briefly describing the operation.

It would be good to add an example here in the docstring too.

Perhaps you should add a task list to the PR description

maniteja123 · 2016-12-22T05:49:52Z

Hi @jnothman , thanks a lot for the detailed review. I have tried to address most of them. It would be great if you could explain the meaning of retype in your comment here.

Also if sparse = True should the transform always return sparse irrespective of input and if so, which sparse type ?

Thanks.

jnothman · 2016-12-22T10:45:32Z

retype was a bad name for "the type you want to change it to"

jnothman · 2016-12-22T10:46:28Z

Whichever sparse type is most appropriate here. COO? CSC?

maniteja123 · 2016-12-22T13:51:59Z

sklearn/preprocessing/tests/test_imputation.py

+            X2_tr = MI.transform(X2)
+            features = MI.feat_with_missing_
+            assert_array_equal(expect_feat_missing, features)
+            assert_array_equal(np.asarray(X2_tr), mask[:, features])


Sorry Joel for many questions but if I am confused here. Suppose I take the case where X1 and X2 are sparse or lists, how do I check for the equality between X2_tr and mask[:, features].

X2_tr should be an array or sparse matrix regardless of X2, no?

yeah yeah sorry it is array or sparse matrix. I was asking about the equality between the sparse matrices because X2_tr will be sparse while mask is array. Is it okay to densify and use assert_array_equals ?

In a test, yes.

maniteja123 · 2016-12-22T13:53:37Z

I don't know much about sparse arrays but I was thinking COO might be a good choice since there might be less missing values in general. Please do correct me if I am thinking it the wrong way.

jnothman · 2016-12-23T04:32:02Z

I don't get your purported motivation for COO. If you construct the matrix in a column-wise fashion, CSC might be the best choice.

maniteja123 · 2016-12-23T07:16:15Z

Oh I see, but the matrix which is being constructed by this using get_mask. It is not column-wise right ?

Also if we return as CSC, then the indices and indptr needs to calculated accordingly ?

jnothman · 2016-12-24T10:52:00Z

For auto, return it in the format it came in. Otherwise, return CSC. You don't need to calculate indices and indptr manually. Calling csc_matrix (or .tocsc() if already sparse) on it will suffice.

maniteja123 · 2016-12-28T04:51:48Z

Thanks @jnothman , just one more clarification. In case of sparse matrix and missing values = 0 currently a dense matrix is returned. Should it be the same even when sparse='auto' ?

jnothman · 2016-12-28T05:06:57Z

Yes, I guess so.

…

On 28 December 2016 at 15:51, Maniteja Nandana ***@***.***> wrote: Thanks @jnothman <https://github.com/jnothman> , just one more clarification. In case of sparse matrix and missing values = 0 currently a dense matrix is returned. Should it be the same even when sparse='auto' ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8075 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xkVk34wY8miIBGIVcwTNxsC40gcks5rMermgaJpZM4LQF9P> .

maniteja123 · 2016-12-28T05:44:50Z

Thanks but lil format wouldn't contain indices right ? So this line fails for it.

also should the return type be numpy array when a list is passed for transform ? Sorry for so many questions but the tests are failing for these scenarios.

jnothman

Yes, return type should be a numpy array, except if it should be a sparse matrix, in which case the returned format should be specified in the docstring, but need not be the same as the input type. check_array will convert to acceptable types.

jnothman · 2016-12-28T08:46:32Z

sklearn/preprocessing/imputation.py

        if sparse.issparse(X) and self.missing_values != 0:
            #  sparse matrix and missing values is not zero
            imputer_mask = _get_mask(X.data, self.missing_values)
            imputer_mask = X.__class__((imputer_mask, X.indices.copy(),
                                        X.indptr.copy()), shape=X.shape,
                                       dtype=X.dtype)
-
+            print 'here' + str(type(X)) + str(type(imputer_mask))


debug print to be removed...?

jnothman

I'll have a look over the transformation and tests once you're happy you know how to determine handling of transform output types etc.

jnothman · 2017-01-09T12:51:33Z

doc/modules/preprocessing.rst

+Transformer indicating missing values
+=====================================
+
+MissingIndicator transformer is useful to transform a dataset into corresponding


use

:class:`MissingIndicator`

jnothman · 2017-01-09T12:52:06Z

doc/modules/preprocessing.rst

+    MissingIndicator(features='train', missing_values=-1, sparse='auto')
+    >>> X2_tr = MI.transform(X2)
+    >>> X2_tr
+    array([[False, False,  True],


I think we should probably be returning ints, not bools.

jnothman · 2017-01-09T13:00:02Z

sklearn/preprocessing/imputation.py

+        X : {array-like, sparse matrix}, shape (n_samples, n_features)
+            Input data, where ``n_samples`` is the number of samples and
+            ``n_features`` is the number of features.
+        Returns


blank line before this, please

jnothman · 2017-01-09T13:00:16Z

sklearn/preprocessing/imputation.py

+
+    def transform(self, X):
+        """Impute all missing values in X.
+        Parameters


jnothman · 2017-01-09T13:00:20Z

sklearn/preprocessing/imputation.py

+        ----------
+        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
+            The input data to complete.
+        Returns


jnothman · 2017-01-09T13:00:26Z

sklearn/preprocessing/imputation.py

+            The input data to complete.
+        Returns
+        -------
+        X : {array-like, sparse matrix}, shape = [n_samples, n_features]


traditionally Xt

jnothman · 2017-01-09T13:00:51Z

sklearn/preprocessing/imputation.py

+        Returns
+        -------
+        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
+            The transformerwith missing indicator.


maniteja123 · 2017-01-19T12:19:17Z

Non zero missing values

sparse	ndarray	csr_matrix	csc_matrix	lil_matrix
True	csc_matrix	csc_matrix	csc_matrix	csc_matrix
False	ndarray	ndarray	ndarray	ndarray
auto	ndarray	csc_matrix	csc_matrix	csc_matrix

Zero missing values*

sparse	ndarray	csr_matrix	csc_matrix	lil_matrix
True	csc_matrix	csc_matrix	csc_matrix	csc_matrix
False	ndarray	ndarray	ndarray	ndarray
auto	ndarray	ndarray	ndarray	ndarray

Hi @jnothman , sorry for the delay. Could you look at the above return types and let me know if it works ? Thanks.

jnothman · 2017-01-23T01:36:40Z

I think that's correct. But to be sure, we could write a test that checks that auto corresponds to the same type/format as output by imputer in that case.

raghavrv · 2017-03-04T02:52:01Z

@maniteja123 any updates? :)

raghavrv · 2017-03-04T02:52:31Z

Does it need a review? Can you rename to MRG if so... I can look into it next week...

maniteja123 · 2017-03-04T12:26:18Z

Hi @raghavrv thanks for reminding. I believe it can be reviewed once though tests can be comprehensive regarding the return type for the sparse and dense matrices. Will look at fixing the failing tests and then ping you.

codecov · 2017-03-04T17:26:44Z

Codecov Report

Merging #8075 into master will increase coverage by <.01%.
The diff coverage is 97.24%.

@@            Coverage Diff             @@
##           master    #8075      +/-   ##
==========================================
+ Coverage   95.48%   95.48%   +<.01%     
==========================================
  Files         342      342              
  Lines       60987    61096     +109     
==========================================
+ Hits        58233    58339     +106     
- Misses       2754     2757       +3

Impacted Files	Coverage Δ
sklearn/preprocessing/init.py	`100% <100%> (ø)`	✅
sklearn/preprocessing/tests/test_imputation.py	`100% <100%> (ø)`	✅
sklearn/preprocessing/imputation.py	`94.23% <94.44%> (+0.07%)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6852a03...fa0c023. Read the comment docs.

…_missing_values

…puter_missing_values

glemaitre · 2018-07-11T22:09:00Z

@jnothman @amueller any review on this feature.

amueller · 2018-07-11T22:47:59Z

can you maybe add this to plot_missing_values.py or something? It would be good to have an example that uses this. Ideally we'd package that into SimpleImputer but for now we can make it explicit and it should still be good for ChainedImputer.

amueller · 2018-07-11T22:50:47Z

sklearn/impute.py

+
+    Parameters
+    ----------
+    missing_values : number, string, np.nan (default) or None


what does number mean and why is np.nan not a number? Maybe just move the np.nan to the end?

number means real number. It's just to fit this in one line.
I think by definition nan is not a number :)

but the dtype is also important, isn't it? I find "float or int" more natural than "number or np.nan" but I don't have a strong opinion.

I agree that "float or int" is better than number, but I think it's important to keep np.nan visible since it should be a common value for missing_values. Maybe something like
int, float, string or None (default=np.nan) ?

right now this is consistent with SimpleImputer and ChainedImputer in fact.

glemaitre · 2018-07-12T20:12:29Z

sklearn/impute.py

+
+    Parameters
+    ----------
+    missing_values : number, string, np.nan (default) or None


right now this is consistent with SimpleImputer and ChainedImputer in fact.

glemaitre · 2018-07-12T20:49:38Z

@amueller I updated the example by doing a feature union of the output of the imputer and the missing indicator. This is probably the use case that each imputer should handle in the future...

jnothman

I think this LGTM. Thanks for completing it @glemaitre. Given recent issues, though, I wonder if we should make sure missing_values=pickle.loads(pickle.dumps(np.nan)) works also.

jnothman · 2018-07-15T07:03:17Z

sklearn/impute.py

+
+    error_on_new : boolean, optional
+        If True (default), transform will raise an error when there are
+        features with missing values in transform but have no missing values in


but -> that

jnothman · 2018-07-15T11:11:29Z

sklearn/tests/test_impute.py

+        assert isinstance(X_fit_mask, np.ndarray)
+        assert isinstance(X_trans_mask, np.ndarray)
+    else:
+        if sparse.issparse(X_fit):


Why is this not another elif?

because it is true for all other param_sparse and missing_values combination

…t-learn into maniteja123-imputer_missing_values

glemaitre · 2018-07-15T21:14:24Z

I think this LGTM. Thanks for completing it @glemaitre. Given recent issues, though, I wonder if we should make sure missing_values=pickle.loads(pickle.dumps(np.nan)) works also.

A similar test is done in the common test now.

amueller · 2018-07-15T22:41:03Z

merge on green?

glemaitre · 2018-07-16T03:45:43Z

We need to merge #11391 first

agramfort · 2018-07-16T08:05:06Z

@maniteja123 can you rebase now that #11391 is merged ?

…_missing_values

glemaitre · 2018-07-16T13:12:29Z

Done @agramfort

…

On 16 July 2018 at 10:05, Alexandre Gramfort ***@***.***> wrote: @maniteja123 <https://github.com/maniteja123> can you rebase now that #11391 <#11391> is merged ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8075 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHG9PxeR-xiPD2GIhwNdqTANSoV7SjK0ks5uHElVgaJpZM4LQF9P> .

-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/

jnothman · 2018-07-16T13:15:56Z

And will we be opening issues to add a add_indicator param to other imputers?

glemaitre · 2018-07-16T14:45:37Z

And will we be opening issues to add a add_indicator param to other imputers?

yes. Do we want to have it in 0.20 thought?

glemaitre · 2018-07-16T14:48:26Z

tests passed in travis

amueller · 2018-08-21T18:44:58Z

Did someone open these issues? I don't see them linked.

jnothman · 2018-08-21T23:46:26Z

I don't think it's been opened. Atm we are only offering one other imputer. (Apparently knn imputer is still blocking on your review @amueller)

maniteja123 force-pushed the imputer_missing_values branch from 7fcdbf4 to 678f2c3 Compare December 18, 2016 10:12

jnothman requested changes Dec 20, 2016

View reviewed changes

jnothman requested changes Dec 21, 2016

View reviewed changes

maniteja123 commented Dec 22, 2016

View reviewed changes

jnothman reviewed Jan 9, 2017

View reviewed changes

maniteja123 force-pushed the imputer_missing_values branch from 5357a1b to 41c2596 Compare February 18, 2017 19:26

glemaitre added 6 commits June 29, 2018 17:09

address jeremy comments

34fb9a3

address andy comments

3abc695

PEP8

05226fd

DOC fix doc parameter

7695551

Merge remote-tracking branch 'origin/master' into maniteja123-imputer…

8c199ba

…_missing_values

Merge remote-tracking branch 'glemaitre/is/11390' into maniteja123-im…

17a0caa

…puter_missing_values

amueller reviewed Jul 11, 2018

View reviewed changes

glemaitre approved these changes Jul 12, 2018

View reviewed changes

EXA show an example using MissingIndicator

d4ca8a8

Update plot_missing_values.py

52d1c02

jnothman approved these changes Jul 15, 2018

View reviewed changes

glemaitre added 2 commits July 15, 2018 23:05

DOC fix

76558e5

Merge branch 'imputer_missing_values' of github.com:maniteja123/sciki…

51c0aa4

…t-learn into maniteja123-imputer_missing_values

Merge remote-tracking branch 'origin/master' into maniteja123-imputer…

82d766d

…_missing_values

amueller added the Blocker label Jul 16, 2018

amueller merged commit 726fa36 into scikit-learn:master Jul 16, 2018

amueller mentioned this pull request Oct 4, 2018

Add MissingIndicator convenience into SimpleImputer #12294

Closed

[MRG+1] MissingIndicator transformer #8075

[MRG+1] MissingIndicator transformer #8075

Conversation

maniteja123 commented Dec 18, 2016 • edited by jnothman Loading

What does this implement/fix? Explain your changes.

Any other comments?

tguillemot commented Dec 19, 2016

maniteja123 commented Dec 19, 2016

tguillemot commented Dec 19, 2016

maniteja123 commented Dec 19, 2016

tguillemot commented Dec 19, 2016

jnothman commented Dec 20, 2016

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Dec 21, 2016

maniteja123 commented Dec 22, 2016

jnothman commented Dec 22, 2016

jnothman commented Dec 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maniteja123 commented Dec 22, 2016

jnothman commented Dec 23, 2016

maniteja123 commented Dec 23, 2016 • edited Loading

jnothman commented Dec 24, 2016

maniteja123 commented Dec 28, 2016

jnothman commented Dec 28, 2016 via email

maniteja123 commented Dec 28, 2016 • edited Loading

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maniteja123 commented Jan 19, 2017

jnothman commented Jan 23, 2017

raghavrv commented Mar 4, 2017

raghavrv commented Mar 4, 2017

maniteja123 commented Mar 4, 2017 • edited Loading

codecov bot commented Mar 4, 2017 • edited Loading

Codecov Report

glemaitre commented Jul 11, 2018

amueller commented Jul 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre commented Jul 12, 2018

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maniteja123 commented Dec 18, 2016 •

edited by jnothman

Loading

maniteja123 commented Dec 23, 2016 •

edited

Loading

maniteja123 commented Dec 28, 2016 •

edited

Loading

maniteja123 commented Mar 4, 2017 •

edited

Loading

codecov bot commented Mar 4, 2017 •

edited

Loading