Skip to content

[MRG] Parameter for stacking missing indicator into imputer #12583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
5f8fbc4
Added the parameter to SimpleImputer that allow to stack MissingIndic…
DanilBaibak Nov 14, 2018
c59e3a4
Added tests for the new functionality
DanilBaibak Nov 14, 2018
d175c89
Updated docs
DanilBaibak Nov 14, 2018
5fd811e
Fixed code style issues
DanilBaibak Nov 14, 2018
357c881
Use sparse.hstack for stacking the sparse matrices
DanilBaibak Nov 15, 2018
2a0b17b
Added tests for the case of the sparse matrices
DanilBaibak Nov 15, 2018
c32968c
Fixed tests
DanilBaibak Nov 15, 2018
173a3dd
Added an entry to the 0.21 what's new
DanilBaibak Nov 16, 2018
e74cb1c
Resolve conflicts after merge with master
DanilBaibak Nov 16, 2018
cf2e48f
Improved tests
DanilBaibak Nov 19, 2018
40e11c1
Code review improvements
DanilBaibak Nov 19, 2018
d38abee
Fixed what's new doc
DanilBaibak Nov 19, 2018
79b454c
Moved MissingIndicator initialization into the fit method
DanilBaibak Nov 19, 2018
636c4a6
Improved tests
DanilBaibak Nov 19, 2018
c4388ea
Added fix for the case when one column is totally missing
DanilBaibak Nov 19, 2018
29401fc
Added more tests for the case with missing column
DanilBaibak Nov 19, 2018
795b57f
Improved the tests
DanilBaibak Nov 20, 2018
73a9cf8
Adjusted MissingIndicator to always drop constant columns
DanilBaibak Dec 10, 2018
e671c8e
Fixed issue with indexes for the fully missing columns
DanilBaibak Dec 21, 2018
c5fc3fc
Improved the tests
DanilBaibak Dec 21, 2018
feb0679
Improved the descriptions
DanilBaibak Dec 21, 2018
22350ce
Fixed tests
DanilBaibak Jan 8, 2019
0bbb57b
Code review improvements regarding naming and descriptions
DanilBaibak Jan 8, 2019
fce7ee1
option to drop full missing columns (dense)
jeremiedbb Mar 21, 2019
cb35f6c
deal with sparse
jeremiedbb Mar 21, 2019
90490f0
what' new
jeremiedbb Mar 21, 2019
5090cce
tst
jeremiedbb Mar 25, 2019
f65248b
Resolved conflicts after merge with master
DanilBaibak Mar 25, 2019
a8d95e5
add test for explicit zeros
jeremiedbb Mar 25, 2019
b0f4fdd
what's new
jeremiedbb Mar 25, 2019
49f63be
Improved code quality
DanilBaibak Mar 26, 2019
fba8ff4
simpler code. fix typos. wording.
jeremiedbb Mar 27, 2019
92f7e44
Reverted changes for the MissingIndicator
DanilBaibak Mar 28, 2019
455c952
Resolved conflicts after merge with #13491
DanilBaibak Mar 28, 2019
8c10635
Set the paramter 'features' for MissingIndicator as 'some-missing' by…
DanilBaibak Mar 28, 2019
afcab04
Improved code quality
DanilBaibak Mar 29, 2019
85d6c88
Improved the documentation
DanilBaibak Mar 29, 2019
16e4561
Fixed the line length
DanilBaibak Mar 29, 2019
b94e05e
Added ability to stack IterativeImputer and MissingIndicator
DanilBaibak Mar 31, 2019
f4c7328
Fixed the line length
DanilBaibak Mar 31, 2019
d57db9d
update docstring + typo
jeremiedbb Apr 1, 2019
37c1cf8
Merge remote-tracking branch 'upstream/master' into missing-indicator…
jeremiedbb Apr 1, 2019
f65055e
update error_on_new
jeremiedbb Apr 1, 2019
db10a82
add test
jeremiedbb Apr 1, 2019
7834b9f
Rollback the changes for the IterativeImputer
DanilBaibak Apr 2, 2019
5743b8d
Resolved conflicts after merge
DanilBaibak Apr 2, 2019
261a76f
Improved the description
DanilBaibak Apr 2, 2019
5481518
Added the description for the indicator_ in the Attributes section
DanilBaibak Apr 3, 2019
a226d01
Updated the description in the Attributes section
DanilBaibak Apr 4, 2019
83e7ffa
Resolved the conflicts after merging with master
DanilBaibak Apr 8, 2019
e899906
Fixed the tests after merge with master
DanilBaibak Apr 8, 2019
ec9c7f2
Fixed code style issue
DanilBaibak Apr 8, 2019
73e7a97
Code review improvements
DanilBaibak Apr 9, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions doc/modules/impute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,10 +45,11 @@ that contain the missing values::
>>> import numpy as np
>>> from sklearn.impute import SimpleImputer
>>> imp = SimpleImputer(missing_values=np.nan, strategy='mean')
>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]]) # doctest: +NORMALIZE_WHITESPACE
SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean', verbose=0)
>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]]) # doctest: +NORMALIZE_WHITESPACE
SimpleImputer(add_indicator=False, copy=True, fill_value=None,
missing_values=nan, strategy='mean', verbose=0)
>>> X = [[np.nan, 2], [6, np.nan], [7, 6]]
>>> print(imp.transform(X)) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
>>> print(imp.transform(X)) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
[[4. 2. ]
[6. 3.666...]
[7. 6. ]]
Expand All @@ -59,9 +60,10 @@ The :class:`SimpleImputer` class also supports sparse matrices::
>>> X = sp.csc_matrix([[1, 2], [0, -1], [8, 4]])
>>> imp = SimpleImputer(missing_values=-1, strategy='mean')
>>> imp.fit(X) # doctest: +NORMALIZE_WHITESPACE
SimpleImputer(copy=True, fill_value=None, missing_values=-1, strategy='mean', verbose=0)
SimpleImputer(add_indicator=False, copy=True, fill_value=None,
missing_values=-1, strategy='mean', verbose=0)
>>> X_test = sp.csc_matrix([[-1, 2], [6, -1], [7, 6]])
>>> print(imp.transform(X_test).toarray()) # doctest: +NORMALIZE_WHITESPACE
>>> print(imp.transform(X_test).toarray()) # doctest: +NORMALIZE_WHITESPACE
[[3. 2.]
[6. 3.]
[7. 6.]]
Expand Down
6 changes: 6 additions & 0 deletions doc/whats_new/v0.21.rst
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,12 @@ Support for Python 3.4 and below has been officially dropped.
used to be kept if there were no missing values at all. :issue:`13562` by
:user:`Jérémie du Boisberranger <jeremiedbb>`.

- |Feature| The :class:`impute.SimpleImputer` has a new parameter
``'add_indicator'``, which simply stacks a :class:`impute.MissingIndicator`
transform into the output of the imputer's transform. That allows a predictive
estimator to account for missingness. :issue:`12583` by
:user:`Danylo Baibak <DanilBaibak>`.

:mod:`sklearn.isotonic`
.......................

Expand Down
37 changes: 32 additions & 5 deletions sklearn/impute.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,13 +141,26 @@ class SimpleImputer(BaseEstimator, TransformerMixin):
a new copy will always be made, even if `copy=False`:

- If X is not an array of floating values;
- If X is encoded as a CSR matrix.
- If X is encoded as a CSR matrix;
- If add_indicator=True.

add_indicator : boolean, optional (default=False)
If True, a `MissingIndicator` transform will stack onto output
of the imputer's transform. This allows a predictive estimator
to account for missingness despite imputation. If a feature has no
missing values at fit/train time, the feature won't appear on
the missing indicator even if there are missing values at
transform/test time.

Attributes
----------
statistics_ : array of shape (n_features,)
The imputation fill value for each feature.

indicator_ : :class:`sklearn.impute.MissingIndicator`
Indicator used to add binary indicators for missing values.
``None`` if add_indicator is False.

See also
--------
IterativeImputer : Multivariate imputation of missing values.
Expand All @@ -159,8 +172,8 @@ class SimpleImputer(BaseEstimator, TransformerMixin):
>>> imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
>>> imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
... # doctest: +NORMALIZE_WHITESPACE
SimpleImputer(copy=True, fill_value=None, missing_values=nan,
strategy='mean', verbose=0)
SimpleImputer(add_indicator=False, copy=True, fill_value=None,
missing_values=nan, strategy='mean', verbose=0)
>>> X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
>>> print(imp_mean.transform(X))
... # doctest: +NORMALIZE_WHITESPACE
Expand All @@ -175,12 +188,13 @@ class SimpleImputer(BaseEstimator, TransformerMixin):

"""
def __init__(self, missing_values=np.nan, strategy="mean",
fill_value=None, verbose=0, copy=True):
fill_value=None, verbose=0, copy=True, add_indicator=False):
self.missing_values = missing_values
self.strategy = strategy
self.fill_value = fill_value
self.verbose = verbose
self.copy = copy
self.add_indicator = add_indicator

def _validate_input(self, X):
allowed_strategies = ["mean", "median", "most_frequent", "constant"]
Expand Down Expand Up @@ -272,6 +286,13 @@ def fit(self, X, y=None):
self.missing_values,
fill_value)

if self.add_indicator:
self.indicator_ = MissingIndicator(
missing_values=self.missing_values)
self.indicator_.fit(X)
else:
self.indicator_ = None

return self

def _sparse_fit(self, X, strategy, missing_values, fill_value):
Expand All @@ -285,7 +306,6 @@ def _sparse_fit(self, X, strategy, missing_values, fill_value):
# for constant strategy, self.statistcs_ is used to store
# fill_value in each column
statistics.fill(fill_value)

else:
for i in range(X.shape[1]):
column = X.data[X.indptr[i]:X.indptr[i + 1]]
Expand Down Expand Up @@ -382,6 +402,9 @@ def transform(self, X):
raise ValueError("X has %d features per sample, expected %d"
% (X.shape[1], self.statistics_.shape[0]))

if self.add_indicator:
X_trans_indicator = self.indicator_.transform(X)

# Delete the invalid columns if strategy is not constant
if self.strategy == "constant":
valid_statistics = statistics
Expand Down Expand Up @@ -420,6 +443,10 @@ def transform(self, X):

X[coordinates] = values

if self.add_indicator:
hstack = sparse.hstack if sparse.issparse(X) else np.hstack
X = hstack((X, X_trans_indicator))

return X

def _more_tags(self):
Expand Down
57 changes: 54 additions & 3 deletions sklearn/tests/test_impute.py
Original file line number Diff line number Diff line change
Expand Up @@ -952,15 +952,15 @@ def test_missing_indicator_error(X_fit, X_trans, params, msg_err):
])
@pytest.mark.parametrize(
"param_features, n_features, features_indices",
[('missing-only', 2, np.array([0, 1])),
[('missing-only', 3, np.array([0, 1, 2])),
('all', 3, np.array([0, 1, 2]))])
def test_missing_indicator_new(missing_values, arr_type, dtype, param_features,
n_features, features_indices):
X_fit = np.array([[missing_values, missing_values, 1],
[4, missing_values, 2]])
[4, 2, missing_values]])
X_trans = np.array([[missing_values, missing_values, 1],
[4, 12, 10]])
X_fit_expected = np.array([[1, 1, 0], [0, 1, 0]])
X_fit_expected = np.array([[1, 1, 0], [0, 0, 1]])
X_trans_expected = np.array([[1, 1, 0], [0, 0, 0]])

# convert the input to the right array format and right dtype
Expand Down Expand Up @@ -1144,3 +1144,54 @@ def test_missing_indicator_sparse_no_explicit_zeros():
Xt = mi.fit_transform(X)

assert Xt.getnnz() == Xt.sum()


@pytest.mark.parametrize("marker", [np.nan, -1, 0])
def test_imputation_add_indicator(marker):
X = np.array([
[marker, 1, 5, marker, 1],
[2, marker, 1, marker, 2],
[6, 3, marker, marker, 3],
[1, 2, 9, marker, 4]
])
X_true = np.array([
[3., 1., 5., 1., 1., 0., 0., 1.],
[2., 2., 1., 2., 0., 1., 0., 1.],
[6., 3., 5., 3., 0., 0., 1., 1.],
[1., 2., 9., 4., 0., 0., 0., 1.]
])

imputer = SimpleImputer(missing_values=marker, add_indicator=True)
X_trans = imputer.fit_transform(X)

assert_allclose(X_trans, X_true)
assert_array_equal(imputer.indicator_.features_, np.array([0, 1, 2, 3]))


@pytest.mark.parametrize(
"arr_type",
[
sparse.csc_matrix, sparse.csr_matrix, sparse.coo_matrix,
sparse.lil_matrix, sparse.bsr_matrix
]
)
def test_imputation_add_indicator_sparse_matrix(arr_type):
X_sparse = arr_type([
[np.nan, 1, 5],
[2, np.nan, 1],
[6, 3, np.nan],
[1, 2, 9]
])
X_true = np.array([
[3., 1., 5., 1., 0., 0.],
[2., 2., 1., 0., 1., 0.],
[6., 3., 5., 0., 0., 1.],
[1., 2., 9., 0., 0., 0.],
])

imputer = SimpleImputer(missing_values=np.nan, add_indicator=True)
X_trans = imputer.fit_transform(X_sparse)

assert sparse.issparse(X_trans)
assert X_trans.shape == X_true.shape
assert_allclose(X_trans.toarray(), X_true)