Skip to content

Commit 726fa36

Browse files
maniteja123amueller
authored andcommitted
[MRG+1] MissingIndicator transformer (#8075)
MissingIndicator transformer for the missing values indicator mask. see #6556 #### What does this implement/fix? Explain your changes. The current implementation returns a indicator mask for the missing values. #### Any other comments? It is a very initial attempt and currently no tests are present. Please do have a look and give suggestions on the design. Thanks ! - [X] Implementation - [x] Documentation - [x] Tests
1 parent 211ded8 commit 726fa36

File tree

7 files changed

+412
-19
lines changed

7 files changed

+412
-19
lines changed

doc/modules/classes.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -656,7 +656,8 @@ Kernels:
656656

657657
impute.SimpleImputer
658658
impute.ChainedImputer
659-
659+
impute.MissingIndicator
660+
660661
.. _kernel_approximation_ref:
661662

662663
:mod:`sklearn.kernel_approximation` Kernel Approximation

doc/modules/impute.rst

Lines changed: 46 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,6 @@ Both :class:`SimpleImputer` and :class:`ChainedImputer` can be used in a Pipelin
121121
as a way to build a composite estimator that supports imputation.
122122
See :ref:`sphx_glr_auto_examples_plot_missing_values.py`.
123123

124-
125124
.. _multiple_imputation:
126125

127126
Multiple vs. Single Imputation
@@ -142,3 +141,49 @@ random seeds with the ``n_imputations`` parameter set to 1.
142141
Note that a call to the ``transform`` method of :class:`ChainedImputer` is not
143142
allowed to change the number of samples. Therefore multiple imputations cannot be
144143
achieved by a single call to ``transform``.
144+
145+
.. _missing_indicator:
146+
147+
Marking imputed values
148+
======================
149+
150+
The :class:`MissingIndicator` transformer is useful to transform a dataset into
151+
corresponding binary matrix indicating the presence of missing values in the
152+
dataset. This transformation is useful in conjunction with imputation. When
153+
using imputation, preserving the information about which values had been
154+
missing can be informative.
155+
156+
``NaN`` is usually used as the placeholder for missing values. However, it
157+
enforces the data type to be float. The parameter ``missing_values`` allows to
158+
specify other placeholder such as integer. In the following example, we will
159+
use ``-1`` as missing values::
160+
161+
>>> from sklearn.impute import MissingIndicator
162+
>>> X = np.array([[-1, -1, 1, 3],
163+
... [4, -1, 0, -1],
164+
... [8, -1, 1, 0]])
165+
>>> indicator = MissingIndicator(missing_values=-1)
166+
>>> mask_missing_values_only = indicator.fit_transform(X)
167+
>>> mask_missing_values_only
168+
array([[ True, True, False],
169+
[False, True, True],
170+
[False, True, False]])
171+
172+
The ``features`` parameter is used to choose the features for which the mask is
173+
constructed. By default, it is ``'missing-only'`` which returns the imputer
174+
mask of the features containing missing values at ``fit`` time::
175+
176+
>>> indicator.features_
177+
array([0, 1, 3])
178+
179+
The ``features`` parameter can be set to ``'all'`` to returned all features
180+
whether or not they contain missing values::
181+
182+
>>> indicator = MissingIndicator(missing_values=-1, features="all")
183+
>>> mask_all = indicator.fit_transform(X)
184+
>>> mask_all
185+
array([[ True, True, False, False],
186+
[False, True, False, True],
187+
[False, True, False, False]])
188+
>>> indicator.features_
189+
array([0, 1, 2, 3])

doc/whats_new/v0.20.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,10 @@ Preprocessing
149149
back to the original space via an inverse transform. :issue:`9041` by
150150
`Andreas Müller`_ and :user:`Guillaume Lemaitre <glemaitre>`.
151151

152+
- Added :class:`MissingIndicator` which generates a binary indicator for
153+
missing values. :issue:`8075` by :user:`Maniteja Nandana <maniteja123>` and
154+
:user:`Guillaume Lemaitre <glemaitre>`.
155+
152156
- Added :class:`impute.ChainedImputer`, which is a strategy for imputing missing
153157
values by modeling each feature with missing values as a function of
154158
other features in a round-robin fashion. :issue:`8478` by

examples/plot_missing_values.py

Lines changed: 20 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,19 @@
44
====================================================
55
66
Missing values can be replaced by the mean, the median or the most frequent
7-
value using the basic ``SimpleImputer``.
7+
value using the basic :func:`sklearn.impute.SimpleImputer`.
88
The median is a more robust estimator for data with high magnitude variables
99
which could dominate results (otherwise known as a 'long tail').
1010
11-
Another option is the ``ChainedImputer``. This uses round-robin linear
12-
regression, treating every variable as an output in turn. The version
13-
implemented assumes Gaussian (output) variables. If your features are obviously
14-
non-Normal, consider transforming them to look more Normal so as to improve
15-
performance.
11+
Another option is the :func:`sklearn.impute.ChainedImputer`. This uses
12+
round-robin linear regression, treating every variable as an output in
13+
turn. The version implemented assumes Gaussian (output) variables. If your
14+
features are obviously non-Normal, consider transforming them to look more
15+
Normal so as to improve performance.
16+
17+
In addition of using an imputing method, we can also keep an indication of the
18+
missing information using :func:`sklearn.impute.MissingIndicator` which might
19+
carry some information.
1620
"""
1721

1822
import numpy as np
@@ -21,8 +25,8 @@
2125
from sklearn.datasets import load_diabetes
2226
from sklearn.datasets import load_boston
2327
from sklearn.ensemble import RandomForestRegressor
24-
from sklearn.pipeline import Pipeline
25-
from sklearn.impute import SimpleImputer, ChainedImputer
28+
from sklearn.pipeline import make_pipeline, make_union
29+
from sklearn.impute import SimpleImputer, ChainedImputer, MissingIndicator
2630
from sklearn.model_selection import cross_val_score
2731

2832
rng = np.random.RandomState(0)
@@ -60,18 +64,18 @@ def get_results(dataset):
6064
X_missing = X_full.copy()
6165
X_missing[np.where(missing_samples)[0], missing_features] = 0
6266
y_missing = y_full.copy()
63-
estimator = Pipeline([("imputer", SimpleImputer(missing_values=0,
64-
strategy="mean")),
65-
("forest", RandomForestRegressor(random_state=0,
66-
n_estimators=100))])
67+
estimator = make_pipeline(
68+
make_union(SimpleImputer(missing_values=0, strategy="mean"),
69+
MissingIndicator(missing_values=0)),
70+
RandomForestRegressor(random_state=0, n_estimators=100))
6771
mean_impute_scores = cross_val_score(estimator, X_missing, y_missing,
6872
scoring='neg_mean_squared_error')
6973

7074
# Estimate the score after chained imputation of the missing values
71-
estimator = Pipeline([("imputer", ChainedImputer(missing_values=0,
72-
random_state=0)),
73-
("forest", RandomForestRegressor(random_state=0,
74-
n_estimators=100))])
75+
estimator = make_pipeline(
76+
make_union(ChainedImputer(missing_values=0, random_state=0),
77+
MissingIndicator(missing_values=0)),
78+
RandomForestRegressor(random_state=0, n_estimators=100))
7579
chained_impute_scores = cross_val_score(estimator, X_missing, y_missing,
7680
scoring='neg_mean_squared_error')
7781

sklearn/impute.py

Lines changed: 223 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@
3535
'predictor'])
3636

3737
__all__ = [
38+
'MissingIndicator',
3839
'SimpleImputer',
3940
'ChainedImputer',
4041
]
@@ -975,3 +976,225 @@ def fit(self, X, y=None):
975976
"""
976977
self.fit_transform(X)
977978
return self
979+
980+
981+
class MissingIndicator(BaseEstimator, TransformerMixin):
982+
"""Binary indicators for missing values.
983+
984+
Parameters
985+
----------
986+
missing_values : number, string, np.nan (default) or None
987+
The placeholder for the missing values. All occurrences of
988+
`missing_values` will be imputed.
989+
990+
features : str, optional
991+
Whether the imputer mask should represent all or a subset of
992+
features.
993+
994+
- If "missing-only" (default), the imputer mask will only represent
995+
features containing missing values during fit time.
996+
- If "all", the imputer mask will represent all features.
997+
998+
sparse : boolean or "auto", optional
999+
Whether the imputer mask format should be sparse or dense.
1000+
1001+
- If "auto" (default), the imputer mask will be of same type as
1002+
input.
1003+
- If True, the imputer mask will be a sparse matrix.
1004+
- If False, the imputer mask will be a numpy array.
1005+
1006+
error_on_new : boolean, optional
1007+
If True (default), transform will raise an error when there are
1008+
features with missing values in transform that have no missing values
1009+
in fit This is applicable only when ``features="missing-only"``.
1010+
1011+
Attributes
1012+
----------
1013+
features_ : ndarray, shape (n_missing_features,) or (n_features,)
1014+
The features indices which will be returned when calling ``transform``.
1015+
They are computed during ``fit``. For ``features='all'``, it is
1016+
to ``range(n_features)``.
1017+
1018+
Examples
1019+
--------
1020+
>>> import numpy as np
1021+
>>> from sklearn.impute import MissingIndicator
1022+
>>> X1 = np.array([[np.nan, 1, 3],
1023+
... [4, 0, np.nan],
1024+
... [8, 1, 0]])
1025+
>>> X2 = np.array([[5, 1, np.nan],
1026+
... [np.nan, 2, 3],
1027+
... [2, 4, 0]])
1028+
>>> indicator = MissingIndicator()
1029+
>>> indicator.fit(X1)
1030+
MissingIndicator(error_on_new=True, features='missing-only',
1031+
missing_values=nan, sparse='auto')
1032+
>>> X2_tr = indicator.transform(X2)
1033+
>>> X2_tr
1034+
array([[False, True],
1035+
[ True, False],
1036+
[False, False]])
1037+
1038+
"""
1039+
1040+
def __init__(self, missing_values=np.nan, features="missing-only",
1041+
sparse="auto", error_on_new=True):
1042+
self.missing_values = missing_values
1043+
self.features = features
1044+
self.sparse = sparse
1045+
self.error_on_new = error_on_new
1046+
1047+
def _get_missing_features_info(self, X):
1048+
"""Compute the imputer mask and the indices of the features
1049+
containing missing values.
1050+
1051+
Parameters
1052+
----------
1053+
X : {ndarray or sparse matrix}, shape (n_samples, n_features)
1054+
The input data with missing values. Note that ``X`` has been
1055+
checked in ``fit`` and ``transform`` before to call this function.
1056+
1057+
Returns
1058+
-------
1059+
imputer_mask : {ndarray or sparse matrix}, shape \
1060+
(n_samples, n_features) or (n_samples, n_features_with_missing)
1061+
The imputer mask of the original data.
1062+
1063+
features_with_missing : ndarray, shape (n_features_with_missing)
1064+
The features containing missing values.
1065+
1066+
"""
1067+
if sparse.issparse(X) and self.missing_values != 0:
1068+
mask = _get_mask(X.data, self.missing_values)
1069+
1070+
# The imputer mask will be constructed with the same sparse format
1071+
# as X.
1072+
sparse_constructor = (sparse.csr_matrix if X.format == 'csr'
1073+
else sparse.csc_matrix)
1074+
imputer_mask = sparse_constructor(
1075+
(mask, X.indices.copy(), X.indptr.copy()),
1076+
shape=X.shape, dtype=bool)
1077+
1078+
missing_values_mask = imputer_mask.copy()
1079+
missing_values_mask.eliminate_zeros()
1080+
features_with_missing = (
1081+
np.flatnonzero(np.diff(missing_values_mask.indptr))
1082+
if missing_values_mask.format == 'csc'
1083+
else np.unique(missing_values_mask.indices))
1084+
1085+
if self.sparse is False:
1086+
imputer_mask = imputer_mask.toarray()
1087+
elif imputer_mask.format == 'csr':
1088+
imputer_mask = imputer_mask.tocsc()
1089+
else:
1090+
if sparse.issparse(X):
1091+
# case of sparse matrix with 0 as missing values. Implicit and
1092+
# explicit zeros are considered as missing values.
1093+
X = X.toarray()
1094+
imputer_mask = _get_mask(X, self.missing_values)
1095+
features_with_missing = np.flatnonzero(imputer_mask.sum(axis=0))
1096+
1097+
if self.sparse is True:
1098+
imputer_mask = sparse.csc_matrix(imputer_mask)
1099+
1100+
return imputer_mask, features_with_missing
1101+
1102+
def fit(self, X, y=None):
1103+
"""Fit the transformer on X.
1104+
1105+
Parameters
1106+
----------
1107+
X : {array-like, sparse matrix}, shape (n_samples, n_features)
1108+
Input data, where ``n_samples`` is the number of samples and
1109+
``n_features`` is the number of features.
1110+
1111+
Returns
1112+
-------
1113+
self : object
1114+
Returns self.
1115+
"""
1116+
if not is_scalar_nan(self.missing_values):
1117+
force_all_finite = True
1118+
else:
1119+
force_all_finite = "allow-nan"
1120+
X = check_array(X, accept_sparse=('csc', 'csr'),
1121+
force_all_finite=force_all_finite)
1122+
_check_inputs_dtype(X, self.missing_values)
1123+
1124+
self._n_features = X.shape[1]
1125+
1126+
if self.features not in ('missing-only', 'all'):
1127+
raise ValueError("'features' has to be either 'missing-only' or "
1128+
"'all'. Got {} instead.".format(self.features))
1129+
1130+
if not ((isinstance(self.sparse, six.string_types) and
1131+
self.sparse == "auto") or isinstance(self.sparse, bool)):
1132+
raise ValueError("'sparse' has to be a boolean or 'auto'. "
1133+
"Got {!r} instead.".format(self.sparse))
1134+
1135+
self.features_ = (self._get_missing_features_info(X)[1]
1136+
if self.features == 'missing-only'
1137+
else np.arange(self._n_features))
1138+
1139+
return self
1140+
1141+
def transform(self, X):
1142+
"""Generate missing values indicator for X.
1143+
1144+
Parameters
1145+
----------
1146+
X : {array-like, sparse matrix}, shape (n_samples, n_features)
1147+
The input data to complete.
1148+
1149+
Returns
1150+
-------
1151+
Xt : {ndarray or sparse matrix}, shape (n_samples, n_features)
1152+
The missing indicator for input data. The data type of ``Xt``
1153+
will be boolean.
1154+
1155+
"""
1156+
check_is_fitted(self, "features_")
1157+
1158+
if not is_scalar_nan(self.missing_values):
1159+
force_all_finite = True
1160+
else:
1161+
force_all_finite = "allow-nan"
1162+
X = check_array(X, accept_sparse=('csc', 'csr'),
1163+
force_all_finite=force_all_finite)
1164+
_check_inputs_dtype(X, self.missing_values)
1165+
1166+
if X.shape[1] != self._n_features:
1167+
raise ValueError("X has a different number of features "
1168+
"than during fitting.")
1169+
1170+
imputer_mask, features = self._get_missing_features_info(X)
1171+
1172+
if self.features == "missing-only":
1173+
features_diff_fit_trans = np.setdiff1d(features, self.features_)
1174+
if (self.error_on_new and features_diff_fit_trans.size > 0):
1175+
raise ValueError("The features {} have missing values "
1176+
"in transform but have no missing values "
1177+
"in fit.".format(features_diff_fit_trans))
1178+
1179+
if (self.features_.size > 0 and
1180+
self.features_.size < self._n_features):
1181+
imputer_mask = imputer_mask[:, self.features_]
1182+
1183+
return imputer_mask
1184+
1185+
def fit_transform(self, X, y=None):
1186+
"""Generate missing values indicator for X.
1187+
1188+
Parameters
1189+
----------
1190+
X : {array-like, sparse matrix}, shape (n_samples, n_features)
1191+
The input data to complete.
1192+
1193+
Returns
1194+
-------
1195+
Xt : {ndarray or sparse matrix}, shape (n_samples, n_features)
1196+
The missing indicator for input data. The data type of ``Xt``
1197+
will be boolean.
1198+
1199+
"""
1200+
return self.fit(X, y).transform(X)

0 commit comments

Comments
 (0)