Skip to content

[MRG+1] MissingIndicator transformer #8075

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 57 commits into from
Jul 16, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
6ff7800
Initial commit for missing values indicator
maniteja123 Dec 17, 2016
98e28a7
Change documentation, remove axis and add simple test
maniteja123 Dec 20, 2016
38c58e2
Add documentation and tests
maniteja123 Dec 22, 2016
781d07d
Add sparse option functionality
maniteja123 Dec 28, 2016
ec6d69a
Modify tests
maniteja123 Jan 3, 2017
605d189
Add comprehensive tests
maniteja123 Feb 2, 2017
ca8af65
Common tests
maniteja123 Feb 18, 2017
07c0fce
fix astype usage
maniteja123 Mar 4, 2017
f02d78a
pep fixes
maniteja123 Mar 4, 2017
2379edb
Implement fit_transform
maniteja123 Mar 6, 2017
552a2cb
modify doc [ci skip]
maniteja123 Mar 6, 2017
0d980e4
fix failing tests
maniteja123 Mar 7, 2017
a1a6982
Change default to np.NaN
maniteja123 Jun 5, 2017
91b0122
Error when transform has features with missing values while not durin…
maniteja123 Jun 6, 2017
4137ed3
Doc and test changes
maniteja123 Jul 19, 2017
fb3d55a
Documentation changes and remove duplicate code
maniteja123 Aug 17, 2017
500aa65
fix tests
maniteja123 Aug 21, 2017
3e7c4c1
fix estimator common tests
maniteja123 Aug 25, 2017
426d179
fix sparse array tests
maniteja123 Aug 25, 2017
9b0e9be
fix sparse array tests
maniteja123 Aug 25, 2017
c45ad5f
fix sparse array tests
maniteja123 Aug 25, 2017
cc23f13
fix sparse array tests
maniteja123 Aug 25, 2017
f50d649
address comments and exception tests
maniteja123 Jan 12, 2018
70e06f7
Move MissingIndicator to impute.py
maniteja123 Feb 18, 2018
37f19a3
fix flake8 comments
maniteja123 Feb 18, 2018
313a71b
docstring changes
maniteja123 Feb 18, 2018
1a064ca
Merge remote-tracking branch 'origin/master' into maniteja123-imputer…
glemaitre Apr 23, 2018
8c956c7
FIX add change in estimator checks
glemaitre Apr 23, 2018
feddbcb
FIX error during solving conflicts
glemaitre Apr 23, 2018
49ef207
EHN address code reviews
glemaitre Apr 27, 2018
712b2f4
PEP8
glemaitre Apr 27, 2018
5c495e1
DOC address comments documentation
glemaitre Apr 27, 2018
1efcd82
TST parametrize error test
glemaitre Apr 27, 2018
50bc29c
reverse useless change
glemaitre Apr 27, 2018
e3abbc6
PEP8
glemaitre Apr 27, 2018
492967c
TST parametrize test and split tests
glemaitre Apr 27, 2018
7df0d14
FIX typo in tests
glemaitre Apr 27, 2018
12103ad
FIX change default type to bool
glemaitre Apr 28, 2018
007c6e3
EHN add a not regarding the default dtype
glemaitre Apr 28, 2018
a29128c
Merge branch 'master' into imputer_missing_values
jnothman Jun 26, 2018
74679e6
Insert missing comma
jnothman Jun 26, 2018
754c4e3
update
glemaitre Jun 29, 2018
b895f7c
FIX raise error with inconsistent dtype X and missing_values
glemaitre Jun 29, 2018
da633f5
Merge remote-tracking branch 'origin/master' into maniteja123-imputer…
glemaitre Jun 29, 2018
ea6f8e8
Merge remote-tracking branch 'glemaitre/is/11390' into maniteja123-im…
glemaitre Jun 29, 2018
44dbc91
solve issue with NaN as string
glemaitre Jun 29, 2018
34fb9a3
address jeremy comments
glemaitre Jun 29, 2018
3abc695
address andy comments
glemaitre Jun 29, 2018
05226fd
PEP8
glemaitre Jun 29, 2018
7695551
DOC fix doc parameter
glemaitre Jun 29, 2018
8c199ba
Merge remote-tracking branch 'origin/master' into maniteja123-imputer…
glemaitre Jul 10, 2018
17a0caa
Merge remote-tracking branch 'glemaitre/is/11390' into maniteja123-im…
glemaitre Jul 10, 2018
d4ca8a8
EXA show an example using MissingIndicator
glemaitre Jul 12, 2018
52d1c02
Update plot_missing_values.py
glemaitre Jul 12, 2018
76558e5
DOC fix
glemaitre Jul 15, 2018
51c0aa4
Merge branch 'imputer_missing_values' of github.com:maniteja123/sciki…
glemaitre Jul 15, 2018
82d766d
Merge remote-tracking branch 'origin/master' into maniteja123-imputer…
glemaitre Jul 16, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -656,7 +656,8 @@ Kernels:

impute.SimpleImputer
impute.ChainedImputer

impute.MissingIndicator

.. _kernel_approximation_ref:

:mod:`sklearn.kernel_approximation` Kernel Approximation
Expand Down
47 changes: 46 additions & 1 deletion doc/modules/impute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,6 @@ Both :class:`SimpleImputer` and :class:`ChainedImputer` can be used in a Pipelin
as a way to build a composite estimator that supports imputation.
See :ref:`sphx_glr_auto_examples_plot_missing_values.py`.


.. _multiple_imputation:

Multiple vs. Single Imputation
Expand All @@ -142,3 +141,49 @@ random seeds with the ``n_imputations`` parameter set to 1.
Note that a call to the ``transform`` method of :class:`ChainedImputer` is not
allowed to change the number of samples. Therefore multiple imputations cannot be
achieved by a single call to ``transform``.

.. _missing_indicator:

Marking imputed values
======================

The :class:`MissingIndicator` transformer is useful to transform a dataset into
corresponding binary matrix indicating the presence of missing values in the
dataset. This transformation is useful in conjunction with imputation. When
using imputation, preserving the information about which values had been
missing can be informative.

``NaN`` is usually used as the placeholder for missing values. However, it
enforces the data type to be float. The parameter ``missing_values`` allows to
specify other placeholder such as integer. In the following example, we will
use ``-1`` as missing values::

>>> from sklearn.impute import MissingIndicator
>>> X = np.array([[-1, -1, 1, 3],
... [4, -1, 0, -1],
... [8, -1, 1, 0]])
>>> indicator = MissingIndicator(missing_values=-1)
>>> mask_missing_values_only = indicator.fit_transform(X)
>>> mask_missing_values_only
array([[ True, True, False],
[False, True, True],
[False, True, False]])

The ``features`` parameter is used to choose the features for which the mask is
constructed. By default, it is ``'missing-only'`` which returns the imputer
mask of the features containing missing values at ``fit`` time::

>>> indicator.features_
array([0, 1, 3])

The ``features`` parameter can be set to ``'all'`` to returned all features
whether or not they contain missing values::

>>> indicator = MissingIndicator(missing_values=-1, features="all")
>>> mask_all = indicator.fit_transform(X)
>>> mask_all
array([[ True, True, False, False],
[False, True, False, True],
[False, True, False, False]])
>>> indicator.features_
array([0, 1, 2, 3])
4 changes: 4 additions & 0 deletions doc/whats_new/v0.20.rst
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,10 @@ Preprocessing
back to the original space via an inverse transform. :issue:`9041` by
`Andreas Müller`_ and :user:`Guillaume Lemaitre <glemaitre>`.

- Added :class:`MissingIndicator` which generates a binary indicator for
missing values. :issue:`8075` by :user:`Maniteja Nandana <maniteja123>` and
:user:`Guillaume Lemaitre <glemaitre>`.

- Added :class:`impute.ChainedImputer`, which is a strategy for imputing missing
values by modeling each feature with missing values as a function of
other features in a round-robin fashion. :issue:`8478` by
Expand Down
36 changes: 20 additions & 16 deletions examples/plot_missing_values.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,19 @@
====================================================

Missing values can be replaced by the mean, the median or the most frequent
value using the basic ``SimpleImputer``.
value using the basic :func:`sklearn.impute.SimpleImputer`.
The median is a more robust estimator for data with high magnitude variables
which could dominate results (otherwise known as a 'long tail').

Another option is the ``ChainedImputer``. This uses round-robin linear
regression, treating every variable as an output in turn. The version
implemented assumes Gaussian (output) variables. If your features are obviously
non-Normal, consider transforming them to look more Normal so as to improve
performance.
Another option is the :func:`sklearn.impute.ChainedImputer`. This uses
round-robin linear regression, treating every variable as an output in
turn. The version implemented assumes Gaussian (output) variables. If your
features are obviously non-Normal, consider transforming them to look more
Normal so as to improve performance.

In addition of using an imputing method, we can also keep an indication of the
missing information using :func:`sklearn.impute.MissingIndicator` which might
carry some information.
"""

import numpy as np
Expand All @@ -21,8 +25,8 @@
from sklearn.datasets import load_diabetes
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, ChainedImputer
from sklearn.pipeline import make_pipeline, make_union
from sklearn.impute import SimpleImputer, ChainedImputer, MissingIndicator
from sklearn.model_selection import cross_val_score

rng = np.random.RandomState(0)
Expand Down Expand Up @@ -60,18 +64,18 @@ def get_results(dataset):
X_missing = X_full.copy()
X_missing[np.where(missing_samples)[0], missing_features] = 0
y_missing = y_full.copy()
estimator = Pipeline([("imputer", SimpleImputer(missing_values=0,
strategy="mean")),
("forest", RandomForestRegressor(random_state=0,
n_estimators=100))])
estimator = make_pipeline(
make_union(SimpleImputer(missing_values=0, strategy="mean"),
MissingIndicator(missing_values=0)),
RandomForestRegressor(random_state=0, n_estimators=100))
mean_impute_scores = cross_val_score(estimator, X_missing, y_missing,
scoring='neg_mean_squared_error')

# Estimate the score after chained imputation of the missing values
estimator = Pipeline([("imputer", ChainedImputer(missing_values=0,
random_state=0)),
("forest", RandomForestRegressor(random_state=0,
n_estimators=100))])
estimator = make_pipeline(
make_union(ChainedImputer(missing_values=0, random_state=0),
MissingIndicator(missing_values=0)),
RandomForestRegressor(random_state=0, n_estimators=100))
chained_impute_scores = cross_val_score(estimator, X_missing, y_missing,
scoring='neg_mean_squared_error')

Expand Down
223 changes: 223 additions & 0 deletions sklearn/impute.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
'predictor'])

__all__ = [
'MissingIndicator',
'SimpleImputer',
'ChainedImputer',
]
Expand Down Expand Up @@ -975,3 +976,225 @@ def fit(self, X, y=None):
"""
self.fit_transform(X)
return self


class MissingIndicator(BaseEstimator, TransformerMixin):
"""Binary indicators for missing values.

Parameters
----------
missing_values : number, string, np.nan (default) or None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does number mean and why is np.nan not a number? Maybe just move the np.nan to the end?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number means real number. It's just to fit this in one line.
I think by definition nan is not a number :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the dtype is also important, isn't it? I find "float or int" more natural than "number or np.nan" but I don't have a strong opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that "float or int" is better than number, but I think it's important to keep np.nan visible since it should be a common value for missing_values. Maybe something like
int, float, string or None (default=np.nan) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now this is consistent with SimpleImputer and ChainedImputer in fact.

The placeholder for the missing values. All occurrences of
`missing_values` will be imputed.

features : str, optional
Whether the imputer mask should represent all or a subset of
features.

- If "missing-only" (default), the imputer mask will only represent
features containing missing values during fit time.
- If "all", the imputer mask will represent all features.

sparse : boolean or "auto", optional
Whether the imputer mask format should be sparse or dense.

- If "auto" (default), the imputer mask will be of same type as
input.
- If True, the imputer mask will be a sparse matrix.
- If False, the imputer mask will be a numpy array.

error_on_new : boolean, optional
If True (default), transform will raise an error when there are
features with missing values in transform that have no missing values
in fit This is applicable only when ``features="missing-only"``.

Attributes
----------
features_ : ndarray, shape (n_missing_features,) or (n_features,)
The features indices which will be returned when calling ``transform``.
They are computed during ``fit``. For ``features='all'``, it is
to ``range(n_features)``.

Examples
--------
>>> import numpy as np
>>> from sklearn.impute import MissingIndicator
>>> X1 = np.array([[np.nan, 1, 3],
... [4, 0, np.nan],
... [8, 1, 0]])
>>> X2 = np.array([[5, 1, np.nan],
... [np.nan, 2, 3],
... [2, 4, 0]])
>>> indicator = MissingIndicator()
>>> indicator.fit(X1)
MissingIndicator(error_on_new=True, features='missing-only',
missing_values=nan, sparse='auto')
>>> X2_tr = indicator.transform(X2)
>>> X2_tr
array([[False, True],
[ True, False],
[False, False]])

"""

def __init__(self, missing_values=np.nan, features="missing-only",
sparse="auto", error_on_new=True):
self.missing_values = missing_values
self.features = features
self.sparse = sparse
self.error_on_new = error_on_new

def _get_missing_features_info(self, X):
"""Compute the imputer mask and the indices of the features
containing missing values.

Parameters
----------
X : {ndarray or sparse matrix}, shape (n_samples, n_features)
The input data with missing values. Note that ``X`` has been
checked in ``fit`` and ``transform`` before to call this function.

Returns
-------
imputer_mask : {ndarray or sparse matrix}, shape \
(n_samples, n_features) or (n_samples, n_features_with_missing)
The imputer mask of the original data.

features_with_missing : ndarray, shape (n_features_with_missing)
The features containing missing values.

"""
if sparse.issparse(X) and self.missing_values != 0:
mask = _get_mask(X.data, self.missing_values)

# The imputer mask will be constructed with the same sparse format
# as X.
sparse_constructor = (sparse.csr_matrix if X.format == 'csr'
else sparse.csc_matrix)
imputer_mask = sparse_constructor(
(mask, X.indices.copy(), X.indptr.copy()),
shape=X.shape, dtype=bool)

missing_values_mask = imputer_mask.copy()
missing_values_mask.eliminate_zeros()
features_with_missing = (
np.flatnonzero(np.diff(missing_values_mask.indptr))
if missing_values_mask.format == 'csc'
else np.unique(missing_values_mask.indices))

if self.sparse is False:
imputer_mask = imputer_mask.toarray()
elif imputer_mask.format == 'csr':
imputer_mask = imputer_mask.tocsc()
else:
if sparse.issparse(X):
# case of sparse matrix with 0 as missing values. Implicit and
# explicit zeros are considered as missing values.
X = X.toarray()
imputer_mask = _get_mask(X, self.missing_values)
features_with_missing = np.flatnonzero(imputer_mask.sum(axis=0))

if self.sparse is True:
imputer_mask = sparse.csc_matrix(imputer_mask)

return imputer_mask, features_with_missing

def fit(self, X, y=None):
"""Fit the transformer on X.

Parameters
----------
X : {array-like, sparse matrix}, shape (n_samples, n_features)
Input data, where ``n_samples`` is the number of samples and
``n_features`` is the number of features.

Returns
-------
self : object
Returns self.
"""
if not is_scalar_nan(self.missing_values):
force_all_finite = True
else:
force_all_finite = "allow-nan"
X = check_array(X, accept_sparse=('csc', 'csr'),
force_all_finite=force_all_finite)
_check_inputs_dtype(X, self.missing_values)

self._n_features = X.shape[1]

if self.features not in ('missing-only', 'all'):
raise ValueError("'features' has to be either 'missing-only' or "
"'all'. Got {} instead.".format(self.features))

if not ((isinstance(self.sparse, six.string_types) and
self.sparse == "auto") or isinstance(self.sparse, bool)):
raise ValueError("'sparse' has to be a boolean or 'auto'. "
"Got {!r} instead.".format(self.sparse))

self.features_ = (self._get_missing_features_info(X)[1]
if self.features == 'missing-only'
else np.arange(self._n_features))

return self

def transform(self, X):
"""Generate missing values indicator for X.

Parameters
----------
X : {array-like, sparse matrix}, shape (n_samples, n_features)
The input data to complete.

Returns
-------
Xt : {ndarray or sparse matrix}, shape (n_samples, n_features)
The missing indicator for input data. The data type of ``Xt``
will be boolean.

"""
check_is_fitted(self, "features_")

if not is_scalar_nan(self.missing_values):
force_all_finite = True
else:
force_all_finite = "allow-nan"
X = check_array(X, accept_sparse=('csc', 'csr'),
force_all_finite=force_all_finite)
_check_inputs_dtype(X, self.missing_values)

if X.shape[1] != self._n_features:
raise ValueError("X has a different number of features "
"than during fitting.")

imputer_mask, features = self._get_missing_features_info(X)

if self.features == "missing-only":
features_diff_fit_trans = np.setdiff1d(features, self.features_)
if (self.error_on_new and features_diff_fit_trans.size > 0):
raise ValueError("The features {} have missing values "
"in transform but have no missing values "
"in fit.".format(features_diff_fit_trans))

if (self.features_.size > 0 and
self.features_.size < self._n_features):
imputer_mask = imputer_mask[:, self.features_]

return imputer_mask

def fit_transform(self, X, y=None):
"""Generate missing values indicator for X.

Parameters
----------
X : {array-like, sparse matrix}, shape (n_samples, n_features)
The input data to complete.

Returns
-------
Xt : {ndarray or sparse matrix}, shape (n_samples, n_features)
The missing indicator for input data. The data type of ``Xt``
will be boolean.

"""
return self.fit(X, y).transform(X)
Loading