Skip to content

[MRG] BUG: add support for non numeric values in MissingIndicator #13046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Feb 3, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
1176d66
EHN: add support for non numeric values in MissingIndicator
glemaitre Jan 26, 2019
00d7cb5
DOC: add whats new entry
glemaitre Jan 26, 2019
21eff01
FIX/TST: estimator supporting string should not raise error with dtyp…
glemaitre Jan 26, 2019
42d5b36
DOC: fix the comment on the reason of not raising an error
glemaitre Jan 28, 2019
507ed3a
FIX: change error message
glemaitre Jan 29, 2019
eb4765c
TST: fix the error message match
glemaitre Jan 29, 2019
d4e7778
Update doc/whats_new/v0.20.rst
jorisvandenbossche Jan 30, 2019
7fb7599
TST: additional test with mixed type string/nan/None
glemaitre Jan 30, 2019
ec0ab84
[MRG] Configure lgtm.yml for CPP (#13044)
thomasjpfan Jan 26, 2019
3721a61
FIX float16 overflow on accumulator operations in StandardScaler (#13…
baluyotraf Jan 26, 2019
f38c8d1
TST Use random state to initialize MLPClassifier. (#12892)
xhan7279 Jan 27, 2019
4078baa
API Deprecate externals.six (#12916)
qinhanmin2014 Jan 27, 2019
a0a99d8
DOC Remove outdated doc in KBinsDiscretizer (#13047)
qinhanmin2014 Jan 27, 2019
e92eb1f
DOC Remove outdated doc in KBinsDiscretizer
qinhanmin2014 Jan 27, 2019
dd73bab
EXA Improve example plot_svm_anova.py (#11731)
qinhanmin2014 Jan 28, 2019
de9630f
DOC Correct TF-IDF formula in TfidfTransformer comments. (#13054)
vishaalkapoor Jan 29, 2019
1d8c534
FIX an issue w/ large sparse matrix indices in CountVectorizer (#11295)
gvacaliuc Jan 30, 2019
f483247
DOC More details about the attributes in MinMaxScaler (#13029)
qinhanmin2014 Jan 30, 2019
42ed142
DOC Clean up the advanced installation doc to remove python < 3.5 par…
jeremiedbb Jan 30, 2019
9445eb1
API NMF and non_negative_factorization have inconsistent default init…
zjpoh Jan 30, 2019
a95b08e
MAINT: pin flake8 to stable version (#13066)
glemaitre Jan 30, 2019
193ed4e
EXA: fix xlabel and ylabel in plot_cv_digits.py (#13067)
qinhanmin2014 Jan 30, 2019
85fe5f6
MAINT: remove flake8 pinning in circle ci (#13071)
glemaitre Jan 30, 2019
0b5a96d
Merge remote-tracking branch 'origin/master' into is/13035
glemaitre Jan 30, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions doc/whats_new/v0.20.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,15 @@ Changelog
threaded when `n_jobs > 1` or `n_jobs = -1`.
:issue:`13005` by :user:`Prabakaran Kumaresshan <nixphix>`.

:mod:`sklearn.impute`
.....................

- |Fix| add support for non-numeric data in
:class:`sklearn.impute.MissingIndicator` which was not supported while
:class:`sklearn.impute.SimpleImputer` was supporting this for some
imputation strategies.
:issue:`13046` by :user:`Guillaume Lemaitre <glemaitre>`.

:mod:`sklearn.linear_model`
...........................

Expand Down
35 changes: 19 additions & 16 deletions sklearn/impute.py
Original file line number Diff line number Diff line change
Expand Up @@ -533,6 +533,23 @@ def _get_missing_features_info(self, X):

return imputer_mask, features_with_missing

def _validate_input(self, X):
if not is_scalar_nan(self.missing_values):
force_all_finite = True
else:
force_all_finite = "allow-nan"
X = check_array(X, accept_sparse=('csc', 'csr'), dtype=None,
force_all_finite=force_all_finite)
_check_inputs_dtype(X, self.missing_values)
if X.dtype.kind not in ("i", "u", "f", "O"):
raise ValueError("MissingIndicator does not support data with "
"dtype {0}. Please provide either a numeric array"
" (with a floating point or integer dtype) or "
"categorical data represented either as an array "
"with integer dtype or an array of string values "
"with an object dtype.".format(X.dtype))
return X

def fit(self, X, y=None):
"""Fit the transformer on X.

Expand All @@ -547,14 +564,7 @@ def fit(self, X, y=None):
self : object
Returns self.
"""
if not is_scalar_nan(self.missing_values):
force_all_finite = True
else:
force_all_finite = "allow-nan"
X = check_array(X, accept_sparse=('csc', 'csr'),
force_all_finite=force_all_finite)
_check_inputs_dtype(X, self.missing_values)

X = self._validate_input(X)
self._n_features = X.shape[1]

if self.features not in ('missing-only', 'all'):
Expand Down Expand Up @@ -588,14 +598,7 @@ def transform(self, X):

"""
check_is_fitted(self, "features_")

if not is_scalar_nan(self.missing_values):
force_all_finite = True
else:
force_all_finite = "allow-nan"
X = check_array(X, accept_sparse=('csc', 'csr'),
force_all_finite=force_all_finite)
_check_inputs_dtype(X, self.missing_values)
X = self._validate_input(X)

if X.shape[1] != self._n_features:
raise ValueError("X has a different number of features "
Expand Down
37 changes: 36 additions & 1 deletion sklearn/tests/test_impute.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from sklearn.impute import MissingIndicator
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_union
from sklearn.model_selection import GridSearchCV
from sklearn import tree
from sklearn.random_projection import sparse_random_matrix
Expand Down Expand Up @@ -509,7 +510,10 @@ def test_imputation_copy():
"'features' has to be either 'missing-only' or 'all'"),
(np.array([[-1, 1], [1, 2]]), np.array([[-1, 1], [1, 2]]),
{'features': 'all', 'sparse': 'random'},
"'sparse' has to be a boolean or 'auto'")]
"'sparse' has to be a boolean or 'auto'"),
(np.array([['a', 'b'], ['c', 'a']], dtype=str),
np.array([['a', 'b'], ['c', 'a']], dtype=str),
{}, "MissingIndicator does not support data with dtype")]
)
def test_missing_indicator_error(X_fit, X_trans, params, msg_err):
indicator = MissingIndicator(missing_values=-1)
Expand Down Expand Up @@ -614,6 +618,37 @@ def test_missing_indicator_sparse_param(arr_type, missing_values,
assert isinstance(X_trans_mask, np.ndarray)


def test_missing_indicator_string():
X = np.array([['a', 'b', 'c'], ['b', 'c', 'a']], dtype=object)
indicator = MissingIndicator(missing_values='a', features='all')
X_trans = indicator.fit_transform(X)
assert_array_equal(X_trans, np.array([[True, False, False],
[False, False, True]]))


@pytest.mark.parametrize(
"X, missing_values, X_trans_exp",
[(np.array([['a', 'b'], ['b', 'a']], dtype=object), 'a',
np.array([['b', 'b', True, False], ['b', 'b', False, True]],
dtype=object)),
(np.array([[np.nan, 1.], [1., np.nan]]), np.nan,
np.array([[1., 1., True, False], [1., 1., False, True]])),
(np.array([[np.nan, 'b'], ['b', np.nan]], dtype=object), np.nan,
np.array([['b', 'b', True, False], ['b', 'b', False, True]],
dtype=object)),
(np.array([[None, 'b'], ['b', None]], dtype=object), None,
np.array([['b', 'b', True, False], ['b', 'b', False, True]],
dtype=object))]
)
def test_missing_indicator_with_imputer(X, missing_values, X_trans_exp):
trans = make_union(
SimpleImputer(missing_values=missing_values, strategy='most_frequent'),
MissingIndicator(missing_values=missing_values)
)
X_trans = trans.fit_transform(X)
assert_array_equal(X_trans, X_trans_exp)


@pytest.mark.parametrize("imputer_constructor",
[SimpleImputer])
@pytest.mark.parametrize(
Expand Down
15 changes: 11 additions & 4 deletions sklearn/utils/estimator_checks.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,10 +74,10 @@
'OrthogonalMatchingPursuit', 'PLSCanonical', 'PLSRegression',
'RANSACRegressor', 'RadiusNeighborsRegressor',
'RandomForestRegressor', 'Ridge', 'RidgeCV']

ALLOW_NAN = ['Imputer', 'SimpleImputer', 'MissingIndicator',
'MaxAbsScaler', 'MinMaxScaler', 'RobustScaler', 'StandardScaler',
'PowerTransformer', 'QuantileTransformer']
SUPPORT_STRING = ['SimpleImputer', 'MissingIndicator']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surely OneHotEncoder, OrdinalEncoder and meta-estimators will belong here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do that in another PR. I think that we should also factorize the input validation as well. It is quite redundant and we could have a common test then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it probably belongs in estimator tags... (perhaps post-#8022)



def _yield_non_meta_checks(name, estimator):
Expand Down Expand Up @@ -625,9 +625,16 @@ def check_dtype_object(name, estimator_orig):
if "Unknown label type" not in str(e):
raise

X[0, 0] = {'foo': 'bar'}
msg = "argument must be a string or a number"
assert_raises_regex(TypeError, msg, estimator.fit, X, y)
if name not in SUPPORT_STRING:
X[0, 0] = {'foo': 'bar'}
msg = "argument must be a string or a number"
assert_raises_regex(TypeError, msg, estimator.fit, X, y)
else:
# Estimators supporting string will not call np.asarray to convert the
# data to numeric and therefore, the error will not be raised.
# Checking for each element dtype in the input array will be costly.
# Refer to #11401 for full discussion.
estimator.fit(X, y)


def check_complex_data(name, estimator_orig):
Expand Down