Skip to content

[MRG] BUG: add support for non numeric values in MissingIndicator #13046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Feb 3, 2019

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Jan 26, 2019

Reference Issues/PRs

closes #13035

What does this implement/fix? Explain your changes.

Add support to non-numeric values in MissingIndicator. I flag this PR as BUG since SimpleImputer is already supporting non-numeric data and that we cannot use the MissingIndicator with the SimpleImputer in 0.20. I think that it should go in 0.20.3

I added 3 tests:

  • check that we are raising an error when the numpy string type is raising an error;
  • check that we can use string using object dtype;
  • check that the missing indicator is working in conjunction with the SimpleImputer.

@glemaitre glemaitre changed the title EHN: add support for non numeric values in MissingIndicator BUG: add support for non numeric values in MissingIndicator Jan 26, 2019
@glemaitre glemaitre changed the title BUG: add support for non numeric values in MissingIndicator [MRG] BUG: add support for non numeric values in MissingIndicator Jan 26, 2019
msg = "argument must be a string or a number"
assert_raises_regex(TypeError, msg, estimator.fit, X, y)
else:
# If the estimator support strings passed as object dtype,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually X[0, 0] = {'foo': 'bar'} is not a string entry, but a dict entry. So I think the test was correct. (Otherwise it would have been needed to modify it for the support of strings in the SimpleImputer).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is also not raised for SimpleImputer with the strategy supporting mean. The difference is that the common tests are using default parameter which is the mean strategy for the SimpleImputer.

In [9]: imputer = SimpleImputer(strategy='most_frequent') 
In [11]: X = np.array([[{'foo': 'bar'}, 'b', 'c']], dtype=object)
In [12]: imputer.fit(X)                                                              
Out[12]: 
SimpleImputer(copy=True, fill_value=None, missing_values=nan,
              strategy='most_frequent', verbose=0)

I have to check what is going wrong in check array which is the one that should raise this error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is raised by np.asarray only when we make the conversion. So it means that we neeed to raise such error when dtype=object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think check_array needs an update to properly deal with non-numeric input

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think check_array needs an update to properly deal with non-numeric input

See also #12148

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be testing the error produced when string dtype is passed? Otherwise looks good.

force_all_finite=force_all_finite)
_check_inputs_dtype(X, self.missing_values)
if X.dtype.kind not in ("i", "u", "f", "O"):
raise ValueError("Missing indicator does not support data with "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing indicator -> MissingIndicator

ALLOW_NAN = ['Imputer', 'SimpleImputer', 'MissingIndicator',
'MaxAbsScaler', 'MinMaxScaler', 'RobustScaler', 'StandardScaler',
'PowerTransformer', 'QuantileTransformer']
SUPPORT_STRING = ['SimpleImputer', 'MissingIndicator']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surely OneHotEncoder, OrdinalEncoder and meta-estimators will belong here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do that in another PR. I think that we should also factorize the input validation as well. It is quite redundant and we could have a common test then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it probably belongs in estimator tags... (perhaps post-#8022)

@glemaitre
Copy link
Member Author

Should we be testing the error produced when string dtype is passed? Otherwise looks good.

It is covered there:
https://github.com/scikit-learn/scikit-learn/pull/13046/files#diff-065d21984eda92d7583f963f704a9876R514

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine. If you use "MissingIndicator" instead of "missing indicator" this is good to merge by me.

"'sparse' has to be a boolean or 'auto'"),
(np.array([['a', 'b'], ['c', 'a']], dtype=str),
np.array([['a', 'b'], ['c', 'a']], dtype=str),
{}, "Missing indicator does not support data with dtype")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, looks good to me

@@ -55,7 +56,6 @@ def _check_statistics(X, X_true,
err_msg=err_msg.format(True))
assert_ae(X_trans, X_true, err_msg=err_msg.format(True))


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8, does not need to be removed

@jorisvandenbossche
Copy link
Member

Just need to merge master for the pep8 failure

@glemaitre
Copy link
Member Author

and I mess up my merging (basically I rebase again) but this should be fine with the final squashing.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to refactor some input validations and some common tests, but I agree that we can do these things in another PR.

@qinhanmin2014 qinhanmin2014 added this to the 0.20.3 milestone Feb 3, 2019
@qinhanmin2014 qinhanmin2014 merged commit 1ae1f1d into scikit-learn:master Feb 3, 2019
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 6, 2019
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 7, 2019
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Feb 19, 2019
@jnothman jnothman mentioned this pull request Feb 19, 2019
17 tasks
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MissingIndicator failed with non-numeric inputs