-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] BUG: add support for non numeric values in MissingIndicator #13046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/utils/estimator_checks.py
Outdated
msg = "argument must be a string or a number" | ||
assert_raises_regex(TypeError, msg, estimator.fit, X, y) | ||
else: | ||
# If the estimator support strings passed as object dtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually X[0, 0] = {'foo': 'bar'}
is not a string entry, but a dict entry. So I think the test was correct. (Otherwise it would have been needed to modify it for the support of strings in the SimpleImputer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error is also not raised for SimpleImputer with the strategy supporting mean. The difference is that the common tests are using default parameter which is the mean strategy for the SimpleImputer.
In [9]: imputer = SimpleImputer(strategy='most_frequent')
In [11]: X = np.array([[{'foo': 'bar'}, 'b', 'c']], dtype=object)
In [12]: imputer.fit(X)
Out[12]:
SimpleImputer(copy=True, fill_value=None, missing_values=nan,
strategy='most_frequent', verbose=0)
I have to check what is going wrong in check array which is the one that should raise this error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error is raised by np.asarray
only when we make the conversion. So it means that we neeed to raise such error when dtype=object
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think check_array
needs an update to properly deal with non-numeric input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
check_array
needs an update to properly deal with non-numeric input
See also #12148
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be testing the error produced when string dtype is passed? Otherwise looks good.
sklearn/impute.py
Outdated
force_all_finite=force_all_finite) | ||
_check_inputs_dtype(X, self.missing_values) | ||
if X.dtype.kind not in ("i", "u", "f", "O"): | ||
raise ValueError("Missing indicator does not support data with " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing indicator -> MissingIndicator
ALLOW_NAN = ['Imputer', 'SimpleImputer', 'MissingIndicator', | ||
'MaxAbsScaler', 'MinMaxScaler', 'RobustScaler', 'StandardScaler', | ||
'PowerTransformer', 'QuantileTransformer'] | ||
SUPPORT_STRING = ['SimpleImputer', 'MissingIndicator'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
surely OneHotEncoder, OrdinalEncoder and meta-estimators will belong here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do that in another PR. I think that we should also factorize the input validation as well. It is quite redundant and we could have a common test then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it probably belongs in estimator tags... (perhaps post-#8022)
It is covered there: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine. If you use "MissingIndicator" instead of "missing indicator" this is good to merge by me.
sklearn/tests/test_impute.py
Outdated
"'sparse' has to be a boolean or 'auto'"), | ||
(np.array([['a', 'b'], ['c', 'a']], dtype=str), | ||
np.array([['a', 'b'], ['c', 'a']], dtype=str), | ||
{}, "Missing indicator does not support data with dtype")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, looks good to me
sklearn/tests/test_impute.py
Outdated
@@ -55,7 +56,6 @@ def _check_statistics(X, X_true, | |||
err_msg=err_msg.format(True)) | |||
assert_ae(X_trans, X_true, err_msg=err_msg.format(True)) | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8, does not need to be removed
Co-Authored-By: glemaitre <g.lemaitre58@gmail.com>
Just need to merge master for the pep8 failure |
and I mess up my merging (basically I rebase again) but this should be fine with the final squashing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to refactor some input validations and some common tests, but I agree that we can do these things in another PR.
…cikit-learn#13046)" This reverts commit 29c51c4.
…cikit-learn#13046)" This reverts commit 29c51c4.
Reference Issues/PRs
closes #13035
What does this implement/fix? Explain your changes.
Add support to non-numeric values in MissingIndicator. I flag this PR as BUG since SimpleImputer is already supporting non-numeric data and that we cannot use the MissingIndicator with the SimpleImputer in 0.20. I think that it should go in 0.20.3
I added 3 tests: