-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] FIX: raise error with inconsistent dtype X and missing_values #11391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/impute.py
Outdated
not isinstance(missing_values, numbers.Real)): | ||
raise TypeError("The data type of 'missing_values' and 'X' are " | ||
"not compatible. 'missing_values' data type is " | ||
"{} and 'X' is {}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the rest of the code, we raise a ValueError for this kind of error, there is a discussion in #11211.
sklearn/tests/test_impute.py
Outdated
# for X and missing_values was not raising a proper error. | ||
X = np.random.randn(1000, 10) | ||
X[0, 0] = X_missing_value | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure 1000 is necessary here :)
sklearn/tests/test_impute.py
Outdated
[SimpleImputer, ChainedImputer]) | ||
@pytest.mark.parametrize( | ||
"missing_values, X_missing_value, err_msg", | ||
[("NaN", np.nan, "contains"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give more context for the contains error message? I don't know what the expected error is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the error raised by check_array
when X
contains NaN
or inf
and `force_all_finite=True:
Input contains NaN, infinity or a value too large for dtype('float64').
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. would be easier to follow if it was "Input contains NaN" ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But aren't we still supporting missing_values='NaN'?
No, in |
Oh, yes, I'd forgotten that we resolved that...
|
@qinhanmin2014 @jnothman DO you have some remarks in order to move on on the MissingIndicator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart for a few minor comments.
sklearn/impute.py
Outdated
@@ -40,6 +40,17 @@ | |||
] | |||
|
|||
|
|||
def _check_inputs_dtype(X, missing_values): | |||
"""Check that the dtype of X is in accordance with the one of | |||
missing_values.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would phrase this in the other order
"""
Check that the type of missing_values is in accordance with the dtype of X.
"""
as that's what's we are checking right? I mean when we get this error, the user should change the value of missing_value
not the other way around, right?
sklearn/impute.py
Outdated
def _check_inputs_dtype(X, missing_values): | ||
"""Check that the dtype of X is in accordance with the one of | ||
missing_values.""" | ||
if (X.dtype.kind in ("f", "i", "u") and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! I didn't know about this.
sklearn/tests/test_impute.py
Outdated
err_msg): | ||
# regression test for issue #11390. Comparison between incoherent dtype | ||
# for X and missing_values was not raising a proper error. | ||
X = np.random.randn(10, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use a random state (even if we don't care about the actual values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments.
sklearn/impute.py
Outdated
@@ -40,6 +40,18 @@ | |||
] | |||
|
|||
|
|||
def _check_inputs_dtype(X, missing_values): | |||
"""Check that the type of missing_values is in accordance with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this docstring, doesn't add much information or document it fully if you insist this function needs to be documented.
sklearn/impute.py
Outdated
""" | ||
if (X.dtype.kind in ("f", "i", "u") and | ||
not isinstance(missing_values, numbers.Real)): | ||
raise ValueError("The data type of 'missing_values' and 'X' are " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add what is expected? In general it's better when the error message says something like: "this is what was expected. Got this instead."
sklearn/tests/test_impute.py
Outdated
@pytest.mark.parametrize("imputer_constructor", | ||
[SimpleImputer, ChainedImputer]) | ||
@pytest.mark.parametrize( | ||
"missing_values, X_missing_value, err_msg", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X_missing_value may not be a great name since it seems to be a number
thx @glemaitre |
closes #11390
Implement an additional check to avoid raising
NotImplemented
fromnp.equal
when the dtype ofX
andmissing_values
are inconsistent for a comparison.