-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+1] MAINT dissociate nan and inf in check_array #10459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@ashimb9 do you want to take from here? |
@glemaitre Hmm I would not mind working on it but I also think it might be less efficient for me to jump into it midway through. So, please free to complete it if you wish but if you prefer that I work on it, then that would be fine by me too. |
@jnothman I think that it is ready for a first review. I am not sure if we should support |
@ashimb9 You can have a look as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this style of testing quite hard to follow, because related tests are not grouped together.
You could have one test for valid and one for invalid, or one for both. You can stack parametrize decorations to test the cartesian product of inputs:
pytest.mark.parametrize('value', 'force_all_finite', [(np.inf, False), (np.inf, 'allow-nan'), (np.nan, False)]
pytest.mark.parametrize('retype', [np.asarray, sparse.csr_matrix])
def test_force_all_finite_valid(value, force_all_finite, retype):
X = retype(...)
...
) | ||
def test_check_array_inf_error(X_inf, accept_sparse): | ||
X_inf[0, 0] = np.inf | ||
with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use match
to check the error message?
sklearn/utils/validation.py
Outdated
@@ -425,6 +462,15 @@ def check_array(array, accept_sparse=False, dtype="numeric", order=None, | |||
# list of accepted types. | |||
dtype = dtype[0] | |||
|
|||
if isinstance(force_all_finite, six.string_types): | |||
if force_all_finite != 'allow-nan': | |||
raise ValueError('When force_all_finite is a string, it should be ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Untested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the same error message as below.
sklearn/utils/validation.py
Outdated
'equal to "allow-nan". Got "{}" instead.'.format( | ||
force_all_finite)) | ||
elif not isinstance(force_all_finite, bool): | ||
raise ValueError('force_all_finite should be a bool or a string. Got ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a string -> 'allow-nan'
not tested
"retype", | ||
[np.asarray, sp.csr_matrix] | ||
) | ||
def test_check_array_valid(value, force_all_finite, retype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I might have made a mistake: this name should mention finiteness
sklearn/utils/validation.py
Outdated
"""Like assert_all_finite, but only for ndarray.""" | ||
if _get_config()['assume_finite']: | ||
return | ||
X = np.asanyarray(X) | ||
# First try an O(n) time, O(1) space solution for the common case that | ||
# everything is finite; fall back to O(n) space np.isfinite to prevent | ||
# false positives from overflow in sum method. | ||
if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum()) | ||
and not np.isfinite(X).all()): | ||
if (not allow_nan and X.dtype.char in np.typecodes['AllFloat'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easier to read if you have:
if allow_nan:
...
else:
...
sklearn/utils/validation.py
Outdated
raise ValueError("Input contains NaN, infinity" | ||
" or a value too large for %r." % X.dtype) | ||
elif (allow_nan and X.dtype.char in np.typecodes['AllFloat'] | ||
and not np.isfinite(X[~np.isnan(X)].sum()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.isnan(X) already spends the memory that .sum() is trying to avoid. You do not need both the .sum() and the .all() conditions if you're using np.isnan(X). if np.isnan().any()
is sufficient and, without chunking, optimal I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not really sure what you mean. we would like to test for infinity only, therefore shall I do the following:
if ... and not np.isfinite(x[~np.isnan(x)]).all():
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant to say np.isinf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth leaving in the O(1) memory case actually, as if all is finite, you can move on. Sorry for my mistakes above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok I see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from cosmetic things, LGTM
sklearn/utils/validation.py
Outdated
raise ValueError("Input contains NaN, infinity" | ||
" or a value too large for %r." % X.dtype) | ||
if allow_nan: | ||
if (X.dtype.char in np.typecodes['AllFloat'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I look again I think these common conditions can be pulled to the outer if:
if finite sum:
pass
else:
func, msg = (np.isinf, 'Infinity') if allow_nan else ...
if func(X).any(): ...
sklearn/utils/validation.py
Outdated
@@ -70,8 +80,17 @@ def as_float_array(X, copy=True, force_all_finite=True): | |||
If True, a copy of X will be created. If False, a copy may still be | |||
returned if X's dtype is not a floating point type. | |||
|
|||
force_all_finite : boolean (default=True) | |||
Whether to raise an error on np.inf and np.nan in X. | |||
force_all_finite : boolean or str {'allow-nan'}, (default=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop str and {}. Just Boolean or '...'
sklearn/utils/validation.py
Outdated
@@ -304,7 +332,10 @@ def _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, | |||
warnings.warn("Can't check %s sparse matrix for nan or inf." | |||
% spmatrix.format) | |||
else: | |||
_assert_all_finite(spmatrix.data) | |||
if force_all_finite == 'allow-nan': | |||
_assert_all_finite(spmatrix.data, allow_nan=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the comparison direct into the call rather than using if unnecessarily.
sklearn/utils/validation.py
Outdated
@@ -425,6 +465,14 @@ def check_array(array, accept_sparse=False, dtype="numeric", order=None, | |||
# list of accepted types. | |||
dtype = dtype[0] | |||
|
|||
if isinstance(force_all_finite, six.string_types): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about if force_all_finite not in (...): raise
sklearn/utils/validation.py
Outdated
@@ -482,7 +530,9 @@ def check_array(array, accept_sparse=False, dtype="numeric", order=None, | |||
if not allow_nd and array.ndim >= 3: | |||
raise ValueError("Found array with dim %d. %s expected <= 2." | |||
% (array.ndim, estimator_name)) | |||
if force_all_finite: | |||
if force_all_finite == 'allow-nan': | |||
_assert_all_finite(array, allow_nan=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Condition here, no if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lesteve, do you mind checking this looks good to you? We'd like to use it :)
sklearn/utils/validation.py
Outdated
|
||
|
||
def assert_all_finite(X): | ||
is_float = X.dtype.char in np.typecodes['AllFloat'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(FWIW, we can just be doing X.dtype.kind in 'fc'
) here.
sklearn/utils/validation.py
Outdated
pass | ||
elif is_float: | ||
msg_err = "Input contains {} or a value too large for {!r}." | ||
cond_err, type_err = ((np.isinf(X).any(), 'infinity') if allow_nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I see I forgot to negate isfinite in my suggestion. Putting the conditions in like this is not very readable. Sorry.
ping @lesteve ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… On 18 Jan 2018 11:56 pm, "Tom Dupré la Tour" ***@***.***> wrote:
Merged #10459 <#10459>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10459 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6z3rj1fgom8AnD09pd4l5RSXsRadks5tLz-LgaJpZM4RcC5Y>
.
|
I'm probably not following this 100%, but both Imputer and MICEImputer currently have |
Because you want to block false and it was a bug in the imputerrwhich was removing the column containing inf instead of raising an error
|
Gotcha! Should I merge to master and then change it to |
@sergeyf merge and change it |
@glemaitre Will do. Just to confirm, we want some more complex logic now:
Right? |
Yes, I suppose that's right.
…On 20 January 2018 at 07:06, Sergey Feldman ***@***.***> wrote:
@glemaitre <https://github.com/glemaitre> Will do.
Just to confirm, we want some more complex logic now:
if self.missing_values == "NaN":
X = check_array(X, force_all_finite='allow-nan')
else:
X = check_array(X, force_all_finite=True)
Right?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10459 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz60nyQYIfT0R_HQhLwGrW5W0rcq5Dks5tMPXVgaJpZM4RcC5Y>
.
|
Or you can use an inline if: `check_array(X, force_all_finite='allow-nan'
if self.missing_values == 'NaN' else True)`
|
Yes, I think that's what it means. We should not support NaN in a feature
matrix if missing_values != NaN. Sorry if that's a pain to change!
|
I think it's all good once the check_array has the flag in it?
On Jan 22, 2018 3:19 PM, "Joel Nothman" <notifications@github.com> wrote:
Yes, I think that's what it means. We should not support NaN in a feature
matrix if missing_values != NaN. Sorry if that's a pain to change!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10459 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABya7DrKqOCEskuxQrZOx9RFCLF1m6N8ks5tNReQgaJpZM4RcC5Y>
.
|
Reference Issues/PRs
closes #10455
What does this implement/fix? Explain your changes.
Any other comments?