Skip to content

[MRG+1] MAINT dissociate nan and inf in check_array #10459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jan 18, 2018

Conversation

glemaitre
Copy link
Member

Reference Issues/PRs

closes #10455

What does this implement/fix? Explain your changes.

Any other comments?

@glemaitre
Copy link
Member Author

@ashimb9 do you want to take from here?

@ashimb9
Copy link
Contributor

ashimb9 commented Jan 12, 2018

@glemaitre Hmm I would not mind working on it but I also think it might be less efficient for me to jump into it midway through. So, please free to complete it if you wish but if you prefer that I work on it, then that would be fine by me too.

@glemaitre glemaitre changed the title [WIP] MAINT dissociate nan and inf in check_array [MRG] MAINT dissociate nan and inf in check_array Jan 12, 2018
@glemaitre
Copy link
Member Author

glemaitre commented Jan 12, 2018

@jnothman I think that it is ready for a first review. I am not sure if we should support 'allow-inf'. Personally, I never had the use to it.

@glemaitre
Copy link
Member Author

@ashimb9 You can have a look as well.

@glemaitre glemaitre mentioned this pull request Jan 13, 2018
4 tasks
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this style of testing quite hard to follow, because related tests are not grouped together.

You could have one test for valid and one for invalid, or one for both. You can stack parametrize decorations to test the cartesian product of inputs:

pytest.mark.parametrize('value', 'force_all_finite', [(np.inf, False), (np.inf, 'allow-nan'), (np.nan, False)]
pytest.mark.parametrize('retype', [np.asarray, sparse.csr_matrix])
def test_force_all_finite_valid(value, force_all_finite, retype):
    X = retype(...)
    ...

)
def test_check_array_inf_error(X_inf, accept_sparse):
X_inf[0, 0] = np.inf
with pytest.raises(ValueError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use match to check the error message?

@@ -425,6 +462,15 @@ def check_array(array, accept_sparse=False, dtype="numeric", order=None,
# list of accepted types.
dtype = dtype[0]

if isinstance(force_all_finite, six.string_types):
if force_all_finite != 'allow-nan':
raise ValueError('When force_all_finite is a string, it should be '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Untested

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the same error message as below.

'equal to "allow-nan". Got "{}" instead.'.format(
force_all_finite))
elif not isinstance(force_all_finite, bool):
raise ValueError('force_all_finite should be a bool or a string. Got '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a string -> 'allow-nan'

not tested

"retype",
[np.asarray, sp.csr_matrix]
)
def test_check_array_valid(value, force_all_finite, retype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I might have made a mistake: this name should mention finiteness

"""Like assert_all_finite, but only for ndarray."""
if _get_config()['assume_finite']:
return
X = np.asanyarray(X)
# First try an O(n) time, O(1) space solution for the common case that
# everything is finite; fall back to O(n) space np.isfinite to prevent
# false positives from overflow in sum method.
if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
and not np.isfinite(X).all()):
if (not allow_nan and X.dtype.char in np.typecodes['AllFloat']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to read if you have:

if allow_nan:
    ...
else:
    ...

raise ValueError("Input contains NaN, infinity"
" or a value too large for %r." % X.dtype)
elif (allow_nan and X.dtype.char in np.typecodes['AllFloat']
and not np.isfinite(X[~np.isnan(X)].sum())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.isnan(X) already spends the memory that .sum() is trying to avoid. You do not need both the .sum() and the .all() conditions if you're using np.isnan(X). if np.isnan().any() is sufficient and, without chunking, optimal I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really sure what you mean. we would like to test for infinity only, therefore shall I do the following:

if ... and not np.isfinite(x[~np.isnan(x)]).all():

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant to say np.isinf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth leaving in the O(1) memory case actually, as if all is finite, you can move on. Sorry for my mistakes above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok I see.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from cosmetic things, LGTM

raise ValueError("Input contains NaN, infinity"
" or a value too large for %r." % X.dtype)
if allow_nan:
if (X.dtype.char in np.typecodes['AllFloat']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I look again I think these common conditions can be pulled to the outer if:

if finite sum:
    pass
else:
    func, msg = (np.isinf, 'Infinity') if allow_nan else ...
    if func(X).any(): ...

@@ -70,8 +80,17 @@ def as_float_array(X, copy=True, force_all_finite=True):
If True, a copy of X will be created. If False, a copy may still be
returned if X's dtype is not a floating point type.

force_all_finite : boolean (default=True)
Whether to raise an error on np.inf and np.nan in X.
force_all_finite : boolean or str {'allow-nan'}, (default=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop str and {}. Just Boolean or '...'

@@ -304,7 +332,10 @@ def _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy,
warnings.warn("Can't check %s sparse matrix for nan or inf."
% spmatrix.format)
else:
_assert_all_finite(spmatrix.data)
if force_all_finite == 'allow-nan':
_assert_all_finite(spmatrix.data, allow_nan=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the comparison direct into the call rather than using if unnecessarily.

@@ -425,6 +465,14 @@ def check_array(array, accept_sparse=False, dtype="numeric", order=None,
# list of accepted types.
dtype = dtype[0]

if isinstance(force_all_finite, six.string_types):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about if force_all_finite not in (...): raise

@@ -482,7 +530,9 @@ def check_array(array, accept_sparse=False, dtype="numeric", order=None,
if not allow_nd and array.ndim >= 3:
raise ValueError("Found array with dim %d. %s expected <= 2."
% (array.ndim, estimator_name))
if force_all_finite:
if force_all_finite == 'allow-nan':
_assert_all_finite(array, allow_nan=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Condition here, no if

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve, do you mind checking this looks good to you? We'd like to use it :)



def assert_all_finite(X):
is_float = X.dtype.char in np.typecodes['AllFloat']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(FWIW, we can just be doing X.dtype.kind in 'fc') here.

pass
elif is_float:
msg_err = "Input contains {} or a value too large for {!r}."
cond_err, type_err = ((np.isinf(X).any(), 'infinity') if allow_nan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I see I forgot to negate isfinite in my suggestion. Putting the conditions in like this is not very readable. Sorry.

@jnothman jnothman changed the title [MRG] MAINT dissociate nan and inf in check_array [MRG+1] MAINT dissociate nan and inf in check_array Jan 16, 2018
@glemaitre
Copy link
Member Author

ping @lesteve ;)

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TomDLT TomDLT merged commit 4c99cd4 into scikit-learn:master Jan 18, 2018
@jnothman
Copy link
Member

jnothman commented Jan 19, 2018 via email

@sergeyf
Copy link
Contributor

sergeyf commented Jan 19, 2018

I'm probably not following this 100%, but both Imputer and MICEImputer currently have force_all_finite=False. I assume we want it to be changed to 'allow-nan' everywhere there? Why not just leave it as False?

@glemaitre
Copy link
Member Author

glemaitre commented Jan 19, 2018 via email

@sergeyf
Copy link
Contributor

sergeyf commented Jan 19, 2018

Gotcha! Should I merge to master and then change it to allow-nan now or make a separate PR after MICE is merged?

@glemaitre
Copy link
Member Author

@sergeyf merge and change it

@sergeyf
Copy link
Contributor

sergeyf commented Jan 19, 2018

@glemaitre Will do.

Just to confirm, we want some more complex logic now:

if self.missing_values == "NaN":
    X = check_array(X, force_all_finite='allow-nan')
else:
    X = check_array(X, force_all_finite=True)

Right?

@jnothman
Copy link
Member

jnothman commented Jan 22, 2018 via email

@jnothman
Copy link
Member

jnothman commented Jan 22, 2018 via email

sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Jan 22, 2018
@jnothman
Copy link
Member

jnothman commented Jan 22, 2018 via email

@sergeyf
Copy link
Contributor

sergeyf commented Jan 23, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Dissociate NaN and Inf when considering force_all_finite in check_array
5 participants