[MRG+1] MAINT dissociate nan and inf in check_array #10459

glemaitre · 2018-01-12T09:47:57Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

glemaitre · 2018-01-12T09:49:08Z

@ashimb9 do you want to take from here?

ashimb9 · 2018-01-12T10:02:29Z

@glemaitre Hmm I would not mind working on it but I also think it might be less efficient for me to jump into it midway through. So, please free to complete it if you wish but if you prefer that I work on it, then that would be fine by me too.

glemaitre · 2018-01-12T16:17:09Z

@jnothman I think that it is ready for a first review. I am not sure if we should support 'allow-inf'. Personally, I never had the use to it.

glemaitre · 2018-01-13T18:10:54Z

@ashimb9 You can have a look as well.

jnothman

I find this style of testing quite hard to follow, because related tests are not grouped together.

You could have one test for valid and one for invalid, or one for both. You can stack parametrize decorations to test the cartesian product of inputs:

pytest.mark.parametrize('value', 'force_all_finite', [(np.inf, False), (np.inf, 'allow-nan'), (np.nan, False)]
pytest.mark.parametrize('retype', [np.asarray, sparse.csr_matrix])
def test_force_all_finite_valid(value, force_all_finite, retype):
    X = retype(...)
    ...

jnothman · 2018-01-14T03:04:05Z

sklearn/utils/tests/test_validation.py

+)
+def test_check_array_inf_error(X_inf, accept_sparse):
+    X_inf[0, 0] = np.inf
+    with pytest.raises(ValueError):


should we use match to check the error message?

jnothman · 2018-01-14T03:08:09Z

sklearn/utils/validation.py

@@ -425,6 +462,15 @@ def check_array(array, accept_sparse=False, dtype="numeric", order=None,
            # list of accepted types.
            dtype = dtype[0]

+    if isinstance(force_all_finite, six.string_types):
+        if force_all_finite != 'allow-nan':
+            raise ValueError('When force_all_finite is a string, it should be '


Use the same error message as below.

jnothman · 2018-01-14T03:08:39Z

sklearn/utils/validation.py

+                             'equal to "allow-nan". Got "{}" instead.'.format(
+                                 force_all_finite))
+    elif not isinstance(force_all_finite, bool):
+        raise ValueError('force_all_finite should be a bool or a string. Got '


a string -> 'allow-nan'

not tested

jnothman · 2018-01-14T22:53:47Z

sklearn/utils/tests/test_validation.py

+    "retype",
+    [np.asarray, sp.csr_matrix]
+)
+def test_check_array_valid(value, force_all_finite, retype):


sorry, I might have made a mistake: this name should mention finiteness

jnothman · 2018-01-14T22:55:50Z

sklearn/utils/validation.py

    """Like assert_all_finite, but only for ndarray."""
    if _get_config()['assume_finite']:
        return
    X = np.asanyarray(X)
    # First try an O(n) time, O(1) space solution for the common case that
    # everything is finite; fall back to O(n) space np.isfinite to prevent
    # false positives from overflow in sum method.
-    if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
-            and not np.isfinite(X).all()):
+    if (not allow_nan and X.dtype.char in np.typecodes['AllFloat']


I think it would be easier to read if you have:

if allow_nan: ... else: ...

jnothman · 2018-01-14T22:57:57Z

sklearn/utils/validation.py

        raise ValueError("Input contains NaN, infinity"
                         " or a value too large for %r." % X.dtype)
+    elif (allow_nan and X.dtype.char in np.typecodes['AllFloat']
+          and not np.isfinite(X[~np.isnan(X)].sum())


np.isnan(X) already spends the memory that .sum() is trying to avoid. You do not need both the .sum() and the .all() conditions if you're using np.isnan(X). if np.isnan().any() is sufficient and, without chunking, optimal I think.

I am not really sure what you mean. we would like to test for infinity only, therefore shall I do the following:

if ... and not np.isfinite(x[~np.isnan(x)]).all():

Sorry, I meant to say np.isinf

It might be worth leaving in the O(1) memory case actually, as if all is finite, you can move on. Sorry for my mistakes above.

Oh ok I see.

jnothman

Apart from cosmetic things, LGTM

jnothman · 2018-01-15T20:04:05Z

sklearn/utils/validation.py

-        raise ValueError("Input contains NaN, infinity"
-                         " or a value too large for %r." % X.dtype)
+    if allow_nan:
+        if (X.dtype.char in np.typecodes['AllFloat']


Now I look again I think these common conditions can be pulled to the outer if:

if finite sum: pass else: func, msg = (np.isinf, 'Infinity') if allow_nan else ... if func(X).any(): ...

jnothman · 2018-01-15T20:04:44Z

sklearn/utils/validation.py

@@ -70,8 +80,17 @@ def as_float_array(X, copy=True, force_all_finite=True):
        If True, a copy of X will be created. If False, a copy may still be
        returned if X's dtype is not a floating point type.

-    force_all_finite : boolean (default=True)
-        Whether to raise an error on np.inf and np.nan in X.
+    force_all_finite : boolean or str {'allow-nan'}, (default=True)


Drop str and {}. Just Boolean or '...'

jnothman · 2018-01-15T20:05:46Z

sklearn/utils/validation.py

@@ -304,7 +332,10 @@ def _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy,
            warnings.warn("Can't check %s sparse matrix for nan or inf."
                          % spmatrix.format)
        else:
-            _assert_all_finite(spmatrix.data)
+            if force_all_finite == 'allow-nan':
+                _assert_all_finite(spmatrix.data, allow_nan=True)


Put the comparison direct into the call rather than using if unnecessarily.

jnothman · 2018-01-15T20:06:51Z

sklearn/utils/validation.py

@@ -425,6 +465,14 @@ def check_array(array, accept_sparse=False, dtype="numeric", order=None,
            # list of accepted types.
            dtype = dtype[0]

+    if isinstance(force_all_finite, six.string_types):


How about if force_all_finite not in (...): raise

jnothman · 2018-01-15T20:07:09Z

sklearn/utils/validation.py

@@ -482,7 +530,9 @@ def check_array(array, accept_sparse=False, dtype="numeric", order=None,
        if not allow_nd and array.ndim >= 3:
            raise ValueError("Found array with dim %d. %s expected <= 2."
                             % (array.ndim, estimator_name))
-        if force_all_finite:
+        if force_all_finite == 'allow-nan':
+            _assert_all_finite(array, allow_nan=True)


Condition here, no if

jnothman

@lesteve, do you mind checking this looks good to you? We'd like to use it :)

jnothman · 2018-01-16T11:56:48Z

sklearn/utils/validation.py

-
-
-def assert_all_finite(X):
+    is_float = X.dtype.char in np.typecodes['AllFloat']


(FWIW, we can just be doing X.dtype.kind in 'fc') here.

jnothman · 2018-01-16T12:00:13Z

sklearn/utils/validation.py

+        pass
+    elif is_float:
+        msg_err = "Input contains {} or a value too large for {!r}."
+        cond_err, type_err = ((np.isinf(X).any(), 'infinity') if allow_nan


Oh. I see I forgot to negate isfinite in my suggestion. Putting the conditions in like this is not very readable. Sorry.

glemaitre · 2018-01-18T10:13:48Z

ping @lesteve ;)

TomDLT

LGTM

jnothman · 2018-01-19T02:09:46Z

we should be using this in Imputer after the module is moved, yeah? ping @ashimb9, @sergeyf

…

On 18 Jan 2018 11:56 pm, "Tom Dupré la Tour" ***@***.***> wrote: Merged #10459 <#10459>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10459 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6z3rj1fgom8AnD09pd4l5RSXsRadks5tLz-LgaJpZM4RcC5Y> .

sergeyf · 2018-01-19T03:50:21Z

I'm probably not following this 100%, but both Imputer and MICEImputer currently have force_all_finite=False. I assume we want it to be changed to 'allow-nan' everywhere there? Why not just leave it as False?

glemaitre · 2018-01-19T07:34:37Z

Because you want to block false and it was a bug in the imputerrwhich was removing the column containing inf instead of raising an error

sergeyf · 2018-01-19T16:10:12Z

Gotcha! Should I merge to master and then change it to allow-nan now or make a separate PR after MICE is merged?

glemaitre · 2018-01-19T17:12:10Z

@sergeyf merge and change it

sergeyf · 2018-01-19T20:06:43Z

@glemaitre Will do.

Just to confirm, we want some more complex logic now:

if self.missing_values == "NaN":
    X = check_array(X, force_all_finite='allow-nan')
else:
    X = check_array(X, force_all_finite=True)

Right?

jnothman · 2018-01-22T12:38:11Z

Yes, I suppose that's right.

…

On 20 January 2018 at 07:06, Sergey Feldman ***@***.***> wrote: @glemaitre <https://github.com/glemaitre> Will do. Just to confirm, we want some more complex logic now: if self.missing_values == "NaN": X = check_array(X, force_all_finite='allow-nan') else: X = check_array(X, force_all_finite=True) Right? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10459 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz60nyQYIfT0R_HQhLwGrW5W0rcq5Dks5tMPXVgaJpZM4RcC5Y> .

jnothman · 2018-01-22T12:38:44Z

Or you can use an inline if: `check_array(X, force_all_finite='allow-nan' if self.missing_values == 'NaN' else True)`

jnothman · 2018-01-22T23:19:12Z

Yes, I think that's what it means. We should not support NaN in a feature matrix if missing_values != NaN. Sorry if that's a pain to change!

sergeyf · 2018-01-23T00:05:56Z

I think it's all good once the check_array has the flag in it? On Jan 22, 2018 3:19 PM, "Joel Nothman" <notifications@github.com> wrote: Yes, I think that's what it means. We should not support NaN in a feature matrix if missing_values != NaN. Sorry if that's a pain to change! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10459 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABya7DrKqOCEskuxQrZOx9RFCLF1m6N8ks5tNReQgaJpZM4RcC5Y> .

iter

e761702

TST add tests for the diffrent functions

6df0d95

glemaitre changed the title ~~[WIP] MAINT dissociate nan and inf in check_array~~ [MRG] MAINT dissociate nan and inf in check_array Jan 12, 2018

glemaitre mentioned this pull request Jan 13, 2018

[WIP] Disregard nan fix #10457

Closed

4 tasks

jnothman requested changes Jan 14, 2018

View reviewed changes

adress joel comments

6b5d753

jnothman reviewed Jan 14, 2018

View reviewed changes

glemaitre added 2 commits January 15, 2018 11:47

address joel comments

89655d4

FIX boolean operation

bf6e50e

jnothman reviewed Jan 15, 2018

View reviewed changes

glemaitre added 5 commits January 16, 2018 00:15

joel comments

dd078e0

iter

7a7e81c

iter

2a8a744

iter

2e9ea5d

iter

9453c66

jnothman reviewed Jan 16, 2018

View reviewed changes

jnothman changed the title ~~[MRG] MAINT dissociate nan and inf in check_array~~ [MRG+1] MAINT dissociate nan and inf in check_array Jan 16, 2018

jnothman approved these changes Jan 16, 2018

View reviewed changes

glemaitre added 2 commits January 16, 2018 13:28

iter

69d2676

DOC fix docstring

b2de79c

TomDLT approved these changes Jan 18, 2018

View reviewed changes

TomDLT merged commit 4c99cd4 into scikit-learn:master Jan 18, 2018

TomDLT mentioned this pull request Jan 18, 2018

[MRG] ENH Permit NaN while allowing to filter out inf in validation tools. #7892

Closed

4 tasks

sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Jan 22, 2018

check_array change to comply with chages scikit-learn#10459

135bd86



		def assert_all_finite(X):
		is_float = X.dtype.char in np.typecodes['AllFloat']

Uh oh!

[MRG+1] MAINT dissociate nan and inf in check_array #10459

[MRG+1] MAINT dissociate nan and inf in check_array #10459

Uh oh!

Conversation

glemaitre commented Jan 12, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

glemaitre commented Jan 12, 2018

Uh oh!

ashimb9 commented Jan 12, 2018

Uh oh!

glemaitre commented Jan 12, 2018 • edited by TomDLT Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Jan 13, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Jan 18, 2018

Uh oh!

TomDLT left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jan 19, 2018 via email

Uh oh!

sergeyf commented Jan 19, 2018

Uh oh!

glemaitre commented Jan 19, 2018 via email

Uh oh!

sergeyf commented Jan 19, 2018

Uh oh!

glemaitre commented Jan 19, 2018

Uh oh!

sergeyf commented Jan 19, 2018

glemaitre commented Jan 12, 2018 •

edited by TomDLT

Loading