[MRG] ENH Permit NaN while allowing to filter out inf in validation tools. #7892

raghavrv · 2016-11-16T14:32:14Z

Allows passing nan but not inf in validation tools. This will be required as we introduce more methods / imputers permitting missing values as 'NaN' / np.nan.
Refactors a tiny bit out of [MRG] ENH Add support for missing values to Tree based Classifiers #5974

For instance currently in imputer this line sets force_all_finite to False as that is the only way to permit NaN. But this also means that we are letting inf to pass through without any checks. There is no test currently but this small sample code emphasizes my point. (This is also added as a NRT).

In master

>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
>>> X = [[np.inf, 8, 9, np.nan], [np.nan, 10, 10, 0], [10, 11, 9, 11]]
# For axis=0, it silently discards column 0, assuming it contains only missing values. An error would be more appropriate in this case.
>>> Imputer(axis=0).fit_transform(X)
array([[  8. ,   9. ,   5.5],
       [ 10. ,  10. ,   0. ],
       [ 11. ,   9. ,  11. ]])

# This error points to the right row but conveys an incorrect message
>>> Imputer(axis=1).fit_transform(X)
ValueError: Some rows only contain missing values: [0]

In this branch

>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
>>> X = [[np.inf, 8, 9, np.nan], [np.nan, 10, 10, 0], [10, 11, 9, 11]]
>>> Imputer(axis=0).fit_transform(X)
ValueError: Input contains infinity or a value too large for dtype('float64').
>>> Imputer(axis=1).fit_transform(X)
ValueError: Input contains infinity or a value too large for dtype('float64').

TODO

Add allow_nan to selectively permit nan without allowing inf values.
Fix imputer to use allow_nan
Add NRT to check raised errors.
Handle the case when missing_values is +/-np.inf

@tguillemot @agramfort @amueller @jnothman Reviews please? :)

Whether to disallow np.nan if missing values is not nan is debatable as it seems to break backward compatibility.

amueller

I think this is alright. It adds more complexity to an already complex part of sklearn, though. We might consider not exposing this in check_array and just calling assert_all_finite() in the imputers.

amueller · 2016-11-16T22:17:36Z

sklearn/utils/tests/test_validation.py

@@ -141,6 +141,11 @@ def test_check_array():
    X_nan[0, 0] = np.nan
    assert_raises(ValueError, check_array, X_nan)
    check_array(X_inf, force_all_finite=False)  # no raise
+    # allow_nan check
+    check_array(X_nan, force_all_finite=True, allow_nan=True)  # no raise


this is a bit hard to read, but I guess it's too late to rename force_all_finite.

amueller · 2016-11-16T22:21:29Z

sklearn/utils/validation.py

-                copy=False, force_all_finite=True, ensure_2d=True,
-                allow_nd=False, ensure_min_samples=1, ensure_min_features=1,
-                warn_on_dtype=False, estimator=None):
+                copy=False, force_all_finite=True, allow_nan=False,


this should go to the end in case someone used non-kwargs, right?

Ah yes indeed! Thanks!

I don't think we'd usually worry about this... We've previously taken the approach that we assume kwargs is required after unspecified small number of required args...

jnothman · 2016-11-16T22:36:23Z

sklearn/utils/validation.py

@@ -83,16 +90,19 @@ def as_float_array(X, copy=True, force_all_finite=True):
    force_all_finite : boolean (default=True)
        Whether to raise an error on np.inf and np.nan in X.

+    allow_nan : boolean (default=False)
+        Whether to allow nan values in X.


But I think you don't currently check in the case that force_all_finite=False and allow_nan=False. Do one of:

handle this case

document this behaviour

make 'allow_nan' a value for force_all_finite rather than an additional parameter.

jnothman · 2016-11-16T22:36:26Z

sklearn/utils/validation.py

-        raise ValueError("Input contains NaN, infinity"
-                         " or a value too large for %r." % X.dtype)
+    if allow_nan:
+        def any_not_isfinite(X): return np.isinf(X).any()


It seems to me there's a not missing here.

TomDLT · 2018-01-18T12:58:44Z

closed by #10459

raghavrv mentioned this pull request Nov 16, 2016

[MRG+2-1] ENH add a ValueDropper to artificially insert missing values (NMAR or MCAR) to the dataset #7084

Closed

4 tasks

amueller reviewed Nov 16, 2016

View reviewed changes

jnothman requested changes Nov 16, 2016

View reviewed changes

raghavrv mentioned this pull request Dec 18, 2016

[MRG] ENH Allow handling nan during input validation #8074

Closed

raghavrv added 2 commits December 18, 2016 14:48

ENH Permit NaN in check_array

d46694c

Validate convert to np.nan; Check for inf

ecc5b47

raghavrv force-pushed the permit_nan_in_check_array branch from 3a0a7ab to ecc5b47 Compare December 18, 2016 13:48

glemaitre mentioned this pull request Jan 11, 2018

[RFC] Dissociate NaN and Inf when considering force_all_finite in check_array #10455

Closed

TomDLT closed this Jan 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] ENH Permit NaN while allowing to filter out inf in validation tools. #7892

[MRG] ENH Permit NaN while allowing to filter out inf in validation tools. #7892

Uh oh!

raghavrv commented Nov 16, 2016 •

edited

Loading

Uh oh!

amueller left a comment

Uh oh!

amueller Nov 16, 2016

Uh oh!

amueller Nov 16, 2016

Uh oh!

raghavrv Nov 16, 2016

Uh oh!

jnothman Nov 16, 2016

Uh oh!

jnothman Nov 16, 2016

Uh oh!

jnothman Nov 16, 2016

Uh oh!

TomDLT commented Jan 18, 2018

Uh oh!

Uh oh!

Uh oh!

[MRG] ENH Permit NaN while allowing to filter out inf in validation tools. #7892

[MRG] ENH Permit NaN while allowing to filter out inf in validation tools. #7892

Uh oh!

Conversation

raghavrv commented Nov 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

In master

In this branch

TODO

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

amueller Nov 16, 2016

Choose a reason for hiding this comment

Uh oh!

amueller Nov 16, 2016

Choose a reason for hiding this comment

Uh oh!

raghavrv Nov 16, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman Nov 16, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman Nov 16, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman Nov 16, 2016

Choose a reason for hiding this comment

Uh oh!

TomDLT commented Jan 18, 2018

Uh oh!

Uh oh!

raghavrv commented Nov 16, 2016 •

edited

Loading