[MRG + 1] FIX be robust to columns name dtype, robust dtype checking #4541

amueller · 2015-04-07T19:01:56Z

Fixes #4540 and makes checks robust to thing holding dtype object data.

amueller · 2015-04-07T19:03:30Z

This would be nice to have in 0.16.1

coveralls · 2015-04-07T19:16:19Z

Coverage increased (+0.0%) to 95.15% when pulling da106d5 on amueller:robust_input_dtype_check into 2305bdc on scikit-learn:master.

GaelVaroquaux · 2015-04-07T19:33:18Z

sklearn/utils/validation.py

    if sp.issparse(array):
-        if dtype == "numeric":
+        if numeric:


I don't understand how the above can not pass. "numeric" is defined just above as "numeric", so the if will always pass.

Oops, sorry, I read the code wrong.

I suggest renaming the variable "numeric" to "is_numeric", to avoid stupid people like me misreading the code.

Will do. The point is to "safe" the original value of dtype after possibly setting it to "None" later. Maybe the logic is not that great. Will add a comment, too.

I'll call it "dtype_numeric", as "is_numeric" could also refer to the input...

…dtype=object.

GaelVaroquaux · 2015-04-07T19:57:20Z

sklearn/utils/tests/test_validation.py

+    assert_equal(check_array(X_df, ensure_2d=False).dtype.kind, "f")
+    # smoke-test against dataframes with column named "dtype"
+    X_df.dtype = "Hans"
+    assert_equal(check_array(X_df, ensure_2d=False).dtype.kind, "f")


I am a bit worried that the above is accepting a very weird object that should not be valid: the "dtype" column is not an iterable with the right length. I would expect check_array to fail.

check_array completely ignores the attribute. The MockDataFrame doesn't try to inspect self.__dict__ so the array that is returned has nothing to do with "Hans". The issue I am smoke-testing is that check_array assumes a dtype property to be a numpy dtype and if it is not, it crashed.
The semantics I intended in this fix is that "if dtype is no vaild dtype, then this thing is probably not a numpy array and we just try to convert it".

Fair enough. You have convinced me.

coveralls · 2015-04-07T20:10:48Z

Coverage increased (+0.0%) to 95.15% when pulling 89c1dc5 on amueller:robust_input_dtype_check into 2305bdc on scikit-learn:master.

ogrisel · 2015-04-09T11:59:52Z

sklearn/utils/validation.py

                # if input is object, convert to float.
                dtype = np.float64
            else:
                dtype = None
        array = np.array(array, dtype=dtype, order=order, copy=copy)
+        # make sure we actually converted to numeric:
+        if dtype_numeric and array.dtype.kind == "O":
+            array = array.astype(np.float64)


Use from sklearn.utils.fixes import astype / array = astype(array, np.float64, copy=False) to avoid a memory copy when not necessary.

See also a related fix I just submitted here: #4555

they are guaranteed to be of different kind, how can you avoid a copy?

ogrisel · 2015-04-09T12:17:20Z

Other than my comment, +1 for merge and backport to 0.16.X.

amueller · 2015-04-13T21:40:19Z

ping @ogrisel ;)

amueller · 2015-04-14T15:32:54Z

Can I get another review so we can do 0.16.1? @GaelVaroquaux @agramfort @jnothman @arjoly ?

GaelVaroquaux · 2015-04-14T15:45:19Z

+1 from me. Merging. Thanks!

I'll let you do the backport :).

[MRG + 1] FIX be robust to columns name dtype, robust dtype checking

amueller · 2015-04-14T15:45:32Z

Thanks :)

amueller changed the title ~~FIX be robust to columns name dtype, robust dtype checking~~ [MRG] FIX be robust to columns name dtype, robust dtype checking Apr 7, 2015

amueller mentioned this pull request Apr 7, 2015

Pandas Validation failure #4540

Closed

amueller added this to the 0.16.1 milestone Apr 7, 2015

GaelVaroquaux reviewed Apr 7, 2015
View reviewed changes

FIX be robust to columns name dtype and also to dataframes that hold …

89c1dc5

…dtype=object.

amueller force-pushed the robust_input_dtype_check branch from da106d5 to 89c1dc5 Compare April 7, 2015 19:55

GaelVaroquaux reviewed Apr 7, 2015
View reviewed changes

ogrisel reviewed Apr 9, 2015
View reviewed changes

amueller changed the title ~~[MRG] FIX be robust to columns name dtype, robust dtype checking~~ [MRG + 1] FIX be robust to columns name dtype, robust dtype checking Apr 14, 2015

GaelVaroquaux added a commit that referenced this pull request Apr 14, 2015

Merge pull request #4541 from amueller/robust_input_dtype_check

203298e

[MRG + 1] FIX be robust to columns name dtype, robust dtype checking

GaelVaroquaux merged commit 203298e into scikit-learn:master Apr 14, 2015

amueller deleted the robust_input_dtype_check branch May 19, 2017 20:46

Uh oh!

[MRG + 1] FIX be robust to columns name dtype, robust dtype checking #4541

[MRG + 1] FIX be robust to columns name dtype, robust dtype checking #4541

Uh oh!

Conversation

amueller commented Apr 7, 2015

Uh oh!

amueller commented Apr 7, 2015

Uh oh!

coveralls commented Apr 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Apr 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Apr 9, 2015

Uh oh!

amueller commented Apr 13, 2015

Uh oh!

amueller commented Apr 14, 2015

Uh oh!

GaelVaroquaux commented Apr 14, 2015

Uh oh!

amueller commented Apr 14, 2015

Uh oh!

Uh oh!