Skip to content

Conversation

amueller
Copy link
Member

@amueller amueller commented Apr 7, 2015

Fixes #4540 and makes checks robust to thing holding dtype object data.

@amueller amueller changed the title FIX be robust to columns name dtype, robust dtype checking [MRG] FIX be robust to columns name dtype, robust dtype checking Apr 7, 2015
@amueller amueller added this to the 0.16.1 milestone Apr 7, 2015
@amueller
Copy link
Member Author

amueller commented Apr 7, 2015

This would be nice to have in 0.16.1

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) to 95.15% when pulling da106d5 on amueller:robust_input_dtype_check into 2305bdc on scikit-learn:master.

if sp.issparse(array):
if dtype == "numeric":
if numeric:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how the above can not pass. "numeric" is defined just above as "numeric", so the if will always pass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry, I read the code wrong.

I suggest renaming the variable "numeric" to "is_numeric", to avoid stupid people like me misreading the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. The point is to "safe" the original value of dtype after possibly setting it to "None" later. Maybe the logic is not that great. Will add a comment, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll call it "dtype_numeric", as "is_numeric" could also refer to the input...

@amueller amueller force-pushed the robust_input_dtype_check branch from da106d5 to 89c1dc5 Compare April 7, 2015 19:55
assert_equal(check_array(X_df, ensure_2d=False).dtype.kind, "f")
# smoke-test against dataframes with column named "dtype"
X_df.dtype = "Hans"
assert_equal(check_array(X_df, ensure_2d=False).dtype.kind, "f")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit worried that the above is accepting a very weird object that should not be valid: the "dtype" column is not an iterable with the right length. I would expect check_array to fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_array completely ignores the attribute. The MockDataFrame doesn't try to inspect self.__dict__ so the array that is returned has nothing to do with "Hans". The issue I am smoke-testing is that check_array assumes a dtype property to be a numpy dtype and if it is not, it crashed.
The semantics I intended in this fix is that "if dtype is no vaild dtype, then this thing is probably not a numpy array and we just try to convert it".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. You have convinced me.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) to 95.15% when pulling 89c1dc5 on amueller:robust_input_dtype_check into 2305bdc on scikit-learn:master.

# if input is object, convert to float.
dtype = np.float64
else:
dtype = None
array = np.array(array, dtype=dtype, order=order, copy=copy)
# make sure we actually converted to numeric:
if dtype_numeric and array.dtype.kind == "O":
array = array.astype(np.float64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use from sklearn.utils.fixes import astype / array = astype(array, np.float64, copy=False) to avoid a memory copy when not necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also a related fix I just submitted here: #4555

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are guaranteed to be of different kind, how can you avoid a copy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.

@ogrisel
Copy link
Member

ogrisel commented Apr 9, 2015

Other than my comment, +1 for merge and backport to 0.16.X.

@amueller
Copy link
Member Author

ping @ogrisel ;)

@amueller amueller changed the title [MRG] FIX be robust to columns name dtype, robust dtype checking [MRG + 1] FIX be robust to columns name dtype, robust dtype checking Apr 14, 2015
@amueller
Copy link
Member Author

Can I get another review so we can do 0.16.1? @GaelVaroquaux @agramfort @jnothman @arjoly ?

@GaelVaroquaux
Copy link
Member

+1 from me. Merging. Thanks!

I'll let you do the backport :).

GaelVaroquaux added a commit that referenced this pull request Apr 14, 2015
[MRG + 1] FIX be robust to columns name dtype, robust dtype checking
@GaelVaroquaux GaelVaroquaux merged commit 203298e into scikit-learn:master Apr 14, 2015
@amueller
Copy link
Member Author

Thanks :)

@amueller amueller deleted the robust_input_dtype_check branch May 19, 2017 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas Validation failure
4 participants