-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG + 1] FIX be robust to columns name dtype, robust dtype checking #4541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG + 1] FIX be robust to columns name dtype, robust dtype checking #4541
Conversation
This would be nice to have in 0.16.1 |
if sp.issparse(array): | ||
if dtype == "numeric": | ||
if numeric: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how the above can not pass. "numeric" is defined just above as "numeric", so the if will always pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, sorry, I read the code wrong.
I suggest renaming the variable "numeric" to "is_numeric", to avoid stupid people like me misreading the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. The point is to "safe" the original value of dtype after possibly setting it to "None" later. Maybe the logic is not that great. Will add a comment, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll call it "dtype_numeric", as "is_numeric" could also refer to the input...
da106d5
to
89c1dc5
Compare
assert_equal(check_array(X_df, ensure_2d=False).dtype.kind, "f") | ||
# smoke-test against dataframes with column named "dtype" | ||
X_df.dtype = "Hans" | ||
assert_equal(check_array(X_df, ensure_2d=False).dtype.kind, "f") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit worried that the above is accepting a very weird object that should not be valid: the "dtype" column is not an iterable with the right length. I would expect check_array to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_array
completely ignores the attribute. The MockDataFrame doesn't try to inspect self.__dict__
so the array that is returned has nothing to do with "Hans". The issue I am smoke-testing is that check_array
assumes a dtype
property to be a numpy dtype and if it is not, it crashed.
The semantics I intended in this fix is that "if dtype is no vaild dtype, then this thing is probably not a numpy array and we just try to convert it".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. You have convinced me.
# if input is object, convert to float. | ||
dtype = np.float64 | ||
else: | ||
dtype = None | ||
array = np.array(array, dtype=dtype, order=order, copy=copy) | ||
# make sure we actually converted to numeric: | ||
if dtype_numeric and array.dtype.kind == "O": | ||
array = array.astype(np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use from sklearn.utils.fixes import astype
/ array = astype(array, np.float64, copy=False)
to avoid a memory copy when not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also a related fix I just submitted here: #4555
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are guaranteed to be of different kind, how can you avoid a copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.
Other than my comment, +1 for merge and backport to 0.16.X. |
ping @ogrisel ;) |
Can I get another review so we can do 0.16.1? @GaelVaroquaux @agramfort @jnothman @arjoly ? |
+1 from me. Merging. Thanks! I'll let you do the backport :). |
[MRG + 1] FIX be robust to columns name dtype, robust dtype checking
Thanks :) |
Fixes #4540 and makes checks robust to thing holding dtype object data.