-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] warn_on_dtype for DataFrames #10949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] warn_on_dtype for DataFrames #10949
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR! Overall this feature makes sense, I think. I haven't reviewed the code yet.
Please change the title to something meaningful (i.e. describe what this PR does). Also please add some unit tests that are skipped if pandas is not found (cf this example)
Will do, thanks! Should I also make some tests with |
you might be better off doing it with a real DataFrame, using importorskip
to avoid the test where pandas is unavailable
|
Done, I also changed the title of the PR to MRG, since all tests pass |
sklearn/utils/validation.py
Outdated
# check if the object contains several dtypes (typically a pandas | ||
# DataFrame), and store them. If not, store None. | ||
dtypes_orig = None | ||
if hasattr(array, "dtypes"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should be a bit more precise on the duck typing, to avoid to capture other objects that happen by chance to have a dtype attribute. For instance, we could test for "array" which, I believe is the method used by numpy to convert to a numpy array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the attribute here is dtypes, not dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will do
sklearn/utils/validation.py
Outdated
# some data must have been converted | ||
msg = ("Data with input dtype %s were all converted to %s%s." | ||
% (', '.join(map(str, set(dtypes_orig))), array.dtype, context)) | ||
warnings.warn(msg, DataConversionWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add "stacklevel=2" here, or should we have 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will do
+1 on merge once the small comments have been addressed. |
Note that the attribute here is dtypes, not dtype.
Yes indeed. I was still thinking that we could do a stronger duck-typing.
|
Thanks for the review @GaelVaroquaux , I ll update the code according to your comments and resolve conflicts |
sklearn/utils/validation.py
Outdated
# (for instance in a DataFrame that can contain several dtypes) then | ||
# some data must have been converted | ||
msg = ("Data with input dtype %s were all converted to %s%s." | ||
% (', '.join(map(str, set(dtypes_orig))), array.dtype, context)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not have deterministic order.... which is probably fine. I would have preferred '%s' % sorted(dtypes_orig)
but this will be alright.
sklearn/utils/validation.py
Outdated
# some data must have been converted | ||
msg = ("Data with input dtype %s were all converted to %s%s." | ||
% (', '.join(map(str, set(dtypes_orig))), array.dtype, context)) | ||
warnings.warn(msg, DataConversionWarning, stacklevel=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other DataConversionWarning does not use stacklevel
. Generally, I find stacklevel unhelpful. (The best way to get the stack info correct is to run again with an 'error'
filter.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find correct stacklevels very useful (typically when not at the interactive interpreter), however, since this is in check_array
which is then still used in other sklearn methods (and not directly by the user), I agree this is less useful (or at least should be more than 2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But LGTM
Do we think this needs a what's new entry? |
sklearn/utils/validation.py
Outdated
@@ -581,6 +587,15 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True, | |||
if copy and np.may_share_memory(array, array_orig): | |||
array = np.array(array, dtype=dtype, order=order) | |||
|
|||
if warn_on_dtype and dtypes_orig is not None and {array.dtype} != \ | |||
set(dtypes_orig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman does sklearn have some style rules on avoiding \
in new code for line continuation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually avoid it, and here it can be easily avoided by bracketing ...
sklearn/utils/validation.py
Outdated
# some data must have been converted | ||
msg = ("Data with input dtype %s were all converted to %s%s." | ||
% (', '.join(map(str, set(dtypes_orig))), array.dtype, context)) | ||
warnings.warn(msg, DataConversionWarning, stacklevel=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find correct stacklevels very useful (typically when not at the interactive interpreter), however, since this is in check_array
which is then still used in other sklearn methods (and not directly by the user), I agree this is less useful (or at least should be more than 2)
- sort dtypes to make output deterministic - put parenthesis instead of backslash for continuation line - put stacklevel=3 for more targeted output
…evazelhes/scikit-learn into fix/check_array_df_conversion
Thanks for the review @jnothman and @jorisvandenbossche |
sklearn/utils/validation.py
Outdated
@@ -568,14 +568,15 @@ def check_array(array, accept_sparse=False, dtype="numeric", order=None, | |||
if copy and np.may_share_memory(array, array_orig): | |||
array = np.array(array, dtype=dtype, order=order) | |||
|
|||
if warn_on_dtype and dtypes_orig is not None and {array.dtype} != \ | |||
set(dtypes_orig): | |||
if (warn_on_dtype and dtypes_orig is not None and {array.dtype} != |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would much prefer the line break next to a Boolean operators than a comparison, given order of precedence!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right
Done
Thanks!!
|
@wdevazelhes In #12104, @ogrisel is removing the warning in case the original data was object dtype data. In your original issue report, you used an object array as example:
did you have a specific use case for this (specifically for object dtype)? Or was this rather as a small example to reproduce the issue and was the actual case where you encountered this not about numeric object being casted to object? |
Another comment, now looking at the changes in this PR. I think a case where you have all numeric data in a DataFrame, but which is consisting of mixed int/float columns, is quite common. Now that is processed fine by sklearn transformers as this gets simply converted to a floating array. But with the change in this PR, that will start to raise a warning in eg the StandardScaler. Do we think it is actually worth to warn in such a case? |
I think this was just a small example to reproduce the issue, I could probably have put something else instead of object
In fact if I remember correctly, at that time I was trying to understand when |
FYI seems that this introduces a bug #12622, suggestions are welcomed since we're going to include the fix in 0.20.1. |
fix in #12625 |
Reference Issues/PRs
Fixes #10948
What does this implement/fix? Explain your changes.
Pandas DataFrames can have different dtypes for each Series they contain. This PR fixes #10948 while displaying a message specific to this case of
DataFrames
with several dtypes. However this code just adds some "ifs" and the behaviour is maybe not what we would want. But if it is OK I will also add some tests.The following examples shows the new behaviour of this PR:
Example 1 (the one in the issue):
Returns:
Example 2: some dtypes are different than the one asked
Returns:
Example 3: all dtypes are different than the one asked
Returns: