Skip to content

[MRG+2] warn_on_dtype for DataFrames #10949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

wdevazelhes
Copy link
Contributor

Reference Issues/PRs

Fixes #10948

What does this implement/fix? Explain your changes.

Pandas DataFrames can have different dtypes for each Series they contain. This PR fixes #10948 while displaying a message specific to this case of DataFrames with several dtypes. However this code just adds some "ifs" and the behaviour is maybe not what we would want. But if it is OK I will also add some tests.

The following examples shows the new behaviour of this PR:

Example 1 (the one in the issue):

from sklearn.utils.validation import check_array
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [2, 3, 4]], dtype=object)
checked = check_array(df, warn_on_dtype=True)

Returns:

DataConversionWarning: Data with input dtype object were all converted to float64.
  warnings.warn(msg, DataConversionWarning)

Example 2: some dtypes are different than the one asked

from sklearn.utils.validation import check_array
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [2, 3, 4]], dtype=object)
df[1] = df[1].astype(float)
df[2] = df[2].astype(int)
# df3 stays object
checked = check_array(df, warn_on_dtype=True) # asks for conversion to numerical type

Returns:

DataConversionWarning: Data with input dtype int64, float64, object were all converted to float64.
  warnings.warn(msg, DataConversionWarning)

Example 3: all dtypes are different than the one asked

from sklearn.utils.validation import check_array
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [2, 3, 4]], dtype=object)
df[1] = df[1].astype(float)
checked = check_array(df, dtype=['int'], warn_on_dtype=True) # asks for conversion to int

Returns:

DataConversionWarning: Data with input dtype float64, object were all converted to int64.
  warnings.warn(msg, DataConversionWarning)

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR! Overall this feature makes sense, I think. I haven't reviewed the code yet.

Please change the title to something meaningful (i.e. describe what this PR does). Also please add some unit tests that are skipped if pandas is not found (cf this example)

@wdevazelhes
Copy link
Contributor Author

wdevazelhes commented Apr 11, 2018

Will do, thanks! Should I also make some tests with MockDataFrame (like this example) ? (I should then add a dtypes attribute to MockDataFrame)

@wdevazelhes wdevazelhes changed the title [WIP] fixes #10948 [WIP] warn_on_dtype for DataFrames Apr 11, 2018
@jnothman
Copy link
Member

jnothman commented Apr 11, 2018 via email

@wdevazelhes wdevazelhes changed the title [WIP] warn_on_dtype for DataFrames [MRG] warn_on_dtype for DataFrames Apr 11, 2018
@wdevazelhes
Copy link
Contributor Author

Done, I also changed the title of the PR to MRG, since all tests pass

# check if the object contains several dtypes (typically a pandas
# DataFrame), and store them. If not, store None.
dtypes_orig = None
if hasattr(array, "dtypes"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should be a bit more precise on the duck typing, to avoid to capture other objects that happen by chance to have a dtype attribute. For instance, we could test for "array" which, I believe is the method used by numpy to convert to a numpy array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the attribute here is dtypes, not dtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do

# some data must have been converted
msg = ("Data with input dtype %s were all converted to %s%s."
% (', '.join(map(str, set(dtypes_orig))), array.dtype, context))
warnings.warn(msg, DataConversionWarning)
Copy link
Member

@GaelVaroquaux GaelVaroquaux Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add "stacklevel=2" here, or should we have 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do

@GaelVaroquaux
Copy link
Member

+1 on merge once the small comments have been addressed.

@GaelVaroquaux GaelVaroquaux changed the title [MRG] warn_on_dtype for DataFrames [MRG+1] warn_on_dtype for DataFrames Apr 18, 2018
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 18, 2018 via email

@wdevazelhes
Copy link
Contributor Author

Thanks for the review @GaelVaroquaux , I ll update the code according to your comments and resolve conflicts

# (for instance in a DataFrame that can contain several dtypes) then
# some data must have been converted
msg = ("Data with input dtype %s were all converted to %s%s."
% (', '.join(map(str, set(dtypes_orig))), array.dtype, context))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not have deterministic order.... which is probably fine. I would have preferred '%s' % sorted(dtypes_orig) but this will be alright.

# some data must have been converted
msg = ("Data with input dtype %s were all converted to %s%s."
% (', '.join(map(str, set(dtypes_orig))), array.dtype, context))
warnings.warn(msg, DataConversionWarning, stacklevel=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other DataConversionWarning does not use stacklevel. Generally, I find stacklevel unhelpful. (The best way to get the stack info correct is to run again with an 'error' filter.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find correct stacklevels very useful (typically when not at the interactive interpreter), however, since this is in check_array which is then still used in other sklearn methods (and not directly by the user), I agree this is less useful (or at least should be more than 2)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But LGTM

@jnothman jnothman changed the title [MRG+1] warn_on_dtype for DataFrames [MRG+2] warn_on_dtype for DataFrames Jun 20, 2018
@jnothman
Copy link
Member

Do we think this needs a what's new entry?

@@ -581,6 +587,15 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True,
if copy and np.may_share_memory(array, array_orig):
array = np.array(array, dtype=dtype, order=order)

if warn_on_dtype and dtypes_orig is not None and {array.dtype} != \
set(dtypes_orig):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman does sklearn have some style rules on avoiding \ in new code for line continuation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually avoid it, and here it can be easily avoided by bracketing ...

# some data must have been converted
msg = ("Data with input dtype %s were all converted to %s%s."
% (', '.join(map(str, set(dtypes_orig))), array.dtype, context))
warnings.warn(msg, DataConversionWarning, stacklevel=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find correct stacklevels very useful (typically when not at the interactive interpreter), however, since this is in check_array which is then still used in other sklearn methods (and not directly by the user), I agree this is less useful (or at least should be more than 2)

William de Vazelhes added 2 commits June 28, 2018 09:32
- sort dtypes to make output deterministic
- put parenthesis instead of backslash for continuation line
- put stacklevel=3 for more targeted output
@wdevazelhes
Copy link
Contributor Author

Thanks for the review @jnothman and @jorisvandenbossche
I just updated the code with your comments

@@ -568,14 +568,15 @@ def check_array(array, accept_sparse=False, dtype="numeric", order=None,
if copy and np.may_share_memory(array, array_orig):
array = np.array(array, dtype=dtype, order=order)

if warn_on_dtype and dtypes_orig is not None and {array.dtype} != \
set(dtypes_orig):
if (warn_on_dtype and dtypes_orig is not None and {array.dtype} !=
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much prefer the line break next to a Boolean operators than a comparison, given order of precedence!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right
Done

@jnothman jnothman merged commit 42e6d4e into scikit-learn:master Jun 28, 2018
@jnothman
Copy link
Member

jnothman commented Jun 28, 2018 via email

@jorisvandenbossche
Copy link
Member

@wdevazelhes In #12104, @ogrisel is removing the warning in case the original data was object dtype data.

In your original issue report, you used an object array as example:

df = pd.DataFrame([[1, 2, 3], [2, 3, 4]], dtype=object)
checked = check_array(df, warn_on_dtype=True)

did you have a specific use case for this (specifically for object dtype)? Or was this rather as a small example to reproduce the issue and was the actual case where you encountered this not about numeric object being casted to object?

@jorisvandenbossche
Copy link
Member

Another comment, now looking at the changes in this PR. I think a case where you have all numeric data in a DataFrame, but which is consisting of mixed int/float columns, is quite common. Now that is processed fine by sklearn transformers as this gets simply converted to a floating array.

But with the change in this PR, that will start to raise a warning in eg the StandardScaler. Do we think it is actually worth to warn in such a case?

@wdevazelhes
Copy link
Contributor Author

@wdevazelhes In #12104, @ogrisel is removing the warning in case the original data was object dtype data.

In your original issue report, you used an object array as example:

df = pd.DataFrame([[1, 2, 3], [2, 3, 4]], dtype=object)
checked = check_array(df, warn_on_dtype=True)

did you have a specific use case for this (specifically for object dtype)? Or was this rather as a small example to reproduce the issue and was the actual case where you encountered this not about numeric object being casted to object?

I think this was just a small example to reproduce the issue, I could probably have put something else instead of object

Another comment, now looking at the changes in this PR. I think a case where you have all numeric data in a DataFrame, but which is consisting of mixed int/float columns, is quite common. Now that is processed fine by sklearn transformers as this gets simply converted to a floating array.

But with the change in this PR, that will start to raise a warning in eg the StandardScaler. Do we think it is actually worth to warn in such a case?

In fact if I remember correctly, at that time I was trying to understand when check_array would return a copy of my data or not, and I used warn_on_dtype as a proxy for that (because when there is a conversion I think there was also a copy although I'm not sure it is always the case)
I agree that it might be too much to raise a warning if ints are converted into floats for instance, inside a transformer

@qinhanmin2014
Copy link
Member

FYI seems that this introduces a bug #12622, suggestions are welcomed since we're going to include the fix in 0.20.1.

@amueller
Copy link
Member

fix in #12625

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

warn_on_dtype with DataFrame
7 participants