[MRG+2] warn_on_dtype for DataFrames #10949

wdevazelhes · 2018-04-10T15:30:56Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Pandas DataFrames can have different dtypes for each Series they contain. This PR fixes #10948 while displaying a message specific to this case of DataFrames with several dtypes. However this code just adds some "ifs" and the behaviour is maybe not what we would want. But if it is OK I will also add some tests.

The following examples shows the new behaviour of this PR:

Example 1 (the one in the issue):

from sklearn.utils.validation import check_array
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [2, 3, 4]], dtype=object)
checked = check_array(df, warn_on_dtype=True)

Returns:

DataConversionWarning: Data with input dtype object were all converted to float64.
  warnings.warn(msg, DataConversionWarning)

Example 2: some dtypes are different than the one asked

from sklearn.utils.validation import check_array
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [2, 3, 4]], dtype=object)
df[1] = df[1].astype(float)
df[2] = df[2].astype(int)
# df3 stays object
checked = check_array(df, warn_on_dtype=True) # asks for conversion to numerical type

Returns:

DataConversionWarning: Data with input dtype int64, float64, object were all converted to float64.
  warnings.warn(msg, DataConversionWarning)

Example 3: all dtypes are different than the one asked

from sklearn.utils.validation import check_array
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [2, 3, 4]], dtype=object)
df[1] = df[1].astype(float)
checked = check_array(df, dtype=['int'], warn_on_dtype=True) # asks for conversion to int

Returns:

DataConversionWarning: Data with input dtype float64, object were all converted to int64.
  warnings.warn(msg, DataConversionWarning)

rth

Thanks for your PR! Overall this feature makes sense, I think. I haven't reviewed the code yet.

Please change the title to something meaningful (i.e. describe what this PR does). Also please add some unit tests that are skipped if pandas is not found (cf this example)

wdevazelhes · 2018-04-11T06:02:45Z

Will do, thanks! Should I also make some tests with MockDataFrame (like this example) ? (I should then add a dtypes attribute to MockDataFrame)

jnothman · 2018-04-11T07:25:25Z

you might be better off doing it with a real DataFrame, using importorskip to avoid the test where pandas is unavailable

wdevazelhes · 2018-04-16T16:11:16Z

Done, I also changed the title of the PR to MRG, since all tests pass

GaelVaroquaux · 2018-04-18T08:43:57Z

sklearn/utils/validation.py

+    # check if the object contains several dtypes (typically a pandas
+    # DataFrame), and store them. If not, store None.
+    dtypes_orig = None
+    if hasattr(array, "dtypes"):


Maybe we should be a bit more precise on the duck typing, to avoid to capture other objects that happen by chance to have a dtype attribute. For instance, we could test for "array" which, I believe is the method used by numpy to convert to a numpy array.

Note that the attribute here is dtypes, not dtype.

Thanks, will do

GaelVaroquaux · 2018-04-18T08:46:46Z

sklearn/utils/validation.py

+        # some data must have been converted
+        msg = ("Data with input dtype %s were all converted to %s%s."
+               % (', '.join(map(str, set(dtypes_orig))), array.dtype, context))
+        warnings.warn(msg, DataConversionWarning)


Maybe add "stacklevel=2" here, or should we have 3?

Thanks, will do

GaelVaroquaux · 2018-04-18T08:49:06Z

+1 on merge once the small comments have been addressed.

GaelVaroquaux · 2018-04-18T14:21:30Z

Note that the attribute here is dtypes, not dtype.

Yes indeed. I was still thinking that we could do a stronger duck-typing.

wdevazelhes · 2018-05-25T09:45:28Z

Thanks for the review @GaelVaroquaux , I ll update the code according to your comments and resolve conflicts

# Conflicts: # sklearn/utils/validation.py

…mment))

jnothman · 2018-06-20T11:27:58Z

sklearn/utils/validation.py

+        # (for instance in a DataFrame that can contain several dtypes) then
+        # some data must have been converted
+        msg = ("Data with input dtype %s were all converted to %s%s."
+               % (', '.join(map(str, set(dtypes_orig))), array.dtype, context))


this does not have deterministic order.... which is probably fine. I would have preferred '%s' % sorted(dtypes_orig) but this will be alright.

jnothman · 2018-06-20T11:28:29Z

sklearn/utils/validation.py

+        # some data must have been converted
+        msg = ("Data with input dtype %s were all converted to %s%s."
+               % (', '.join(map(str, set(dtypes_orig))), array.dtype, context))
+        warnings.warn(msg, DataConversionWarning, stacklevel=2)


The other DataConversionWarning does not use stacklevel. Generally, I find stacklevel unhelpful. (The best way to get the stack info correct is to run again with an 'error' filter.)

I personally find correct stacklevels very useful (typically when not at the interactive interpreter), however, since this is in check_array which is then still used in other sklearn methods (and not directly by the user), I agree this is less useful (or at least should be more than 2)

jnothman

But LGTM

jnothman · 2018-06-26T13:31:44Z

Do we think this needs a what's new entry?

jorisvandenbossche · 2018-06-27T21:04:40Z

sklearn/utils/validation.py

@@ -581,6 +587,15 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True,
    if copy and np.may_share_memory(array, array_orig):
        array = np.array(array, dtype=dtype, order=order)

+    if warn_on_dtype and dtypes_orig is not None and {array.dtype} != \
+            set(dtypes_orig):


@jnothman does sklearn have some style rules on avoiding \ in new code for line continuation?

We usually avoid it, and here it can be easily avoided by bracketing ...

jorisvandenbossche · 2018-06-27T21:07:37Z

sklearn/utils/validation.py

+        # some data must have been converted
+        msg = ("Data with input dtype %s were all converted to %s%s."
+               % (', '.join(map(str, set(dtypes_orig))), array.dtype, context))
+        warnings.warn(msg, DataConversionWarning, stacklevel=2)


I personally find correct stacklevels very useful (typically when not at the interactive interpreter), however, since this is in check_array which is then still used in other sklearn methods (and not directly by the user), I agree this is less useful (or at least should be more than 2)

- sort dtypes to make output deterministic - put parenthesis instead of backslash for continuation line - put stacklevel=3 for more targeted output

…evazelhes/scikit-learn into fix/check_array_df_conversion

wdevazelhes · 2018-06-28T07:37:23Z

Thanks for the review @jnothman and @jorisvandenbossche
I just updated the code with your comments

jnothman · 2018-06-28T08:09:39Z

sklearn/utils/validation.py

@@ -568,14 +568,15 @@ def check_array(array, accept_sparse=False, dtype="numeric", order=None,
    if copy and np.may_share_memory(array, array_orig):
        array = np.array(array, dtype=dtype, order=order)

-    if warn_on_dtype and dtypes_orig is not None and {array.dtype} != \
-            set(dtypes_orig):
+    if (warn_on_dtype and dtypes_orig is not None and {array.dtype} !=


I would much prefer the line break next to a Boolean operators than a comparison, given order of precedence!

You're right
Done

jnothman · 2018-06-28T12:00:36Z

Thanks!!

jorisvandenbossche · 2018-09-24T15:18:58Z

@wdevazelhes In #12104, @ogrisel is removing the warning in case the original data was object dtype data.

In your original issue report, you used an object array as example:

df = pd.DataFrame([[1, 2, 3], [2, 3, 4]], dtype=object)
checked = check_array(df, warn_on_dtype=True)

did you have a specific use case for this (specifically for object dtype)? Or was this rather as a small example to reproduce the issue and was the actual case where you encountered this not about numeric object being casted to object?

jorisvandenbossche · 2018-09-24T15:28:01Z

Another comment, now looking at the changes in this PR. I think a case where you have all numeric data in a DataFrame, but which is consisting of mixed int/float columns, is quite common. Now that is processed fine by sklearn transformers as this gets simply converted to a floating array.

But with the change in this PR, that will start to raise a warning in eg the StandardScaler. Do we think it is actually worth to warn in such a case?

wdevazelhes · 2018-09-25T16:27:11Z

@wdevazelhes In #12104, @ogrisel is removing the warning in case the original data was object dtype data.

In your original issue report, you used an object array as example:
df = pd.DataFrame([[1, 2, 3], [2, 3, 4]], dtype=object)
checked = check_array(df, warn_on_dtype=True)
did you have a specific use case for this (specifically for object dtype)? Or was this rather as a small example to reproduce the issue and was the actual case where you encountered this not about numeric object being casted to object?

I think this was just a small example to reproduce the issue, I could probably have put something else instead of object

Another comment, now looking at the changes in this PR. I think a case where you have all numeric data in a DataFrame, but which is consisting of mixed int/float columns, is quite common. Now that is processed fine by sklearn transformers as this gets simply converted to a floating array.

But with the change in this PR, that will start to raise a warning in eg the StandardScaler. Do we think it is actually worth to warn in such a case?

In fact if I remember correctly, at that time I was trying to understand when check_array would return a copy of my data or not, and I used warn_on_dtype as a proxy for that (because when there is a conversion I think there was also a copy although I'm not sure it is always the case)
I agree that it might be too much to raise a warning if ints are converted into floats for instance, inside a transformer

qinhanmin2014 · 2018-11-20T15:54:45Z

FYI seems that this introduces a bug #12622, suggestions are welcomed since we're going to include the fix in 0.20.1.

amueller · 2018-11-20T16:11:05Z

fix in #12625

[WIP] fixes scikit-learn#10948

606f9f9

rth reviewed Apr 10, 2018

View reviewed changes

wdevazelhes changed the title ~~[WIP] fixes #10948~~ [WIP] warn_on_dtype for DataFrames Apr 11, 2018

fix travis test with pandas 0.20

35bb7a1

Add test.

17de6aa

wdevazelhes changed the title ~~[WIP] warn_on_dtype for DataFrames~~ [MRG] warn_on_dtype for DataFrames Apr 11, 2018

GaelVaroquaux reviewed Apr 18, 2018

View reviewed changes

GaelVaroquaux changed the title ~~[MRG] warn_on_dtype for DataFrames~~ [MRG+1] warn_on_dtype for DataFrames Apr 18, 2018

William de Vazelhes added 4 commits May 25, 2018 11:50

Merge branch 'master' into fix/check_array_df_conversion

ea91bd3

# Conflicts: # sklearn/utils/validation.py

ENH: add stacklevel=2 for better warning log

5a3aa9e

ENH: add more precise duck typing (see comment scikit-learn#10949 (co…

f6843c7

…mment))

FIX: flake8 remove blank line

b8e0d10

jnothman reviewed Jun 20, 2018

View reviewed changes

jnothman approved these changes Jun 20, 2018

View reviewed changes

jnothman changed the title ~~[MRG+1] warn_on_dtype for DataFrames~~ [MRG+2] warn_on_dtype for DataFrames Jun 20, 2018

Merge branch 'master' into fix/check_array_df_conversion

36b3511

jorisvandenbossche approved these changes Jun 27, 2018

View reviewed changes

William de Vazelhes added 2 commits June 28, 2018 09:32

FIX take into account review comments:

2b906b0

- sort dtypes to make output deterministic - put parenthesis instead of backslash for continuation line - put stacklevel=3 for more targeted output

Merge branch 'fix/check_array_df_conversion' of https://github.com/wd…

122ade5

…evazelhes/scikit-learn into fix/check_array_df_conversion

jnothman reviewed Jun 28, 2018

View reviewed changes

REF move line break after logical operator

c37e7ba

jnothman merged commit 42e6d4e into scikit-learn:master Jun 28, 2018

jorisvandenbossche mentioned this pull request Sep 24, 2018

[MRG] Convert ColumnTransformer input list to numpy array #12104

Merged

qinhanmin2014 mentioned this pull request Nov 20, 2018

TypeError: "iteration over a 0-d array" when trying to preprocessing.scale a pandas.Series #12622

Closed

amueller mentioned this pull request Sep 25, 2019

MaxAbsScaler Upcasts Pandas to float64 #15093

Closed

Uh oh!

[MRG+2] warn_on_dtype for DataFrames #10949

[MRG+2] warn_on_dtype for DataFrames #10949

Uh oh!

Conversation

wdevazelhes commented Apr 10, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Example 1 (the one in the issue):

Example 2: some dtypes are different than the one asked

Example 3: all dtypes are different than the one asked

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

wdevazelhes commented Apr 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Apr 11, 2018 via email

Uh oh!

wdevazelhes commented Apr 16, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux Apr 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Apr 18, 2018

Uh oh!

GaelVaroquaux commented Apr 18, 2018 via email

Uh oh!

wdevazelhes commented May 25, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wdevazelhes commented Jun 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 28, 2018 via email

Uh oh!

jorisvandenbossche commented Sep 24, 2018

Uh oh!

jorisvandenbossche commented Sep 24, 2018

Uh oh!

wdevazelhes commented Sep 25, 2018

Uh oh!

qinhanmin2014 commented Nov 20, 2018

Uh oh!

amueller commented Nov 20, 2018

Uh oh!

Uh oh!

wdevazelhes commented Apr 11, 2018 •

edited

Loading

GaelVaroquaux Apr 18, 2018 •

edited

Loading