Skip to content

FIX: clean up unit conversion unpacking of data, particularly for dates and pandas series #11664

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jklymak
Copy link
Member

@jklymak jklymak commented Jul 15, 2018

PR Summary

Closes #10022 and #11391

This does one small thing (run units conversion on the position argument of bxp/boxplot) and two big things:

  1. At unit conversion, run np.asarray on incoming objects to unpack numpy arrays from them. Previously in order to get native pandas support we were unpacking the values field, but running
    np.asarray does the same thing and allows the packed values to not necessarily be stored in values. i.e.
dates = np.arange('2005-02', '2005-05', dtype='datetime64[M]')
values = np.sin(np.array(range(len(dates))))
df = pd.DataFrame({'dates': dates, 'z': values})
print(df['dates'])
print(np.asarray(df['dates']))
print(df['dates'].values)

returns:

0   2005-02-01
1   2005-03-01
2   2005-04-01
Name: dates, dtype: datetime64[ns]
['2005-02-01T00:00:00.000000000' '2005-03-01T00:00:00.000000000'
 '2005-04-01T00:00:00.000000000']
['2005-02-01T00:00:00.000000000' '2005-03-01T00:00:00.000000000'
 '2005-04-01T00:00:00.000000000']

This re-does #10638, which unpacked pandas py checking for and using values.

  1. For the pandas tests, this deregisters the pandas converters, because its not clear we should be testing those. With the above, our native date handling should be able to strip pandas series of their date information and convert. This passes all the pandas tests, but I admit this may be a controversial step.

Immediate fix code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.plotting.deregister_matplotlib_converters()

data = np.random.rand(5,2)
years = pd.date_range('1/1/2000',
                      periods=2, freq=pd.DateOffset(years=1)).year
# Does not work
plt.figure()
plt.boxplot(data, positions=years)
plt.show()

failed on master. This fixes...

PR Checklist

  • Has Pytest style unit tests
  • Code is PEP 8 compliant
  • New features are documented, with examples if plot related
  • Documentation is sphinx and numpydoc compliant
  • Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
  • Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

@jklymak
Copy link
Member Author

jklymak commented Jul 15, 2018

BTW I've used this fix in #10638. Not sure this normalization shouldn't be in cbook somewhere: cbook._check_1d? EDIT: obsolete comment

@jklymak jklymak changed the title FIX: let boxplot take pandas position FIX: let boxplot take pandas date as position Jul 15, 2018
@jklymak jklymak force-pushed the fix-boxplot-pandas-position branch from 12b3b7f to 445047c Compare July 15, 2018 18:21
@@ -3644,6 +3644,8 @@ def dopatch(xs, ys, **kwargs):
datashape_message = ("List of boxplot statistics and `{0}` "
"values must have same the length")
# check position
if hasattr(positions, 'values'):
positions = positions.values

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any better ways to check if something is a pandas Series?

In this case the following (admittedly very constructed) example would fail:

import numpy as np
import matplotlib.pyplot as plt

class Years(list):
    values = True

data = np.random.rand(5,2)

years = Years([1999,2003])

plt.figure()
plt.boxplot(data, positions=years)
plt.show()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, its a damned if you do, damned if you don't situation.

Other data types use the values convention as well (xarray for instance). The goal here was to unpack the pandas series w/o having to know what a pandas series is (there is no pandas import in the main library).

If someone tries to pass a class that has values meaning something else, then I guess they will have to unpack before they plot.

FWIW, my preference would be that we only support numpy arrays and basic python lists, and let users and/or downstream libraries do their own unpacking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea at all that only because some object has a values attribute, matplotlib now thinks that it should use it (and not the actual data in the object, e.g. provided by some iterator).

Or, if you think this is what should always happen, it should actually always happen. As it stands, matplotlib would use the iterator for usual numbers, but the values attribute for dates (#10638) or in boxplots (this PR).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, agreed. Stand by for a more holistic solution - though potentially more controversial...

@jklymak jklymak force-pushed the fix-boxplot-pandas-position branch from 445047c to fb7eb0b Compare July 15, 2018 20:55
@jklymak jklymak force-pushed the fix-boxplot-pandas-position branch 6 times, most recently from b84c4b5 to 95e4e52 Compare July 16, 2018 03:26
@jklymak jklymak changed the title FIX: let boxplot take pandas date as position FIX: clean up unit conversion unpacking of data, particularly for dates and pandas series Jul 16, 2018
@jklymak jklymak force-pushed the fix-boxplot-pandas-position branch 4 times, most recently from d87ac91 to db394cb Compare July 16, 2018 12:46
@jklymak
Copy link
Member Author

jklymak commented Jul 16, 2018

ping @jorisvandenbossche as this is a redo of a PR you helped with previously...

@tacaswell tacaswell added this to the v3.1 milestone Jul 17, 2018
@jklymak
Copy link
Member Author

jklymak commented Jul 24, 2018

Maybe closes #5896

@jklymak jklymak force-pushed the fix-boxplot-pandas-position branch from db394cb to b16c311 Compare November 30, 2018 03:37
@jklymak
Copy link
Member Author

jklymak commented Nov 30, 2018

Rebased, this is still ready to go and waiting for review.

@timhoffm
Copy link
Member

timhoffm commented Dec 4, 2018

Looks okish, but I'm not deep enough into the numpy/pandas data formats to judge the implications.

Copy link
Member

@efiring efiring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just a very minor change and some explanation or additional comments.

@@ -6048,7 +6048,7 @@ def pcolormesh(self, *args, alpha=None, norm=None, cmap=None, vmin=None,

# convert to one dimensional arrays
C = C.ravel()
coords = np.column_stack((X, Y)).astype(float, copy=False)
coords = np.column_stack((X, Y)).astype(float)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any point in removing the "copy=False" kwarg. It saves time, and it is logically what is wanted here. A copy was already made via the column_stack call, so nothing done to coords will affect X or Y.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. Changed back

if munits.ConversionInterface.is_numlike(x):
return x
if isinstance(x, list) or isinstance(x, numbers.Number):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for returning a list as-is instead of running it through np.asarray? Just that the latter is unnecessary? Or are there cases where the conversion would be an error?

return x
if isinstance(x, list) or isinstance(x, numbers.Number):
return x
if issubclass(type(x), np.ma.MaskedArray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this using issubclass(type(x)... instead of isinstance?

return x
if issubclass(type(x), np.ma.MaskedArray):
return np.ma.asarray(x)
return np.asarray(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the point of this whole addition above to short-circuit some of what would otherwise happen in get_converter? I'm not opposed, just confused.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems worse, because calling np.asarray can drop information in some cases, making it unavailable in the call to convert below. Unless I'm missing something...

if ((isinstance(d, np.ndarray) and
np.issubdtype(d.dtype, np.datetime64)) or
isinstance(d, np.datetime64)):
return _dt64_to_ordinalf(d)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What case is this whole block covering? Again, I'm confused.

data = np.random.rand(5, 2)
years = pd.date_range('1/1/2000',
periods=2, freq=pd.DateOffset(years=1)).year
# Does not work
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another place I am confused. Is the comment obsolete?

@@ -1477,9 +1479,14 @@ def have_units(self):
return self.converter is not None or self.units is not None

def convert_units(self, x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be converting anything numlike, which includes things like pint.Quantity, into an array/masked array before calling convert. That seems like it would break pint, though I will note that our unit tests seem to be passing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I’ll check vs pint when I look at the other comments above. Thanks for the reminder...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I guess I still don't understand how all of this works. 😉 So is_numlike only returns true for things that are basic numbers (or contain them) and don't need converting. I think I'm ok with this code then. Tests passing is always a good thing too. 😄

@jklymak
Copy link
Member Author

jklymak commented Jan 28, 2019

OK, somewhat confused about everything I was thinking here, and some of the type checking has moved on since this was written. So I'm going to close. Note #13304 deregisters the pandas converters which I think was an important contribution here. Sorry for the reviewer effort on this...

@jklymak jklymak closed this Jan 28, 2019
@jklymak jklymak deleted the fix-boxplot-pandas-position branch January 28, 2019 00:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

boxplot: positions used to take Int64Index
6 participants