-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Draft version of MEP28: Simplification of boxplots #7282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,326 @@ | ||
============================================= | ||
MEP 28: Remove Complexity from Axes.boxplot | ||
============================================= | ||
|
||
.. contents:: | ||
:local: | ||
|
||
Status | ||
====== | ||
|
||
.. | ||
.. MEPs go through a number of phases in their lifetime: | ||
|
||
- **Discussion** | ||
.. | ||
.. - **Progress**: Consensus was reached on the mailing list and | ||
.. implementation work has begun. | ||
.. | ||
.. - **Completed**: The implementation has been merged into master. | ||
.. | ||
.. - **Superseded**: This MEP has been abandoned in favor of another | ||
.. approach. | ||
|
||
Branches and Pull requests | ||
========================== | ||
|
||
Adding pre- & post-processing options to ``cbook.boxplot_stats``: https://github.com/phobson/matplotlib/tree/boxplot-stat-transforms | ||
Exposing ``cbook.boxplot_stats`` through ``Axes.boxplot`` kwargs: None | ||
Remove redundant statistical kwargs in ``Axes.boxplot``: None | ||
Remove redundant style options in ``Axes.boxplot``: None | ||
Remaining items that arise through discussion: None | ||
|
||
Abstract | ||
======== | ||
|
||
Over the past few releases, the ``Axes.boxplot`` method has grown in | ||
complexity to support fully customizable artist styling and statistical | ||
computation. This lead to ``Axes.boxplot`` being split off into multiple | ||
parts. The statistics needed to draw a boxplot are computed in | ||
``cbook.boxplot_stats``, while the actual artists are drawn by ``Axes.bxp``. | ||
The original method, ``Axes.boxplot`` remains as the most public API that | ||
handles passing the user-supplied data to ``cbook.boxplot_stats``, feeding | ||
the results to ``Axes.bxp``, and pre-processing style information for | ||
each facet of the boxplot plots. | ||
|
||
This MEP will outline a path forward to rollback the added complexity | ||
and simplify the API while maintaining reasonable backwards | ||
compatibility. | ||
|
||
Detailed description | ||
==================== | ||
|
||
Currently, the ``Axes.boxplot`` method accepts parameters that allow the | ||
users to specify medians and confidence intervals for each box that | ||
will be drawn in the plot. These were provided so that avdanced users | ||
could provide statistics computed in a different fashion that the simple | ||
method provided by matplotlib. However, handling this input requires | ||
complex logic to make sure that the forms of the data structure match what | ||
needs to be drawn. At the moment, that logic contains 9 separate if/else | ||
statements nested up to 5 levels deep with a for loop, and may raise up to 2 errors. | ||
These parameters were added prior to the creation of the ``Axes.bxp`` method, | ||
which draws boxplots from a list of dictionaries containing the relevant | ||
statistics. Matplotlib also provides a function that computes these | ||
statistics via ``cbook.boxplot_stats``. Note that advanced users can now | ||
either a) write their own function to compute the stats required by | ||
``Axes.bxp``, or b) modify the output returned by ``cbook.boxplots_stats`` | ||
to fully customize the position of the artists of the plots. With this | ||
flexibility, the parameters to manually specify only the medians and their | ||
confidences intervals remain for backwards compatibility. | ||
|
||
Around the same time that the two roles of ``Axes.boxplot`` were split into | ||
``cbook.boxplot_stats`` for computation and ``Axes.bxp`` for drawing, both | ||
``Axes.boxplot`` and ``Axes.bxp`` were written to accept parameters that | ||
individually toggle the drawing of all components of the boxplots, and | ||
parameters that individually configure the style of those artists. However, | ||
to maintain backwards compatibility, the ``sym`` parameter (previously used | ||
to specify the symbol of the fliers) was retained. This parameter itself | ||
requires fairly complex logic to reconcile the ``sym`` parameters with the | ||
newer ``flierprops`` parameter at the default style specified by ``matplotlibrc``. | ||
|
||
This MEP seeks to dramatically simplify the creation of boxplots for | ||
novice and advanced users alike. Importantly, the changes proposed here | ||
will also be available to downstream packages like seaborn, as seaborn | ||
smartly allows users to pass arbitrary dictionaries of parameters through | ||
the seaborn API to the underlying matplotlib functions. | ||
|
||
This will be achieved in the following way: | ||
|
||
1. ``cbook.boxplot_stats`` will be modified to allow pre- and post- | ||
computation transformation functions to be passed in (e.g., ``np.log`` | ||
and ``np.exp`` for lognormally distributed data) | ||
2. ``Axes.boxplot`` will be modified to also accept and naïvely pass them | ||
to ``cbook.boxplots_stats`` (Alt: pass the stat function and a dict | ||
of its optional parameters). | ||
3. Outdated parameters from ``Axes.boxplot`` will be deprecated and | ||
later removed. | ||
|
||
Implementation | ||
============== | ||
|
||
Passing transform functions to ``cbook.boxplots_stats`` | ||
------------------------------------------------------- | ||
|
||
This MEP proposes that two parameters (e.g., ``transform_in`` and | ||
``transform_out`` be added to the cookbook function that computes the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm thinking |
||
statistics for the boxplot function. These will be optional keyword-only | ||
arguments and can easily be set to ``lambda x: x`` as a no-op when omitted | ||
by the user. The ``transform_in`` function will be applied to the data | ||
as the ``boxplot_stats`` function loops through each subset of the data | ||
passed to it. After the list of statistics dictionaries are computed the | ||
``transform_out`` function is applied to each value in the dictionaries. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't it make more sense to apply it to the entire dictionary? All the individual things don't make much sense on their own. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we're saying the same thing, but AFAIK, you can't pass a dictionary to e.g., for key, value in stat_dict.items():
if key != 'label':
stat_dict[key] = transform_out(value) |
||
|
||
These transformations can then be added to the call signature of | ||
``Axes.boxplot`` with little impact to that method's complexity. This is | ||
because they can be directly passed to ``cbook.boxplot_stats``. | ||
Alternatively, ``Axes.boxplot`` could be modified to accept an optional | ||
statistical function kwarg and a dictionary of parameters to be direcly | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't quite understand this alternative. |
||
passed to it. | ||
|
||
At this point in the implementation users and external libraries like | ||
seaborn would have complete control via the ``Axes.boxplot`` method. More | ||
importantly, at the very least, seaborn would require no changes to its | ||
API to allow users to take advantage of these new options. | ||
|
||
Simplifications to the ``Axes.boxplot`` API and other functions | ||
--------------------------------------------------------------- | ||
|
||
Simplifying the boxplot method consists primarily of deprecating and then | ||
removing the redundant parameters. Optionally, a next step would include | ||
rectifying minor terminological inconsistencies between ``Axes.boxplot`` | ||
and ``Axes.bxp``. | ||
|
||
The parameters to be deprecated and removed include: | ||
|
||
1. ``usermedians`` - processed by 10 SLOC, 3 ``if`` blocks, a ``for`` loop | ||
2. ``conf_intervals`` - handled by 15 SLOC, 6 ``if`` blocks, a ``for`` loop | ||
3. ``sym`` - processed by 12 SLOC, 4 ``if`` blocks | ||
|
||
Removing the ``sym`` option allows all code in handling the remaining | ||
styling parameters to be moved to ``Axes.bxp``. This doesn't remove | ||
any complexity, but does reinforce the single responsibility principle | ||
among ``Axes.bxp``, ``cbook.boxplot_stats``, and ``Axes.boxplot``. | ||
|
||
Additionally, the ``notch`` parameter could be renamed ``shownotches`` | ||
to be consistent with ``Axes.bxp``. This kind of cleanup could be taken | ||
a step further and the ``whis``, ``bootstrap``, ``autorange`` could | ||
be rolled into the kwargs passed to the new ``statfxn`` parameter. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll clarify this in the MEP but I'll explain here too. From my POV, there are two options available. The first looks like this: def boxplot_stats(data, ..., transform_in=None, transform_out=None):
if transform_in is None:
transform_in = lambda x: x
if transform_out is None:
transform_out = lambda x: x
output = []
for _d in data:
d = transform_in(_d)
stat_dict = do_stats(d)
for key, value in stat_dict.item():
if key != 'label':
stat_dict[key] = transform_out(value)
output.append(d)
return output
class Axes(...):
def boxplot_option1(data, ..., transform_in=None, transform_out=None):
stats = cbook.boxplot_stats(data, ..., transform_in=transform_in, transform_out=transform_out)
return self.bxp(stats, ...)
def boxplot_option2(data, ..., statfxn=None, **statopts):
if statfxn is None:
statfxn = boxplot_stats
stats = statfxn(data, **statopts)
return self.bxp(stats, ...) So in both cases you can do: fig, ax1 = plt.subplots()
artists1 = ax1.boxplot_optionX(data, transform_in=np.log, transform_out=np.exp) But Option two lets a user write a completely custom stat function ( Currently, users can do: fig, ax1 = plt.subplots()
my_stats = my_box_stats(data, bootstrap_method='BCA', whisker_method='dynamic')
ax1.bxp(my_stats) Under Option 2, the user could do (only slightly more convenient): fig, ax1 = plt.subplots()
ax1.boxplot(data, statfxn=my_box_stats, bootstrap_method='BCA', whisker_method='dynamic') It's not that compelling of a difference, I admit. But Option 2 would let the following be valid in seaborn: sns.factorplot(x="day", y="total_bill", hue="sex", data=tips, kind='box',
palette="PRGn", shownotches=True, statfxn=my_box_stats,
bootstrap_method='BCA', whisker_method='dynamic') So a lot of downstream flexibility is gained (in both approaches) with little implementation. |
||
|
||
Backward compatibility | ||
====================== | ||
|
||
Implementation of this MEP would eventually result in the backwards | ||
incompatible deprecation and then removal of the keyword parameters | ||
``usermedians``, ``conf_intervals``, and ``sym``. Cursory searches on | ||
GitHub indicated that ``usermedians``, ``conf_intervals`` are used by | ||
few users, who all seem to have a very strong knowledge of matplotlib. | ||
A robust deprecation cycle should provide sufficient time for these | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Considering we are have discussion about one release cycle versus two release cycle on the finance module right now, it might be worth being specific here. |
||
users to migrate to a new API. | ||
|
||
Deprecation of ``sym`` however, may have a much broader reach into | ||
the matplotlib userbase. | ||
|
||
Schedule | ||
-------- | ||
An accelerated timeline could look like the following: | ||
|
||
#. v2.0.1 add transforms to ``cbook.boxplots_stats``, expose in ``Axes.boxplot`` | ||
#. v2.1.0 deprecate ``usermedians``, ``conf_intervals``, ``sym`` parameters | ||
#. v2.2.0 make deprecations noisier | ||
#. v2.3.0 remove ``usermedians``, ``conf_intervals``, ``sym`` parameters | ||
#. v2.3.0 deprecate ``notch`` in favor of ``shownotches`` to be consistent with other parameters and ``Axes.bxp`` | ||
#. v2.4.0 remove ``notch`` parameter, move all style and artist toggling logic to ``Axes.bxp``. ``Axes.boxplot`` is little more than a broker between ``Axes.bxp`` and ``cbook.boxplots_stats`` | ||
|
||
|
||
Anticipated Impacts to Users | ||
---------------------------- | ||
|
||
As described above deprecating ``usermedians`` and ``conf_intervals`` | ||
will likely impact few users. Those who will be impacted are almost | ||
certainly advanced users who will be able to adapt to the change. | ||
|
||
Deprecating the ``sym`` option may import more users and effort should | ||
be taken to collect community feedback on this. | ||
|
||
Anticipated Impacts to Downstream Libraries | ||
------------------------------------------- | ||
|
||
The source code (GitHub master as of 2016-10-17) was inspected for | ||
seaborn and python-ggplot to see if these changes would impact their | ||
use. None of the parameters nominated for removal in this MEP are used by | ||
seaborn. The seaborn APIs that use matplotlib's boxplot function allow | ||
user's to pass arbitrary ``**kwargs`` through to matplotlib's API. Thus | ||
seaborn users with modern matplotlib installations will be able to take | ||
full advantage of any new features added as a result of this MEP. | ||
|
||
Python-ggplot has implemented its own function to draw boxplots. Therefore, | ||
no impact can come to it as a result of implementing this MEP. | ||
|
||
Alternatives | ||
============ | ||
|
||
Variations on the theme | ||
----------------------- | ||
|
||
This MEP can be divided into a few loosely coupled components: | ||
|
||
#. Allowing pre- and post-computation tranformation function in ``cbook.boxplot_stats`` | ||
#. Exposing that transformation in the ``Axes.boxplot`` API | ||
#. Removing redundant statistical options in ``Axes.boxplot`` | ||
#. Shifting all styling parameter processing from ``Axes.boxplot`` to ``Axes.bxp``. | ||
|
||
|
||
With this approach, #2 depends and #1, and #4 depends on #3. | ||
|
||
There are two possible approaches to #2. The first and most direct would | ||
be to mirror the new ``transform_in`` and ``tranform_out`` parameters of | ||
``cbook.boxplot_stats`` in ``Axes.boxplot`` and pass them directly. | ||
|
||
The second approach would be to add ``statfxn`` and ``statfxn_args`` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. any chance of psuedocode like examples of the two approaches? |
||
parameters to ``Axes.boxplot``. Under this implementation, the default | ||
value of ``statfxn`` would be ``cbook.boxplot_stats``, but users could | ||
pass their own function. Then ``transform_in`` and ``tranform_out`` would | ||
then be passed as elements of the ``statfxn_args`` parameter. | ||
|
||
.. python: | ||
def boxplot_stats(data, ..., transform_in=None, transform_out=None): | ||
if transform_in is None: | ||
transform_in = lambda x: x | ||
|
||
if transform_out is None: | ||
transform_out = lambda x: x | ||
|
||
output = [] | ||
for _d in data: | ||
d = transform_in(_d) | ||
stat_dict = do_stats(d) | ||
for key, value in stat_dict.item(): | ||
if key != 'label': | ||
stat_dict[key] = transform_out(value) | ||
output.append(d) | ||
return output | ||
|
||
|
||
class Axes(...): | ||
def boxplot_option1(data, ..., transform_in=None, transform_out=None): | ||
stats = cbook.boxplot_stats(data, ..., | ||
transform_in=transform_in, | ||
transform_out=transform_out) | ||
return self.bxp(stats, ...) | ||
|
||
def boxplot_option2(data, ..., statfxn=None, **statopts): | ||
if statfxn is None: | ||
statfxn = boxplot_stats | ||
stats = statfxn(data, **statopts) | ||
return self.bxp(stats, ...) | ||
|
||
Both cases would allow users to do the following: | ||
|
||
.. python: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be very annoying, ( 😄 ) could you add an example of how you would do this with the current API? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @NelleV great suggestion. done. |
||
fig, ax1 = plt.subplots() | ||
artists1 = ax1.boxplot_optionX(data, transform_in=np.log, | ||
transform_out=np.exp) | ||
|
||
|
||
But Option Two lets a user write a completely custom stat function | ||
(e.g., ``my_box_stats``) with fancy BCA confidence intervals and the | ||
whiskers set differently depending on some attribute of the data. | ||
|
||
This is available under the current API: | ||
|
||
.. python: | ||
fig, ax1 = plt.subplots() | ||
my_stats = my_box_stats(data, bootstrap_method='BCA', | ||
whisker_method='dynamic') | ||
ax1.bxp(my_stats) | ||
|
||
And would be more concise with Option Two | ||
|
||
.. python: | ||
fig, ax = plt.subplots() | ||
statopts = dict(transform_in=np.log, transform_out=np.exp) | ||
ax.boxplot(data, ..., **statopts) | ||
|
||
Users could also pass their own function to compute the stats: | ||
|
||
.. python: | ||
fig, ax1 = plt.subplots() | ||
ax1.boxplot(data, statfxn=my_box_stats, bootstrap_method='BCA', | ||
whisker_method='dynamic') | ||
|
||
From the examples above, Option Two seems to have only marginal benifit, | ||
but in the context of downstream libraries like seaborn, its advantage | ||
is more apparent as the following would be possible without any patches | ||
to seaborn: | ||
|
||
.. python: | ||
import seaborn | ||
tips = seaborn.load_data('tips') | ||
g = seaborn.factorplot(x="day", y="total_bill", hue="sex", data=tips, | ||
kind='box', palette="PRGn", shownotches=True, | ||
statfxn=my_box_stats, bootstrap_method='BCA', | ||
whisker_method='dynamic') | ||
|
||
This type of flexibility was the intention behind splitting the overall | ||
boxplot API in the current three functions. In practice however, downstream | ||
libraries like seaborn support versions of matplotlib dating back well | ||
before the split. Thus, adding just a bit more flexibility to the | ||
``Axes.boxplot`` could expose all the functionality to users of the | ||
downstream libraries with modern matplotlib installation without intervention | ||
from the downstream library maintainers. | ||
|
||
Doing less | ||
---------- | ||
|
||
Another obvious alternative would be to omit the added pre- and post- | ||
computation transform functionality in ``cbook.boxplot_stats`` and | ||
``Axes.boxplot``, and simply remove the redundant statistical and style | ||
parameters as described above. | ||
|
||
Doing nothing | ||
------------- | ||
|
||
As with many things in life, doing nothing is an option here. This means | ||
we simply advocate for users and downstream libraries to take advantage | ||
of the split between ``cbook.boxplot_stats`` and ``Axes.bxp`` and let | ||
them decide how to provide an interface to that. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,3 +29,4 @@ Matplotlib Enhancement Proposals | |
MEP25 | ||
MEP26 | ||
MEP27 | ||
MEP28 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, a little unclear about scope delineation here. What does seaborn do above and beyond mpl? And how is this not creeping into seaborn's territory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought is that these few changes should be implemented in such a way that seaborn users have access to them without seaborn needing to change at all. Shooting for simplicity and flexibility for users and downstream libraries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so you want to maintain downstream use, but you plan to remove **kwargs. Is there verification that seaborn doesn't use those args? Granted, this seems like it'd be a pretty small PR on seaborn if it does change things with them..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seaborn uses very few parameters and just passes everything else from the user as
**kws
. Also instances of call matplotlib's boxplot API are below:Next step would be to checkout yhat's python-ggplot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
welp ggplot impliments their own boxplot drawer: https://github.com/yhat/ggplot/blob/5957a4db941be1da578ecc462d1f5b99f6d776ab/ggplot/geoms/geom_boxplot.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some ways that's probably good in that it means you don't have to worry about downstream support with ggplot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're absolutely right about that. i guess it just makes me feel insecure about our implementation.