-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Feature: draw percentiles in violinplot #14107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for submitting this pull request! Overall it looks good to me, I've just added a few comments for small things that need changing. The other thing is test images, we are trying to not add any new images to the repository if possible. I think the one image you change (not added) is a good enough test, so I think it would be worth getting rid of the new image comparison tests you have added (sorry). Thanks again for the PR.
Hi @dstansby, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 looks good to me now, thanks for the contribution! (we wait for two approvals to merge, so hopefully someone else will have a look at it soon)
fwiw I would prefer this to use quantiles (https://docs.scipy.org/doc/numpy/reference/generated/numpy.quantile.html) rather than percentiles (https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html). |
Please rebase to remove the extra test images entirely, or please squash merge this PR when it's accepted. |
@anntzer @timhoffm @QuLogic I tried to use numpy.quantile, and I got error only in python3.6 (as the failing test result in commit a3912a9) which says "numpy does not have attribute quantile". After investigation, I found numpy.quantile is newly added in numpy version 1.15.0, but currently in So if we want to use numpy.quantile instead of numpy.percentile, I think the version number for numpy in |
This reverts commit ee5cd08.
I would use quantile in the API and implement it using np.percentile(100 * q). That said, I‘m not 100% sure which of quantile/percentile is better/more common. Anybody has experience with what‘s used in communities that actually use Violinplot? |
I updated the code following @timhoffm's suggestion - using quantile in the API and implement it using np.percentile. Is it ok to merge? |
lib/matplotlib/axes/_axes.py
Outdated
@@ -7880,14 +7880,14 @@ def matshow(self, Z, **kwargs): | |||
@_preprocess_data(replace_names=["dataset"]) | |||
def violinplot(self, dataset, positions=None, vert=True, widths=0.5, | |||
showmeans=False, showextrema=True, showmedians=False, | |||
points=100, bw_method=None): | |||
percentiles=None, points=100, bw_method=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name must be changed to quantiles
. The implementation correctly implements quantile-behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code and documentation updated. Please have a check!
Not found too much on the quantile vs. percentile topic. Single source of information is ggplot, which uses quantiles. So, I think it's right for us to prefer quantiles as well. |
lib/matplotlib/axes/_axes.py
Outdated
@@ -7953,6 +7958,11 @@ def violinplot(self, dataset, positions=None, vert=True, widths=0.5, | |||
- ``cmedians``: A `~.collections.LineCollection` instance that | |||
marks the median values of each of the violin's distribution. | |||
|
|||
- ``cquantiles``: A | |||
:class:`matplotlib.collections.LineCollection` instance created to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:class:`matplotlib.collections.LineCollection` instance created to | |
`.LineCollection` instance created to |
You can shorten this since we use the sphinx option default_role = 'obj'
See also
https://matplotlib.org/devdocs/devel/documenting_mpl.html?highlight=documenting#referring-to-other-code
lib/matplotlib/axes/_axes.py
Outdated
@@ -8043,13 +8056,20 @@ def violin(self, vpstats, positions=None, vert=True, widths=0.5, | |||
|
|||
- ``cmedians``: A `~.collections.LineCollection` instance that | |||
marks the median values of each of the violin's distribution. | |||
|
|||
- ``cquantiles``: A | |||
:class:`matplotlib.collections.LineCollection` instance created to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
lib/matplotlib/axes/_axes.py
Outdated
@@ -7997,6 +8008,8 @@ def violin(self, vpstats, positions=None, vert=True, widths=0.5, | |||
|
|||
- ``max``: The maximum value for this violin's dataset. | |||
|
|||
- ``quantiles``: The quantile values for this violin's dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be an optional key to maintain backward compatibility with existing inputs.
lib/matplotlib/axes/_axes.py
Outdated
@@ -8106,6 +8126,7 @@ def violin(self, vpstats, positions=None, vert=True, widths=0.5, | |||
mins.append(stats['min']) | |||
maxes.append(stats['max']) | |||
medians.append(stats['median']) | |||
quantiles = np.concatenate((quantiles, stats['quantiles'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use stats.get('quantiles')
and appropriately handle the case, when it's not there to maintain backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also not sure if this should be flat or 2D.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2D is not supported by the function prep_lines (typically vlines() and hlines()). In other word, these two functions requires the input at most 1D
lib/matplotlib/axes/_axes.py
Outdated
@@ -7916,6 +7916,11 @@ def violinplot(self, dataset, positions=None, vert=True, widths=0.5, | |||
showmedians : bool, default = False | |||
If `True`, will toggle rendering of the medians. | |||
|
|||
quantiles : array-like, default=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quantiles : array-like, default=None | |
quantiles : array-like, default = None |
@@ -1469,6 +1475,7 @@ def violin_stats(X, method, points=100): | |||
- median: The median value for this column of data. | |||
- min: The minimum value for this column of data. | |||
- max: The maximum value for this column of data. | |||
- quantiles: The quantile values for this column of data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking API change. I'm afraid, there's no way around that, but it at least needs an API change note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused. If user directly use this function as API, my code only extends the results that user can get. If this extension is considered as a "break", then I don't think it is necessary to make qunatiles key optional. According to current code, the option, whether to add a Linecollection, is decided by violin().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If user never directly use this function, then it is only called by violinplot(). Should this be counted as a API though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matplotlib is extensively used out in the wild. By experience, for every remote but possible way of using the official API, there's someone out there, who actually has code with it. Therefore, we've become quite sensitive to possible API breaks. They will pop up as bug reports sooner or later. So we have to at least document the changes.
Tuple unpacking of the result will break with the change:
coords, vals, mean, median, min, max = violin_stats()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to have a document. But I feel it is odd to make quantiles optional since the current code in violin_stats doesn't care whether those value will be drawn out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I misread the documentation here. It's a list of dictionaries with one dictionary per input dataset. Adding keys to these dicts is not breaking the API. So the above API change note is not necessary and should be removed again. Sorry for the noise.
lib/matplotlib/cbook/__init__.py
Outdated
# Zip x and quantiles | ||
data = zip(X, quantiles) | ||
|
||
for (x, p) in data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be shortend to:
for (x, p) in data: | |
for (x, q) in zip(X, quantiles): |
lib/matplotlib/cbook/__init__.py
Outdated
# Dictionary of results for this distribution | ||
stats = {} | ||
|
||
# Calculate basic stats for the distribution | ||
min_val = np.min(x) | ||
max_val = np.max(x) | ||
p_val = np.percentile(x, 100 * p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p_val = np.percentile(x, 100 * p) | |
quantiles_val = np.percentile(x, 100 * q) |
lib/matplotlib/tests/test_axes.py
Outdated
|
||
# Different size quantile list and plots | ||
with pytest.raises(ValueError): | ||
ax.violinplot(data, quantiles=[[0.1, 0.2], [0.5]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this API doesn't quite make sense. You allow quantiles=[[0.1, 0.9], [0.2, 0.8]]
but not quantiles=[[0.1, 0.2], [0.5]]
.
Either there is just a simple case that has the same quantiles for every postion. Or one can freely define the quantiles per position. Just allowing the same number of quantiles but with different values is an odd cut.
That said, I'm unclear if there is a practical need for different quantiles per position. It prevents comparability of the quantile lines between positions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is odd. I feel it perfectly makes sense to allow user draw different quantiles for different violins. I think in this implementation user still can draw same quantiles for each violin easily. But if only support same quantiles for every postion, it would be hard if users really want to draw different quantiles for different violins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if the quantiles can be different, why should you prohibit different numbers of quantiles between violins? Either values and number of quantiles must be the same for all, or values and number can vary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this example is user wants to plot only one violin, but gives two lists of quantiles. The ValueError is cause by number of violins != number of quantile lists, ie. 1 != 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the updated code in example folder, my design accepts two violins have different number and value of quantiles. (ie. [0.1, 0.2] for first violin, [0.5, 0.7, 0.9] for second violin)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I misinterpreted the test. Can you please use the same number of quantiles for both lists. Makes it hopefully more obvious that the number of elements in the sublists is not the issue to be tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will update it with some other changes you mentioned/suggested.
Add next API change document for violin_stats in cbook. Make violin() more backward campatible by making quantiles key optional. Update smoke test for violin number and quantile lists number mismatch.
@timhoffm I updated the docstring, code and test cases based on your suggestions. I also added a .rst doc in next_api_change folder. Please have a check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Please remove the API change again and it will be ready to go. Thanks for your effort!
@@ -1469,6 +1475,7 @@ def violin_stats(X, method, points=100): | |||
- median: The median value for this column of data. | |||
- min: The minimum value for this column of data. | |||
- max: The maximum value for this column of data. | |||
- quantiles: The quantile values for this column of data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I misread the documentation here. It's a list of dictionaries with one dictionary per input dataset. Adding keys to these dicts is not breaking the API. So the above API change note is not necessary and should be removed again. Sorry for the noise.
@timhoffm API change doc removed. |
Thanks! And congratulations on your first contribution to matplotlib! |
PR Summary
Resolve #8532
Added an optional argument, a list for percentiles, to violinplot function. User now can specify the percentiles they want to draw for each violin.
I'm opening this as part of coursework from University of Toronto Scarborough's CSCD01 taught by Dr. Anya Tafliovich @atafliovich. (https://www.utsc.utoronto.ca/~atafliovich/cscd01/index.html)
PR Checklist