Skip to content

Feature: draw percentiles in violinplot #14107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
May 11, 2019

Conversation

sasoripathos
Copy link
Contributor

@sasoripathos sasoripathos commented Apr 30, 2019

PR Summary

Resolve #8532

Added an optional argument, a list for percentiles, to violinplot function. User now can specify the percentiles they want to draw for each violin.

I'm opening this as part of coursework from University of Toronto Scarborough's CSCD01 taught by Dr. Anya Tafliovich @atafliovich. (https://www.utsc.utoronto.ca/~atafliovich/cscd01/index.html)

PR Checklist

  • Has Pytest style unit tests
  • Code is Flake 8 compliant
  • New features are documented, with examples if plot related
  • Documentation is sphinx and numpydoc compliant
  • Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
  • Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

@dstansby dstansby added this to the v3.2.0 milestone May 1, 2019
Copy link
Member

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for submitting this pull request! Overall it looks good to me, I've just added a few comments for small things that need changing. The other thing is test images, we are trying to not add any new images to the repository if possible. I think the one image you change (not added) is a good enough test, so I think it would be worth getting rid of the new image comparison tests you have added (sorry). Thanks again for the PR.

@sasoripathos
Copy link
Contributor Author

Hi @dstansby,
I have updated documentation as your suggestions and also removed the extra image-compare test cases. Please have a check!

Copy link
Member

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 looks good to me now, thanks for the contribution! (we wait for two approvals to merge, so hopefully someone else will have a look at it soon)

@anntzer
Copy link
Contributor

anntzer commented May 6, 2019

fwiw I would prefer this to use quantiles (https://docs.scipy.org/doc/numpy/reference/generated/numpy.quantile.html) rather than percentiles (https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html).

@QuLogic
Copy link
Member

QuLogic commented May 7, 2019

Please rebase to remove the extra test images entirely, or please squash merge this PR when it's accepted.

@sasoripathos
Copy link
Contributor Author

sasoripathos commented May 7, 2019

@anntzer @timhoffm @QuLogic I tried to use numpy.quantile, and I got error only in python3.6 (as the failing test result in commit a3912a9) which says "numpy does not have attribute quantile". After investigation, I found numpy.quantile is newly added in numpy version 1.15.0, but currently in requirements/testing/travis36minver.txt numpy version is specified as 1.11.0.

So if we want to use numpy.quantile instead of numpy.percentile, I think the version number for numpy in requirements/testing/travis36minver.txt should be changed. Is that acceptable? Or I can revert back to the version using numpy.percentile.

@timhoffm
Copy link
Member

timhoffm commented May 7, 2019

I would use quantile in the API and implement it using np.percentile(100 * q).

That said, I‘m not 100% sure which of quantile/percentile is better/more common. Anybody has experience with what‘s used in communities that actually use Violinplot?

@sasoripathos
Copy link
Contributor Author

sasoripathos commented May 8, 2019

I updated the code following @timhoffm's suggestion - using quantile in the API and implement it using np.percentile. Is it ok to merge?

@@ -7880,14 +7880,14 @@ def matshow(self, Z, **kwargs):
@_preprocess_data(replace_names=["dataset"])
def violinplot(self, dataset, positions=None, vert=True, widths=0.5,
showmeans=False, showextrema=True, showmedians=False,
points=100, bw_method=None):
percentiles=None, points=100, bw_method=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name must be changed to quantiles. The implementation correctly implements quantile-behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code and documentation updated. Please have a check!

@timhoffm
Copy link
Member

timhoffm commented May 8, 2019

Not found too much on the quantile vs. percentile topic. Single source of information is ggplot, which uses quantiles. So, I think it's right for us to prefer quantiles as well.

@@ -7953,6 +7958,11 @@ def violinplot(self, dataset, positions=None, vert=True, widths=0.5,
- ``cmedians``: A `~.collections.LineCollection` instance that
marks the median values of each of the violin's distribution.

- ``cquantiles``: A
:class:`matplotlib.collections.LineCollection` instance created to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:class:`matplotlib.collections.LineCollection` instance created to
`.LineCollection` instance created to

You can shorten this since we use the sphinx option default_role = 'obj' See also
https://matplotlib.org/devdocs/devel/documenting_mpl.html?highlight=documenting#referring-to-other-code

@@ -8043,13 +8056,20 @@ def violin(self, vpstats, positions=None, vert=True, widths=0.5,

- ``cmedians``: A `~.collections.LineCollection` instance that
marks the median values of each of the violin's distribution.

- ``cquantiles``: A
:class:`matplotlib.collections.LineCollection` instance created to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

@@ -7997,6 +8008,8 @@ def violin(self, vpstats, positions=None, vert=True, widths=0.5,

- ``max``: The maximum value for this violin's dataset.

- ``quantiles``: The quantile values for this violin's dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be an optional key to maintain backward compatibility with existing inputs.

@@ -8106,6 +8126,7 @@ def violin(self, vpstats, positions=None, vert=True, widths=0.5,
mins.append(stats['min'])
maxes.append(stats['max'])
medians.append(stats['median'])
quantiles = np.concatenate((quantiles, stats['quantiles']))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use stats.get('quantiles') and appropriately handle the case, when it's not there to maintain backward compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not sure if this should be flat or 2D.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2D is not supported by the function prep_lines (typically vlines() and hlines()). In other word, these two functions requires the input at most 1D

@@ -7916,6 +7916,11 @@ def violinplot(self, dataset, positions=None, vert=True, widths=0.5,
showmedians : bool, default = False
If `True`, will toggle rendering of the medians.

quantiles : array-like, default=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
quantiles : array-like, default=None
quantiles : array-like, default = None

@@ -1469,6 +1475,7 @@ def violin_stats(X, method, points=100):
- median: The median value for this column of data.
- min: The minimum value for this column of data.
- max: The maximum value for this column of data.
- quantiles: The quantile values for this column of data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking API change. I'm afraid, there's no way around that, but it at least needs an API change note.

Copy link
Contributor Author

@sasoripathos sasoripathos May 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused. If user directly use this function as API, my code only extends the results that user can get. If this extension is considered as a "break", then I don't think it is necessary to make qunatiles key optional. According to current code, the option, whether to add a Linecollection, is decided by violin().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user never directly use this function, then it is only called by violinplot(). Should this be counted as a API though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matplotlib is extensively used out in the wild. By experience, for every remote but possible way of using the official API, there's someone out there, who actually has code with it. Therefore, we've become quite sensitive to possible API breaks. They will pop up as bug reports sooner or later. So we have to at least document the changes.

Tuple unpacking of the result will break with the change:

coords, vals, mean, median, min, max = violin_stats()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to have a document. But I feel it is odd to make quantiles optional since the current code in violin_stats doesn't care whether those value will be drawn out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I misread the documentation here. It's a list of dictionaries with one dictionary per input dataset. Adding keys to these dicts is not breaking the API. So the above API change note is not necessary and should be removed again. Sorry for the noise.

# Zip x and quantiles
data = zip(X, quantiles)

for (x, p) in data:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be shortend to:

Suggested change
for (x, p) in data:
for (x, q) in zip(X, quantiles):

# Dictionary of results for this distribution
stats = {}

# Calculate basic stats for the distribution
min_val = np.min(x)
max_val = np.max(x)
p_val = np.percentile(x, 100 * p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
p_val = np.percentile(x, 100 * p)
quantiles_val = np.percentile(x, 100 * q)


# Different size quantile list and plots
with pytest.raises(ValueError):
ax.violinplot(data, quantiles=[[0.1, 0.2], [0.5]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this API doesn't quite make sense. You allow quantiles=[[0.1, 0.9], [0.2, 0.8]] but not quantiles=[[0.1, 0.2], [0.5]].

Either there is just a simple case that has the same quantiles for every postion. Or one can freely define the quantiles per position. Just allowing the same number of quantiles but with different values is an odd cut.

That said, I'm unclear if there is a practical need for different quantiles per position. It prevents comparability of the quantile lines between positions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is odd. I feel it perfectly makes sense to allow user draw different quantiles for different violins. I think in this implementation user still can draw same quantiles for each violin easily. But if only support same quantiles for every postion, it would be hard if users really want to draw different quantiles for different violins.

Copy link
Member

@timhoffm timhoffm May 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the quantiles can be different, why should you prohibit different numbers of quantiles between violins? Either values and number of quantiles must be the same for all, or values and number can vary.

Copy link
Contributor Author

@sasoripathos sasoripathos May 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this example is user wants to plot only one violin, but gives two lists of quantiles. The ValueError is cause by number of violins != number of quantile lists, ie. 1 != 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the updated code in example folder, my design accepts two violins have different number and value of quantiles. (ie. [0.1, 0.2] for first violin, [0.5, 0.7, 0.9] for second violin)

Copy link
Member

@timhoffm timhoffm May 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misinterpreted the test. Can you please use the same number of quantiles for both lists. Makes it hopefully more obvious that the number of elements in the sublists is not the issue to be tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will update it with some other changes you mentioned/suggested.

Add next API change document for violin_stats in cbook.
Make violin() more backward campatible by making quantiles key optional.
Update smoke test for violin number and quantile lists number mismatch.
@sasoripathos
Copy link
Contributor Author

@timhoffm I updated the docstring, code and test cases based on your suggestions. I also added a .rst doc in next_api_change folder. Please have a check.

Copy link
Member

@timhoffm timhoffm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Please remove the API change again and it will be ready to go. Thanks for your effort!

@@ -1469,6 +1475,7 @@ def violin_stats(X, method, points=100):
- median: The median value for this column of data.
- min: The minimum value for this column of data.
- max: The maximum value for this column of data.
- quantiles: The quantile values for this column of data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I misread the documentation here. It's a list of dictionaries with one dictionary per input dataset. Adding keys to these dicts is not breaking the API. So the above API change note is not necessary and should be removed again. Sorry for the noise.

@timhoffm timhoffm dismissed their stale review May 11, 2019 17:11

Comments handled.

@sasoripathos
Copy link
Contributor Author

@timhoffm API change doc removed.

@timhoffm timhoffm merged commit b6d3f7c into matplotlib:master May 11, 2019
@timhoffm
Copy link
Member

Thanks! And congratulations on your first contribution to matplotlib!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: draw percentiles in violinplot
5 participants