Skip to content

[ENH]: Adding jitter mode in scatter #27935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
edoaltamura opened this issue Mar 15, 2024 · 13 comments
Open

[ENH]: Adding jitter mode in scatter #27935

edoaltamura opened this issue Mar 15, 2024 · 13 comments

Comments

@edoaltamura
Copy link

Problem

Is there a plan to add a 'jitter' functionality to scatter, so that markers are re-positioned to avoid overlap? An example of this idea is described here.

Proposed solution

Provided that the number of overlapping markers isn't too large, the re-positioning can be achieved by minimizing a measure of distance over the shape centers (for circles). A basic implementation could be based on this discussion.

@tacaswell
Copy link
Member

This has come up and was rejected at least once before: #2750

If we did take some version of this functionality, it should be its own function/method rather than being tacked onto scatter. This really only makes sense in the narrow case of one of x being categorical so we should let it only work for catagorical data (or at least require integers or maybe it takes a scalar x and a list of y or a sequence x and a sequence of sequence of y (e.g. the groupby) or match the API of boxplot and violinplot).

I also suspect that there is going to be a large number of knobs people will want to have added to control how the jitter works which should not be shoe-horned into (already too complicated) signature of scatter.

I suspect that a majority of these cases are with the realm of what seaborn targets and in that case, we should point people at sns.swarmplot (https://seaborn.pydata.org/generated/seaborn.swarmplot.html ).

This sort of thing could be made easier by the work in https://github.com/matplotlib/data-prototype where this could be implemented as step in the processing pipeline.

That said, I wonder if this can be achieved by writing a very custom Transform (https://matplotlib.org/stable/api/transformations.html#matplotlib.transforms.Transform) that is offered the full set of x, y and returns the "jittered" x, y. If you write that, then I think you can do ax.scatter(..., offset_transform=JitterTransfrom()) and it will "just work".


In summary my views on the ways this can be addressed:

  • adding API to scatter to jitter: hard no
  • adding a new method that pull swarmplot upstream: maybe leaning no
  • adding a Transform subclass that does the jitter + an example: maybe, leaning yes
  • new third-party library that implements hive/swarm/jitter plots: you don't need our permission, but why not just use seaborn?

@edoaltamura
Copy link
Author

Thanks for the response @tacaswell, that's useful and clearly articulated. I like the Transform strategy to implement the jitter. Would you see value in exploring this?

@jklymak
Copy link
Member

jklymak commented Mar 17, 2024

First, hard agree this cannot be tacked onto scatter. It's far more akin to violin or box plots, and the x-value needs to be categorical since you want to use "jitter" in x to move the points around.

I like the Transform strategy to implement the jitter. Would you see value in exploring this?

I think that is going to be very hard as you are going to want the jitter in screen space, information of which we don't naturally pipe down to the the Transform. Not impossible, but it's not going to be straightforward. I think you'd be better to write a wrapper and not try and overload scatter.

@tacaswell
Copy link
Member

We have transforms in the annotation context that do absolute offsets so it can be done. I agree not easy, but I think it will be the fun kind of hard.

@scottshambaugh
Copy link
Contributor

scottshambaugh commented Jun 24, 2024

Throwing my +1 for some sort of native jittering, it's something I go to reach for fairly often and rediscover that it's not implemented. Was reading https://nightingaledvs.com/ive-stopped-using-box-plots-should-you/ (and somewhat related https://nightingaledvs.com/color-jitter/) this morning and am coming around to seeing it as a fairly fundamental data viz.

@story645
Copy link
Member

story645 commented Jun 24, 2024

what about adding a jitter example to the gallery that also explicitly links out to seaborn/etc? if the example is popular enough, can then be pulled into library?

ETA: Thinking is example reminds folks that seaborn does this but also shows custom option. And we can maybe use intersphinx w/ seaborn to maintain the links.

@jklymak
Copy link
Member

jklymak commented Jun 24, 2024

I'm fine with an example or even a method - I just don't think the API should be an option to scatter.

As for boxplots, it is hard to come up with a reason why you would ever want to use those, but I'm not convinced "jitter" plots are the right way to go either. Multiple histograms, or 2-D histograms seem much better than a weird extra dimension where you are to artistically pick the density out by eye.

@story645
Copy link
Member

story645 commented Jun 24, 2024

Multiple histograms, or 2-D histograms seem much better than a weird extra dimension

Or violin plots, but swarm and stripplots can be nice for showing how the underlying data yields the distribution and showing discontinuities that may not be apparent in a continuous mapping. ETA: also when the data distribution can't be fitted to a standard distribution.

Either way, @scottshambaugh's right that they're a pretty standard technique - ggplot and seaborn implement it and Nightingale is a serious (if not quite academic) viz publication.

@tacaswell
Copy link
Member

I'm softening a bit from my comment in March and am now neutral on adding a method with an API that rhymes with boxplot/violinplot that makes the "stripplot" from the Nightingale article or "bee plots" (sp? where the jitter range depends on the local density).

Anything we do need to make sure that the jitter is only applied to "categorical like" data.

Another (maybe bad) idea would be to inject the jitter into the categorical unit handlers.

@story645
Copy link
Member

or "bee plots" (sp? where the jitter range depends on the local density).

swarmplots

Another (maybe bad) idea would be to inject the jitter into the categorical unit handlers.

Was kinda thinking if default_units takes a dictionary (b/c jpl was injecting sorting that way), but I don't think that's any cleaner than using the transform

@story645
Copy link
Member

story645 commented Jun 25, 2024

Also if something gets implemented in mpl, @mwaskom what kinda things could make the seaborn implementations cleaner?

@scottshambaugh
Copy link
Contributor

scottshambaugh commented Jun 25, 2024

Scatter plots already handle categorical data seamlessly, so we already have the "stripplot" functionality.

I agree with @tacaswell that it makes sense to restrict this to categorical data. Perhaps adding on to scatter is fine but with an error raised for non-categorical axes? I'm not so clear on the reason for the hard no. (Maybe we really do need a separate function, as some categories might be numeric?). Note also that both axes might be categorical, in which case we will want to apply jitter in both directions.

@scottshambaugh
Copy link
Contributor

scottshambaugh commented Jun 26, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants