Skip to content

Scatter plots are very slow when using multiple colors #9053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jzwinck opened this issue Aug 18, 2017 · 8 comments
Open

Scatter plots are very slow when using multiple colors #9053

jzwinck opened this issue Aug 18, 2017 · 8 comments
Labels
keep Items to be ignored by the “Stale” Github Action Performance

Comments

@jzwinck
Copy link

jzwinck commented Aug 18, 2017

This program plots 3 million dots with random colors:

import matplotlib.pyplot as plt
import numpy as np

N = 3000000
x = np.random.random(N)
y = np.random.random(N)
c = np.random.random((N, 3)) # RGB

plt.scatter(x, y, 1, c, '.')
plt.show()

The initial display is very slow. Even more problematic: zooming is very slow. If you set c to None it will use a single color for all points and it will be fast, with zooming taking about 1 second, vs about 20 seconds with multiple colors.

If you zoom until only a few points are visible, the single-color plot will respond instantly, but the multi-color one will still take 20 seconds. It's as if all 3 million colors are being slowly remapped every time--even for points which can't be seen.

I would expect multi-color scatter plots to be only marginally slower than single-color ones. A 10x slowdown or worse makes me want to disable colors, but then I can't visualize my data properly.

#2156 (four years ago) was aimed at scatter plot performance but seems to have neglected the multi-color case, which as @ChrisBeaumont pointed out is a main use case for scatter(): #2156 (comment)

Unfortunately, the biggest speedup in this PR (the blitting) essentially replicates what plot can do already. The compelling functionality of scatter is, IMO, the ability to map color and/or size onto data. I can envision two "medium"-hanging fruit optimizations, that might push this kind of functionality into the 10^5-6 points range [...]

My real data has more points but only 12 distinct colors, so I'd be happy with a speedup even if it only applies when there are, say, up to 50 distinct colors. I also use ColorMap in my real application, but again I only take a few distinct choices from the map (whereas in the example above, every point has a unique color).

I'm using Matplotlib 2.0.2, NumPy 1.12.1, and Python 3.5.3 on 64-bit Linux with 128 GB of RAM.

@tacaswell
Copy link
Member

In the case where everything is one color and 1 marker size I think we take a short-cut and fallback to the code path used for plot which renders the marker once and then 'stamps' it on to the canvas enough times. In the more general case every marker gets rendered, colored, and then clipped (as in general every marker could be a different color and size).

If you have order 50 colors and are not using the continuous size variability of scatter I suggest using a loop + plot with no line style.

@anntzer
Copy link
Contributor

anntzer commented Aug 18, 2017

@mdboom suggests this would be not so easy to do with Agg (#2156 (comment)). I don't know Agg's internal, but this issue is solved by https://github.com/anntzer/mpl_cairo (... if you can manage to install it :-))

@jzwinck
Copy link
Author

jzwinck commented Aug 18, 2017

@tacaswell In the typical case where all markers are the same size, there is no need to render markers that cannot be inside the viewport. Also, when zooming in, there is no need to render markers which were previously not visible (zooming in can only eliminate markers from the viewport, not introduce them).

I tried your suggestion of calling scatter() in a loop with a single color at a time:

N = 3000000
C = 10
x = np.random.random(N)
y = np.random.random(N)
coloridx = np.random.randint(0, C, N)
colors = np.random.random((C, 3))

# old way: plt.scatter(x, y, 1, colors.take(coloridx), '.')

for ii in range(C):
    mask = ii == coloridx
    plt.scatter(x[mask], y[mask], 1, colors[ii], '.')

plt.show()

It is indeed much faster, though it produces a different Z order (before zooming in, all markers appear to be the same color, because they were drawn in color order).

I can live with the Z order difference since it provides so much better performance. Thank you for the suggestion. Now that we know how to do it, what do you think about making Matplotlib enable this efficient behavior in a more obvious way? For example, the API could take a new type of color map:

plt.scatter(x, y, 1, coloridx, '.', colortable=colors) # hypothetical API

The hypothetical colortable parameter would require the c parameter to be an array of integer indexes, and the colors would then be looked up. This is similar to ColorMap but discrete.

One could also imagine transparently accelerating the existing API by first checking if the number of distinct values in c is much smaller than the number of data points, and reusing stamps if so.

@anntzer
Copy link
Contributor

anntzer commented Aug 18, 2017

I would quite strongly oppose adding an API that provides an approximate behavior to work around a less-than-optimal implementation.

@jzwinck
Copy link
Author

jzwinck commented Aug 18, 2017

@anntzer I too would prefer that the existing API simply be made more efficient. Right now, its performance is quite terrible for one of its most common use cases.

@github-actions
Copy link

This issue has been marked "inactive" because it has been 365 days since the last comment. If this issue is still present in recent Matplotlib releases, or the feature request is still wanted, please leave a comment and this label will be removed. If there are no updates in another 30 days, this issue will be automatically closed, but you are free to re-open or create a new issue if needed. We value issue reports, and this procedure is meant to help us resurface and prioritize issues that have not been addressed yet, not make them disappear. Thanks for your help!

@github-actions github-actions bot added the status: inactive Marked by the “Stale” Github Action label Apr 16, 2023
@anntzer
Copy link
Contributor

anntzer commented Apr 16, 2023

Generating the stamp in monochrome (or rather, alpha-only) and coloring it on-demand should definitely help; in mplcairo this is implemented in PatternCache::mask and ultimately relies on cairo_mask (https://www.cairographics.org/manual/cairo-cairo-t.html#cairo-mask) to do the coloring of the alpha mask.

@github-actions github-actions bot removed the status: inactive Marked by the “Stale” Github Action label Apr 17, 2023
Copy link

This issue has been marked "inactive" because it has been 365 days since the last comment. If this issue is still present in recent Matplotlib releases, or the feature request is still wanted, please leave a comment and this label will be removed. If there are no updates in another 30 days, this issue will be automatically closed, but you are free to re-open or create a new issue if needed. We value issue reports, and this procedure is meant to help us resurface and prioritize issues that have not been addressed yet, not make them disappear. Thanks for your help!

@github-actions github-actions bot added the status: inactive Marked by the “Stale” Github Action label May 29, 2024
@anntzer anntzer added keep Items to be ignored by the “Stale” Github Action and removed status: inactive Marked by the “Stale” Github Action labels May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep Items to be ignored by the “Stale” Github Action Performance
Projects
None yet
Development

No branches or pull requests

4 participants