-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Add new downsample method for lines #7632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -592,6 +599,59 @@ def get_markevery(self): | |||
"""return the markevery setting""" | |||
return self._markevery | |||
|
|||
def set_downsample(self, downsample): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a property for these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @tacaswell, thanks for reviewing! This is my first time working with matplotlib code, do you mind clarifying what property you'd like me to add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean add down sample as a property (ex downsample = property(get_downsample, set_downsample)
). We are in the mid-term future moving away from all of the getter/setter methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Does it make sense to make the getter/setter private if I'm doing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the get_
/set
_? Since this is new, can we just use the @property
decorator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm down for whatever. I was trying to use the markevery
option as a guide for how to format the downsample
option, but that was out of ignorance more than anything else. Just let me know what format you want it in and I'll refactor to that.
Very interesting! |
1a89dd0
to
b6c4b2c
Compare
I assume the performance win here comes from effectively dropping the anti-aliasing? I wonder if just turning it off on the line object would get you an equivalent speed up? There already exists some path simplification code in the path drawing code (https://github.com/tacaswell/matplotlib/blob/master/src/path_converters.h#L528). How does this compare with that implementation? |
The speed up comes from the fact that you are rendering far fewer points (e.g. on my 1080p screen, it will render at most ~8000 line segments regardless of how many points of data there are). A short test I ran showed about 3.7 seconds to render when there was no downsampling, using both antialiased and not antialiased, compared to 0.2 seconds when there was downsampling. import matplotlib.pyplot as plt
import numpy as np
import time
x = np.random.randn(1000000)
# antialias, no downsample
fig, ax = plt.subplots()
ax.plot(x, antialiased=True, downsample=False)
tic = time.time()
fig.savefig('antialiased_no_downsample.png')
print('antialias, no downsample', time.time() - tic)
# antialias, downsample
fig, ax = plt.subplots()
ax.plot(x, antialiased=True, downsample=True)
tic = time.time()
fig.savefig('antialiased_downsample.png')
print('antialias, downsample', time.time() - tic)
# no antialias, no downsample
fig, ax = plt.subplots()
ax.plot(x, antialiased=False, downsample=False)
tic = time.time()
fig.savefig('no_antialiased_no_downsample.png')
print('no antialias, no downsample', time.time() - tic)
# no antialias, downsample
fig, ax = plt.subplots()
ax.plot(x, antialiased=False, downsample=True)
tic = time.time()
fig.savefig('no_antialiased_downsample.png')
print('no antialias, downsample', time.time() - tic)
# Prints out the following:
# antialias, no downsample 3.771287202835083
# antialias, downsample 0.2187511920928955
# no antialias, no downsample 3.7316763401031494
# no antialias, downsample 0.218766450881958
I'll look into that and reply. |
It looks like these two algorithms are accomplishing different things -- in my opinion rending would benefit from having both downsampling/simplification run (although I'm guessing the downsampling in this PR would result in the simplification not doing too much). The path simplification code linked above says /* idea: we can skip drawing many lines: lines < 1 pixel in
length, and we can combine sequential parallel lines into a
single line instead of redrawing lines over the same
points. The loop below works a bit like a state machine,
where what it does depends on what it did in the last
looping. To test whether sequential lines are close to
parallel, I calculate the distance moved perpendicular to
the last line. Once it gets too big, the lines cannot be
combined. */ The idea of simplifying almost-parallel lines is completely separate from this PR. Removing line segments that are < 1 pixel in length is similar -- I would say this PR is kind of an extension of that idea, but instead of just looking at a single pixel you try to simplify the path along a single column of pixels. Was this simplification algorithm being used in matplotlib versions 1.5.3 and lower? I haven't tested version 2.0 or higher -- if this was introduced in 2.0 or later then perhaps just using the existing path simplification algorithm would help my rendering time. But from the sound of it, there would still be some types of data that would not benefit from the simplification algorithm as much as they would benefit from the downsampling proposed in this PR. |
The simplification code has been in the code base longer than I have been around, but it won't help much with really spiky data. |
# Find where the pixel column of the x data changes | ||
split_indices = np.diff(np.floor(verts_trans[:, 0]).astype(int)) != 0 | ||
split_indices = np.where(split_indices)[0] + 1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe shorten the code and reduce the number of operations inside the loop (untested!):
for i, y_pixel_col in enumerate(np.split(verts_trans[:, 1], split_indices)):
keep_inds[i, 1:3] = np.argmin(y_pixel_col), np.argmax(y_pixel_col)
starts = np.hstack((0, split_indices))
ends = np.hstack((split_indices, verts_trans.shape[0]))
keep_inds[:, :3] += starts[:, np.newaxis]
keep_inds[:, 3] = ends - 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Done. Your code worked without modification for me.
Current coverage is 62.07% (diff: 90.90%)@@ master #7632 diff @@
==========================================
Files 174 174
Lines 56150 56182 +32
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 34844 34876 +32
Misses 21306 21306
Partials 0 0
|
f134cc7
to
445a051
Compare
It looks like one of the SVG generation tests failed. Is this expected, or did my code break something? |
e5ba6a4
to
5a9bc7a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pull request. I think this is almost ready to be merged.
Can you add an entry to the what's new documentation?
@@ -0,0 +1,21 @@ | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a sphinx-gallery like docstring to explain why this example has been added to the gallery?
@@ -592,6 +599,61 @@ def get_markevery(self): | |||
"""return the markevery setting""" | |||
return self._markevery | |||
|
|||
@property | |||
def downsample(self): | |||
"""Set the downsample property to subsample the plotted line segments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current docstring is a mixture of our old and new style of docstring. Can you please stick to the new style? (I'm adding a couple notes to explain what I mean).
Recommended *only* when plotting with a solid line (e.g. no dashes, | ||
no markers). | ||
|
||
ACCEPTS: [True | False] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this (it duplicates the information in the Parameters section)
|
||
Parameters | ||
---------- | ||
downsample: True | False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a space after the column. It else doesn't render properly with numpydoc.
|
||
Notes | ||
----- | ||
If the x values of the line are monotonic, then the downsampling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a dash missing between x and values.
@kbrose I suspect the failure is unrelated to your patch. |
Is there more applying downsampling than the same logic that governs whether path simplification is suitable? (i.e., plain linestyle, no markers)? I may really be missing something, but if not, it seems better to use a global switch and rely on the same logic as path simplification to choose whether to downsample individual paths. |
Path simplification is general, works with any sort of path and backend; downsampling is specific to line plots of things like time series, where there are a very large number of points, x is monotonic, and y jumps up and down repeatedly over intervals of x smaller than a pixel. |
The current path simplification algorithm already handles the given test case reasonably well:
gives
gives
so there's a few duplicates but I doubt they're the issue. Rather, it appears that the path simplification code is not called at all in such a case (which can be checked by adding a Thus I'd rather be sure that the OP's use case cannot, indeed, already be covered by the path simplification algorithm by clarifying the logic of where it is enabled. |
@anntzer If I understand correct, you're saying that the existing path simplification code is not actually called during Either way, if I understand your code correctly, then the following code should show that the proposed method is substantially different: import matplotlib.pyplot as plt
import numpy as np
x = np.random.rand(1000000)
l, = plt.plot(x)
tpath, affine = l._get_transformed_path().get_transformed_path_and_affine()
print('raw path:', l._path.vertices[:,0].size)
print('current :', l._path.cleaned(simplify=True).vertices[:,0].size)
print('proposed:', l._downsample_path(tpath, affine).vertices[:,0].size) shows the number of vertices in each of the paths
|
I agree with your observation. However, note that
I would think that such an algorithm (ignore the part about lines < 1px), if properly implemented, should have an effect very similar to what you are proposing (because combining the many ups-and-downs of the line (your proposal) is the same as squashing many parallel segments together (path simplification)), but covers more cases (e.g. when the lines are all "slanted"). Obviously it is not the case right now, but why? I'd rather, once again, have path simplification fixed to handle your case properly, unless it is clear that this is not possible. |
Have you seen this comment in path_converter.h? /* NOTE: We used to skip this very short segments, but if
you have a lot of them cumulatively, you can miss
maxima or minima in the data. */
/* Don't render line segments less than one pixel long */
/* if (fabs(*x - m_lastx) < 1.0 && fabs(*y - m_lasty) < 1.0) */
/* { */
/* continue; */
/* } */ However, even if this is done I do not know if it would be enough. It is quite possible for the data to be very "spiky", i.e. for consecutive y-values to be further than 1 pixel apart. The worst case scenario is probably having your y-data be repeated |
I think your point corresponds to the "skipping lines < 1px in length" part of the comment; squashing consecutive parallel or antiparallel lines is implemented separately below ( |
Do we know it does or is supposed to handle anti-parallel lines? That could be the problem. Regardless, I am satisfied that my proposal here handles my use case well, and does not cause problems elsewhere in Matplotlib. I'd not be able to say the same if I modified the C++ code (it's been quite a few years since I've touched any C-based language). I'm just looking to get something into Matplotlib that will handle my use case ASAP. If updating the C++ algorithm would mean bumping the target version of this PR to 2.2 instead of 2.1, then I'd be in favor of keeping this proposal. If you think the C++ algorithm could be updated in time for 2.1 then that'd be even better. Ultimately, I will leave the decision of whether this proposal should be scrapped in favor of updates to the C++ algorithm to the Matplotlib maintainers. |
I think it should handle antiparallel lines as parallelism is checked using projected distance. I understand your interest in getting your PR in, but I'd rather have the already present code fixed rather than ending with two incomplete methods, plus an additional API that's going to take forever to deprecate if we end up fixing path simplification. If you wish to use downsampling for your own projects, note that you can easily patch
(where
(I've used similar patching methods myself for other changes...) |
From what I've pieced together, it appears as if the code intentionally will not simplify anti-parallel lines. Consider the following code: if (totdot > 0.0) {
if (paradNorm2 > m_dnorm2Max) {
m_lastMax = true;
m_dnorm2Max = paradNorm2;
m_nextX = *x;
m_nextY = *y;
}
} else {
_push(&m_lastx, &m_lasty);
_push(x, y);
break;
} there, If we actually want to go the route of modifying the C++ code, it looks like the main problem will be
It would be a non-trivial modification to add code to essentially mirror what's in this proposal but on almost-parallel paths instead of just almost-vertical paths; to (a) keep track of first point on the path that might get simplified, (b) find the min/max there (but in this case min/max found by treating the path you are building as the Reals), (c) keep track of last point on the path that might get simplified. |
That being said, I see the utility in having one path simplification algorithm that returns a path whose shape is relatively preserved, e.g. for saving vector graphics files (zooming in will produce a line segment that is some-what close to what it would be in the non-simplified version) and another path simplification that will destroy shape (mainly used for rasterization). These could definitely be the same method, just parameterized. |
Good catch. I went through the current path simplification algorithm and I think something like the following "should" work:
|
@anntzer do you think you will be able to follow up on your ideas for achieving a similar gain via modifications to the path simplification code? |
Not right now (real life has caught up, and this is a nontrivial project). Unless this turns out much easier to do than expected I don't think I'll have time to work on it before, say, April. A quick googling pointed out that, not unexpectedly, this is a well-studied problem. See e.g. https://en.wikipedia.org/wiki/Ramer%E2%80%93Douglas%E2%80%93Peucker_algorithm -- there are five other algorithms listed at the bottom. Whether any of them can handle the use case of the OP, I don't know, but I'd suggest we take advantage of the existing algorithms rather than our home-cooked thing. Perhaps make this a GSOC? :) (attn @tacaswell) I understand that the usage of the OP is a fairly common one, and the code is already there, so I won't oppose merging something like that. However, in that case I'd still suggest that control of whether the procedure is applied be folded into the path.simplify rcparam (with internal testing that the linestyle is indeed solid) rather than a kwarg on |
@anntzer The algorithm proposed here is The Ramer–Douglas–Peucker algorithm is expected The first additional algorithm (Visvalingam–Whyatt) listed on the wikipedia page has a good description here. Since it depends on the area of triangles, it looks like it too is not well suited to datasets with highly negative one-step autocorrelation. Consider my prototypical example of I do not care strongly about the specifics of how the downsampling is turned on, or what it is called. Also, congratulations, everyone, on releasing 2.0! Thanks for all your hard work! |
GSoC -> google summer of code. |
Sorry to keep pestering, but are there any thoughts on what to do with this? |
(I'm unable to comment on this in any depth, but wanted to clarify that your pinging on this after two weeks of silence is completely appropriate.) |
As mentioned earlier I do not consider that this PR should be merged (based on technical grounds -- do not take this personally :-)). If other devs want to push for it I would appreciate if my comment above (which I reproduce below) could be taken into account.
|
I agree this should be merged with the existing path simplification, or at least put on the |
Ok. I have a couple of questions before I attempt anything else. Apologies in advance for all the notifications I'm about to generate. I want to get everyone's input before putting more work into this. Is everyone in agreement that modifying the existing path simplification code is the way to go? To be clear, I'd be modifying the algorithm as @anntzer and I discussed above -- essentially removing the current constraint that in order for a vertex to be removed, it must have a non-negative dot product with the currently proposed simplification vector.
|
|
Does anyone else have any comments? I now have Linux installed once again so at least I can build from source. But I am wondering -- I understand wanting to keep source code clean and wanting to avoid having two very similar ways to accomplish something implemented in different places. But if the C++ code is not actually used anywhere, why not just swap it out with a Python implementation? If it's found to not be fast enough it could always be implemented in C++. So I'm asking, how would others feel about updating the proposed algorithm here to handle more than just vertical lines (essentially the C++ algorithm but handling anti-parallel lines) in Python, and deprecating the C++ implementation algorithm? I'm willing to put work into this, but am wary to move forward without being relatively certain my efforts will be worth it. |
I doubt that a pure Python implementation can be fast enough, but would certainly be willing to switch to one instead of the current C++ implementation if it handles more cases and is actually reasonably fast. |
My current implementation is Python (not pure Python -- it uses numpy of course) and has been fast enough for me as I've been using it the past few months (thanks to your Monkey Patch suggestion!). It's orders of magnitude better than the what currently exists... |
Closing in favor of #8776. Thanks for all the comments everyone. |
Overview
This PR adds a new downsample method for lines. This is an improved version of the algorithm found here.
When the
x
-values of the line are monotonic, then this downsampling method will render at most 4 times the width of the parentAxis
in pixels vertices. The downsampling is done such that the rendering of the downsampled plot looks very similar to the rending of the non-downsampled plot. In fact, if there is no anti-aliasing and plotting uses a solid line only, then the renderings will be identical.Possible Use Case(s)
I created this PR because I plot a lot of ECG data over large timespans. Typically, I might want to plot 12 or 24 hours of ECG data sampled at 125 Hz, resulting in
125 * 12 * 60 * 60 = 5400000
points of data. This is painfully slow to render, and interaction is very slow. Further, naive downsampling (e.g.plt.plot(x[::100], y[::100])
) is not sufficient in this case, as artifacts of interest in this plot might only last 1 sample.Of course, this is just a specific instance, and any plotting of large datasets could benefit.
Algorithm Description
This algorithm makes use of the fact that for large data-sets, lots of
x
-values will lie on the same column of pixels. We really only need to draw the minimum and maximumy
-values from all of thex
-values that get mapped to the same column of pixels. A little more book-keeping is required to ensure that the connecting lines between two columns of pixels render correctly, but this just amounts to keeping the first and last pairs that are mapped to that column of pixels.An image will probably help:
Caveats
This downsampling works best for plotting large amounts of data using a solid line and no markers.
If the
x
-values of the line are not monotonic, then there are no guarantees that rendering will be any faster. The "less monotonic" (for some definition of "less monotonic") thex
-values are, the less downsampling that will be able to occur.Questions
I have marked with a
TODO
in the code a question I had. I am recomputingself.axes.transData.transform(verts)
each update (verts is the vertices of the path). Is that cached somewhere already?TODO: