Automatic downsampling of images. #5602

anntzer · 2015-12-02T06:15:32Z

This patch adds automatic downsampling of images with dimensions larger than
the physical number of pixels that will be displayed, in order to allow
displaying of very large images. The idea is to patch AxesImage to replace
calls to self._A[x0:x1, y0:y1] to calls to self._A[x0:x1:xstep, y0:y1:ystep]
with a well-chosen step.

Because this directly patches AxesImage's code, this is simpler than
e.g. @ChrisBeaumont's ModestImage, which must modify self._A to "pretend" the
image is something else than what it really is, and doesn't handle the "extent"
kwarg.

What still needs to be done:

agree on a kwarg/rcParam to control the behavior, and possibly the
downsampling level (perhaps one wants e.g. up to 5 data points per pixel
instead of a single one),
handle image caches: currently I just disabled rgbacache, which should be
fine as to_rgba (which is what hanged matplotlib on huge images so far) will
only be called on reasonably-sized arrays, but perhaps there's a better option?
tests.

@ChrisBeaumont

This patch adds automatic downsampling of images with dimensions larger than the physical number of pixels that will be displayed, in order to allow displaying of very large images. The idea is to patch AxesImage to replace calls to self._A[x0:x1, y0:y1] to calls to self._A[x0:x1:xstep, y0:y1:ystep] with a well-chosen step. Because this directly patches AxesImage's code, this is simpler than e.g. @ChrisBeaumont's ModestImage, which must modify self._A to "pretend" the image is something else than what it really is, and doesn't handle the "extent" kwarg. What still needs to be done: - agree on a kwarg/rcParam to control the behavior, and possibly the downsampling level (perhaps one wants e.g. up to 5 data points per pixel instead of a single one), - handle image caches: currently I just disabled rgbacache, which should be fine as to_rgba (which is what hanged matplotlib on huge images so far) will only be called on reasonably-sized arrays, but perhaps there's a better option? - tests.

mdboom · 2015-12-02T13:14:19Z

First, a big thank you for taking this on. This is an important part of matplotlib that needs improvement.

One concern: This is effectively decimating (i.e. ignoring data), not downsampling (i.e. averaging across neighboring values), correct? I understand that might be acceptable in some domains, but probably not in others (particularly in astronomy I know this would almost never be the right thing). That's not a reason to exclude it, but it should be documented clearly that that is what this feature does.

The bigger problem matplotlib has currently is that images are processed by:

converting floating point values to RGB (colormapping)
interpolating in RGB space

It's done this way historically because we're using Agg, which has a bunch of interpolation algorithms in place that were easy to grab and use, and these only supported RGB. With the latest SVN version of Agg, this is reportedly no longer the case, but I spent some time a few months ago looking at this and it seems to not work as advertised.

But assuming someone can do the work so instead we do:

interpolate in floating point
colormap floating point -> RGB

then what you've done here just becomes one of many interpolation algorithms (and probably the fastest one).

Are you interested in helping with this?

With some code cleanup thrown in at the same time, but it should be clear what is the interesting part. Compare x, y = np.ogrid[:10000, :10000] plt.imshow(np.sin(x + y), interpolation="none") with the downsampling and averaging methods to see the difference (one shows severe aliasing, the other doesn't). In either case one can zoom in interactively to check that at high zooms, individual pixels are correctly plotted.

anntzer · 2015-12-02T18:28:28Z

See latest commit, which allows the choice of downsampling method. The commit message includes an example where decimation vs. e.g. averaging produces significantly different results. (I wrote the decimating code first mostly for simplicity (with speed as a secondary benefit) but index_tricks comes to the rescue as usual.)
I think this issue is pretty different from interpolation (and I certainly don't intend to rewrite all of agg's interpolation methods :-)), here instead we're grouping multiple values into a single one.

mdboom · 2015-12-03T18:25:40Z

I think long term we really want to do proper interpolation on the input (greyscale) image, and this is just a half measure. I've already solved the C++ side of it in matplotlib/aggravate, now we just need to make the Python side use it. I think that will ultimately be more flexible than this and also be the same speed when interpolation == nearest. Would you like to help with that effort?

anntzer · 2015-12-03T19:23:11Z

To be honest I don't care much, or at least have very little intuition, about interpolation (I only believe in interpolation="none" :-)) but more importantly I still fail to see the connection between the two cases -- other than in the case where you have too many data points per pixel in one direction (you need to downsample there) and not enough in the other (you need to interpolate there) (but then you can always do the two parts sequentially).
A quick look (... limited by the quality of the docs but heh) at Agg's resampling algorithms suggests that they do not support downsampling, only upsampling, right?

mdboom · 2015-12-03T20:09:40Z

A quick look (... limited by the quality of the docs but heh) at Agg's resampling algorithms suggests that they do not support downsampling, only upsampling, right?

Nope, they address both. It's really just generic "resampling" where each pixel in the output is projected onto the input, and the average of that rectangle taken (modulo the filtering kernel).

anntzer · 2015-12-03T21:49:30Z

But do we always want to use averaging? I'd say this is pretty application dependent and certainly at least averaging, geometric averaging, median and possibly decimation all have legitimate uses there.

jenshnielsen · 2016-01-04T20:09:10Z

How does this connect with the new image infrastructure in #5718

anntzer · 2016-01-04T20:25:44Z

@mdboom suggests that #5718 handles this issue as a special case, but I haven't really followed what is happening on that issue.

tacaswell · 2016-01-05T00:45:23Z

This is a special case of #5718 (see my pathological example).

mdboom · 2016-01-05T14:28:33Z

This is basically a performance improvement for plotting very large images through decimation (which only really does the right thing if interpolation=='nearest').

To test, I'm using the following:

from matplotlib import pyplot as plt
import numpy as np

Z = np.random.rand(10000, 10000)
plt.imshow(Z, interpolation='nearest')
plt.savefig("test.png")

(If you want to reproduce the numbers, make sure you run it twice and ignore the first run to remove the effects of regenerating the font cache).

I get:

master: 7.24s
#5718: 3.61s
#5602: 2.22s

So, #5718 is much better than master already, by virtue of downsampling first, and then colormapping something much smaller. I'm a bit surprised by the remaining advantage of this over #5718, since it should boil down to effectively the same thing computationally... I'll need to do some further investigation. I'd still like to find a more general solution than this if possible.

mdboom · 2016-01-05T16:32:33Z

As best I can tell from profiling, the difference in time is mostly in norm. This PR runs norm on the decimated data, which unfortunately isn't correct, since it could miss local maxima and minima in the data that way if they fall in the cracks. I think we have to norm (or at least find min and max) over the entire original data, which obviously takes considerably more memory bandwidth which is I believe what accounts for the difference in timing.

anntzer · 2016-01-05T16:57:08Z

Two quick notes:

In its current state, the patch allows "fairly easily" other downsampling methods than decimation:

                # This is where downsampling happens.
                A = A[:, :, 0, 0] # or A = A.mean(axis=(-1, -2)), or etc.

(image.py line 239-240). Not that it matters because of the following point:

You are correct that normalization needs to happen on the whole image before resampling (typically: if you use a Power norm, the proper way to average your values is probably a geometric mean, so you may as well normalize to log scale and do an arithmetic mean after). So only the decimation method can benefit from normalization-after-resampling. (Also, I guess the difference in timing isn't just from computing the min and max, but also from doing the normalization itself on the whole vs the downsampled array.)

efiring · 2016-01-05T18:44:34Z

@mdboom, @anntzer, I'm not following this in detail, but here are a couple thoughts that might be relevant to the discussion:

The computation of min and max is going through ma.min and ma.max, which probably has two sources of inefficiency: first, ma operations tend to be slow for a variety of reasons; and second, because it is unnecessarily making two passes instead of calculating min and max in a single pass. It's unfortunate that numpy has ptp(), but not a function that returns min and max from a single pass. We could use our own C++ code for this (handling the possible mask at the same time).
I haven't checked, but in almost anything related to color handling in floating point, if we are using double precision we are wasting memory. I think there is a place where we push things toward single precision, but I don't know whether we are enforcing single precision.
Sorry not to be able to look into this more closely for now.
Eric

mdboom · 2016-01-05T19:11:06Z

The computation of min and max is going through ma.min and ma.max, which probably has two sources of inefficiency: first, ma operations tend to be slow for a variety of reasons; and second, because it is unnecessarily making two passes instead of calculating min and max in a single pass. It's unfortunate that numpy has ptp(), but not a function that returns min and max from a single pass. We could use our own C++ code for this (handling the possible mask at the same time).

Yeah -- this seems like something that Numpy really should grow. I wonder if they'd be amenable to a pull request on that. We'd still have to have our own answer in the meantime, though.

I haven't checked, but in almost anything related to color handling in floating point, if we are using double precision we are wasting memory. I think there is a place where we push things toward single precision, but I don't know whether we are enforcing single precision.

In the common case of single plane data that is color mapped, we color map to 8-bit ints (the colormap code can also output to floats, but in practice that code path is rarely used). #5718 improves things further when the input data is already color: in that case, if the input data is 8-bit int, we no longer require passing through float, but can preserve it as 8-bit all the way down.

WeatherGod · 2016-01-05T19:31:08Z

There have been talk of a np.minmax() function for years, but it has never
moved beyond that. I think they would be amendable to such a PR.

Something else that is even worse about np.ma.min() and such... they make
copies of the array (sans masked parts)! I was certain there was a github
issue filed for that, but I can't find it at the moment.

On Tue, Jan 5, 2016 at 2:11 PM, Michael Droettboom <notifications@github.com

wrote:

The computation of min and max is going through ma.min and ma.max,
which probably has two sources of inefficiency: first, ma operations tend
to be slow for a variety of reasons; and second, because it is
unnecessarily making two passes instead of calculating min and max in a
single pass. It's unfortunate that numpy has ptp(), but not a function that
returns min and max from a single pass. We could use our own C++ code for
this (handling the possible mask at the same time).

Yeah -- this seems like something that Numpy really should grow. I wonder
if they'd be amenable to a pull request on that. We'd still have to have
our own answer in the meantime, though.

I haven't checked, but in almost anything related to color handling in
floating point, if we are using double precision we are wasting memory. I
think there is a place where we push things toward single precision, but I
don't know whether we are enforcing single precision.

In the common case of single plane data that is color mapped, we color map
to 8-bit ints (the colormap code can also output to floats, but in practice
that code path is rarely used). #5718
#5718 improves things
further when the input data is already color: in that case, if the input
data is 8-bit int, we no longer require passing through float, but can
preserve it as 8-bit all the way down.

—
Reply to this email directly or view it on GitHub
#5602 (comment)
.

efiring · 2016-01-05T19:36:15Z

The following shows where some major slowdowns are coming from the masked array support. It certainly would be nice to get all that sped up within numpy...

In [1]: import numpy as np
In [2]: xx = np.random.randn(10000, 10000)
In [3]: timeit(xx.min())
10 loops, best of 3: 47 ms per loop
In [4]: timeit(xx.ptp())
10 loops, best of 3: 94.5 ms per loop
In [7]: yy = np.ma.masked_greater(xx, 2)
In [8]: timeit(yy.min())
1 loops, best of 3: 508 ms per loop
In [9]: timeit np.nanmin(yy.filled(np.nan))
1 loops, best of 3: 577 ms per loop
In [10]: timeit yy.filled(np.nan)
1 loops, best of 3: 457 ms per loop
In [12]: timeit np.ma.max(xx)
10 loops, best of 3: 47.1 ms per loop

It looks like the implementation of ptp() is not very efficient.
Masked arrays are slow only when there are masked values, so it doesn't look like min and max are responsible for the slow norming in your test case--correct?

mdboom · 2016-01-05T19:37:43Z

Masked arrays are slow only when there are masked values, so it doesn't look like min and max are responsible for the slow norming in your test case--correct?

Yeah -- in my test case, there's nothing masked out...

mdboom · 2016-01-06T14:04:32Z

(Also, I guess the difference in timing isn't just from computing the min and max, but also from doing the normalization itself on the whole vs the downsampled array.)

Yes -- there may be a way to take advantage of that in the context of #5718. The tricky bit is that normalization happens in Python/Numpy (and through a nice plug-in API where different normalization functions can be written and interchanged), and resampling happens in C++. Plugging normalization into the resampling step might be tricky, though. #5718 already cuts the run time in half -- we know from this PR that there's more that could probably be done wrt to speed, but I'm not certain we should let that hold up #5718.

anntzer · 2016-05-12T07:21:50Z

How does the new image machinery play with this? I have noticed it seems to perform quite well even with very large images. Should this PR be closed?

mdboom · 2016-05-12T18:12:45Z

Thanks for pinging this. Indeed, the image refactor was in part a response to this -- to try to do this faster without resorting to decimation. I think we should probably close this unless further testing reveals new problems with the new code.

anntzer · 2019-03-20T14:20:30Z

Reopening per the discussion at https://gitter.im/matplotlib/matplotlib?at=5c92437e49fdaa0d81e71ec9.
(Obviously the PR needs a big update, this is just to track the issue.)

jklymak · 2019-03-21T00:40:14Z

Image taken from http://www.imagemagick.org/Usage/filter/nicolas/rings_sm_orig.gif The article is also quite useful: http://www.imagemagick.org/Usage/filter/nicolas/

import matplotlib.pyplot as plt
import numpy as np

a = plt.imread('rings_sm_orig.gif')

print(np.shape(a))
# 200x200
fig, ax = plt.subplots(figsize=(4, 4))
ax.imshow(a, cmap='gray')
ax.set_position([0, 0, 1, 1])
fig.savefig('dpi50.png', dpi=50)
fig.savefig('dpi30.png', dpi=30)


fig, ax = plt.subplots(figsize=(4, 4))
ax.imshow(a, cmap='gray', interpolation='lanczos', filterrad=(2*50/30))
ax.set_position([0, 0, 1, 1])
fig.savefig('filtdpi50.png', dpi=50)
fig.savefig('filtdpi30.png', dpi=30)

dpi50.png:

dpi30.png

0.png

With filter:

filtdpi50.png

filtdpi30.png

Conclusion:

Agg seems to do the right thing if we give it the appropriate filter. The question is if we apply this filter automatically if we know we are downsampling....

anntzer · 2019-05-02T10:27:35Z

The issue is better described in #13724 now, so I'll close this.

tacaswell added the status: needs review label Dec 2, 2015

mdboom mentioned this pull request Dec 2, 2015

Don't interpolate images in RGB space #5490

Closed

tacaswell added this to the proposed next point release (2.1) milestone Dec 3, 2015

mdboom mentioned this pull request Jan 5, 2016

Rewrite of image infrastructure #5718

Merged

mdboom closed this May 12, 2016

mdboom removed the status: needs review label May 12, 2016

anntzer mentioned this pull request Sep 12, 2016

Default rstride and cstride for plot_surface changed to 1 from 10. #3142

Closed

anntzer deleted the auto-image-downsampling branch June 23, 2017 09:40

tacaswell removed this from the 2.1 (next point release) milestone Oct 3, 2017

anntzer restored the auto-image-downsampling branch March 20, 2019 14:20

anntzer reopened this Mar 20, 2019

anntzer added this to the v3.2.0 milestone Mar 20, 2019

jklymak mentioned this pull request Mar 21, 2019

ENH: anti-alias down-sampled images #13724

Merged

anntzer closed this May 2, 2019

anntzer deleted the auto-image-downsampling branch May 2, 2019 10:27

Uh oh!

Automatic downsampling of images. #5602

Automatic downsampling of images. #5602

Uh oh!

Conversation

anntzer commented Dec 2, 2015

Uh oh!

mdboom commented Dec 2, 2015

Uh oh!

anntzer commented Dec 2, 2015

Uh oh!

mdboom commented Dec 3, 2015

Uh oh!

anntzer commented Dec 3, 2015

Uh oh!

mdboom commented Dec 3, 2015

Uh oh!

anntzer commented Dec 3, 2015

Uh oh!

jenshnielsen commented Jan 4, 2016

Uh oh!

anntzer commented Jan 4, 2016

Uh oh!

tacaswell commented Jan 5, 2016

Uh oh!

mdboom commented Jan 5, 2016

Uh oh!

mdboom commented Jan 5, 2016

Uh oh!

anntzer commented Jan 5, 2016

Uh oh!

efiring commented Jan 5, 2016

Uh oh!

mdboom commented Jan 5, 2016

Uh oh!

WeatherGod commented Jan 5, 2016

Uh oh!

efiring commented Jan 5, 2016

Uh oh!

mdboom commented Jan 5, 2016

Uh oh!

mdboom commented Jan 6, 2016

Uh oh!

anntzer commented May 12, 2016

Uh oh!

mdboom commented May 12, 2016

Uh oh!

anntzer commented Mar 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jklymak commented Mar 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

dpi50.png:

dpi30.png

With filter:

filtdpi50.png

filtdpi30.png

Conclusion:

Uh oh!

anntzer commented May 2, 2019

Uh oh!

Uh oh!

anntzer commented Mar 20, 2019 •

edited

Loading

jklymak commented Mar 21, 2019 •

edited

Loading