Skip to content

plt.yscale('log') after plt.scatter() behaves unpredictably in this example. #6915

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davidmikolas opened this issue Aug 7, 2016 · 32 comments
Assignees
Milestone

Comments

@davidmikolas
Copy link

davidmikolas commented Aug 7, 2016

As mentioned in this SE question a scatter plot is not autoscaling to include all of the data if plt.yscale('log') is used after plt.scatter(). This happens for the y-axis but not the x-axis in the example, and does not happen for plt.plot().

In an earlier answer by a developer, ax.set_yscale('log') is shown following ax.scatter(), so I am wondering if this may be a bug.

Using matplotlib version 1.5.1 and python 2.7.11

image

import matplotlib.pyplot as plt

X = [0.997, 2.643, 0.354, 0.075, 1.0, 0.03, 2.39, 0.364, 0.221, 0.437]
Y = [15.487507, 2.320735, 0.085742, 0.303032, 1.0, 0.025435, 4.436435,
     0.025435, 0.000503, 2.320735]

plt.figure()

plt.subplot(2,2,1)
plt.scatter(X, Y)
plt.xscale('log')
plt.yscale('log')
plt.title('scatter - scale last')   

plt.subplot(2,2,2)
plt.plot(X, Y)
plt.xscale('log')
plt.yscale('log')
plt.title('plot - scale last')   

plt.subplot(2,2,3)
plt.xscale('log')
plt.yscale('log')
plt.scatter(X, Y)
plt.title('scatter - scale first')   

plt.subplot(2,2,4)
plt.xscale('log')
plt.yscale('log')
plt.plot(X, Y)
plt.title('plot - scale first')   

plt.show()
@LindyBalboa
Copy link
Contributor

LindyBalboa commented Aug 10, 2016

Tested using lastest master 2.0.0b3 and python 3.5. Problem still present.

I played around a bit with this. It seems to effect both OO style and pyplot. Of note in the latest version, the ticks and tick labels disappear. Also, the automatic scaling when setting the scaling beforehand does not leave enough space at the bottom of the scale. The points end up along the x-axis.

EDIT: More playing.

So it seems that this problem is data dependent. X=[1,2,3] and Y=[1,10,100] works absolutely fine so long as you don't set the x-axis to log scale. Trying to set it to log scale breaks the graph. X=[1,10,100] and Y=[1,10,100] and both axes going to log scale (after scatter) works as well.

My guess is there is something about the sample data you are trying to plot is messing up an autoscaling algorithm somewhere.

@davidmikolas
Copy link
Author

OK the data-dependence is great to know for two reasons: 1) I was pretty sure I hadn't seen this behavior before, 2) it offers a clue that can narrow down what's going wrong.

I'll write a little script that creates a bunch of different kinds of data sets and try to narrow it down which conditions are necessary to cause the problem.

@LindyBalboa
Copy link
Contributor

@uhoh-SE did you have any luck at tracking what kind of data triggers the problem and possible where the limit of that data might lie?

@davidmikolas
Copy link
Author

davidmikolas commented Aug 19, 2016

@LindyBalboa Thanks for the reminder - yes I got to this point and then said my brain hurts! I can't look at this more until the weekend, I'll post this Monte-Carlo simulation here in case someone wants to look at it. It was written quickly (not intended to be shared) so it is a bit scrappy and there are no comments.

For 4000 plot simulations it took about 6 minutes on my laptop. In general the dots should be stair-casing between the two lines. There are a few things happening - something about the value -2.33 that is probably an excellent clue, but while np.exp(-2.33) is close to 10^-1, it's not exact.

Is it possible to search for an occurrence of math.log() or np.log() instead of math.log10() or np.log10() somewhere in the appropriate place in matplotlib functions?

It seems to happen equally for both x and y axes, and for either axis alone if only one is log scale.

I guess there's no longer any question that it is a bug at least!

screen shot 2016-08-19 at 7 45 20 am

import matplotlib.pyplot as plt
import numpy as np
import time

N = 4000

mins =  (7*np.random.random(2*N) - 5)     # -5 to +2
maxes = (5*np.random.random(2*N)) + mins  #  0 to  5 more

xmins,  ymins  = mins.reshape(2, -1)
xmaxes, ymaxes = maxes.reshape(2, -1)

n = 12

datasets = []

for xmin, xmax, ymin, ymax in zip(xmins, xmaxes, ymins, ymaxes):


    rans = np.random.random(2*n).reshape(-1, 2)
    lowers = rans.min(axis=0)
    uppers = rans.max(axis=0)
    xran, yran = ((rans-lowers) / (uppers-lowers)).T  # make sure you hit the target min/max
    x, y = xmin + (xmax-xmin)*xran, ymin + (ymax-ymin)*yran

    datasets.append((10**x, 10**y))


xlimits, ylimits = [], []


print "OK Go!"

start = time.clock()


print "N =", N

for i, (X, Y) in enumerate(datasets):

    plt.figure()

    plt.scatter(X, Y)
    plt.yscale('log')
    plt.xscale('log')
    # plt.title('scatter - scale last')

    xlimits.append(plt.xlim())
    ylimits.append(plt.ylim())

    plt.close()

    if not i%(N/20):

        print i,


time = time.clock() - start

print time, time/float(N)

xllims, xulims = np.array(xlimits).T
yllims, yulims = np.array(ylimits).T

plt.figure()

plt.subplot(2,2,1)
plt.scatter(xmins, np.log10(xllims))
plt.plot((-4.5, 2.5), (-5.5, 1.5))
plt.plot((-4.5, 1.5), (-4.5, 1.5))
plt.title("X LOWER limits")
plt.xlim(-6, 3)
plt.ylim(-6, 3)

plt.subplot(2,2,2)
plt.scatter(xmaxes, np.log10(xulims))
plt.plot((-3.5, 7.5), (-3.5, 7.5))
plt.plot((-4.5, 6.5), (-3.5, 7.5))
plt.title("X UPPER limits")
plt.xlim(-6, 8)
plt.ylim(-6, 8)

plt.subplot(2,2,3)
plt.scatter(ymins, np.log10(yllims))
plt.plot((-4.5, 2.5), (-5.5, 1.5))
plt.plot((-4.5, 1.5), (-4.5, 1.5))
plt.title("Y LOWER limits")
plt.xlim(-6, 3)
plt.ylim(-6, 3)

plt.subplot(2,2,4)
plt.scatter(ymaxes, np.log10(yulims))
plt.plot((-3.5, 7.5), (-3.5, 7.5))
plt.plot((-4.5, 6.5), (-3.5, 7.5))
plt.title("Y UPPER limits")
plt.xlim(-6, 8)
plt.ylim(-6, 8)

plt.show()

@LindyBalboa
Copy link
Contributor

Ah that is fantastic! That does definitely help narrow it a bit. I'm looking forward to looking into this later 👍

@LindyBalboa
Copy link
Contributor

LindyBalboa commented Aug 21, 2016

I did some tracking with pdb and things start to get a little funny inside mpl/axes/_base.py:

3142    def set_yscale(self, value, **kwargs):
3143         """
3144         Call signature::
3145 
3146           set_yscale(value)
3147 
3148         Set the scaling of the y-axis: %(scale)s
3149 
3150         ACCEPTS: [%(scale)s]
3151 
3152         Different kwargs are accepted, depending on the scale:
3153         %(scale_docs)s
3154         """
3155         # If the scale is being set to log, clip nonposy to prevent headaches
3156         # around zero
3157         if value.lower() == 'log' and 'nonposy' not in kwargs.keys():
3158             kwargs['nonposy'] = 'clip'
3159 
3160         g = self.get_shared_y_axes()
3161         for ax in g.get_siblings(self):
3162             ax.yaxis._set_scale(value, **kwargs)  <<< I think the error is located somewhere in 
3163             ax._update_transScale()               <<< these three lines. There is a visual bug
3164             ax.stale = True                       <<< upon ax.stale = True
3165         self.autoscale_view(scalex=False)

@LindyBalboa
Copy link
Contributor

LindyBalboa commented Aug 22, 2016

Okay I have narrowed this down a little more. File mpl/axes/_base.py, method autoscale_view, submethod handle_single_axis.

There is a call on line 2270 for minpos = getattr(bb, minpos) with the minpos argument expanding it out to minpos = getattr(bb, 'minposy'). The trouble is this minpos, at least for the very first example you posted, is returning the max value instead of the min value: 15.487507 instead of .000503.

The minpos attribute is set inside mpl/transforms.py line 919 which calls update_path_extents. I'm guessing the error is in there. It is inside the _path_wrapper.cpp file. I'm not so familiar with the mpl .cpp files so I'm having a hard time figuring out what is going on in there.

Too tired to continue tonight. Maybe someone else can crack the cpp easily now that we know what is happening.

@tacaswell tacaswell modified the milestones: 2.0.1 (next bug fix release), 2.0 (style change major release) Aug 23, 2016
@davidmikolas
Copy link
Author

davidmikolas commented Aug 23, 2016

@LindyBalboa Very brave! I can't help with this part (more than a few dozen lines of Python makes me dizzy and want to start eating instant coffee right out of the jar) but I can mention that I looked more closely, ran more Monte Carlos and still I see this funny pattern. The strange effect still seems happen when the minimum X value is below about 10** -2.3466 which is at first glance a strange number. However, if you try np.exp(-2.3466) you get 0.09569 which looks very close to an upper limit of 0.1or so.

Is it possible to do a simple text search for 0.1 or np.exp( or math.exp( or np.log( or math.log( (in other words natural logs and exponents instead of the base-10 varieties?)

screen shot 2016-08-23 at 8 41 09 am

@tacaswell
Copy link
Member

@uhoh-SE can you explain those plots (or give them axis labels!).

@davidmikolas
Copy link
Author

@tacaswell OK sure I'll make something more self-explanatory today, thanks. Briefly each dot is the result of one of 4000 log-log plots from 4000 random data sets (of 12 points each). The X coordinate of the dot is the actual minimum x (abscissa) value of the 12 points, and the Y coordinate of the dot is the lower limit chosen by matplotlib. The staircasing between the two lines is the expected and desired (presumably) behavior of a plotting routine. The streaks to the left represent cases where the minimum x (abscissa) value is much lower than the lower limit chosen - resulting in missing data from the final plot.

Because I didn't want to actually use a log plot to accurately portray a bug with log plotting, I am plotting the base-10 logarithms on a linear scale.

The visual appearance of some kind of vertical band at around -2.5 makes is striking and has no simple explanation that I can think of. Therefore it may be useful hint.

@efiring
Copy link
Member

efiring commented Sep 12, 2016

The first example is even worse on v2.x.

@LindyBalboa
Copy link
Contributor

LindyBalboa commented Sep 16, 2016

Piece of the puzzle:

vmin is generated by axis.get_view_interval() which in turn comes from axes.viewLim.interval[x,y]

 /home/conner/Documents/GitHub/matplotlib/lib/matplotlib/axes/_base.py(3161)set_yscale()
-> ax.yaxis._set_scale(value, **kwargs)
(Pdb) s
--Call--
> /home/conner/Documents/GitHub/matplotlib/lib/matplotlib/axis.py(680)_set_scale()
-> def _set_scale(self, value, **kwargs):
(Pdb) n
> /home/conner/Documents/GitHub/matplotlib/lib/matplotlib/axis.py(681)_set_scale()
-> self._scale = mscale.scale_factory(value, self, **kwargs)
(Pdb) n
> /home/conner/Documents/GitHub/matplotlib/lib/matplotlib/axis.py(682)_set_scale()
-> self._scale.set_default_locators_and_formatters(self)
(Pdb) n
> /home/conner/Documents/GitHub/matplotlib/lib/matplotlib/axis.py(684)_set_scale()
-> self.isDefault_majloc = True
(Pdb) Traceback (most recent call last):
  File "/home/conner/Documents/GitHub/matplotlib/lib/matplotlib/backends/backend_qt5agg.py", line 183, in __draw_idle_agg
    FigureCanvasAgg.draw(self)
  File "/home/conner/Documents/GitHub/matplotlib/lib/matplotlib/backends/backend_agg.py", line 464, in draw
    self.figure.draw(self.renderer)
  File "/home/conner/Documents/GitHub/matplotlib/lib/matplotlib/artist.py", line 68, in draw_wrapper
    return draw(artist, renderer, *args, **kwargs)
  File "/home/conner/Documents/GitHub/matplotlib/lib/matplotlib/figure.py", line 1262, in draw
    renderer, self, dsu, self.suppressComposite)
  File "/home/conner/Documents/GitHub/matplotlib/lib/matplotlib/image.py", line 139, in _draw_list_compositing_images
    a.draw(renderer)
  File "/home/conner/Documents/GitHub/matplotlib/lib/matplotlib/artist.py", line 68, in draw_wrapper
    return draw(artist, renderer, *args, **kwargs)
  File "/home/conner/Documents/GitHub/matplotlib/lib/matplotlib/axes/_base.py", line 2381, in draw
    mimage._draw_list_compositing_images(renderer, self, dsu)
  File "/home/conner/Documents/GitHub/matplotlib/lib/matplotlib/image.py", line 139, in _draw_list_compositing_images
    a.draw(renderer)
  File "/home/conner/Documents/GitHub/matplotlib/lib/matplotlib/artist.py", line 68, in draw_wrapper
    return draw(artist, renderer, *args, **kwargs)
  File "/home/conner/Documents/GitHub/matplotlib/lib/matplotlib/axis.py", line 1111, in draw
    ticks_to_draw = self._update_ticks(renderer)
  File "/home/conner/Documents/GitHub/matplotlib/lib/matplotlib/axis.py", line 944, in _update_ticks
    tick_tups = [t for t in self.iter_ticks()]
  File "/home/conner/Documents/GitHub/matplotlib/lib/matplotlib/axis.py", line 944, in <listcomp>
    tick_tups = [t for t in self.iter_ticks()]
  File "/home/conner/Documents/GitHub/matplotlib/lib/matplotlib/axis.py", line 889, in iter_ticks
    self.major.formatter.set_locs(majorLocs)
  File "/home/conner/Documents/GitHub/matplotlib/lib/matplotlib/ticker.py", line 867, in set_locs
    vmin = math.log(vmin) / math.log(b)
ValueError: math domain error

@LindyBalboa
Copy link
Contributor

LindyBalboa commented Sep 17, 2016

I haven't traced it back to the cause, but I found a place to interrupt the error chain:

Inside ticker.py, in the LogLocator class:

1998        minpos = self.axis.get_minpos()   <<< This is returning the max value instead of the min
1999
2000        if minpos <= 0 or not np.isfinite(minpos):
2001            raise ValueError(
2002                "Data has no positive values, and therefore can not be "
2003                "log-scaled.")
2004
2005        if vmin <= minpos:
2006            vmin = minpos <<< Commenting this line out stops the plot from breaking

This is really annoying because I have been trying to track down when the minpos attribute is getting messed up, but everything looks normal!

@efiring efiring self-assigned this Sep 18, 2016
@efiring
Copy link
Member

efiring commented Sep 18, 2016

The first basic problem is that for collections the dataLim is a Bbox calculated when the collection is added to the Axes. Those dataLim change depending on the transforms, however, so when a transform changes they need to be updated by removing and re-adding each collection. When I do that, I get inclusive but too-large ax.dataLim. I will try to track that down tomorrow.

@davidmikolas
Copy link
Author

As far as I know natural logarithms or exp() are not likely to appear in plotting routines - most things should be base 10. However, there is that artifact at 10** -2.3466, or ~0.0045 which looks like natural logs were used. Is it possible to to a quick text search for np.exp( or math.exp( or np.log( or math.log( just to rule it out?

@LindyBalboa
Copy link
Contributor

LindyBalboa commented Sep 18, 2016

Thanks for jumping in on this efiring. This has been bugging me for a while. I was hoping for a nice mid-difficulty fix but found quite a rabbit hole. At least I'm sharpening my debugging skills in the process!

@davidmikolas

If you look at the error trace I posted, the offending statement is vmin = math.log(vmin) / math.log(b) . Despite being log it is in fact a natural logarithm. As a physicist this bothers me, but it is rather common in programming languages. But the math.log(b) is there to convert it to base 10. Bad notation and a roundabout solution IMO, but that is how the math library and even numpy are....

@adeak
Copy link
Contributor

adeak commented Sep 18, 2016

@LindyBalboa for what it's worth, log(vmin)/log(b) should be the logarithm of vmin in base b regardless of the base of the log, at least on paper. This doesn't address numerical precision, of course, but at least I don't think it's bad notation. There's no straightforward way to compute log_b(vmin), is there (assuming that the given b-base logarithm is not implemented, of course)?

@LindyBalboa
Copy link
Contributor

In all my years of taking math and physics, a base 10 logarithm has always been log and the natural logarithm as always been ln-- without exception. Because of that I always second guess what log in programming means.

@adeak
Copy link
Contributor

adeak commented Sep 18, 2016

It is surely regional:) I was taught lg for base 10 and ln for natural base, and a dangling log then would be meaningless. You can argue that without a specification the obvious choice would be the natural choice, but then again we should resist the temptation to guess in the face of ambiguity.

@efiring efiring assigned mdboom and unassigned efiring Sep 18, 2016
@efiring
Copy link
Member

efiring commented Sep 18, 2016

I'm going to admit defeat. A simpler test case is

import numpy as np
import matplotlib.pyplot as plt

X = np.linspace(0, 1, 3)
Y = np.logspace(np.log10(5e-4), np.log10(15), 3)

fig, ax = plt.subplots()
#ax.set_yscale('log')
col = ax.scatter(X, Y)
ax.set_yscale('log')
plt.show()

There is considerable updating of the collection and the axes that would need to get done for this to work with the generation of any collection being before a switch to a log scale. Removing the collection, executing ax.relim() and then adding the collection seems to be a step in the right direction, but it does not get everything right. As a side note, I think there is some margin hackery in scatter that should have been removed when the get_datalim() method was added to collections. And as a second side note, we are not handling margins correctly when using a log scale.

@davidmikolas
Copy link
Author

davidmikolas commented Sep 19, 2016

@LindyBalboa OK so the funkiness shown in my Monte Carlo plot above is happening at -2.305..ish, which is what you'd get if you were trying to take log(x<~0.1)/log(10) but forgot the /log(10). So a regex search of an occurrence of log( which is not closely followed by a / and a second log( may turn up an error. Maybe not the error, but something is definitely going on with that -2.305.

Or... a situation where b stops being 10 for some reason.

@davidmikolas
Copy link
Author

@efiring Kudos for the mcve!

@efiring
Copy link
Member

efiring commented Sep 19, 2016

This problem has nothing to do with magic values; it is a matter of how transforms are made, modified, and used in the autoscaling, when dealing with collections. I'm not 100% sure, but it looks like a fairly fundamental limitation, not a simple bug. I don't expect the solution to be a one-liner.
I suspect that if you specify a much larger-than-default size for the scatter marker, you will see your -2.305 value change.
The fundamental problem with collections is that they operate with two or more transforms, so that in the scatter case, for example, they can draw objects of a given physical size at locations given in data coordinates. Autoscaling therefore requires figuring out the data range required for the objects to fit, and this depends on the data transform--and in a particularly tricky way when it is a log transform. This is handled correctly (or at least nearly so) when the log transform is in effect at the time the collection is added to the Axes. Switching to the log transform after this, however, does not properly undo the calculation done with the original linear transform, and redo it with the new log transform. It is handled correctly by plot because that yields a Line2D object which is much simpler than a collection, and has a hook that triggers the necessary recalculations. The assumption is that marker sizes are small enough that they don't have to be taken into account in the autoscaling.

@davidmikolas
Copy link
Author

davidmikolas commented Sep 19, 2016

@efiring OK, well in the Monte Carlo simulation above I did run thousands of plots with simulated data ranging over six orders of magnitude, from very narrow to very wide ranges, and that -2.305 repeated very consistently. Sorry but I think this is meaningful.

@LindyBalboa
Copy link
Contributor

LindyBalboa commented Sep 19, 2016

_NOTE_ This is wrong, I applied the sizes to the wrong scatter

Monte Carlo with sizes parameter set: plt.scatter(..., sizes=[300])

I should have updated the code to add axis labels but
x-axis: datalim
y-axis: viewlim generated for the given datalim

figure_2

@adeak
Copy link
Contributor

adeak commented Sep 19, 2016

@LindyBalboa just to be sure: did you change those sizes for the Monte Carlo bit too?

@LindyBalboa
Copy link
Contributor

LindyBalboa commented Sep 19, 2016

Yes, that is the output of the Monte Carlo code, just adding sizes=[300] to the calls to plt.scatter.

I was just following up on efirings hunch

I suspect that if you specify a much larger-than-default size for the scatter marker, you will see your -2.305 value change.

@adeak
Copy link
Contributor

adeak commented Sep 19, 2016

Yeah, I got that, I just meant that there are two kinds of scatters in there: one in the actual Monte Carlo part (from which the data are determined), and a final call to scatter to plot said data. I would only think to change the former ones, that might affect the data; while a call to scatter with small markers can stay to visualize said data in the original way. I just wanted to make sure that you changed the former scatters too, to avoid any confusion:)

@LindyBalboa
Copy link
Contributor

LindyBalboa commented Sep 19, 2016

Oh shoot you are totally right! I messed that up. Good catch. I ran it again with sizes=[300] but in the right spot. The breaking point has shifted to ~ -1.7. It seems the position of that breaking point is dependent on the marker size, which is totally in line with what efiring said.

figure_2

@davidmikolas
Copy link
Author

@efiring Amazing! Somehow my brain did not actually process the words 'marker size' properly when I read what you wrote. OK I understand better now, thanks! Also thanks @LindyBalboa for the new plots.

@NelleV NelleV modified the milestones: 2.1 (next point release), 2.0 (style change major release) Nov 11, 2016
@dstansby
Copy link
Member

dstansby commented Aug 9, 2017

I'm running into this too at the moment. A workaround is to use plot to imitate scatter:

ax.plot(x, y, marker='o', linewidth=0)

will do all the auto-scaling properly, but still look like a scatter plot.

@efiring
Copy link
Member

efiring commented Aug 9, 2017

For suppressing the line between markers, using linestyle='None' would be more direct.
I'm going to close this, though. I think it is encompassed and superseded by #7413.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants