Skip to content

Inconsistent shape handling of parameter c compared to x/y in scatter() #12735

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
timhoffm opened this issue Nov 4, 2018 · 9 comments · Fixed by #13959
Closed

Inconsistent shape handling of parameter c compared to x/y in scatter() #12735

timhoffm opened this issue Nov 4, 2018 · 9 comments · Fixed by #13959

Comments

@timhoffm
Copy link
Member

timhoffm commented Nov 4, 2018

Bug report

As described in #11663 (review):

Something funny is going on here (I know it was already there before, but still seems worth pointing out): we take x and y of any shapes and flatten them (they just need to have the same size), but c has to match either the shape of x or y, not just the size.

In other words the following "work" (i.e. perform an implicit ravel()):

scatter(np.arange(12).reshape((3, 4)), np.arange(12).reshape((4, 3)), c=np.arange(12).reshape((3, 4)))
scatter(np.arange(12).reshape((3, 4)), np.arange(12).reshape((4, 3)), c=np.arange(12).reshape((4, 3)))

but the following fail:

scatter(np.arange(12).reshape((3, 4)), np.arange(12).reshape((4, 3)), c=np.arange(12).reshape((6, 2)))
# and even
scatter(np.arange(12).reshape((3, 4)), np.arange(12).reshape((4, 3)), c=np.arange(12))

Of course that last one has the best error message (irony intended):

ValueError: 'c' argument has 12 elements, which is not acceptable for use with 'x' with size 12, 'y' with size 12.

@anntzer
Copy link
Contributor

anntzer commented Nov 4, 2018

(I think we should just reject (with deprecation, yada yada) any non-1D input (and yes, that includes (n, 1) and (1, n) input).)

@efiring
Copy link
Member

efiring commented Apr 14, 2019

The ravel that we do is a considerable convenience for common use cases such as working with gridded data for ocean or atmospheric fields.

@jklymak
Copy link
Member

jklymak commented Apr 14, 2019

I dunno - it all seems pretty inconsistent to me, like whoever wrote scatter never looked at what plot does, or visce versa, whereas I think they are the same thing.

plt.plot(np.arange(12).reshape(3, 4), np.arange(12).reshape(3, 4))
plt.scatter(np.arange(12).reshape(3, 4), np.arange(12).reshape(3, 4))

are pretty shockingly different. I'm fine w/ them being different, but I'm somewhat against scatter just flattening the array without comment. I think I'd go with @anntzer's proposal so the difference is explicit. Its not like scatter(X.flat, Y.flat) is so hard...

@efiring
Copy link
Member

efiring commented Apr 14, 2019

If they were the same thing we wouldn't need both. They are intentionally different. Where their purposes overlap, we recommend using plot.

@timhoffm
Copy link
Member Author

Still we should aim at consistent API where possible.

I don't see an argument for different handling of mutl-dimensional data

  • of parameter c vs parameters x, y in scatter()
  • of parameters x, y in scatter() vs plot()

@efiring
Copy link
Member

efiring commented Apr 17, 2019

In scatter, c can be an array of color specs, including rgb or rgba, which means, for example, that if x and y are 1D, c can be 2D, with its first dimension matching the lengths of x and y. Or it could be a single color spec, which could be a string, a scalar, or an rgb or rgba sequence. Therefore, we must handle c a bit differently than 'x' and 'y'.

x and y are handled differently in scatter() vs plot() because these two functions overlap, but are still fundamentally different. plot plots one or more lines (whether using an actual line for each, or a set of identical points for each, or both). scatter just plots a set--a single set--of points, but it allows the size and color of each point to be specified individually.

@timhoffm
Copy link
Member Author

@efiring Thanks for the clarification. I forgot about the more "sophisticated" plot() variant for simultaneously plotting multiple data sets. The x, y handling makes sense, even though it's unfortunately not clear from simple cases why scatter() and plot() must behave differently sometimes.

I also see that consistent handling of c is more cumbersome.

Given that, I still support @anntzer's suggestion to deprecate any non-1D input to scatter(). There's little benefit in auto-flatting. Removing that would make a more consistent API: no differences between x, y, c handling. And, while plot() and scatter() won't be the same, they become more similar making it easier to guess the behavior of one knowning the other.

@efiring
Copy link
Member

efiring commented Apr 17, 2019

I disagree. Relative to #13959, I don't think that removing the flattening makes anything substantially simpler or more consistent. It would break existing code for no gain that I can see, and would require adding shape checking in place of size checking. Really, what is the big gain here, to motivate a considerable break in backward compatibility? I just don't see it. (And removing the flattening does not remove differences between x, y, and c handling.)
#13959 is making things more consistent in scatter. What's wrong with it?

If we were starting from scratch, and if we took the point of view that we should keep the API as simple as possible and leave it to the user to make their data fit that API, then it might make sense to require that x and y be 1-D, since logically and in practice they are within the function. But we are not starting from scratch--we have a long history and an admittedly imperfect heritage--and our general practice in the high-level API has been to try to make things easy for the end user, to make things work if they logically can. In fact we have been adding such things, not removing them: witness the support for pandas-like data structures.

@timhoffm
Copy link
Member Author

Reconsidered. Actually, it can be reasonable to support 2D arrays as scatter input. For some kinds of scatter plots x and y can be semantically the same type e.g. positions in a plane (as opposed to y = f(x)). And if these are aligned on a regular grid, a 2D shape may be the natural form of the input data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants