Skip to content

Document and test what "array like" means to Matplotlib #22879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tacaswell opened this issue Apr 22, 2022 · 10 comments
Open

Document and test what "array like" means to Matplotlib #22879

tacaswell opened this issue Apr 22, 2022 · 10 comments

Comments

@tacaswell
Copy link
Member

We have always said we take "array like" input, however when we started saying that there was numarray/numeric, then numpy, and always Python lists. Later we started to have some special cases to support pandas.Series (both on straight conversion and to pick of the index). However, the the current rise of sturcutred and array-like libraries (xarray, akwaardarray, pandas recent changes, the fleet of GPU libraries, dask, the pandas-likes, ...) things have become a bit fuzzier and different parts of the library are not completely consistent with what they support.

There has been work recently (most recently #22560) to consolidate all of the attempted coercion into one place in the code and articulate exactly what we mean by "array like".

We have standardized on the chain of:

  • isinstance(obj, np.array)
  • obj.to_numpy
  • obj.values
  • np.asanyarray(obj)

We should: document this chain as what it means to "be array like" including including the expected shapes / squashing of these things (e.g. when do we expect (n,) vs (1, n) vs (n, 1) shaped arrays).

If we do this we may want to make _unpack_to_numpy and maybe our 1D and 2D coercion function public so that the datastructure libraries can test if they comply. We should, however, make clear in the documentation that the API garuntees on these functions will be a bit weaker in that we will reserve the right to make the more permissive in the future (that is input that we currently bounce may be handled in the future with no warning), but anything that we correctly get to numpy will continue to work (or go through the normal deprecation cycle).

If we do this we should also provide a set of numpy arrays and/or lists and the expected return through the conversion as a fixture.

@tacaswell tacaswell added this to the v3.6.0 milestone Apr 22, 2022
@oscargus
Copy link
Member

I would expect that chain to unify/modify/improve over time, see #22126 and #22645. Right now, I believe that there is a bit of special handling in different methods/functions so we should probably make sure that "all" methods/functions support data in the same format.

index_of behaves differently at the moment, or it should probably use _unpack_to_numpy to rely on any further improvements there.

mocquin added a commit to mocquin/physipy that referenced this issue May 7, 2022
see matplotlib/matplotlib#22973, matplotlib/matplotlib#22879, and matplotlib/matplotlib#22560
It is not clear to me as to which is the standard interface for unit handling (for eg, hist still doesn't handle unit by default)
@QuLogic QuLogic modified the milestones: v3.6.0, v3.6.1 Sep 14, 2022
@QuLogic QuLogic modified the milestones: v3.6.1, v3.6.2 Oct 6, 2022
@QuLogic QuLogic modified the milestones: v3.6.2, v3.6.3 Oct 27, 2022
@QuLogic QuLogic modified the milestones: v3.6.3, v3.7.0 Jan 11, 2023
@oscargus oscargus modified the milestones: v3.7.0, v3.8.0 Jan 12, 2023
@ksunden ksunden removed this from the v3.8.0 milestone Sep 15, 2023
@randolf-scholz
Copy link
Contributor

randolf-scholz commented Jan 11, 2024

Here's a suggestion: numpy provides two interfaces:

  1. numpy.typing.ArrayLike which covers lists and the like, represents valid input types for np.array(...)
  2. Types that support the __array__ method: https://numpy.org/doc/stable/reference/c-api/array.html
@runtime_checkable
class SupportsArray(Protocol):
    """Protocol for objects that support `__array__`."""

    @abstractmethod
    def __array__(self) -> NDArray[np.object_]:
        """Return the array of the tensor."""
        ...

Many libraries (numpy/pandas/pyarrow/polars/pytorch/...) support the __array__ method, so typically code could look like this

def func(x: SupportsArray | ArrayLike):
    if isinstance(x, SupportsArray):
        arr = x.__array__()
    else:
        try:
            arr = numpy.array(x)
        except TypeError as exc:
            raise TypeError(f"Cannot convert {x} to array.") from exc

So array_like could be interpreted as SupportsArray | numpy.typing.ArrayLike

@jonas-eschle
Copy link
Contributor

jonas-eschle commented Apr 13, 2024

(I just wanted to open an issue on the conversion and propose a PR, as I came accross this issue)

I would second the protocol method: just ran into the issue that TensorFlow and JAX arrays are not converted (but they have both the __array__ method, so the protocol would solve this), instead, they are treated as an iterable.
Plotting a hist with a few 100K+ points in it becomes minutes...

The reason for the protocol (but not excluding more): it's exactly what is wanted. Numpy defines that this is the way to convert it, so it's also an easy way for anyone to implement this. The strongest point to add this "feature" sooner than later: it's fully backwards compatible

The possible issue with the "brute-force" conversion: could work, but maybe creates arrays of objects (if lists or other objects in other lists etc). If it's well documented behavior, sure! But it could break current use-cases.

@timhoffm
Copy link
Member

Note that Jax and torch arrays will be converted in the upcoming 3.9 release. See #25887

@oscargus
Copy link
Member

And as discussed in e.g. #25887, the reason that __array__ is not "blindly" used is that units may be dropped. (This is something that should be solved though, but it is quite intricately woven into the current code base.)

@randolf-scholz
Copy link
Contributor

randolf-scholz commented Apr 14, 2024

@oscargus What units are we talking about here? I am not aware of widely used python packages that add units (like e.g. Julia's Unitful.jl does), I guess there is pint?. Are we just talking about pandas datetime/timedelta resolutions?

@jonas-eschle
Copy link
Contributor

Note that Jax and torch arrays will be converted in the upcoming 3.9 release. See #25887

Hm, that seems like a suboptimal solution? as @timhoffm also commented there, should we therefore rather strive for a general case or should I add TensorFlow separately? (-> I'll open a separate issue then)

And as discussed in e.g. #25887, the reason that array is not "blindly" used is that units may be dropped. (This is something that should be solved though, but it is quite intricately woven into the current code base.)

This argument was never convincing to me: there is a general case that libraries adhere to, standards: the __numpy__() method. And then there are a few special cases that matplotlib knows about: why not check the special cases first, pick them out, then call __array__() on the rest if available (and possibly numpy and to_numpy)? Also in regards to documentation that would be pretty straight forward (name the special cases and say exactly what happens). It also solves any "import magic" problems etc.

@oscargus
Copy link
Member

@randolf-scholz time is definitely one use case, but there is support for defining your own unit handling. Not sure if pint is supported directly though (been primarily involved as accidentally breaking unit support every now and then...). See https://matplotlib.org/stable/gallery/units/index.html for some idea how it works (not that exciting examples though, and some have disabled units I notice).

There's been a recent push to figure out where unit actually is supported #26864, and, yes, primarily time is considered in the use cases, but there should be support in general.

There's also a specific issue topic: topic: units and array ducktypes

(There are people knowing the details to a much larger extent than I do.)

@timhoffm
Copy link
Member

or should I add TensorFlow separately? (-> I'll open a separate issue then)

As a quick solution, please add a special case for tensorflow as well similar to JAX/torch if needed.

why not check the special cases first, pick them out, then call __array__() on the rest if available (and possibly numpy and to_numpy)?

If starting from scratch, this would indeed be a no-brainer. However Matplotlib's unit support predates all those standards and is deeply baked into the library. I don't have a full oversight (and not sure if anybody has), but changing the logic and implementation is likely a medium to major-scale effort, given that we need to maintain compatibility with existing code.

@jklymak
Copy link
Member

jklymak commented Apr 15, 2024

I personally think we should work with downstream to make this work properly. The only tests that fail are the jpl_unit tests, and it should be relatively trivial to refactor their mock object to do the right thing. The real jpl could update their array class, and astrophysics and pint could as well. It's a little infuriating that we are ignoring community standards on our basic functionality to support our inconsistent and half-baked units implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants