Skip to content

DOC: caution against relying upon NumPy's implementation in subclasses #13633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 30, 2019

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented May 26, 2019

xref #12028

I think this is an important warning to include for subclass authors.
Otherwise, we will be expanding our exposure of internal APIs as part of
__array_function__. All things being equal, it's great when things "just
work" subclasses, but I don't want to guarantee it. In particular, I would be
very displeased if __array_function__ leads to NumPy adding more subclass
specific hacks like always calling mean() inside median() (GH-3846).

@mhvk: please take a look.

I think this is an important warning to include for subclass authors.
Otherwise, we will be expanding our exposure of internal APIs as part of
``__array_function__``. All things being equal, it's great when things "just
work" subclasses, but I don't want to guarantee it. In particular, I would be
very displeased if  ``__array_function__`` leads to NumPy adding more subclass
specific hacks like always calling ``mean()`` inside ``median()`` (numpyGH-3846).

mhvk: please take a look.
shoyer added a commit to shoyer/numpy that referenced this pull request May 26, 2019
This is a partial reprise of the optimizations from numpyGH-13585.

The trade-offs here are about readability, performance and whether these
functions automatically work on ndarray subclasses.

You'll have to judge the readability costs for yourself, but I think this is
pretty reasonable.

Here are the performance numbers for three relevant functions with the
following IPython script:

    import numpy as np
    x = np.array([1])
    xs = [x, x, x]
    for func in [np.stack, np.vstack, np.block]:
        %timeit func(xs)

| Function | Master | This PR |
| --- | --- | --- |
| `stack` | 6.36 µs ± 175 ns | 6 µs ± 174 ns |
| `vstack` | 7.18 µs ± 186 ns | 5.43 µs ± 125 ns |
| `block` | 15.1 µs ± 141 ns | 11.3 µs ± 104 ns |

The performance benefit for `stack` is somewhat marginal (perhaps it should be
dropped), but it's much more meaningful inside `vstack`/`hstack` and `block`,
because these functions call other dispatched functions within a loop.

For automatically working on ndarray subclasses, the main concern would be that
by skipping dispatch with `concatenate`, subclasses that define `concatenate`
won't automatically get implementations for `*stack` functions. (But I don't
think we should consider ourselves obligated to guarantee these implementation
details, as I write in numpyGH-13633.)

`block` also will not get an automatic implementation, but given that `block`
uses two different code paths depending on argument values, this is probably a
good thing, because there's no way the code path not involving `concatenate`
could automatically work (it uses `empty()`).
@hameerabbasi
Copy link
Contributor

hameerabbasi commented May 27, 2019

If we actually follow Python protocol rules:

>>> class A:
...   def __add__(self, other):
...     return (self, other)
... 
>>> class B(A):
...   def __add__(self, other):
...     return NotImplemented
... 
>>> B() + B()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'B' and 'B'

We should only allow this when there IS an ndarray present. One can always do the following:

if func not in IMPLEMENTATIONS:
  return super().__array_function__(func, args, kwargs, arrays)
return IMPLEMENTATIONS[func](*args, **kwargs)

@shoyer
Copy link
Member Author

shoyer commented May 27, 2019

@hameerabbasi I'm not sure I understand your example. I think __array_function__ is already consistent with this behvavior, e.g.,

class SubArray(np.ndarray):
     def __array_function__(self, func, types, args, kwargs):
         return NotImplemented

array = np.array([1]).view(SubArray)
np.stack([array, array])  # TypeError: no implementation found for 'numpy.stack'

You only get the base-class behavior if at least one argument inherits from ndarray and hasn't overriden __array_function__.

@hameerabbasi
Copy link
Contributor

In this case, I think the warning should be extended to everything that uses np.ndarray.__array_function__.

@shoyer
Copy link
Member Author

shoyer commented May 27, 2019

In this case, I think the warning should be extended to everything that uses np.ndarray.__array_function__.

How else would you use np.ndarray.__array_function__, if not from a subclass?

OK, I suppose you could indeed call it directly, but calling special methods directly is generally not recommended in Python. The methods exist for the sake of a protocol, not arbitrary usage.

In any case, if you call ndarray.__array_function__ with any objects that define __array_function__ other than subclasses, you will just get NotImplemented back. If you intentionally misuse the protocol and put the wrong arguments in types, then yes you could cause NumPy to expose implementation details for non-subclasses. But hopefully it should be obvious that we make no guarantees about such behavior.

@mhvk
Copy link
Contributor

mhvk commented May 27, 2019

@shoyer - This warning is fine and indeed appropriate. The big difference with gh-3846 is that now we can tell anybody who is using an implementation that changes, that they can just override the function; in gh-3846 there was no way to get the old behaviour back.

@mhvk
Copy link
Contributor

mhvk commented May 27, 2019

p.s. Hopefully, the bugs smoked out by Quantity and MaskedColumn provide a nice counterweight to the occasional pain of having to fix regressions...

@charris
Copy link
Member

charris commented May 30, 2019

Thanks @shoyer.

shoyer added a commit that referenced this pull request Jun 12, 2019
* MAINT: avoid nested dispatch in numpy.core.shape_base

This is a partial reprise of the optimizations from GH-13585.

The trade-offs here are about readability, performance and whether these
functions automatically work on ndarray subclasses.

You'll have to judge the readability costs for yourself, but I think this is
pretty reasonable.

Here are the performance numbers for three relevant functions with the
following IPython script:

    import numpy as np
    x = np.array([1])
    xs = [x, x, x]
    for func in [np.stack, np.vstack, np.block]:
        %timeit func(xs)

| Function | Master | This PR |
| --- | --- | --- |
| `stack` | 6.36 µs ± 175 ns | 6 µs ± 174 ns |
| `vstack` | 7.18 µs ± 186 ns | 5.43 µs ± 125 ns |
| `block` | 15.1 µs ± 141 ns | 11.3 µs ± 104 ns |

The performance benefit for `stack` is somewhat marginal (perhaps it should be
dropped), but it's much more meaningful inside `vstack`/`hstack` and `block`,
because these functions call other dispatched functions within a loop.

For automatically working on ndarray subclasses, the main concern would be that
by skipping dispatch with `concatenate`, subclasses that define `concatenate`
won't automatically get implementations for `*stack` functions. (But I don't
think we should consider ourselves obligated to guarantee these implementation
details, as I write in GH-13633.)

`block` also will not get an automatic implementation, but given that `block`
uses two different code paths depending on argument values, this is probably a
good thing, because there's no way the code path not involving `concatenate`
could automatically work (it uses `empty()`).

* MAINT: only remove internal use in np.block

* MAINT: fixup comment on np.block optimization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants