Skip to content

MAINT: avoid nested dispatch in numpy.core.shape_base #13634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 12, 2019

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented May 26, 2019

This is a partial reprise of the optimizations from GH-13585.

The trade-offs here are about readability, performance and whether these
functions automatically work on ndarray subclasses.

You'll have to judge the readability costs for yourself, but I think this is
pretty reasonable.

Here are the performance numbers for three relevant functions with the
following IPython script:

import numpy as np
x = np.array([1])
xs = [x, x, x]
for func in [np.stack, np.vstack, np.block]:
    %timeit func(xs)
Function Master This PR
stack 6.36 µs ± 175 ns 6 µs ± 174 ns
vstack 7.18 µs ± 186 ns 5.43 µs ± 125 ns
block 15.1 µs ± 141 ns 11.3 µs ± 104 ns

The performance benefit for stack is somewhat marginal (perhaps it should be
dropped), but it's much more meaningful inside vstack/hstack and block,
because these functions call other dispatched functions within a loop.

For automatically working on ndarray subclasses, the main concern would be that
by skipping dispatch with concatenate, subclasses that define concatenate
won't automatically get implementations for *stack functions. (But I don't
think we should consider ourselves obligated to guarantee these implementation
details, as I write in GH-13633.)

block also will not get an automatic implementation, but given that block
uses two different code paths depending on argument values, this is probably a
good thing, because there's no way the code path not involving concatenate
could automatically work (it uses empty()).

This is a partial reprise of the optimizations from numpyGH-13585.

The trade-offs here are about readability, performance and whether these
functions automatically work on ndarray subclasses.

You'll have to judge the readability costs for yourself, but I think this is
pretty reasonable.

Here are the performance numbers for three relevant functions with the
following IPython script:

    import numpy as np
    x = np.array([1])
    xs = [x, x, x]
    for func in [np.stack, np.vstack, np.block]:
        %timeit func(xs)

| Function | Master | This PR |
| --- | --- | --- |
| `stack` | 6.36 µs ± 175 ns | 6 µs ± 174 ns |
| `vstack` | 7.18 µs ± 186 ns | 5.43 µs ± 125 ns |
| `block` | 15.1 µs ± 141 ns | 11.3 µs ± 104 ns |

The performance benefit for `stack` is somewhat marginal (perhaps it should be
dropped), but it's much more meaningful inside `vstack`/`hstack` and `block`,
because these functions call other dispatched functions within a loop.

For automatically working on ndarray subclasses, the main concern would be that
by skipping dispatch with `concatenate`, subclasses that define `concatenate`
won't automatically get implementations for `*stack` functions. (But I don't
think we should consider ourselves obligated to guarantee these implementation
details, as I write in numpyGH-13633.)

`block` also will not get an automatic implementation, but given that `block`
uses two different code paths depending on argument values, this is probably a
good thing, because there's no way the code path not involving `concatenate`
could automatically work (it uses `empty()`).
Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this special-casing some functions: for stack, the benefit is tiny, and for hstack and vstack it will likely be similarly small when the inefficient repeated call noted in-line has been changed.

For block, the case is stronger, but given that we have #13186 with a much faster C implementation pending, it is not clear it is worth it (though among the overrides, the one I mind least are the ndim and size overrides - those are some of the functions where it is not obvious to me that those should even be captured under __array_function__ - getattr(a, 'ndim', 0) would seem totally OK).

@@ -276,7 +290,7 @@ def vstack(tup):
if not overrides.ARRAY_FUNCTION_ENABLED:
# raise warning if necessary
_arrays_for_stack_dispatcher(tup, stacklevel=2)
return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
return _concatenate([_atleast_2d(_m) for _m in tup], 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is strange: why loop over the arrays when atleast_2d does that as well? This means one can reduce the overhead in your example by a factor 3 with zero effort or impact!

This of course separate from the very simple change in atleast_2d to use np.array(ary, copy=False, subok=True, ndmin=2)...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. Though in my more ideal world, atleast_2d would just call np.array(...) only when .shape and .reshape were not present and otherwise just rely on those. But that's a separate issue!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall correctly, the atleast_*d functions have different semantics than np.array(..., ndim=*)

@@ -331,12 +345,12 @@ def hstack(tup):
# raise warning if necessary
_arrays_for_stack_dispatcher(tup, stacklevel=2)

arrs = [atleast_1d(_m) for _m in tup]
arrs = [_atleast_1d(_m) for _m in tup]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above

@eric-wieser
Copy link
Member

eric-wieser commented Jun 1, 2019

I'm still -0.5 on this whole thing. 20% performance gains for very small arrays don't seem like a strong argument for me for making things harder for subclasses to hook into. The readability cost is ok here, but it still leaves the reader asking the question "why are we skipping this dispatching?"

But I don't think we should consider ourselves obligated to guarantee these implementation details

I don't see any reason not to guarantee that vstack and hstack will be implemented in terms of concatenate and atleast2d, since that's exactly its semantics.

There are probably cases like the mean / median thing where our internals could change. Imagine a foo(arr) method that currently delegates to bar(arr). We change it to internally call baz(arr) instead. One of two things happens to downstream subclasses:

  • Their code continues to work, as they already implement baz(arr) with sensible semantics
  • Their code does not work. They release an update to their library, and say "if you want to use numpy 1.17 and newer make sure to update to our new version". This would happen if:
    • They do not implement baz(arr) at all. By making this change, we just caused a downstream library to improve and now support baz(arr) too.
    • They implement baz(arr) with incompatible semantics. They should think carefully about whether this was a good idea. Either they fix baz(arr) (good), or they decide they need to overload foo(arr) specially after all.

I suppose this all predicates on having a __numpy_implementation__ attribute available to use inside __array_function__.

@shoyer shoyer force-pushed the shape_base_optimize branch from fca9898 to c0aae6b Compare June 1, 2019 22:30
@shoyer
Copy link
Member Author

shoyer commented Jun 1, 2019

OK, the latest version only removes the internal uses inside np.block. I think there is a pretty clear case in merging this to avoid a performance regression, given that #13186 may not make it into 1.17. Also, it's just a bad idea to partially support subclasses in a shape dependent fashion -- we should either support subclasses always with a function or never. Value dependent semantics are unlikely to be tested properly, which vastly increases the likelihood of bugs.

@@ -416,6 +416,13 @@ def stack(arrays, axis=0, out=None):
return _nx.concatenate(expanded_arrays, axis=axis, out=out)


# Internal functions to elimainate the overhead of repeated dispatch inside
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling of eliminate. Also maybe write "repeated dispatch in one of the two possible paths inside np.block"

@mhvk
Copy link
Contributor

mhvk commented Jun 2, 2019

@shoyer - for np.block, I agree the case is better given that the path taken depends on the shape. Also, forcing a specific override means people will be ready for the C implementation.

p.s. I thought I should not let my suggested optimization of hstack and vstack slip by: see #13697. About as good as what you had by bypassing dispatch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants