Skip to content

ENH: use __skip_array_function__ internally in NumPy #13585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented May 18, 2019

This should speed up some functions considerably.

Note: I'm also intentionally using this even (especially!) in functions that do not directly call np.asarray() on their inputs. This is so that overrides of array_function cannot effect the internals of NumPy functions, which would be leaking our implementation as our API.

This should speed up some functions consideraly.

Note: I'm also intentionally using this even (especially!) in functions that do
not directly call np.asarray() on their inputs. This is so that overrides of
__array_function__ cannot effect the internals of NumPy functions, which would
be leaking our implementation as our API.
Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, went through all of them. I am wondering a bit if helpers such as flatnonzero actually should be dispatched, since they are super shallow wrappers around dispatched functions. But I guess the promise is to dispatch everything? And I did not follow this all that closely anyway until now.

(mostly 1-2 possible squirms, but since it is easy to miss, marking as "request changes").

@@ -194,7 +194,7 @@ def _broadcast_shape(*args):
# ironically, np.broadcast does not properly handle np.broadcast
# objects (it treats them as scalars)
# use broadcasting to avoid allocating the full array
b = broadcast_to(0, b.shape)
b = broadcast_to.__skip_array_function__(0, b.shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good, except this one (maybe!). This gets called from inside the np.vectorize class, which is not hidden away behind __array_function__. So I think this may be wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vectorize always coerces with asarray() or asanyarray(), so I think this is fine, as a (minor) optimization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these optimizations really worth making?

@@ -241,7 +241,7 @@ def transpose(a):
-------
aT : (...,N,M) ndarray
"""
return swapaxes(a, -1, -2)
return swapaxes.__skip_array_function__(a, -1, -2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transpose here is not actually decorated as a skipping function, because it is linalg only. So I am really unsure that this is correct. (very likely my ignorance though).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is right, because this is always used inside other linalg functions, which are all wrapped with __array_function__ and also cast their inputs into NumPy arrays (see the _make_array utility function). And even if we had some functions that didn't do this, we don't want to expose the fact that they happen to use swapaxes internally. That would be leaking an implementation detail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is whole thing is new anyway. The function is just publicly exposed, so that means if someone actually uses it, it will fail on some array likes (and it would not before).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is transpose really publicly exposed? I guess as np.linalg.linalg.transpose, but it isn't documented anywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But to be clear, if somebody used it on array-likes before, this function would indeed really work the same. There hasn't been a release of NumPy that enables __array_function__ yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linalg.linalg isn't public

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I had missed the second linalg there somehow...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to expose the fact that they happen to use swapaxes internally. That would be leaking an implementation detail.

Calling it an inplementation seems like a stretch to me. The 2d transpose is essentially defined as swapping the last two axes. If linalg.transpose were public, then I'd argue that it should remain unchanged.

As written here, it seems like a questionable but harmless micro-optimization.

@rgommers
Copy link
Member

This all looks correct to me, thanks Stephan.

Taking a step back, this PR demonstrates an issue with the way the protocol works. This makes all NumPy code that calls another function wrapped with the __array_function__ dispatcher more verbose, harder to understand, modify, and maintain. Not much we can do about at this point perhaps, but at least worth thinking about a little.

It's probably also easy to forget to add __skip_array_function__ in all code changes and new functions in the future. Perhaps some kind of test that scans all .py files for direct calls to wrapped functions to guard against regressions is in order here.

@charris
Copy link
Member

charris commented May 20, 2019

Taking a step back, this PR demonstrates an issue with the way the protocol works.

It does look like the beginnings of a slippery slope. The danger of the protocol is that it might encourage excessive tinkering down the line. What seems to be the immediate problem is keeping NumPy at the top of the user heap rather then in the scrum with everyone else.

@shoyer
Copy link
Member Author

shoyer commented May 20, 2019

Taking a step back, this PR demonstrates an issue with the way the protocol works. This makes all NumPy code that calls another function wrapped with the __array_function__ dispatcher more verbose, harder to understand, modify, and maintain. Not much we can do about at this point perhaps, but at least worth thinking about a little.

I agree, I don't love it. But on the plus side, it's still way more readable than defining functions in C :)

It's probably also easy to forget to add __skip_array_function__ in all code changes and new functions in the future. Perhaps some kind of test that scans all .py files for direct calls to wrapped functions to guard against regressions is in order here.

To be clear: this change is already probably more pervasive than strictly necessary. As long as we cast inputs with asarray() or asanyarray(), none of this should be necessary -- it's merely a small performance optimization. The problematic cases are functions that, for whatever reason do not currently explicitly call asarray() on their inputs. These are pretty rare in the broader scheme of things, and we already have to think carefully when we write these sorts of functions.

Perhaps I try to roll back the extent of the changes here? I'm guessing that many of these don't actually make any difference, beyond slightly obfuscating code.

As for automatic testing -- yes, this would be a good idea, though static analysis to find "unsafe" uses of unwrapped functions seems non-trivial. Perhaps a tool like mypy could be leveraged? The tricky part is recognizing cases where a function like array/asarray/asanyarray has already cast the inputs.

@mattip
Copy link
Member

mattip commented May 20, 2019

I'm guessing that many of these don't actually make any difference

It should be easy to benchmark the effect of these changes

@shoyer
Copy link
Member Author

shoyer commented May 20, 2019

Unfortunately ASV gives pretty noisy results, at least when I run it on my laptop. Here are all the benchmarks that change by at least 10%:

All benchmarks:

       before           after         ratio
     [7495de47]       [dd2057ac]
     <old-master>       <use-skip-array-function>
              n/a              n/a      n/a  bench_linalg.Linalg.time_op('det', 'complex256')
              n/a              n/a      n/a  bench_linalg.Linalg.time_op('det', 'float16')
              n/a              n/a      n/a  bench_linalg.Linalg.time_op('det', 'longfloat')
              n/a              n/a      n/a  bench_linalg.Linalg.time_op('pinv', 'complex256')
              n/a              n/a      n/a  bench_linalg.Linalg.time_op('pinv', 'float16')
              n/a              n/a      n/a  bench_linalg.Linalg.time_op('pinv', 'longfloat')
              n/a              n/a      n/a  bench_linalg.Linalg.time_op('svd', 'complex256')
              n/a              n/a      n/a  bench_linalg.Linalg.time_op('svd', 'float16')
              n/a              n/a      n/a  bench_linalg.Linalg.time_op('svd', 'longfloat')
      2.38±0.04ms       3.18±0.4ms    ~1.34  bench_lib.Pad.time_pad((4, 4, 4, 4), (0, 32), 'mean')
      9.40±0.09ms         12.3±2ms    ~1.31  bench_core.Core.time_identity_3000
          345±1μs        451±100μs    ~1.31  bench_ufunc.UFunc.time_ufunc_types('floor')
      2.67±0.02ms       3.32±0.8ms    ~1.24  bench_ufunc.UFunc.time_ufunc_types('log')
+         297±1μs         365±10μs     1.23  bench_ufunc.UFunc.time_ufunc_types('frexp')
+        570±10μs          698±4μs     1.23  bench_lib.Pad.time_pad((4, 4, 4, 4), 8, 'reflect')
       9.51±0.5ms         11.5±2ms    ~1.21  bench_core.Core.time_eye_3000
          229±1μs         276±40μs    ~1.20  bench_ufunc.UFunc.time_ufunc_types('less_equal')
+     3.34±0.01μs       3.98±0.2μs     1.19  bench_shape_base.Block.time_3d(1, 'copy')
          386±4μs         454±80μs    ~1.18  bench_shape_base.Block2D.time_block2d((512, 512), 'uint64', (4, 4))
          152±3μs         177±30μs    ~1.16  bench_io.Copy.time_memcpy_large_out_of_place('int16')
          337±1μs         389±50μs    ~1.15  bench_ufunc.UFunc.time_ufunc_types('fmin')
          642±5μs        738±100μs    ~1.15  bench_lib.Nan.time_nancumprod(200000, 0.1)
      10.3±0.08μs         11.8±1μs    ~1.15  bench_io.Copy.time_strided_copy('float32')
         323±10μs         367±50μs    ~1.14  bench_function_base.Select.time_select_larger
          591±6μs        671±100μs    ~1.14  bench_lib.Nan.time_nanstd(200000, 0.1)
        142±0.7μs         161±10μs    ~1.13  bench_ufunc.UFunc.time_ufunc_types('isfinite')
+      45.5±0.1μs       51.6±0.9μs     1.13  bench_ufunc.CustomInplace.time_double_add
+     2.19±0.02μs      2.47±0.09μs     1.13  bench_ma.Indexing.time_scalar(False, 2, 10)
      2.21±0.01μs      2.49±0.07μs    ~1.13  bench_ma.Indexing.time_scalar(False, 1, 10)
          450±3μs         508±50μs    ~1.13  bench_indexing.Indexing.time_op('indexes_', 'np.ix_(I, I)', '')
      1.77±0.02ms      2.00±0.07ms    ~1.13  bench_lib.Pad.time_pad((256, 128, 1), 8, 'edge')
+     13.8±0.06μs       15.5±0.4μs     1.12  bench_ufunc.CustomScalar.time_divide_scalar2(<class 'numpy.float64'>)
          339±2μs         380±50μs    ~1.12  bench_ufunc.UFunc.time_ufunc_types('fmax')
      1.63±0.02ms       1.83±0.1ms    ~1.12  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'int'>)
          651±7μs         729±80μs    ~1.12  bench_lib.Nan.time_nancumsum(200000, 0)
      1.05±0.02ms       1.17±0.2ms    ~1.12  bench_ufunc.UFunc.time_ufunc_types('floor_divide')
      6.58±0.06μs       7.32±0.2μs    ~1.11  bench_ufunc.CustomScalar.time_divide_scalar2(<class 'numpy.float32'>)
+     29.2±0.09μs         32.5±3μs     1.11  bench_function_base.Sort.time_argsort('heap', 'int16', ('uniform',))
          206±2μs         229±20μs    ~1.11  bench_core.CorrConv.time_convolve(50, 10000, 'same')
          202±2μs         225±10μs    ~1.11  bench_core.CorrConv.time_convolve(50, 10000, 'valid')
       45.7±0.2ms         50.7±2ms    ~1.11  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('str', 10000)
       7.70±0.1μs         8.55±1μs    ~1.11  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'int'>)
          227±1μs         252±30μs    ~1.11  bench_core.CorrConv.time_correlate(50, 10000, 'same')
      6.90±0.06μs         7.64±1μs    ~1.11  bench_lib.Nan.time_nanmax(200, 50.0)
+     1.85±0.02μs      2.04±0.06μs     1.11  bench_core.CorrConv.time_correlate(50, 10, 'same')
          236±2μs         261±10μs    ~1.11  bench_ufunc.UFunc.time_ufunc_types('greater_equal')
          379±1μs         419±10μs    ~1.11  bench_ufunc.UFunc.time_ufunc_types('reciprocal')
      13.2±0.06μs         14.6±1μs    ~1.11  bench_ufunc.CustomScalar.time_divide_scalar2_inplace(<class 'numpy.float64'>)
      1.10±0.02ms       1.22±0.1ms    ~1.10  bench_lib.Nan.time_nanpercentile(200000, 90.0)
      5.60±0.06μs       6.18±0.5μs    ~1.10  bench_ufunc.CustomScalar.time_less_than_scalar2(<class 'numpy.float64'>)
+     2.08±0.02μs      2.29±0.05μs     1.10  bench_reduce.AnyAll.time_any_fast
          717±3μs         789±70μs    ~1.10  bench_lib.Nan.time_nanstd(200000, 2.0)
       21.0±0.2μs         23.1±1μs    ~1.10  bench_core.Core.time_diagflat_l100
          641±3μs         704±80μs     1.10  bench_lib.Nan.time_nancumsum(200000, 0.1)
       32.4±0.1ms         35.5±3ms     1.10  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'object'>)
      8.94±0.09ms       9.82±0.7ms     1.10  bench_ufunc.Broadcast.time_broadcast
...........
-      15.9±0.4μs      14.4±0.09μs     0.90  bench_lib.Nan.time_nanargmax(200, 90.0)
       2.28±0.4ms       2.06±0.1ms    ~0.90  bench_lib.Pad.time_pad((1, 1, 1, 1, 1), 8, 'constant')
       28.9±0.3μs         26.1±1μs    ~0.90  bench_shape_base.Block2D.time_block2d((128, 128), 'uint32', (2, 2))
       17.6±0.3μs         15.9±1μs    ~0.90  bench_shape_base.Block.time_block_simple_column_wise(10)
        114±0.8μs          103±3μs    ~0.90  bench_shape_base.Block2D.time_block2d((512, 512), 'uint16', (4, 4))
       3.06±0.2ms      2.76±0.01ms    ~0.90  bench_core.CorrConv.time_correlate(100000, 100, 'full')
         410±50μs          370±3μs    ~0.90  bench_lib.Nan.time_nanmean(200000, 0.1)
-      13.2±0.2μs       11.9±0.3μs     0.90  bench_shape_base.Block.time_no_lists(1)
-      8.07±0.1μs       7.28±0.2μs     0.90  bench_lib.Nan.time_nancumprod(200, 2.0)
-      16.1±0.4μs       14.5±0.3μs     0.90  bench_lib.Nan.time_nanargmax(200, 50.0)
-      9.65±0.1μs       8.69±0.2μs     0.90  bench_lib.Nan.time_nansum(200, 50.0)
-     9.52±0.04μs       8.56±0.1μs     0.90  bench_lib.Nan.time_nanprod(200, 50.0)
          148±8μs          133±4μs    ~0.90  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'str'>)
-      14.9±0.2μs       13.4±0.2μs     0.90  bench_shape_base.Block.time_no_lists(10)
       8.34±0.2μs       7.50±0.4μs    ~0.90  bench_lib.Nan.time_nancumsum(200, 0)
      1.00±0.05μs         903±40ns    ~0.90  bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.), subok=True, where=True))
-         544±8ns         488±10ns     0.90  bench_indexing.IndexingStructured0D.time_array_all
       88.7±0.8μs         79.6±2μs    ~0.90  bench_shape_base.Block2D.time_block2d((256, 256), 'uint32', (4, 4))
       4.37±0.4ms       3.92±0.2ms    ~0.90  bench_reduce.AddReduceSeparate.time_reduce(0, 'complex256')
       24.2±0.1μs         21.7±4μs    ~0.90  bench_shape_base.Block2D.time_block2d((64, 64), 'uint32', (2, 2))
       13.1±0.2μs       11.7±0.2μs    ~0.90  bench_ma.Indexing.time_0d(True, 2, 100)
-      8.47±0.2μs      7.57±0.04μs     0.89  bench_lib.Nan.time_nancumprod(200, 50.0)
       15.8±0.6μs       14.1±0.1μs    ~0.89  bench_lib.Nan.time_nanargmin(200, 2.0)
       24.1±0.5μs         21.5±2μs    ~0.89  bench_shape_base.Block2D.time_block2d((64, 64), 'uint16', (2, 2))
-        447±20μs          398±1μs     0.89  bench_function_base.Sort.time_sort('quick', 'int64', ('random',))
         381±40μs          338±2μs    ~0.89  bench_lib.Pad.time_pad((4, 4, 4, 4), 8, 'edge')
-     8.07±0.02μs       7.17±0.1μs     0.89  bench_lib.Nan.time_nancumprod(200, 0.1)
       8.42±0.7μs       7.48±0.4μs    ~0.89  bench_lib.Nan.time_nancumsum(200, 90.0)
-     9.38±0.05μs       8.32±0.1μs     0.89  bench_lib.Nan.time_nanprod(200, 2.0)
-     8.28±0.09μs       7.34±0.1μs     0.89  bench_lib.Nan.time_nancumprod(200, 90.0)
       2.76±0.3ms       2.45±0.3ms    ~0.89  bench_indexing.IndexingSeparate.time_mmap_slicing
       1.89±0.1μs      1.67±0.01μs    ~0.88  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), axis=0))
       92.4±0.6μs         81.6±3μs    ~0.88  bench_shape_base.Block2D.time_block2d((512, 512), 'uint8', (4, 4))
-      9.54±0.1μs       8.40±0.2μs     0.88  bench_lib.Nan.time_nansum(200, 90.0)
       30.6±0.4μs         27.0±1μs    ~0.88  bench_shape_base.Block2D.time_block2d((256, 256), 'uint8', (2, 2))
-        757±10μs          666±9μs     0.88  bench_reduce.AddReduceSeparate.time_reduce(1, 'int16')
-      9.47±0.2μs      8.32±0.08μs     0.88  bench_lib.Nan.time_nansum(200, 0)
-      13.2±0.9μs       11.6±0.2μs     0.88  bench_ma.Indexing.time_0d(True, 2, 10)
-     9.48±0.06μs       8.33±0.1μs     0.88  bench_lib.Nan.time_nanprod(200, 0)
-      16.3±0.2μs      14.3±0.08μs     0.88  bench_lib.Nan.time_nanargmax(200, 0)
-      15.7±0.8μs      13.7±0.08μs     0.88  bench_lib.Nan.time_nanargmin(200, 0.1)
-      8.29±0.2μs       7.20±0.1μs     0.87  bench_lib.Nan.time_nancumsum(200, 0.1)
-      9.57±0.1μs       8.29±0.1μs     0.87  bench_lib.Nan.time_nansum(200, 0.1)
-      69.1±0.4μs         59.7±2μs     0.86  bench_shape_base.Block2D.time_block2d((128, 128), 'uint32', (4, 4))
-        77.4±4μs         66.6±2μs     0.86  bench_shape_base.Block2D.time_block2d((128, 128), 'uint64', (4, 4))
       25.3±0.2μs       21.7±0.7μs    ~0.86  bench_shape_base.Block2D.time_block2d((128, 128), 'uint8', (2, 2))
       26.0±0.2μs       22.3±0.5μs    ~0.86  bench_shape_base.Block2D.time_block2d((128, 128), 'uint16', (2, 2))
         10.1±1ms         8.66±2ms    ~0.86  bench_lib.Pad.time_pad((4194304,), 1, 'wrap')
-     6.04±0.05μs       5.16±0.2μs     0.85  bench_core.Core.time_dstack_l
         216±10μs          183±3μs    ~0.85  bench_lib.Nan.time_nansum(200000, 0)
         66.2±1μs         56.1±3μs    ~0.85  bench_shape_base.Block2D.time_block2d((128, 128), 'uint16', (4, 4))
       10.1±0.1ms      8.55±0.08ms    ~0.84  bench_lib.Pad.time_pad((4194304,), 1, 'linear_ramp')
-     1.86±0.04μs      1.56±0.02μs     0.84  bench_core.Core.time_ones_100
-     8.44±0.06μs       7.05±0.2μs     0.84  bench_lib.Nan.time_nancumprod(200, 0)
       78.6±0.8μs         65.5±3μs    ~0.83  bench_shape_base.Block2D.time_block2d((256, 256), 'uint16', (4, 4))
-      16.4±0.2μs       13.7±0.4μs     0.83  bench_shape_base.Block.time_block_simple_column_wise(1)
-         102±3μs         84.8±1μs     0.83  bench_lib.Nan.time_nanpercentile(200, 90.0)
-      9.90±0.6μs       8.21±0.2μs     0.83  bench_lib.Nan.time_nanprod(200, 0.1)
-      51.2±0.5μs       42.2±0.2μs     0.82  bench_shape_base.Block.time_nested(1)
-      47.7±0.6μs       39.3±0.8μs     0.82  bench_shape_base.Block.time_3d(1, 'block')
-      53.3±0.3μs       43.9±0.2μs     0.82  bench_shape_base.Block.time_nested(10)
-      11.0±0.7μs       9.06±0.1μs     0.82  bench_function_base.Sort.time_argsort('merge', 'int16', ('uniform',))
-      43.2±0.2μs       35.5±0.2μs     0.82  bench_shape_base.Block.time_block_complicated(10)
-     1.73±0.08μs      1.42±0.06μs     0.82  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), 0))
-        41.0±1μs       33.5±0.8μs     0.82  bench_shape_base.Block.time_block_complicated(1)
-      65.4±0.9μs         53.3±2μs     0.82  bench_shape_base.Block2D.time_block2d((64, 64), 'uint64', (4, 4))
-         106±3μs       86.8±0.7μs     0.82  bench_lib.Nan.time_nanpercentile(200, 50.0)
-      10.8±0.5μs       8.80±0.1μs     0.81  bench_shape_base.Block.time_block_simple_row_wise(10)
-     4.75±0.09μs      3.86±0.07μs     0.81  bench_core.Core.time_hstack_l
-      66.0±0.5μs         53.6±3μs     0.81  bench_shape_base.Block2D.time_block2d((128, 128), 'uint8', (4, 4))
-      22.6±0.2μs       18.3±0.3μs     0.81  bench_shape_base.Block2D.time_block2d((16, 16), 'uint8', (2, 2))
-     10.1±0.09μs      8.14±0.08μs     0.80  bench_shape_base.Block.time_block_simple_row_wise(1)
-        74.3±2μs         59.3±2μs     0.80  bench_shape_base.Block2D.time_block2d((256, 256), 'uint8', (4, 4))
-     23.1±0.08μs       18.5±0.3μs     0.80  bench_shape_base.Block2D.time_block2d((16, 16), 'uint64', (2, 2))
-     5.48±0.09μs      4.34±0.06μs     0.79  bench_core.Core.time_vstack_l
         884±10μs         698±70μs    ~0.79  bench_ufunc.UFunc.time_ufunc_types('sqrt')
-        24.6±2μs      19.4±0.08μs     0.79  bench_shape_base.Block2D.time_block2d((32, 32), 'uint64', (2, 2))
-      22.8±0.5μs       18.0±0.2μs     0.79  bench_shape_base.Block2D.time_block2d((16, 16), 'uint16', (2, 2))
-        64.9±1μs       51.1±0.3μs     0.79  bench_shape_base.Block2D.time_block2d((64, 64), 'uint32', (4, 4))
       63.7±0.5μs       49.8±0.7μs    ~0.78  bench_shape_base.Block2D.time_block2d((64, 64), 'uint16', (4, 4))
-      24.2±0.9μs      18.9±0.09μs     0.78  bench_shape_base.Block2D.time_block2d((32, 32), 'uint16', (2, 2))
-      61.1±0.8μs         47.5±1μs     0.78  bench_shape_base.Block2D.time_block2d((16, 16), 'uint16', (4, 4))
-      61.0±0.7μs       47.3±0.3μs     0.78  bench_shape_base.Block2D.time_block2d((16, 16), 'uint8', (4, 4))
-        24.5±2μs       19.0±0.2μs     0.77  bench_shape_base.Block2D.time_block2d((32, 32), 'uint32', (2, 2))
-        25.1±2μs       19.3±0.2μs     0.77  bench_shape_base.Block2D.time_block2d((64, 64), 'uint8', (2, 2))
-        24.1±2μs      18.5±0.08μs     0.77  bench_shape_base.Block2D.time_block2d((32, 32), 'uint8', (2, 2))
-        64.3±3μs       48.8±0.1μs     0.76  bench_shape_base.Block2D.time_block2d((32, 32), 'uint32', (4, 4))
-        62.6±5μs       47.3±0.3μs     0.76  bench_shape_base.Block2D.time_block2d((16, 16), 'uint32', (4, 4))
-        63.4±2μs       47.9±0.5μs     0.75  bench_shape_base.Block2D.time_block2d((32, 32), 'uint8', (4, 4))
-        66.3±5μs       49.3±0.7μs     0.74  bench_shape_base.Block2D.time_block2d((32, 32), 'uint64', (4, 4))
-        65.1±3μs       48.3±0.5μs     0.74  bench_shape_base.Block2D.time_block2d((32, 32), 'uint16', (4, 4))
-        67.8±6μs       50.2±0.5μs     0.74  bench_shape_base.Block2D.time_block2d((64, 64), 'uint8', (4, 4))
-        66.4±5μs       47.7±0.5μs     0.72  bench_shape_base.Block2D.time_block2d((16, 16), 'uint64', (4, 4))
-      26.1±0.8μs      18.4±0.08μs     0.70  bench_shape_base.Block2D.time_block2d((16, 16), 'uint32', (2, 2))

The performance difference for np.block is definitely real, and there seems to be a (small) marginal benefit for functions like nansum.

@mhvk
Copy link
Contributor

mhvk commented May 20, 2019

Sorry to be late to this (I was on holidays), but I must admit I don't like this much at all! It makes the functions much less readable, and I feel it goes in the wrong direction from a duck-typing perspective: this seems analogous to rewriting every function that does a + b to ndarray.__add__(a, b) (something actually done a lot in MaskedArray, with unpleasant side effects).

As an admittedly selfish example, I've just started implementing __array_function__ for Quantity and (not surprisingly) found that once I implemented concatenate, most of the functions that use it internally (such as hstack) now work without me having to override them.

With this PR, such partial overrides will no longer work, thus forcing users to copy & paste implementations like those of hstack. I feel this will lead to needless code duplication and future code
divergence.

My sense remains that we should aim to get to a state where functions can assume that their arguments behave like ndarray, not that they are ndarray.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly marking this as requesting changes, so we don't just merge based on the one approval while discussion is ongoing.

Specific request is to only do this (if at all) for functions where arguments are coerced with as*array.

@shoyer
Copy link
Member Author

shoyer commented May 20, 2019

Specific request is to only do this (if at all) for functions where arguments are coerced with as*array.

This is something we discussed, but it wasn't a goal of NEP-18:
http://www.numpy.org/neps/nep-0018-array-function-protocol.html#implementations-in-terms-of-a-limited-core-api

We can definitely try to figure out how to do this in the future, but it should be done intentionally and with care, because it exposes NumPy's internal implementations as public APIs. This would be a major increase in our API surface, and would limit our ability to make future changes.

Consider the internal implementation of np.block as a good example. We currently switch between two different implementations based on a heuristic:

if list_ndim * final_size > (2 * 512 * 512):
return _block_slicing(arrays, list_ndim, result_ndim)
else:
return _block_concatenate(arrays, list_ndim, result_ndim)

However, note that the _block_concatenate path (which was added later as a optimization, if I recall correctly) does support duck-typing, because it uses np.empty(). If we had exposed the implementation of np.block as a public API, then making this change would have been much more difficult. In the future, we might want to np.block in C, which would break both code paths yet again.

This PR maintains a simple policy: if you want to overload a NumPy function, you have to override it directly. This isn't the most convenient API for implementers, but it keeps our options open as NumPy maintainers.

@mhvk
Copy link
Contributor

mhvk commented May 21, 2019

I guess I must have lost track during the long discussion, but I thought the main purpose of __skip_array_function__ was to make it easier for users of the numpy API to proceed step by step, as one can simply pass on to the implementation for functions that "happen to work". This PR goes completely in the opposite direction!

It doesn't seem to me that the relatively miminal speed-ups outweight either the cost of above or the cost in readability (let alone both).

p.s. The above does not mean I disagree that we should be free to change the implementation for np.block - to me, using __array_function__ implies that you are willing to adapt to changes in implementation. But having to adjust, e.g.,hstack when one has a nice override for concatenate just seems gratuitous.

@rgommers
Copy link
Member

It doesn't seem to me that the relatively miminal speed-ups outweight either the cost of above or the cost in readability

Perhaps that's a fair assessment.

to me, using __array_function__ implies that you are willing to adapt to changes in implementation.

I'd more put it like: you are supplying your own implementation, hence changes in NumPy's internals are irrelevant (except if they're backwards-incompatible changes in API, to which you adapt).

Note: I'm also intentionally using this even (especially!) in functions that do not directly call np.asarray() on their inputs. This is so that overrides of __array_function__ cannot effect the internals of NumPy functions, which would be leaking our implementation as our API.

On second thought, I'm not sure this is right. Or I need a more detailed explanation of what you mean by this. Taking a random example:

@dispatcher....
def full_like(a, ...):
    ...
    np.empty_like.__skip_array_function__(a, ...)

If a has an __array_function__ attribute, under what circumstances does this "leak implementation as API"? If a.__array_function__ has a changed full_like function, then np.empty_like is never called to begin with.

@mhvk
Copy link
Contributor

mhvk commented May 21, 2019

to me, using __array_function__ implies that you are willing to adapt to changes in implementation.

I'd more put it like: you are supplying your own implementation, hence changes in NumPy's internals are irrelevant (except if they're backwards-incompatible changes in API, to which you adapt).

I should have been more explicit: if you use __array_function__ and use the numpy implementation via __skip_array_function__ (either directly or after some manipulation), you signal you are willing to adapt to changes in that implementation.

Note: I'm also intentionally using this even (especially!) in functions that do not directly call np.asarray() on their inputs. This is so that overrides of __array_function__ cannot effect the internals of NumPy functions, which would be leaking our implementation as our API.

On second thought, I'm not sure this is right. Or I need a more detailed explanation of what you mean by this. Taking a random example:

@dispatcher....
def full_like(a, ...):
    ...
    np.empty_like.__skip_array_function__(a, ...)

If a has an __array_function__ attribute, under what circumstances does this "leak implementation as API"? If a.__array_function__ has a changed full_like function, then np.empty_like is never called to begin with.

Better answered by @shoyer, of course, but the reason I would like this not to happen is that if I correctly implement empty_like, it may well be that all the other *_like functions "just work" because they do not rely on the inputs strictly being ndarray. In this respect, I actually wish all the *like functions were more obviously written as

def zeros_like(a):
    result = empty_like(a)
    result[...] = 0
    return result

rather than use copyto internally - if the assignment is more inefficient for some reason, we should solve that rather than use less obvious routines!

@charris
Copy link
Member

charris commented May 21, 2019

we should solve that rather than use less obvious routines!

This sounds a bit like @njsmith's ducktype proposal, a minimal list of functions that need to be implemented.

@mhvk
Copy link
Contributor

mhvk commented May 21, 2019

This sounds a bit like @njsmith's ducktype proposal, a minimal list of functions that need to be implemented.

Yes, I liked that proposal very much and still hope that __array_function__ provides a route towards it! Certainly should try to avoid making it more difficult.

@shoyer
Copy link
Member Author

shoyer commented May 21, 2019

On second thought, I'm not sure this is right. Or I need a more detailed explanation of what you mean by this. Taking a random example:

@dispatcher....
def full_like(a, ...):
    ...
    np.empty_like.__skip_array_function__(a, ...)

If a has an __array_function__ attribute, under what circumstances does this "leak implementation as API"? If a.__array_function__ has a changed full_like function, then np.empty_like is never called to begin with.

This version is fine. The version where I'm concerned about leaking implementation as API looks like this:

@dispatcher....
def full_like(a, ...):
    ...
    np.empty_like(a, ...)

Consider a library that implements full_like() inside __array_function__ by calling full_like.__skip_array_function__(). Their implementation of full_like() implicitly relies upon the details of NumPy's internal implementation, e.g., the fact that it uses empty_like() internally. If we change this, we might break them.

@shoyer
Copy link
Member Author

shoyer commented May 21, 2019

I should have been more explicit: if you use __array_function__ and use the numpy implementation via __skip_array_function__ (either directly or after some manipulation), you signal you are willing to adapt to changes in that implementation.

This is a reasonable thing to want, but it was not what I was aiming for with __skip_array_function__. I drafted a few words to NEP-18 to hopefully clarify this: #13600

Let me also surface one of my previous mailing list posts on this topic:

I'm a little puzzled by why you are concerned about retaining this
flexibility to reuse the attribute I'm asking for here for a function that
works differently. What I want is a special attribute that is guaranteed to
work like the [existing] public version of a NumPy function, but without checking for
an __array_function__ attribute.

If we later decide we want to expose an attribute that provides a
non-coercing function that calls ufuncs directly instead of np.asarray,
what do we lose by giving it a new name so users don't need to worry about
changed behavior? There is plenty of room for special attributes on NumPy
functions. We can have both np.something.__skip_array_overrides__ and
np.something.__array_implementation__.

(I added "[existing]" for clarification.)

So sure, let's make __array_implementation__, too, but let's save it for a different protocol -- a more ambitious one that what I had in mind here.

Better answered by @shoyer, of course, but the reason I would like this not to happen is that if I correctly implement empty_like, it may well be that all the other *_like functions "just work" because they do not rely on the inputs strictly being ndarray. In this respect, I actually wish all the *like functions were more obviously written as

def zeros_like(a):
    result = empty_like(a)
    result[...] = 0
    return result

rather than use copyto internally - if the assignment is more inefficient for some reason, we should solve that rather than use less obvious routines!

I am very sympathetic here, but I think this exposes a typical issue: a duck array class could implement np.copyto() but not assignment based indexing. If NumPy exposes __skip_array_function__ without re-writing zeros_like, then we could be unable to make these types of refactors.

We could say that we make no backwards compatibility guarantees on what goes on inside __skip_array_function__, but historically we've seen that users will rely upon every detail of our API that we expose publicly, and we've been very reluctant to break them.

@rgommers
Copy link
Member

... Their implementation of full_like() implicitly relies upon the details of NumPy's internal implementation, e.g., the fact that it uses empty_like() internally. If we change this, we might break them.

You can also see this as an argument for not doing the whole __skip_array_function__....

Wouldn't we be able to implement this inside the dispatch mechanism instead? Something like

def __skip_array_function__(...):
    with disable_array_function_dispatching:
        return call_wrapped_numpy_func()

I suspect that that's doable with not too much code (and hopefully similar performance gain as this PR), and you get the skip method you want without needing the changes in this PR.

@mhvk
Copy link
Contributor

mhvk commented May 21, 2019

@shoyer - it may be best to go back to a specific example rather than to try to generalize, as it may clarify where we differ. Let's pick vstack as it is particularly simple, being really just a synonym for concatenate with a default axis=0 after ensuring everything has 2 axes by a possible reshape.

Currently, the vstack implementation will happily work for any class that correctly implements reshaping (probably all duck arrays) and also overrides concatenate.

I think we should encourage people who use __array_function__ to use our implementation rather than forcing them, as this PR does, of duplicating it. I am not sure how any worry about frozen API can offset the huge cost of code duplication, thus failing to correct errors down the line, etc.

p.s. We will be stuck with our present implementations for a very long time anyway, just for classes that do not define __array_function__.

@rgommers
Copy link
Member

rgommers commented May 21, 2019

I should have been more explicit: if you use array_function and use the numpy implementation via skip_array_function (either directly or after some manipulation), you signal you are willing to adapt to changes in that implementation.

This is a reasonable thing to want, but it was not what I was aiming for with skip_array_function.

I don't really agree that this is a reasonable thing to want. You never want to agree to be sensitive to internal implementation details, what good does that ever do you? It's just a concession with zero upsides.

What you signal is "ignore my __array_function__ attribute, just treat me exactly as if I didn't have one". This PR gets you that behavior, it's just not very good for maintainability so a cleaner implementation would be nice.

@mhvk
Copy link
Contributor

mhvk commented May 21, 2019

p.s. Sorry, np.vstack is not a good example, as it explicitly coerces to array!

@shoyer
Copy link
Member Author

shoyer commented May 21, 2019

Wouldn't we be able to implement this inside the dispatch mechanism instead? Something like

def __skip_array_function__(...):
    with disable_array_function_dispatching:
        return call_wrapped_numpy_func()

I suspect that that's doable with not too much code (and hopefully similar performance gain as this PR), and you get the skip method you want without needing the changes in this PR.

Now that you mention it, this does seem like it may be possible. I'm going to look into this...

Currently, the vstack implementation will happily work for any class that correctly implements reshaping (probably all duck arrays) and also overrides concatenate.

Yes, this is true with the current implementation of __array_function__, but not with the default behvaior of NumPy in any released version, because it calls np.concatenate() which does not yet have an override.

This is a reasonable thing to want, but it was not what I was aiming for with skip_array_function.

I don't really agree that this is a reasonable thing to want. You never want to agree to be sensitive to internal implementation details, what good does that ever do you? It's just a concession with zero upsides.

I agree, so let me restate this: I do think there are cases where it convenient to be able to use a canonical implementation of a NumPy function terms of a fixed set of core operations. This may or may not correspond to our current implementations in NumPy.

@shoyer
Copy link
Member Author

shoyer commented May 21, 2019

p.s. We will be stuck with our present implementations for a very long time anyway, just for classes that do not define __array_function__.

Examples would indeed be helpful here. At present, I don't think there any many NumPy functions that expose their internal implementation details for duck arrays. Some (inconsistently) do so for ndarray subclasses, but almost everything using asarray() or asanyarray(), even if only at the C level because they use a multiarray function.

@mhvk
Copy link
Contributor

mhvk commented May 21, 2019

I agree, so let me restate this: I do think there are cases where it convenient to be able to use a canonical implementation of a NumPy function terms of a fixed set of core operations. This may or may not correspond to our current implementations in NumPy.

I agree with this too. I simply fear that by making it impossible to use even current implementations that are close to that ideal, we make it unnecessarily harder to reach it. I still see what this PR as making duck-typing unnecessarily hard (it still seems it is doing the equivalent to replacing all a + b with ndarray.__add__(a, b)). Why force things?

@rgommers
Copy link
Member

I still see what this PR as making duck-typing unnecessarily hard (it still seems it is doing the equivalent to replacing all a + b with ndarray.__add__(a, b)). Why force things?

I don't think that's an accurate analogy. This PR is simply a (admittedly not pretty) way to preserve the current default 1.16 behaviour. It doesn't make duck typing easier, but it also should not make it harder. This whole feature is basically unrelated to duck-typing imho.

I agree, so let me restate this: I do think there are cases where it convenient to be able to use a canonical implementation of a NumPy function terms of a fixed set of core operations. This may or may not correspond to our current implementations in NumPy.

Yes, that would be nice. It's not something that we guarantee at all today though (AFAIK - if we do somewhere, can you point to it?). It requires, even for implementations that now happen to be just about right, tests and documentation.

@mhvk
Copy link
Contributor

mhvk commented May 21, 2019

Just going down the list, the first and really obvious one is allclose, which is nothing more than bool(isclose(*args, **kwargs).all()).

The code in lib is perhaps even more clear (say, rot90).

But concatenate remains the hands-down best example from what I tried so far for Quantity - with that implemented, all of stack, column_stack, hstack, vstack, and dstack work with the numpy implementation.

@mhvk
Copy link
Contributor

mhvk commented May 21, 2019

I don't think that's an accurate analogy. This PR is simply a (admittedly not pretty) way to preserve the current default 1.16 behaviour.

In what way does it preserve 1.16 behaviour? If one doesn't define __array_function__ nothing changes.

It doesn't make duck typing easier, but it also should not make it harder. This whole feature is basically unrelated to duck-typing imho.

It does, it that it removes control: instead of "treat me as if I were ndarray in this function", it becomes "treat me as if I were ndarray from here on". I don't see much difference with also ignoring __array_ufunc__ or indeed any python operator override.

@rgommers
Copy link
Member

In what way does it preserve 1.16 behaviour?

Maybe I should say "give the option to keep 1.16 behaviour, even in the presence of __array_function__" rather than preserve.

# for x with an `__array_function`
f1 = np.somefunc.__skip_array_function__ # from this PR
f2 = np.somefunc  # from 1.16, no env var set

assert f1(x, ...) == f2(x, ...)

@mhvk
Copy link
Contributor

mhvk commented May 21, 2019

Just going down the list, the first and really obvious one is allclose, which is nothing more than bool(isclose(*args, **kwargs).all()).

That's a good example. The input to allclose is documented as array_like, and the tests only use ndarray, plus some weird subclass Foo memmap related test that checks that the return type is bool. There is absolutely no guarantee that the implementation will stay exactly like that. We'd be perfectly within our rights to insert an np.asarray there. We won't, but we can today.

I agree we can change it any time. My main point is that there is no reason to change it now, when really, unlike so much of numpy, we have a nice pythonic implementation. This means that any class that defines __array_function__ and overrides isclose, can use the numpy implementation for allclose. Surely that is better than to force them to copy & paste it to their own code base?

np.concatenate seems the same, the docs say array_like and there's even an explicit disclaimer that MaskedArray does not work, users are to use np.ma.concatenate instead. It just happens to work for Quantity right now apparently.

For Quantity, I need to override concatenate (solving a long-standing issue), but doing that immediately solves also all those other functions, since those are just wrappers around concatenate.

@rgommers
Copy link
Member

This means that any class that defines __array_function__ and overrides isclose, can use the numpy implementation for allclose. Surely that is better than to force them to copy & paste it to their own code base?

No it's not. At least I would not state that with such certainty. Because now you have just agreed to making our internal implementation details public. That's a major decision that negatively impacts the maintainability of NumPy itself. It may be convenient for some other library, but there's a big tradeoff here. There's probably way more work in us writing tests and docs if we do make that decision than there is in copying some code around.

@mhvk
Copy link
Contributor

mhvk commented May 21, 2019

@rgommers - If we want to preserve 1.16 behaviour like that, I think we would be much better off with a context manager that does the opposite of the current environment variable.

Indeed, that would have the advantage of keeping the actual code readable. Indeed, I would not mind if such a "context manager" were used inside ndarray.__array_function__.

@rgommers
Copy link
Member

If we want to preserve 1.16 behaviour like that, I think we would be much better off with a context manager that does the opposite of the current environment variable.

No need. The preserving is not a goal. I just tried to explain that that's what it does in practice, so it should not affect duck arrays one way or the other. Situation for those does not get better, does not get worse. I think the fundamental issue here is that you want __array_function__ to work for duck arrays somehow, but that's just a whole separate topic.

Anyway, maybe we should give this a rest till @shoyer had a chance to try the with disable_array_function_dispatching: thing - that should make this PR unnecessary and should address your worry that the duck array situation gets worse.

@rgommers
Copy link
Member

Indeed, that would have the advantage of keeping the actual code readable. Indeed, I would not mind if such a "context manager" were used inside ndarray.__array_function__.

me too:)

@mhvk
Copy link
Contributor

mhvk commented May 21, 2019

This means that any class that defines __array_function__ and overrides isclose, can use the numpy implementation for allclose. Surely that is better than to force them to copy & paste it to their own code base?

No it's not. At least I would not state that with such certainty. Because now you have just agreed to making our internal implementation details public.

In what way is this making it more public than it is already? I guess we may have to agree to disagree here on what constitutes a larger cost, but I think you may be underestimating the cost of copy & paste coding.

@rgommers
Copy link
Member

I guess we may have to agree to disagree here on what constitutes a larger cost, but I think you may be underestimating the cost of copy & paste coding.

I'm not assuming or underestimating anything. I'm just stating that it's a tradeoff, and one with big implications. I don't see anyone lining up to write all those tests and docs for us. Or changing implementations for those functions that aren't ideal in implementation right now. This needs a NEP at the very least, and it's not trivial.

In what way is this making it more public than it is already?

It's not public today. You just agreed we can change it if we'd want. So if we can't change it anymore later, then we just made it more public.

@mhvk
Copy link
Contributor

mhvk commented May 21, 2019

I only meant that I don't see how the extent we can change it depends on what we do for __skip_array_function__ - the whole __array_function__ NEP very clearly warns that we can do this. But from previous comments it is probably clear that I worry a bit less about backwards compatibility than others.

Similarly, it should be obvious that I don't think we have to write tests for other packages. Though I do think we'd have quite a good set if we were to use __array_function__ for MaskedArray (something I was intrigued by, though as I never got around to doing __array_ufunc__, it seems unlikely I'll be the one doing it).

@mhvk
Copy link
Contributor

mhvk commented May 21, 2019

p.s. And, yes, let's give this a rest to see if the context manager thingy works - code readability alone makes this PR very non-ideal.

@mhvk
Copy link
Contributor

mhvk commented May 21, 2019

p.p.s. I do wonder whether we shouldn't revert __skip_array_function__ I feel this is all happening awfully close to the release date and without much by way of actual problems encountered implementing __array_function__.

@charris
Copy link
Member

charris commented May 21, 2019

p.p.s. I do wonder whether we shouldn't revert skip_array_function

That thought has also occurred to me. All these things need implementation testing and feedback as we feel our way into what could easily become a mess. @rgommers I believe you also have some recent experience with using __array_function__, although not with __skip_array_function__.

@mattip
Copy link
Member

mattip commented May 21, 2019

Perhaps we should have a synchronous discussion, if not during the community call then a specially scheduled one to work this out. Trying to follow the discussion on github is painful.

@shoyer
Copy link
Member Author

shoyer commented May 21, 2019

p.s. And, yes, let's give this a rest to see if the context manager thingy works - code readability alone makes this PR very non-ideal.

Having thought about this a bit more, I think a context manager would be very challenging:

  1. To avoid the performance hit of adding another function call in every NumPy operation without overrides, it would need to be written in C.
  2. The context manager needs to be thread-safe, because the rest of NumPy already is (and this is relied upon by tools such as dask.array).

This basically means we would need to use something like contextvars, but that's only available in Python 3.7. We could potentially do a back-port, but even then the performance hit might be unacceptable -- the contextvars will add overhead at least as large as a dictionary lookup.

What I do think would make sense is a enabling a run-time check for wrapped invocations of __array_function__, that only is used when running NumPy's test suite. So at least if NumPy functions have test coverage, we could identify these cases automatically, even if the required fix is manual.

@njsmith
Copy link
Member

njsmith commented May 21, 2019

Accessing thread locals is fast, especially if you use modern compiler features like __thread. contextvars is indeed a bit slower, but it has some magic to skip that dictionary lookup overhead in many cases (each ContextVar object has an internal cache with a single entry, so repeated lookups within a single thread/context are very fast) – one of the design goals was to make it usable for numpy's errstate :-). But contextvars is really only important for async code; if you don't care about that then thread locals are fine.

My concern about a context manager would be spooky action at a distance. If I do with skip_array_function: sklearn.something(...), and sklearn.something tries to use a sparse array internally, then you may get surprising results.

@rgommers
Copy link
Member

@njsmith the idea was a private context manager, inside the __array_function__ dispatcher. Just to give an efficient way for ourselves to stay within NumPy. You'd never encounter sklearn or sparse arrays that way

@shoyer
Copy link
Member Author

shoyer commented May 21, 2019

@njsmith the idea was a private context manager, inside the __array_function__ dispatcher. Just to give an efficient way for ourselves to stay within NumPy. You'd never encounter sklearn or sparse arrays that way

Yes, this would only be for cases where we're using NumPy's implementation of functions. As long as these functions don't have some API that allows for running arbitrary code, we would be in the clear.

That said, I agree with @njsmith about disliking action at a distance, even if we think it's safe. For that reason alone I would prefer explicitly calls to __skip_array_function__ within NumPy.

For what it's worth, I think this is a pretty minor loss in readability. It's not pretty, but a 10% speed-up for the price of adding one magic attribute look-up seems like a pretty good deal to me. We already do lots of stuff like this, e.g., using copyto() instead of indexing within full_like(). Certainly it's better than rewriting internal NumPy functions in C, which is our other go-to alternative.

@mhvk
Copy link
Contributor

mhvk commented May 21, 2019

I guess people differ in the extent to which they adhere to "readability counts" - though if I start quoting the zen of python, I should be fair and agree that it does suggest that the context manager is not that great an idea!

But if speed is of the essence, then perhaps an alternative ifor 1.17 s to keep the environment variable that turns __array_function__ on or off, but flip its default. That would allow further experimentation on our side without the speed penalty for pure-numpy users.

@rgommers
Copy link
Member

That said, I agree with @njsmith about disliking action at a distance, even if we think it's safe. For that reason alone I would prefer explicitly calls to __skip_array_function__ within NumPy.

This doesn't make much sense to me. This would be much less hard to understand than the rest of the dispatcher implementation. It has nothing to do with action at a distance. It's about keeping something that's a property of the whole __array_function__ mechanism within the implementation of that mechanism rather than scattered all over the code base.

I also really don't care whether it's actually a context manager or something else. If we could do in a thread-safe manner del __array_function__ as the first thing when an external user calls __skip_array_function__, and then reattach it at the end, that would be fine too.

@rgommers
Copy link
Member

For what it's worth, I think this is a pretty minor loss in readability

Opinions on that vary I guess. Another thing to keep in mind is that you were in favor of using __array_function__ outside of numpy (e.g. in scipy). That story also now takes a big hit. It goes from "just add a decorator" to "add a decorator, and rewrite a lot of the numpy calls in your implementation".

@rgommers
Copy link
Member

Perhaps we should have a synchronous discussion, if not during the community call then a specially scheduled one to work this out. Trying to follow the discussion on github is painful.

One call about the design may not be enough, but for talking about the proposal @mhvk just made on the mailing list, it could be useful. Scheduling this may not be easy though .....

@rgommers
Copy link
Member

a 10% speed-up for the price of ...

Keep in mind that none of this is a speedup compared to a baseline of "no __array_function__". We're adding some features, and they come at costs - performance, readability, extensibility, impact on future changes, etc.

What we gain with __skip_array_function__ is convenience for libraries that want to do partial implementations of __array_function__. And a reduction of the performance penalty of adding __array_function__ in its current form.

I'm just running the SciPy test suite with the env var on and off to get a sense of the impact. It would be useful to have a bigger comparison for other libraries, and some real-world code bases (test suites may slow down more than real-world code).

@shoyer
Copy link
Member Author

shoyer commented May 22, 2019

What I do think would make sense is a enabling a run-time check for wrapped invocations of __array_function__, that only is used when running NumPy's test suite. So at least if NumPy functions have test coverage, we could identify these cases automatically, even if the required fix is manual.

To follow up on this: I made an attempt to detect nested use of __array_function__ invocations at test time in shoyer#1.

Unfortunately, the result is not very satisfying. The best I can do is check whether we call another overrided function within a NumPy implementation, but the end result of this would be using __skip_array_function__ everywhere inside NumPy, which really is not a great option. The problem is that my runtime test on NumPy arrays can't detect "safe" uses where the arguments were guarded with asarray().

If we want to detect this programmatically, we would need either:

  1. A comprehensive version of NumPy's test suite that uses duck arrays, or
  2. Some sort of more sophisticated static analysis check. This might be possible with a tool like mypy.

As is, I'm not really comfortable with any solution here. Going forward with __skip_array_function__ without any way to programmatically verify that we are not expanding our API surface seems like it would set us up for future pain.

@rgommers
Copy link
Member

Thanks for trying @shoyer. That's too bad.

@shoyer
Copy link
Member Author

shoyer commented May 25, 2019

Closing this based on the consensus from the mailing list. See #13624 fo a PR revising the NEP.

@shoyer shoyer closed this May 25, 2019
shoyer added a commit to shoyer/numpy that referenced this pull request May 26, 2019
This is a partial reprise of the optimizations from numpyGH-13585.

The trade-offs here are about readability, performance and whether these
functions automatically work on ndarray subclasses.

You'll have to judge the readability costs for yourself, but I think this is
pretty reasonable.

Here are the performance numbers for three relevant functions with the
following IPython script:

    import numpy as np
    x = np.array([1])
    xs = [x, x, x]
    for func in [np.stack, np.vstack, np.block]:
        %timeit func(xs)

| Function | Master | This PR |
| --- | --- | --- |
| `stack` | 6.36 µs ± 175 ns | 6 µs ± 174 ns |
| `vstack` | 7.18 µs ± 186 ns | 5.43 µs ± 125 ns |
| `block` | 15.1 µs ± 141 ns | 11.3 µs ± 104 ns |

The performance benefit for `stack` is somewhat marginal (perhaps it should be
dropped), but it's much more meaningful inside `vstack`/`hstack` and `block`,
because these functions call other dispatched functions within a loop.

For automatically working on ndarray subclasses, the main concern would be that
by skipping dispatch with `concatenate`, subclasses that define `concatenate`
won't automatically get implementations for `*stack` functions. (But I don't
think we should consider ourselves obligated to guarantee these implementation
details, as I write in numpyGH-13633.)

`block` also will not get an automatic implementation, but given that `block`
uses two different code paths depending on argument values, this is probably a
good thing, because there's no way the code path not involving `concatenate`
could automatically work (it uses `empty()`).
shoyer added a commit that referenced this pull request Jun 12, 2019
* MAINT: avoid nested dispatch in numpy.core.shape_base

This is a partial reprise of the optimizations from GH-13585.

The trade-offs here are about readability, performance and whether these
functions automatically work on ndarray subclasses.

You'll have to judge the readability costs for yourself, but I think this is
pretty reasonable.

Here are the performance numbers for three relevant functions with the
following IPython script:

    import numpy as np
    x = np.array([1])
    xs = [x, x, x]
    for func in [np.stack, np.vstack, np.block]:
        %timeit func(xs)

| Function | Master | This PR |
| --- | --- | --- |
| `stack` | 6.36 µs ± 175 ns | 6 µs ± 174 ns |
| `vstack` | 7.18 µs ± 186 ns | 5.43 µs ± 125 ns |
| `block` | 15.1 µs ± 141 ns | 11.3 µs ± 104 ns |

The performance benefit for `stack` is somewhat marginal (perhaps it should be
dropped), but it's much more meaningful inside `vstack`/`hstack` and `block`,
because these functions call other dispatched functions within a loop.

For automatically working on ndarray subclasses, the main concern would be that
by skipping dispatch with `concatenate`, subclasses that define `concatenate`
won't automatically get implementations for `*stack` functions. (But I don't
think we should consider ourselves obligated to guarantee these implementation
details, as I write in GH-13633.)

`block` also will not get an automatic implementation, but given that `block`
uses two different code paths depending on argument values, this is probably a
good thing, because there's no way the code path not involving `concatenate`
could automatically work (it uses `empty()`).

* MAINT: only remove internal use in np.block

* MAINT: fixup comment on np.block optimization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants