MAINT: speed up `_block` by avoiding a recursive closure #11991

hmaarrfk · 2018-09-19T14:51:43Z

This is a small PR that speeds up the case of blocking small arrays by 10-20% in CPython 3.6 by avoiding closures.

This only makes PR #11971 a harder sell, but I think these optimizations are worthwhile for people blocking small arrays.

Benchmarks Python 3.6 conda(-forge?)

       before           after         ratio
     [d126ff7a]       [535d337d]
     <master>         <block_optimize_order>
-      32.8±0.3μs       28.6±0.4μs     0.87  bench_shape_base.Block2D.time_block2d((128, 128), 'uint64', (2, 2))
-        67.7±4μs       57.4±0.5μs     0.85  bench_shape_base.Block2D.time_block2d((256, 256), 'uint16', (4, 4))
-      27.9±0.1μs       23.4±0.2μs     0.84  bench_shape_base.Block2D.time_block2d((256, 256), 'uint8', (2, 2))
-        60.3±4μs       50.7±0.9μs     0.84  bench_shape_base.Block2D.time_block2d((256, 256), 'uint8', (4, 4))
-      57.9±0.6μs       48.6±0.7μs     0.84  bench_shape_base.Block2D.time_block2d((128, 128), 'uint32', (4, 4))
-        39.1±1μs         32.7±1μs     0.84  bench_shape_base.Block.time_block_complicated(10)
-      50.6±0.4μs         42.2±1μs     0.83  bench_shape_base.Block2D.time_block2d((64, 64), 'uint16', (4, 4))
-      35.7±0.8μs       29.6±0.4μs     0.83  bench_shape_base.Block.time_no_lists(100)
-      53.9±0.4μs       44.6±0.3μs     0.83  bench_shape_base.Block2D.time_block2d((128, 128), 'uint16', (4, 4))
-        65.1±2μs       53.7±0.6μs     0.82  bench_shape_base.Block2D.time_block2d((128, 128), 'uint64', (4, 4))
-        23.3±1μs       19.1±0.1μs     0.82  bench_shape_base.Block2D.time_block2d((64, 64), 'uint64', (2, 2))
-      53.6±0.6μs       43.7±0.4μs     0.82  bench_shape_base.Block2D.time_block2d((64, 64), 'uint64', (4, 4))
-      27.5±0.5μs       22.4±0.2μs     0.81  bench_shape_base.Block2D.time_block2d((128, 128), 'uint32', (2, 2))
-      41.0±0.2μs       33.2±0.3μs     0.81  bench_shape_base.Block.time_3d(1)
-        35.4±1μs       28.6±0.3μs     0.81  bench_shape_base.Block2D.time_block2d((256, 256), 'uint16', (2, 2))
-      22.1±0.5μs       17.6±0.2μs     0.80  bench_shape_base.Block2D.time_block2d((64, 64), 'uint32', (2, 2))
-        49.8±1μs       39.4±0.4μs     0.79  bench_shape_base.Block2D.time_block2d((32, 32), 'uint8', (4, 4))
-      11.7±0.5μs       9.23±0.4μs     0.79  bench_shape_base.Block.time_no_lists(10)
-        52.3±3μs       40.8±0.3μs     0.78  bench_shape_base.Block2D.time_block2d((32, 32), 'uint64', (4, 4))
-      17.5±0.5μs       13.6±0.2μs     0.78  bench_shape_base.Block.time_block_simple_column_wise(10)
-        37.3±1μs       28.9±0.2μs     0.77  bench_shape_base.Block.time_block_complicated(1)
-        23.1±1μs       17.8±0.2μs     0.77  bench_shape_base.Block2D.time_block2d((128, 128), 'uint8', (2, 2))
-      21.7±0.8μs       16.7±0.1μs     0.77  bench_shape_base.Block2D.time_block2d((64, 64), 'uint16', (2, 2))
-      11.1±0.5μs       8.48±0.1μs     0.76  bench_shape_base.Block.time_no_lists(1)
-      21.2±0.7μs      16.1±0.08μs     0.76  bench_shape_base.Block2D.time_block2d((32, 32), 'uint32', (2, 2))
-        55.5±4μs       42.1±0.5μs     0.76  bench_shape_base.Block2D.time_block2d((64, 64), 'uint32', (4, 4))
-        56.4±1μs       42.6±0.1μs     0.75  bench_shape_base.Block2D.time_block2d((128, 128), 'uint8', (4, 4))
-      26.1±0.7μs       19.7±0.2μs     0.75  bench_shape_base.Block2D.time_block2d((128, 128), 'uint16', (2, 2))
-      21.0±0.3μs       15.8±0.3μs     0.75  bench_shape_base.Block2D.time_block2d((16, 16), 'uint32', (2, 2))
-        53.2±1μs       39.7±0.7μs     0.75  bench_shape_base.Block2D.time_block2d((16, 16), 'uint32', (4, 4))
-        52.4±2μs       39.1±0.4μs     0.75  bench_shape_base.Block2D.time_block2d((16, 16), 'uint8', (4, 4))
-      21.5±0.1μs       15.9±0.2μs     0.74  bench_shape_base.Block2D.time_block2d((16, 16), 'uint64', (2, 2))
-      22.1±0.5μs       16.3±0.1μs     0.74  bench_shape_base.Block2D.time_block2d((64, 64), 'uint8', (2, 2))
-        53.4±2μs       39.4±0.3μs     0.74  bench_shape_base.Block2D.time_block2d((32, 32), 'uint16', (4, 4))
-      10.5±0.2μs       7.71±0.1μs     0.73  bench_shape_base.Block.time_block_simple_row_wise(10)
-      21.0±0.2μs       15.2±0.1μs     0.73  bench_shape_base.Block2D.time_block2d((16, 16), 'uint8', (2, 2))
-        54.4±2μs       38.9±0.4μs     0.72  bench_shape_base.Block2D.time_block2d((16, 16), 'uint16', (4, 4))
-        57.1±1μs       40.8±0.4μs     0.71  bench_shape_base.Block2D.time_block2d((64, 64), 'uint8', (4, 4))
-        55.6±3μs       39.7±0.4μs     0.71  bench_shape_base.Block2D.time_block2d((16, 16), 'uint64', (4, 4))
-      22.2±0.4μs      15.8±0.09μs     0.71  bench_shape_base.Block2D.time_block2d((32, 32), 'uint16', (2, 2))
-      23.1±0.2μs       16.4±0.1μs     0.71  bench_shape_base.Block2D.time_block2d((32, 32), 'uint64', (2, 2))
-      22.2±0.7μs       15.7±0.1μs     0.70  bench_shape_base.Block2D.time_block2d((32, 32), 'uint8', (2, 2))
-        22.2±2μs       15.4±0.6μs     0.69  bench_shape_base.Block2D.time_block2d((16, 16), 'uint16', (2, 2))
-        58.1±5μs       40.1±0.4μs     0.69  bench_shape_base.Block2D.time_block2d((32, 32), 'uint32', (4, 4))

Benchmarks python 3.5 conda(-forge?)


       before           after         ratio
     [d126ff7a]       [91391c1a]
     <master>         <block_optimize_order>
-       59.6±10μs       47.4±0.8μs     0.79  bench_shape_base.Block2D.time_block2d((512, 512), 'uint8', (2, 2))
-        78.1±1μs         60.6±2μs     0.78  bench_shape_base.Block2D.time_block2d((256, 256), 'uint16', (4, 4))
-        57.7±5μs         44.2±1μs     0.77  bench_shape_base.Block2D.time_block2d((32, 32), 'uint16', (4, 4))
-        57.7±3μs         43.8±1μs     0.76  bench_shape_base.Block2D.time_block2d((64, 64), 'uint8', (4, 4))
-        55.5±3μs       42.1±0.6μs     0.76  bench_shape_base.Block2D.time_block2d((16, 16), 'uint32', (4, 4))
-        63.0±6μs         46.0±4μs     0.73  bench_shape_base.Block2D.time_block2d((64, 64), 'uint16', (4, 4))
-        27.9±4μs         20.2±1μs     0.72  bench_shape_base.Block2D.time_block2d((64, 64), 'uint64', (2, 2))
-        111±20μs         79.7±2μs     0.72  bench_shape_base.Block2D.time_block2d((512, 512), 'uint8', (4, 4))
-        59.1±1μs       41.4±0.7μs     0.70  bench_shape_base.Block2D.time_block2d((16, 16), 'uint16', (4, 4))
-        71.3±5μs         49.6±2μs     0.70  bench_shape_base.Block2D.time_block2d((128, 128), 'uint8', (4, 4))
-        21.5±3μs      14.4±0.07μs     0.67  bench_shape_base.Block2D.time_block2d((16, 16), 'uint8', (2, 2))
-        66.3±3μs         44.1±1μs     0.67  bench_shape_base.Block2D.time_block2d((64, 64), 'uint64', (4, 4))
-        63.5±7μs       41.7±0.3μs     0.66  bench_shape_base.Block2D.time_block2d((64, 64), 'uint32', (4, 4))

Benchmarks Python 2.7 conda(-forge)

       before           after         ratio
     [d126ff7a]       [91391c1a]
     <master>         <block_optimize_order>
-      28.8±0.4μs       25.4±0.2μs     0.88  bench_shape_base.Block2D.time_block2d((256, 256), 'uint16', (2, 2))
-        68.8±2μs         60.2±1μs     0.87  bench_shape_base.Block2D.time_block2d((256, 256), 'uint32', (4, 4))
-      42.1±0.7μs       36.9±0.4μs     0.87  bench_shape_base.Block2D.time_block2d((128, 128), 'uint8', (4, 4))
-        42.2±1μs       36.4±0.7μs     0.86  bench_shape_base.Block2D.time_block2d((256, 256), 'uint32', (2, 2))
-      52.5±0.3μs       45.3±0.4μs     0.86  bench_shape_base.Block2D.time_block2d((128, 128), 'uint64', (4, 4))
-      48.1±0.4μs       41.1±0.3μs     0.86  bench_shape_base.Block2D.time_block2d((256, 256), 'uint8', (4, 4))
-      46.9±0.3μs       39.7±0.3μs     0.85  bench_shape_base.Block2D.time_block2d((128, 128), 'uint32', (4, 4))
-      22.1±0.2μs       18.6±0.4μs     0.84  bench_shape_base.Block2D.time_block2d((128, 128), 'uint32', (2, 2))
-      37.3±0.2μs       31.4±0.7μs     0.84  bench_shape_base.Block2D.time_block2d((16, 16), 'uint8', (4, 4))
-      23.9±0.3μs       20.2±0.1μs     0.84  bench_shape_base.Block2D.time_block2d((256, 256), 'uint8', (2, 2))
-      38.0±0.5μs       32.0±0.5μs     0.84  bench_shape_base.Block2D.time_block2d((32, 32), 'uint8', (4, 4))
-      42.6±0.4μs       35.9±0.4μs     0.84  bench_shape_base.Block2D.time_block2d((64, 64), 'uint64', (4, 4))
-      43.4±0.4μs       36.3±0.3μs     0.84  bench_shape_base.Block2D.time_block2d((128, 128), 'uint16', (4, 4))
-      40.8±0.2μs       34.2±0.4μs     0.84  bench_shape_base.Block2D.time_block2d((64, 64), 'uint32', (4, 4))
-      19.3±0.2μs       16.0±0.2μs     0.83  bench_shape_base.Block2D.time_block2d((128, 128), 'uint16', (2, 2))
-      17.5±0.1μs       14.5±0.1μs     0.83  bench_shape_base.Block2D.time_block2d((128, 128), 'uint8', (2, 2))
-      40.3±0.6μs       33.4±0.2μs     0.83  bench_shape_base.Block2D.time_block2d((64, 64), 'uint16', (4, 4))
-      37.8±0.2μs       31.2±0.3μs     0.83  bench_shape_base.Block2D.time_block2d((16, 16), 'uint16', (4, 4))
-      17.8±0.4μs       14.7±0.2μs     0.83  bench_shape_base.Block2D.time_block2d((64, 64), 'uint32', (2, 2))
-      18.6±0.2μs       15.3±0.1μs     0.83  bench_shape_base.Block2D.time_block2d((64, 64), 'uint64', (2, 2))
-      39.1±0.9μs       31.9±0.5μs     0.82  bench_shape_base.Block2D.time_block2d((16, 16), 'uint64', (4, 4))
-      15.8±0.4μs       12.9±0.1μs     0.81  bench_shape_base.Block2D.time_block2d((32, 32), 'uint32', (2, 2))
-      14.8±0.4μs       12.0±0.3μs     0.81  bench_shape_base.Block2D.time_block2d((16, 16), 'uint8', (2, 2))
-      16.3±0.2μs       13.2±0.4μs     0.81  bench_shape_base.Block2D.time_block2d((32, 32), 'uint16', (2, 2))
-     15.9±0.05μs      12.8±0.09μs     0.80  bench_shape_base.Block2D.time_block2d((64, 64), 'uint8', (2, 2))

Pinging @eric-wieser who helped me with #11971
and @jakirkham and @j-towns who contributed to np.block in #9667.

eric-wieser · 2018-09-20T08:27:20Z

This mostly just undoes the speed regression caused by #11910, I think

numpy/core/shape_base.py

eric-wieser · 2018-09-20T08:29:07Z

@mattip, thoughts? I'm not sure how I feel about micro-optimizations like this that come at varying clarity costs.

mattip · 2018-09-20T14:53:33Z

It seems good to me, removing that recursive closure is +1 for clarity, and other than that it just seems to juggle the same functions that were called before as far as I am concerned. I am personally not a big advocate of one-line functions, they break up the flow for me, so I would prefer to inline _atleast_nd (the comment should remain in any case), but not critical. Maybe someone will find some use for it in the future.

eric-wieser · 2018-09-20T16:59:46Z

@hmaarrfk: Regarding benchmarks - it's usually valuable to make them in a separate PR, so that it can be merged first, and given time to show up in https://pv.github.io/numpy-bench/#/summarylist

hmaarrfk · 2018-09-20T18:23:52Z

This mostly just undoes the speed regression caused by #11910, I think

@eric-wieser You are right that it was a regression caused by that commit. It is pretty clear from the ASV history.
https://pv.github.io/numpy-bench/#bench_shape_base.Block.time_nested?p-size=1&p-size=10&y-axis-scale=zoom

ASV is pretty cool!

ahaldane

I like it. No more magic involving @recursive, code is still clear.

mattip · 2018-10-01T18:24:28Z

Thanks @hmaarrfk

hmaarrfk mentioned this pull request Sep 19, 2018

MAINT: Block algorithm with a single copy per call to block #11971

Merged

5 tasks

hmaarrfk force-pushed the block_optimize_order branch from 91391c1 to 1af752a Compare September 19, 2018 15:47

eric-wieser reviewed Sep 20, 2018

View reviewed changes

numpy/core/shape_base.py Outdated Show resolved Hide resolved

hmaarrfk force-pushed the block_optimize_order branch from 1af752a to eedd445 Compare September 20, 2018 12:20

hmaarrfk added 2 commits September 20, 2018 13:20

BENCH: Add a relatable 2D benchmark for block

2ccfc32

MAINT: speed up _block by avoiding a recursive closure

9779e69

hmaarrfk force-pushed the block_optimize_order branch from eedd445 to 9779e69 Compare September 20, 2018 17:20

hmaarrfk mentioned this pull request Sep 20, 2018

BENCH: Add a relatable 2D benchmark for block #12001

Closed

charris added 03 - Maintenance component: numpy._core labels Sep 22, 2018

ahaldane approved these changes Sep 30, 2018

View reviewed changes

mattip merged commit 7511d0b into numpy:master Oct 1, 2018

hmaarrfk deleted the block_optimize_order branch November 5, 2018 03:20

charris changed the title ~~MAINT: speed up _block by avoiding a recursive closure~~ MAINT: speed up _block by avoiding a recursive closure Nov 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MAINT: speed up `_block` by avoiding a recursive closure #11991

MAINT: speed up `_block` by avoiding a recursive closure #11991

Uh oh!

hmaarrfk commented Sep 19, 2018 •

edited

Loading

Uh oh!

eric-wieser commented Sep 20, 2018

Uh oh!

Uh oh!

eric-wieser commented Sep 20, 2018

Uh oh!

mattip commented Sep 20, 2018

Uh oh!

eric-wieser commented Sep 20, 2018

Uh oh!

hmaarrfk commented Sep 20, 2018

Uh oh!

ahaldane left a comment

Uh oh!

mattip commented Oct 1, 2018

Uh oh!

Uh oh!

Uh oh!

MAINT: speed up _block by avoiding a recursive closure #11991

MAINT: speed up _block by avoiding a recursive closure #11991

Uh oh!

Conversation

hmaarrfk commented Sep 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eric-wieser commented Sep 20, 2018

Uh oh!

Uh oh!

eric-wieser commented Sep 20, 2018

Uh oh!

mattip commented Sep 20, 2018

Uh oh!

eric-wieser commented Sep 20, 2018

Uh oh!

hmaarrfk commented Sep 20, 2018

Uh oh!

ahaldane left a comment

Choose a reason for hiding this comment

Uh oh!

mattip commented Oct 1, 2018

Uh oh!

Uh oh!

MAINT: speed up `_block` by avoiding a recursive closure #11991

MAINT: speed up `_block` by avoiding a recursive closure #11991

hmaarrfk commented Sep 19, 2018 •

edited

Loading