Skip to content

MAINT: speed up _block by avoiding a recursive closure #11991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 1, 2018

Conversation

hmaarrfk
Copy link
Contributor

@hmaarrfk hmaarrfk commented Sep 19, 2018

This is a small PR that speeds up the case of blocking small arrays by 10-20% in CPython 3.6 by avoiding closures.

This only makes PR #11971 a harder sell, but I think these optimizations are worthwhile for people blocking small arrays.

Benchmarks Python 3.6 conda(-forge?)
       before           after         ratio
     [d126ff7a]       [535d337d]
     <master>         <block_optimize_order>
-      32.8±0.3μs       28.6±0.4μs     0.87  bench_shape_base.Block2D.time_block2d((128, 128), 'uint64', (2, 2))
-        67.7±4μs       57.4±0.5μs     0.85  bench_shape_base.Block2D.time_block2d((256, 256), 'uint16', (4, 4))
-      27.9±0.1μs       23.4±0.2μs     0.84  bench_shape_base.Block2D.time_block2d((256, 256), 'uint8', (2, 2))
-        60.3±4μs       50.7±0.9μs     0.84  bench_shape_base.Block2D.time_block2d((256, 256), 'uint8', (4, 4))
-      57.9±0.6μs       48.6±0.7μs     0.84  bench_shape_base.Block2D.time_block2d((128, 128), 'uint32', (4, 4))
-        39.1±1μs         32.7±1μs     0.84  bench_shape_base.Block.time_block_complicated(10)
-      50.6±0.4μs         42.2±1μs     0.83  bench_shape_base.Block2D.time_block2d((64, 64), 'uint16', (4, 4))
-      35.7±0.8μs       29.6±0.4μs     0.83  bench_shape_base.Block.time_no_lists(100)
-      53.9±0.4μs       44.6±0.3μs     0.83  bench_shape_base.Block2D.time_block2d((128, 128), 'uint16', (4, 4))
-        65.1±2μs       53.7±0.6μs     0.82  bench_shape_base.Block2D.time_block2d((128, 128), 'uint64', (4, 4))
-        23.3±1μs       19.1±0.1μs     0.82  bench_shape_base.Block2D.time_block2d((64, 64), 'uint64', (2, 2))
-      53.6±0.6μs       43.7±0.4μs     0.82  bench_shape_base.Block2D.time_block2d((64, 64), 'uint64', (4, 4))
-      27.5±0.5μs       22.4±0.2μs     0.81  bench_shape_base.Block2D.time_block2d((128, 128), 'uint32', (2, 2))
-      41.0±0.2μs       33.2±0.3μs     0.81  bench_shape_base.Block.time_3d(1)
-        35.4±1μs       28.6±0.3μs     0.81  bench_shape_base.Block2D.time_block2d((256, 256), 'uint16', (2, 2))
-      22.1±0.5μs       17.6±0.2μs     0.80  bench_shape_base.Block2D.time_block2d((64, 64), 'uint32', (2, 2))
-        49.8±1μs       39.4±0.4μs     0.79  bench_shape_base.Block2D.time_block2d((32, 32), 'uint8', (4, 4))
-      11.7±0.5μs       9.23±0.4μs     0.79  bench_shape_base.Block.time_no_lists(10)
-        52.3±3μs       40.8±0.3μs     0.78  bench_shape_base.Block2D.time_block2d((32, 32), 'uint64', (4, 4))
-      17.5±0.5μs       13.6±0.2μs     0.78  bench_shape_base.Block.time_block_simple_column_wise(10)
-        37.3±1μs       28.9±0.2μs     0.77  bench_shape_base.Block.time_block_complicated(1)
-        23.1±1μs       17.8±0.2μs     0.77  bench_shape_base.Block2D.time_block2d((128, 128), 'uint8', (2, 2))
-      21.7±0.8μs       16.7±0.1μs     0.77  bench_shape_base.Block2D.time_block2d((64, 64), 'uint16', (2, 2))
-      11.1±0.5μs       8.48±0.1μs     0.76  bench_shape_base.Block.time_no_lists(1)
-      21.2±0.7μs      16.1±0.08μs     0.76  bench_shape_base.Block2D.time_block2d((32, 32), 'uint32', (2, 2))
-        55.5±4μs       42.1±0.5μs     0.76  bench_shape_base.Block2D.time_block2d((64, 64), 'uint32', (4, 4))
-        56.4±1μs       42.6±0.1μs     0.75  bench_shape_base.Block2D.time_block2d((128, 128), 'uint8', (4, 4))
-      26.1±0.7μs       19.7±0.2μs     0.75  bench_shape_base.Block2D.time_block2d((128, 128), 'uint16', (2, 2))
-      21.0±0.3μs       15.8±0.3μs     0.75  bench_shape_base.Block2D.time_block2d((16, 16), 'uint32', (2, 2))
-        53.2±1μs       39.7±0.7μs     0.75  bench_shape_base.Block2D.time_block2d((16, 16), 'uint32', (4, 4))
-        52.4±2μs       39.1±0.4μs     0.75  bench_shape_base.Block2D.time_block2d((16, 16), 'uint8', (4, 4))
-      21.5±0.1μs       15.9±0.2μs     0.74  bench_shape_base.Block2D.time_block2d((16, 16), 'uint64', (2, 2))
-      22.1±0.5μs       16.3±0.1μs     0.74  bench_shape_base.Block2D.time_block2d((64, 64), 'uint8', (2, 2))
-        53.4±2μs       39.4±0.3μs     0.74  bench_shape_base.Block2D.time_block2d((32, 32), 'uint16', (4, 4))
-      10.5±0.2μs       7.71±0.1μs     0.73  bench_shape_base.Block.time_block_simple_row_wise(10)
-      21.0±0.2μs       15.2±0.1μs     0.73  bench_shape_base.Block2D.time_block2d((16, 16), 'uint8', (2, 2))
-        54.4±2μs       38.9±0.4μs     0.72  bench_shape_base.Block2D.time_block2d((16, 16), 'uint16', (4, 4))
-        57.1±1μs       40.8±0.4μs     0.71  bench_shape_base.Block2D.time_block2d((64, 64), 'uint8', (4, 4))
-        55.6±3μs       39.7±0.4μs     0.71  bench_shape_base.Block2D.time_block2d((16, 16), 'uint64', (4, 4))
-      22.2±0.4μs      15.8±0.09μs     0.71  bench_shape_base.Block2D.time_block2d((32, 32), 'uint16', (2, 2))
-      23.1±0.2μs       16.4±0.1μs     0.71  bench_shape_base.Block2D.time_block2d((32, 32), 'uint64', (2, 2))
-      22.2±0.7μs       15.7±0.1μs     0.70  bench_shape_base.Block2D.time_block2d((32, 32), 'uint8', (2, 2))
-        22.2±2μs       15.4±0.6μs     0.69  bench_shape_base.Block2D.time_block2d((16, 16), 'uint16', (2, 2))
-        58.1±5μs       40.1±0.4μs     0.69  bench_shape_base.Block2D.time_block2d((32, 32), 'uint32', (4, 4))
Benchmarks python 3.5 conda(-forge?)

       before           after         ratio
     [d126ff7a]       [91391c1a]
     <master>         <block_optimize_order>
-       59.6±10μs       47.4±0.8μs     0.79  bench_shape_base.Block2D.time_block2d((512, 512), 'uint8', (2, 2))
-        78.1±1μs         60.6±2μs     0.78  bench_shape_base.Block2D.time_block2d((256, 256), 'uint16', (4, 4))
-        57.7±5μs         44.2±1μs     0.77  bench_shape_base.Block2D.time_block2d((32, 32), 'uint16', (4, 4))
-        57.7±3μs         43.8±1μs     0.76  bench_shape_base.Block2D.time_block2d((64, 64), 'uint8', (4, 4))
-        55.5±3μs       42.1±0.6μs     0.76  bench_shape_base.Block2D.time_block2d((16, 16), 'uint32', (4, 4))
-        63.0±6μs         46.0±4μs     0.73  bench_shape_base.Block2D.time_block2d((64, 64), 'uint16', (4, 4))
-        27.9±4μs         20.2±1μs     0.72  bench_shape_base.Block2D.time_block2d((64, 64), 'uint64', (2, 2))
-        111±20μs         79.7±2μs     0.72  bench_shape_base.Block2D.time_block2d((512, 512), 'uint8', (4, 4))
-        59.1±1μs       41.4±0.7μs     0.70  bench_shape_base.Block2D.time_block2d((16, 16), 'uint16', (4, 4))
-        71.3±5μs         49.6±2μs     0.70  bench_shape_base.Block2D.time_block2d((128, 128), 'uint8', (4, 4))
-        21.5±3μs      14.4±0.07μs     0.67  bench_shape_base.Block2D.time_block2d((16, 16), 'uint8', (2, 2))
-        66.3±3μs         44.1±1μs     0.67  bench_shape_base.Block2D.time_block2d((64, 64), 'uint64', (4, 4))
-        63.5±7μs       41.7±0.3μs     0.66  bench_shape_base.Block2D.time_block2d((64, 64), 'uint32', (4, 4))
Benchmarks Python 2.7 conda(-forge)
       before           after         ratio
     [d126ff7a]       [91391c1a]
     <master>         <block_optimize_order>
-      28.8±0.4μs       25.4±0.2μs     0.88  bench_shape_base.Block2D.time_block2d((256, 256), 'uint16', (2, 2))
-        68.8±2μs         60.2±1μs     0.87  bench_shape_base.Block2D.time_block2d((256, 256), 'uint32', (4, 4))
-      42.1±0.7μs       36.9±0.4μs     0.87  bench_shape_base.Block2D.time_block2d((128, 128), 'uint8', (4, 4))
-        42.2±1μs       36.4±0.7μs     0.86  bench_shape_base.Block2D.time_block2d((256, 256), 'uint32', (2, 2))
-      52.5±0.3μs       45.3±0.4μs     0.86  bench_shape_base.Block2D.time_block2d((128, 128), 'uint64', (4, 4))
-      48.1±0.4μs       41.1±0.3μs     0.86  bench_shape_base.Block2D.time_block2d((256, 256), 'uint8', (4, 4))
-      46.9±0.3μs       39.7±0.3μs     0.85  bench_shape_base.Block2D.time_block2d((128, 128), 'uint32', (4, 4))
-      22.1±0.2μs       18.6±0.4μs     0.84  bench_shape_base.Block2D.time_block2d((128, 128), 'uint32', (2, 2))
-      37.3±0.2μs       31.4±0.7μs     0.84  bench_shape_base.Block2D.time_block2d((16, 16), 'uint8', (4, 4))
-      23.9±0.3μs       20.2±0.1μs     0.84  bench_shape_base.Block2D.time_block2d((256, 256), 'uint8', (2, 2))
-      38.0±0.5μs       32.0±0.5μs     0.84  bench_shape_base.Block2D.time_block2d((32, 32), 'uint8', (4, 4))
-      42.6±0.4μs       35.9±0.4μs     0.84  bench_shape_base.Block2D.time_block2d((64, 64), 'uint64', (4, 4))
-      43.4±0.4μs       36.3±0.3μs     0.84  bench_shape_base.Block2D.time_block2d((128, 128), 'uint16', (4, 4))
-      40.8±0.2μs       34.2±0.4μs     0.84  bench_shape_base.Block2D.time_block2d((64, 64), 'uint32', (4, 4))
-      19.3±0.2μs       16.0±0.2μs     0.83  bench_shape_base.Block2D.time_block2d((128, 128), 'uint16', (2, 2))
-      17.5±0.1μs       14.5±0.1μs     0.83  bench_shape_base.Block2D.time_block2d((128, 128), 'uint8', (2, 2))
-      40.3±0.6μs       33.4±0.2μs     0.83  bench_shape_base.Block2D.time_block2d((64, 64), 'uint16', (4, 4))
-      37.8±0.2μs       31.2±0.3μs     0.83  bench_shape_base.Block2D.time_block2d((16, 16), 'uint16', (4, 4))
-      17.8±0.4μs       14.7±0.2μs     0.83  bench_shape_base.Block2D.time_block2d((64, 64), 'uint32', (2, 2))
-      18.6±0.2μs       15.3±0.1μs     0.83  bench_shape_base.Block2D.time_block2d((64, 64), 'uint64', (2, 2))
-      39.1±0.9μs       31.9±0.5μs     0.82  bench_shape_base.Block2D.time_block2d((16, 16), 'uint64', (4, 4))
-      15.8±0.4μs       12.9±0.1μs     0.81  bench_shape_base.Block2D.time_block2d((32, 32), 'uint32', (2, 2))
-      14.8±0.4μs       12.0±0.3μs     0.81  bench_shape_base.Block2D.time_block2d((16, 16), 'uint8', (2, 2))
-      16.3±0.2μs       13.2±0.4μs     0.81  bench_shape_base.Block2D.time_block2d((32, 32), 'uint16', (2, 2))
-     15.9±0.05μs      12.8±0.09μs     0.80  bench_shape_base.Block2D.time_block2d((64, 64), 'uint8', (2, 2))

Pinging @eric-wieser who helped me with #11971
and @jakirkham and @j-towns who contributed to np.block in #9667.

@eric-wieser
Copy link
Member

This mostly just undoes the speed regression caused by #11910, I think

@eric-wieser
Copy link
Member

@mattip, thoughts? I'm not sure how I feel about micro-optimizations like this that come at varying clarity costs.

@hmaarrfk hmaarrfk force-pushed the block_optimize_order branch from 1af752a to eedd445 Compare September 20, 2018 12:20
@mattip
Copy link
Member

mattip commented Sep 20, 2018

It seems good to me, removing that recursive closure is +1 for clarity, and other than that it just seems to juggle the same functions that were called before as far as I am concerned. I am personally not a big advocate of one-line functions, they break up the flow for me, so I would prefer to inline _atleast_nd (the comment should remain in any case), but not critical. Maybe someone will find some use for it in the future.

@eric-wieser
Copy link
Member

@hmaarrfk: Regarding benchmarks - it's usually valuable to make them in a separate PR, so that it can be merged first, and given time to show up in https://pv.github.io/numpy-bench/#/summarylist

@hmaarrfk
Copy link
Contributor Author

This mostly just undoes the speed regression caused by #11910, I think

@eric-wieser You are right that it was a regression caused by that commit. It is pretty clear from the ASV history.
https://pv.github.io/numpy-bench/#bench_shape_base.Block.time_nested?p-size=1&p-size=10&y-axis-scale=zoom

ASV is pretty cool!

Copy link
Member

@ahaldane ahaldane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. No more magic involving @recursive, code is still clear.

@mattip mattip merged commit 7511d0b into numpy:master Oct 1, 2018
@mattip
Copy link
Member

mattip commented Oct 1, 2018

Thanks @hmaarrfk

@hmaarrfk hmaarrfk deleted the block_optimize_order branch November 5, 2018 03:20
@charris charris changed the title MAINT: speed up _block by avoiding a recursive closure MAINT: speed up _block by avoiding a recursive closure Nov 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants