MAINT: Block algorithm with a single copy per call to `block` #11971

hmaarrfk · 2018-09-17T05:48:32Z

Block used to make repeated calls to concatenate. For a 3D block, this would copy the arrays 3 times. 1 time for each dimension of the block. This is 2 times too many.

This proposed algorithm does so with 1 memory copy.

Wisdom allows numpy to guess which algorithm is better performing and chooses that one automatically. We can add a switch, to let the user decide.

TODO:

Add tests that trigger both code paths.
Ensure that the docstrings are sensible.
Decide on how to test the other branch of block function.
Decide on how to treat F order matricies (or mixes of dimensions).
Decide on how to deal with masked array objects. concatenate supports them. Seems like we still don't know how to deal with subclasses. See np.pad applied to the MaskedArray subclass silently unmasks the array, and returns output as ndarray #8881 and Augmented assignments between unmasked and masked arrays #9270

XREF: Other issues I found regarding concatenate that we might want to test for

endianness: np.concatenate loses endianness / byte order #7829 -- should test
Empty lists: Concatenate with sequence that contains empty sequences. (Trac #988) #1586 -- Block doesn't support empty lists
F/C ordering: See MAINT: Block algorithm with a single copy per call to block #11971 (comment), and previous ENH: Add an out argument to concatenate #9209 (comment) and issue BUG: Fix bug with size 1-dims in CreateSortedStridePerm #2696

The general steps are:

Sanitize the input.
Cast each block as a numpy array of the final dimension
Calculate the shape of each block
Concatenate the shapes
Keep track of the slice of a particular block within the larger whole.
Combines dtypes to find the dtype of the result.
Create the new array.
Copy the blocks in.

A benchmark was added to estimate the 2D blocking performance in a straightforward manner.

Approach details

2D results

For small arrays, this approach seems to suffer from large overhead. Benchmarks show that "small arrays" are anything as large as 256x256 arrays a modern i7 from 2017.

The performance of master is optimized in PR #11991. This shaves the overhead time for a 2x2 block from 20us to 15 us. Benchmarks in this section are compared to that PR. The overhead for this method is 30 us when blocking a small 2x2 array.

My general sense is that small arrays are pretty important. While the concatenate approach was "wasteful" for large arrays, it was quick for small arrays. There is probably room for both approaches.

Benchmark results compared to PR #11991

       before           after         ratio
     [1af752aa]       [0b1d866f]
     <block_optimize_order>       <block_single_assignment>
+      15.6±0.3μs       30.0±0.8μs     1.92  bench_shape_base.Block2D.time_block2d((16, 16), 'uint8', (2, 2))
+     15.8±0.08μs         30.2±1μs     1.91  bench_shape_base.Block2D.time_block2d((16, 16), 'uint32', (2, 2))
+      39.4±0.2μs         73.1±2μs     1.85  bench_shape_base.Block2D.time_block2d((32, 32), 'uint8', (4, 4))
+      16.9±0.3μs         31.1±1μs     1.84  bench_shape_base.Block2D.time_block2d((32, 32), 'uint64', (2, 2))
+      16.8±0.3μs       31.0±0.9μs     1.84  bench_shape_base.Block2D.time_block2d((32, 32), 'uint32', (2, 2))
+      40.3±0.3μs       73.7±0.9μs     1.83  bench_shape_base.Block2D.time_block2d((32, 32), 'uint16', (4, 4))
+      16.4±0.6μs       29.8±0.7μs     1.82  bench_shape_base.Block2D.time_block2d((32, 32), 'uint8', (2, 2))
+      16.9±0.3μs       30.8±0.9μs     1.82  bench_shape_base.Block2D.time_block2d((32, 32), 'uint16', (2, 2))
+      39.6±0.7μs       71.8±0.9μs     1.81  bench_shape_base.Block2D.time_block2d((16, 16), 'uint8', (4, 4))
+      16.2±0.3μs       29.3±0.5μs     1.81  bench_shape_base.Block2D.time_block2d((16, 16), 'uint16', (2, 2))
+      16.6±0.1μs       30.0±0.3μs     1.80  bench_shape_base.Block2D.time_block2d((64, 64), 'uint8', (2, 2))
+      41.4±0.3μs         74.5±2μs     1.80  bench_shape_base.Block2D.time_block2d((16, 16), 'uint32', (4, 4))
+      42.3±0.7μs         75.7±2μs     1.79  bench_shape_base.Block2D.time_block2d((32, 32), 'uint64', (4, 4))
+      16.6±0.6μs       29.7±0.4μs     1.79  bench_shape_base.Block2D.time_block2d((16, 16), 'uint64', (2, 2))
+      18.1±0.1μs       32.3±0.8μs     1.79  bench_shape_base.Block2D.time_block2d((64, 64), 'uint32', (2, 2))
+      17.3±0.6μs       30.9±0.3μs     1.78  bench_shape_base.Block2D.time_block2d((64, 64), 'uint16', (2, 2))
+      41.7±0.6μs         74.2±2μs     1.78  bench_shape_base.Block2D.time_block2d((64, 64), 'uint16', (4, 4))
+      40.8±0.9μs       72.0±0.7μs     1.77  bench_shape_base.Block2D.time_block2d((16, 16), 'uint16', (4, 4))
+      42.7±0.4μs       75.0±0.3μs     1.76  bench_shape_base.Block2D.time_block2d((64, 64), 'uint32', (4, 4))
+        41.9±1μs         72.9±2μs     1.74  bench_shape_base.Block2D.time_block2d((16, 16), 'uint64', (4, 4))
+        41.9±1μs       72.8±0.5μs     1.74  bench_shape_base.Block2D.time_block2d((32, 32), 'uint32', (4, 4))
+        46.0±1μs         79.8±2μs     1.73  bench_shape_base.Block2D.time_block2d((128, 128), 'uint16', (4, 4))
+      18.2±0.3μs       31.4±0.7μs     1.72  bench_shape_base.Block2D.time_block2d((128, 128), 'uint8', (2, 2))
+      44.1±0.9μs       75.1±0.8μs     1.70  bench_shape_base.Block2D.time_block2d((128, 128), 'uint8', (4, 4))
+        45.1±1μs         75.7±2μs     1.68  bench_shape_base.Block2D.time_block2d((64, 64), 'uint64', (4, 4))
+        44.0±1μs         73.4±2μs     1.67  bench_shape_base.Block2D.time_block2d((64, 64), 'uint8', (4, 4))
+      49.4±0.3μs       79.3±0.4μs     1.60  bench_shape_base.Block2D.time_block2d((256, 256), 'uint8', (4, 4))
+      20.6±0.4μs       32.9±0.4μs     1.60  bench_shape_base.Block2D.time_block2d((128, 128), 'uint16', (2, 2))
+      19.7±0.5μs       31.4±0.2μs     1.59  bench_shape_base.Block2D.time_block2d((64, 64), 'uint64', (2, 2))
+        50.0±2μs         78.9±3μs     1.58  bench_shape_base.Block2D.time_block2d((128, 128), 'uint32', (4, 4))
+      54.3±0.2μs         81.4±1μs     1.50  bench_shape_base.Block2D.time_block2d((128, 128), 'uint64', (4, 4))
+      23.5±0.6μs         34.6±1μs     1.47  bench_shape_base.Block2D.time_block2d((128, 128), 'uint32', (2, 2))
+        57.8±1μs         85.0±2μs     1.47  bench_shape_base.Block2D.time_block2d((256, 256), 'uint16', (4, 4))
+      24.5±0.7μs       35.4±0.4μs     1.45  bench_shape_base.Block2D.time_block2d((256, 256), 'uint8', (2, 2))
+      28.3±0.3μs         38.4±1μs     1.36  bench_shape_base.Block2D.time_block2d((128, 128), 'uint64', (2, 2))
+      29.8±0.2μs         40.2±1μs     1.35  bench_shape_base.Block2D.time_block2d((256, 256), 'uint16', (2, 2))
-       175±200μs          150±4μs     0.85  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint8', (4, 4))
-      72.0±100μs       59.6±0.8μs     0.83  bench_shape_base.Block2D.time_block2d((512, 512), 'uint16', (2, 2))
-      70.8±100μs       56.9±0.1μs     0.80  bench_shape_base.Block2D.time_block2d((256, 256), 'uint64', (2, 2))
-       167±200μs          133±1μs     0.80  bench_shape_base.Block2D.time_block2d((512, 512), 'uint32', (4, 4))
-       132±200μs         89.4±3μs     0.68  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint8', (2, 2))
-       131±200μs       82.4±0.8μs     0.63  bench_shape_base.Block2D.time_block2d((512, 512), 'uint32', (2, 2))
-       377±500μs        191±0.8μs     0.51  bench_shape_base.Block2D.time_block2d((512, 512), 'uint64', (4, 4))
-       403±500μs          203±4μs     0.51  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint16', (4, 4))
-      1.20±0.8ms         472±20μs     0.39  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint32', (4, 4))
-        6.14±2ms         985±20μs     0.16  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint64', (4, 4))
-        6.28±2ms         983±50μs     0.16  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint64', (2, 2))
-        2.94±1ms         360±20μs     0.12  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint32', (2, 2))
-      1.27±0.5ms          146±3μs     0.12  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint16', (2, 2))
-      1.32±0.5ms          126±4μs     0.10  bench_shape_base.Block2D.time_block2d((512, 512), 'uint64', (2, 2))

A few small micro optimizations

This PR builds on-top of pretty optimized code already in numpy. Getting the performance close hard/impossible.

List comprehensions are really expensive. Unfortunately, blocking allows for really fancy block shapes, which means that we cant do much but to use them in python.

A few things can be optimized:

Generators are slow. Thanks @eric-wieser for teaching me about that.
Empty lists are falsy. You can use this to test for unlikely cases using list comprehensions.
Creating tuples is expensive. Create them once, reuse them.

I'm unsure how these optimizations affect the performance in PyPy

3D results:

For large arrays, the benchmark #11965 shows that there is now little room to improve compared to a straight concatenate operation. This is really expected since here we are probably copying the arrays more than the cost it takes to generate the slices.

benchmarks

The benchmarks show results for a final array of size `5 n`^3. Therefore, for n=10, the final array has 125,000 elements, each 8 bytes (int 64), so close to 1MB. `n=100` is closer to 1GB

New results

===== ========== =============
--              mode          
----- ------------------------
  n     block     concatenate 
===== ========== =============
  1    74.0±2μs    2.03±0.1μs 
  10   150±6μs      37.6±1μs  
 100   366±7ms      393±10ms  
===== ========== =============

(I'm pretty convinced that the difference in time here between block and concatenate for n=100 is artificial, though not sure how to prove it.)

Improvement compared to the old results

asv continuous -E conda:3.6 -b Block.time_3d master block_single_concatenate_call

       before           after         ratio
     [b5f56572]       [a2b4b356]
     <master>         <block_single_concatenate_call>
+      40.7±0.2μs         74.0±2μs     1.82  bench_shape_base.Block.time_3d(1, 'block')
-      1.10±0.02s          366±7ms     0.33  bench_shape_base.Block.time_3d(100, 'block')
-        681±20μs          150±6μs     0.22  bench_shape_base.Block.time_3d(10, 'block')

Extensions

Unblocking

@eric-wieser discussed the possibility of using this algorithm to unblock arrays. This is rather straightforward to implement once we lock down the implementation. Details are in the comments below

User friendly options

Finally, this probably lends itself well to offering options like order, which would allow one to choose the order of the final array, or even to specify the array that they want to provide through an out parameter.

Closures are avoided entirely, so I think that gh-10620 is not an issue.

eric-wieser · 2018-09-17T06:11:20Z

I had this in mind a while ago, and was hoping someone else would give it a go :)

This algorithm opens up the option of an "unblock" operation, which does exactly the same thing, but with the assignment in the other direction. Something to explore later.

eric-wieser · 2018-09-17T06:12:22Z

I'm afraid I introduced a conflict by merging #11910. The upshot is you could use closures again, if they help

hmaarrfk · 2018-09-17T06:14:53Z

I'm afraid I introduced a conflict by merging #11910. The upshot is you could use closures again, if they help

no worries.

eric-wieser · 2018-09-17T06:16:49Z

numpy/core/shape_base.py

+def _shape(array, result_ndim):
+    # array is either a scalar or an array
+    # type(arrays) is not list
+    # if it is a scalar, tell it we have shape==(1,)


This comment looks wrong.

Perhaps a better function would be shape_with_ndim(shape, ndim) with description "pad a shape with leading 1 to make it be of dimension ndim". I'm not super happy with that function name, but I think it would be clearer if it operated on a shape rather than an array.

See below - you can drop this in favor of atleast_nd

eric-wieser · 2018-09-17T06:19:37Z

numpy/core/shape_base.py

+                    shape[concat_depth+1:] != shapes[0][concat_depth+1:]):
+                raise ValueError('Mismatched array shapes in block.')
+        shape_on_dim = sum(shape[concat_depth] for shape in shapes)
+        shape = list(shapes[0])


The goal here is to make a copy, I assume? Would be nice to indicate that.

I think I'd prefer

first_shape = shapes[0] ... shape = shapes[:concat_depth] + (shape_on_dim,) + shapes[concat_depth+1:]

because shapes are typically tuples

Let me know if you like the improvement.

eric-wieser · 2018-09-17T06:20:51Z

numpy/core/shape_base.py

+                for arr in arrays]
+    else:
+        # We've 'bottomed out'
+        return _nx.asanyarray(arrays)


I think you might as well just use atleast_nd(arrays, result_ndim) here, which also saves the need for your _shape function (it becomes just arr.shape)

Yeah you are right. I kinda didn't want to cast everything to an ndarray at first, but then I realized that it was in the tests so I ended up using it.

eric-wieser · 2018-09-17T06:21:52Z

numpy/core/shape_base.py

-    return block_recursion(arrays)
+
+def _asanyarray_recursion(arrays, list_ndim, depth=0):
+    """Convert all inputs to arrays."""


With the atleast_nd(arrays, result_ndim) change below, the function name should become to_uniform_dim_arrays or similar

eric-wieser · 2018-09-17T06:23:00Z

numpy/core/shape_base.py

+        shape = _shape(arrays, result.ndim)
+        slices = tuple(slice(start, start+size)
+                       for start, size in zip(start_indicies, shape))
+        result[slices] = arrays


Instead of doing the assignment, can you assemble a list of (slices, array) to return to the caller? That would let us implement unblock easily too. So this function would become _match_slices or similar

eric-wieser · 2018-09-17T06:31:02Z

numpy/core/shape_base.py

+        shape_on_dim = sum(shape[concat_depth] for shape in shapes)
+        # Take a shape, any shape
+        shape = shapes[0]
+        shape = shape[:concat_depth] + (shape_on_dim,) + shape[concat_depth+1:]


Helper function for lines 457 to 465: result_shape = _concatenate_shape((shape1, shape2, ...), axis=axis), where all those shapes are just tuples

eric-wieser · 2018-09-17T06:32:57Z

numpy/core/shape_base.py

    """
-    def atleast_nd(a, ndim):


I mean this atleast_nd function, which already does allow subclasses. I think you should keep it.

I'll keep it if you insist. I just really don't like 1 liner, single caller internal functions. We could just make this function public too

There's a PR somewhere about making it public. The nice thing about having it internal like this is that if we do decide to make it public, we already know where we can leverage it internally.

hmaarrfk · 2018-09-17T06:38:55Z

Suggestions kept since they are being hidden by github:

@eric-wieser wrt lines 447
Instead of doing the assignment, can you assemble a list of (slices, array) to return to the caller? That would let us implement unblock easily too. So this function would become _match_slices or similar

@eric-wieser
Helper function for lines 457 to 465: result_shape = _concatenate_shape((shape1, shape2, ...), axis=axis), where all those shapes are just tuples

These require thinking and it is too late now for me to do that. Goodnight!

Thank you for your help.

numpy/core/shape_base.py

eric-wieser · 2018-09-17T06:44:42Z

The latter shouldn't really require any work at all, just a copy-paste of the function contents. No rush though - thanks for the good work so far!

Regarding unblock - one of my original goals was to add python-style unpacking. Where in python we could write;

a, *b = range(5)

I was thinking of aiming for

a = np.empty(2)
b = np.empty(3)
np.b_[a, b] = np.arange(5)  #desugars to `np.unblock(np.arange(5), out=[a, b])`

eric-wieser · 2018-09-17T06:47:25Z

numpy/core/shape_base.py

+                raise ValueError('Mismatched array shapes in block.')
+        shape_on_dim = sum(shape[concat_depth] for shape in shapes)
+        return (first_shape[:concat_depth] + (shape_on_dim,) +
+                first_shape[concat_depth+1:])


Helper would be:

def _concatenate_shapes(shapes, axis): """ concatenate(arrs, axis).shape == _concatenate_shapes([a.shape for a in arrs], axis) """ first_shape = shapes[0] for shape in shapes[1:]: if (shape[:axis] != first_shape[:axis] or shape[axis+1:] != first_shape[axis+1:]): raise ValueError('Mismatched array shapes in block.') shape_on_axis = sum(shape[axis] for shape in shapes) return first_shape[:axis] + (shape_on_axis,) + first_shape[axis+1:]

numpy/core/shape_base.py

@@ -442,22 +442,25 @@ def _assignment_recursion(result, arrays, list_ndim, start_indicies,
        return arrays.shape


+def _concatenate_shapes(shapes, axis):


hmaarrfk · 2018-09-17T07:00:55Z

I think you should make this unblock a separate issue.

We could do what you suggested, or what might also be useful is provided a list of shapes, We could then return views into the original array.

Do I need to comment the internal functions more than what I have now?

Personally, having gone through the super commented code of pad, I actually felt that the comments to internal functions were distracting. Especially since I couldn't see the signature and function body on the same screen...

eric-wieser · 2018-09-17T07:03:21Z

having gone through the super commented code of pad, I actually felt that the comments to internal functions were distracting.

I agree

Do I need to comment the internal functions more than what I have now?

Maybe a little more, but what you have looks pretty good.

I think you should make this unblock a separate issue.

unblock should indeed be a separate issue - but if you can separate the slice computation from the assignment now, then you can combine your shape recursion with the slice recursion.

eric-wieser · 2018-09-17T07:07:11Z

numpy/core/shape_base.py

+            start_indicies[concat_depth] += shape[concat_depth]
+        # lists are mutable, so we have to reset the start_indicies list
+        start_indicies[concat_depth:] = (0, ) * (result.ndim - concat_depth)
+        return shape


This looks off to me - doesn't it just return the shape of the very last array?

Feel free to comment it out, then run pytest.

It took me a while to figure out what was wrong. It is the same reason why you don't have a function signature with

def foo(baz, bar=[]): bar.append(baz)

I use a to have a tuple based implementation, but it looked worse IMO
2a3bd7f

See this comment - there's a simpler tuple based implementation possible

hmaarrfk · 2018-09-17T07:09:16Z

I'm getting better performance thatn concatenate on large arrays..... is that a red herring?

[ 50.00%] ··· ===== ========= =============
              --              mode         
              ----- -----------------------
                n     block    concatenate 
              ===== ========= =============
                1    190±0μs     19.5±0μs  
                10   645±0μs     476±0μs   
               100   447±0ms     509±0ms   
              ===== ========= =============

htop says that assignment is single threaded :/

This is kinda consistent.

I think the poor performance on the 10 shaped array is due to the 2 extra unnecessary recursions (and probably the sanity checking code that could get cleaned up).

eric-wieser · 2018-09-17T07:14:08Z

The slice-based recursion could look something like:

def _concat_info(arrays, list_ndim, start_indicies, depth=0):
    """
    Returns ``shape, [(idx_1, arr_1), (idx_2, arr_2), ...]```
    """
    if depth < list_ndim:
        item_pairs = []
        item_shapes = []
        concat_depth = depth + result.ndim - list_ndim
        for arr in arrays:
            shape, slices = _concat_info(arr, list_ndim,
                                          start_indicies, depth=depth+1)
            start_indicies[concat_depth] += shape[concat_depth]
            item_shapes.append(shape)
            item_pairs.append(slices)
        # lists are mutable, so we have to reset the start_indicies list
        start_indicies[concat_depth:] = (0,) * (result.ndim - concat_depth)

        shape = _concatenate_shapes(item_shapes, axis=concat_depth)
        slices = [pair for pairs in item_pairs for pair in pairs]
        return shape, slices
    else:
        # We've bottom out, assign the array
        idx = tuple(
            slice(start, start+size)
            for start, size in zip(start_indicies, arrays.shape)
        )
        return arrays.shape, [(idx, arrays)]

shape, pairs = _concat_info(...)
result = np.empty(shape)
for idx, subarray in pairs:
    result[idx] = subarray

eric-wieser · 2018-09-17T07:17:12Z

numpy/core/shape_base.py

+                                          start_indicies, depth=depth+1)
+            start_indicies[concat_depth] += shape[concat_depth]
+        # lists are mutable, so we have to reset the start_indicies list
+        start_indicies[concat_depth:] = (0, ) * (result.ndim - concat_depth)


A better way of handling this loop to avoid needing to reset the indices - initialize start_indicies=(), and do:

i = 0 # or a better name for arr in arrays: shape = _assignment_recursion(result, arr, list_ndim, start_indicies + (i,), depth=depth+1) i += shape[concat_depth]

Also, typo: indices, not indicies

Thats not a typo, I just can't spell 👍

Fits into the broader category of failing to convert a sequence of phonemes in your head into a sequence of characters in memory, which I'm happy to write off as a typo :)

hmaarrfk · 2018-09-17T07:33:22Z

@eric-wieser That final refactoring is basically what is needed to get

_to_uniform_dim_recursion
_shape_recursion
_dtype_recursion
_assignment_recursion

All in 1 recursion.

I was hoping to do it with some kind of list comprehension. I guess that won't happen though.

eric-wieser · 2018-09-17T07:39:36Z

To get it in a more comprehension-y form, I think you'd need to eliminate the start_indices argument entirely, and do repeated addition in the recursive step.

hmaarrfk · 2018-09-17T07:45:05Z

Honestly, list comprehension seems like a cop-out for python not to implement a JIT.

I understand generator expressions, but if they actually worked on a JIT, you wouldn't need to learn to read loops backward.

One day, I'll jump into PyPy.

eric-wieser · 2018-09-17T07:48:52Z

Honestly, list comprehension seems like a cop-out for python not to implement a JIT.

I've really got no idea what you're talking about. As far as I'm aware, list comprehensions are nothing but (useful) syntactic sugar

hmaarrfk · 2018-09-17T07:56:34Z

Why was I convinced that list comprehensions were faster than loops. Eye opening!

eric-wieser · 2018-09-17T08:05:18Z

An OO-based approach to give you your comprehension-based recursive step:

def _concatenate_offsets(shapes, axis):
    offset = 0
    offsets = []
    for shape in shape:
        offsets.append(offset)
        offset += shape[axis]
    return offsets


class concat_info:
    __slots__ = ('shape', 'dtype', 'locs')
    def __init__(self, shape, dtype, locs):
        self.shape = shape
        self.dtype = dtype
        self.locs = locs

    @classmethod
    def concatenate(cls, infos, axis):
        dtype = np.result_type(*(info.dtype for info in infos))
        shape = _concatenate_shapes([info.shape for info in infos], axis)
        offsets = _concatenate_offsets([info.shape for info in infos], axis)
        locs = [
            (
                loc[:axis] + (loc[axis] + offset,) + loc[axis+1:],
                arr
            )
            for info, offset in zip(infos, offsets)
            for loc, arr in info.locs
        ]
        return cls(shape, dtype, locs)

    @classmethod
    def single(cls, arr):
        return concat_info(arr.shape, arr.dtype, [((0,)*arr.ndim, arr)])

    @property
    def regions(self):
        for loc, arr in self.locs:
            region = tuple(
                slice(start, start + size)
                for start, size in zip(loc, arr.shape)
            )
            yield region, arr

    def store(cls, out=None):
        if out is None:
            out = np.empty(self.shape, self.dtype)
        for index, arr in self.regions:
            out[index] = arr
        return out

    def load(cls, in_):
        for index, arr in self.regions:
            arr[...] = in_[index]


def info_recurse(arrays, list_ndim, result_ndim, depth=0):
    if depth < list_ndim:
        axis = depth + result_ndim - list_ndim
        return concat_info.concatenate([
            info_recurse(arr, list_ndim, result_ndim, depth+1)
        ], axis=axis)
    else:
        return concat_info.single(arrays)

def block(...)
    info = info_recurse(arrays,  list_ndim, result_ndim)
    return info.store()

I don't know what the cost of the class is

ahaldane · 2018-10-18T20:23:29Z

Looks good, including the new F-order test.

I think everyone is happy with it now, so let's go ahead and merge soon. I'll wait a little bit in case other reviewers have final comments (any of you feel free to merge if you're happy too).

(There is a release note conflict to resolve, but that's easily taken care of with the online editor before merging)

hmaarrfk · 2018-10-19T18:17:37Z

@ahaldane should we add a test for slice based blocking preserving F order?

ahaldane · 2018-10-19T19:30:36Z

Yeah, that would be a good addition.

We're not guaranteering users anything about order, but it would be nice to make sure we don't accidentally lose the F-order speed we get now in future updates.

hmaarrfk · 2018-10-20T12:51:09Z

@ahaldane I rebased (after pulling in your changes) and added some simple tests for F and C order.

The CIs are failing for some strange reason. I'll try to force push to trigger them later.

ahaldane · 2018-10-20T23:28:10Z

numpy/core/tests/test_shape_base.py

+                    [arr_f, arr_f]]]
+
+        assert block(b_c).flags['C_CONTIGUOUS']
+        assert block(b_mixed).flags['C_CONTIGUOUS']


I don't think we should have a test for the mixed case, because we don't want to guarantee anything about this case either way. Future updates should feel free to change the order.

ahaldane · 2018-10-20T23:30:30Z

numpy/core/tests/test_shape_base.py

+        assert block(b_mixed).flags['C_CONTIGUOUS']
+        # ``_block_force_concatenate`` returns some mixed `C`/`F` array
+        if block == _block_force_slicing:
+            assert block(b_f).flags['F_CONTIGUOUS']


It would be better to make this test cover both the concatenate and the slicing code-path. What about removing the outer nesting in the test input so that we get a case which should be in F-order no matter what.

I mean, do:

b_f = [[arr_f, arr_f], [arr_f, arr_f]]

That test exists above no?

eric-wieser · 2018-10-21T00:50:39Z

numpy/core/shape_base.py

+        ```
+
+        Thses are called slice prefixes since they are used in the recursive
+        blocking algorithm to compute the left most slice during the


should this be left-hand slices or left-most slices, since its more than one slice?

eric-wieser · 2018-10-21T00:51:16Z

numpy/core/shape_base.py

+
+    if any(shape[:axis] != first_shape_pre or
+           shape[axis+1:] != first_shape_post for shape in shapes):
+        raise ValueError('Mismatched array shapes in block.')


Would be nice to include "along axis {axis}" in this message

Thanks, I'm always annoyed when I have to find that information myself.

… copy.

ahaldane · 2018-10-23T20:40:10Z

numpy/core/tests/test_shape_base.py

+               [arr_c, arr_c]]
+
+         b_f = [[arr_f, arr_f],
+                [arr_f, arr_f]]


Indentation error here

…call to ``np.block``.

ahaldane · 2018-10-24T03:23:48Z

All right, I just gave it one more readthrough, and all looks good.

Merging time! Here goes...

Thanks @hmaarrfk for the PR and @eric-wieser for reviews.

eric-wieser · 2018-10-24T04:44:10Z

numpy/core/tests/test_shape_base.py

+        argnames = sorted(arglist[0])
+        metafunc.parametrize(argnames,
+                             [[funcargs[name] for name in argnames]
+                              for funcargs in arglist])


This seems like a lot of magic for something that can be achieved with a parametrized fixture

Off the top of my head, I think something like:

class Tests: @pytest.fixture(params=[block1, block2, ...]) def block(request): # maybe needs self, not sure return request.param def method(self, block): ...

do you have a working example of a fixture i can build off?

Does the one above not work?

hmaarrfk force-pushed the block_single_concatenate_call branch from 5ecfe33 to 2a3bd7f Compare September 17, 2018 06:14

eric-wieser reviewed Sep 17, 2018

View reviewed changes

numpy/core/shape_base.py Outdated Show resolved Hide resolved

hmaarrfk force-pushed the block_single_concatenate_call branch from 1ecbdba to dfa18f2 Compare September 17, 2018 06:45

eric-wieser reviewed Sep 17, 2018

View reviewed changes

numpy/core/shape_base.py Outdated

@@ -442,22 +442,25 @@ def _assignment_recursion(result, arrays, list_ndim, start_indicies,

return arrays.shape

def _concatenate_shapes(shapes, axis):

This comment was marked as resolved.

Sign in to view

eric-wieser added the 03 - Maintenance label Sep 17, 2018

eric-wieser reviewed Sep 17, 2018

View reviewed changes

hmaarrfk mentioned this pull request Sep 17, 2018

Improve the implementation of dask.array.block dask/dask#3987

Open

hmaarrfk force-pushed the block_single_concatenate_call branch 2 times, most recently from fe7cab0 to a4b616a Compare October 20, 2018 12:49

hmaarrfk force-pushed the block_single_concatenate_call branch from a4b616a to 8e02803 Compare October 20, 2018 22:05

ahaldane reviewed Oct 20, 2018

View reviewed changes

hmaarrfk force-pushed the block_single_concatenate_call branch 2 times, most recently from 3390b68 to 22e3424 Compare October 21, 2018 00:33

eric-wieser reviewed Oct 21, 2018

View reviewed changes

hmaarrfk added 4 commits October 20, 2018 21:02

MAINT: provide an algorithm that blocks matrices with a single memory…

198df77

… copy.

TST: Block test: Trigger both code paths.

6d4715e

TST: Add a test to block that checks for mismatched shapes in 2D

d9824ad

DOC: Add a release note about the slice based blocking algorithm

c5f21f6

hmaarrfk force-pushed the block_single_concatenate_call branch from 22e3424 to 245cb80 Compare October 21, 2018 01:03

ahaldane reviewed Oct 23, 2018

View reviewed changes

TST: Add a test to ensure the memory order is respected when after a …

f164d2e

…call to ``np.block``.

hmaarrfk force-pushed the block_single_concatenate_call branch from 245cb80 to f164d2e Compare October 23, 2018 20:42

ahaldane merged commit bde5929 into numpy:master Oct 24, 2018

eric-wieser reviewed Oct 24, 2018

View reviewed changes

hmaarrfk mentioned this pull request Oct 24, 2018

TST: simplify how the different code paths for block are tested. #12259

Merged

hmaarrfk deleted the block_single_concatenate_call branch November 5, 2018 03:20

hmaarrfk mentioned this pull request Nov 22, 2018

MAINT: Rewrite numpy.pad without concatenate #11358

Merged

5 tasks

ahaldane mentioned this pull request Mar 25, 2019

ENH: Port single-copy np.block implementation to C #13186

Open

eric-wieser mentioned this pull request Apr 15, 2019

ENH: Possible improvements to np.block #8899

Open

6 tasks

hmaarrfk mentioned this pull request Jan 9, 2025

Support taking hirez screenshots pygfx/pygfx#909

Merged

10 tasks

		@@ -442,22 +442,25 @@ def _assignment_recursion(result, arrays, list_ndim, start_indicies,
		return arrays.shape


		def _concatenate_shapes(shapes, axis):

Uh oh!

MAINT: Block algorithm with a single copy per call to block #11971

MAINT: Block algorithm with a single copy per call to block #11971

Uh oh!

Conversation

hmaarrfk commented Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approach details

2D results

A few small micro optimizations

3D results:

Extensions

Unblocking

User friendly options

Uh oh!

eric-wieser commented Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eric-wieser commented Sep 17, 2018

Uh oh!

hmaarrfk commented Sep 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hmaarrfk commented Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

eric-wieser commented Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eric-wieser Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

hmaarrfk commented Sep 17, 2018

Uh oh!

eric-wieser commented Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MAINT: Block algorithm with a single copy per call to `block` #11971

MAINT: Block algorithm with a single copy per call to `block` #11971

hmaarrfk commented Sep 17, 2018 •

edited

Loading

eric-wieser commented Sep 17, 2018 •

edited

Loading

eric-wieser Sep 17, 2018 •

edited

Loading

eric-wieser Sep 17, 2018 •

edited

Loading

eric-wieser Sep 17, 2018 •

edited

Loading

hmaarrfk commented Sep 17, 2018 •

edited

Loading

eric-wieser commented Sep 17, 2018 •

edited

Loading

eric-wieser Sep 17, 2018 •

edited

Loading

eric-wieser commented Sep 17, 2018 •

edited

Loading

eric-wieser commented Sep 17, 2018 •

edited

Loading

eric-wieser Sep 17, 2018 •

edited

Loading

eric-wieser commented Sep 17, 2018 •

edited

Loading

ahaldane commented Oct 19, 2018 •

edited

Loading