Skip to content

MAINT: Block algorithm with a single copy per call to block #11971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 24, 2018

Conversation

hmaarrfk
Copy link
Contributor

@hmaarrfk hmaarrfk commented Sep 17, 2018

Block used to make repeated calls to concatenate. For a 3D block, this would copy the arrays 3 times. 1 time for each dimension of the block. This is 2 times too many.

This proposed algorithm does so with 1 memory copy.

Wisdom allows numpy to guess which algorithm is better performing and chooses that one automatically. We can add a switch, to let the user decide.

TODO:

XREF: Other issues I found regarding concatenate that we might want to test for

The general steps are:

  1. Sanitize the input.
  2. Cast each block as a numpy array of the final dimension
  3. Calculate the shape of each block
  4. Concatenate the shapes
  5. Keep track of the slice of a particular block within the larger whole.
  6. Combines dtypes to find the dtype of the result.
  7. Create the new array.
  8. Copy the blocks in.

A benchmark was added to estimate the 2D blocking performance in a straightforward manner.

Approach details

2D results

For small arrays, this approach seems to suffer from large overhead. Benchmarks show that "small arrays" are anything as large as 256x256 arrays a modern i7 from 2017.

The performance of master is optimized in PR #11991. This shaves the overhead time for a 2x2 block from 20us to 15 us. Benchmarks in this section are compared to that PR. The overhead for this method is 30 us when blocking a small 2x2 array.

My general sense is that small arrays are pretty important. While the concatenate approach was "wasteful" for large arrays, it was quick for small arrays. There is probably room for both approaches.

Benchmark results compared to PR #11991
       before           after         ratio
     [1af752aa]       [0b1d866f]
     <block_optimize_order>       <block_single_assignment>
+      15.6±0.3μs       30.0±0.8μs     1.92  bench_shape_base.Block2D.time_block2d((16, 16), 'uint8', (2, 2))
+     15.8±0.08μs         30.2±1μs     1.91  bench_shape_base.Block2D.time_block2d((16, 16), 'uint32', (2, 2))
+      39.4±0.2μs         73.1±2μs     1.85  bench_shape_base.Block2D.time_block2d((32, 32), 'uint8', (4, 4))
+      16.9±0.3μs         31.1±1μs     1.84  bench_shape_base.Block2D.time_block2d((32, 32), 'uint64', (2, 2))
+      16.8±0.3μs       31.0±0.9μs     1.84  bench_shape_base.Block2D.time_block2d((32, 32), 'uint32', (2, 2))
+      40.3±0.3μs       73.7±0.9μs     1.83  bench_shape_base.Block2D.time_block2d((32, 32), 'uint16', (4, 4))
+      16.4±0.6μs       29.8±0.7μs     1.82  bench_shape_base.Block2D.time_block2d((32, 32), 'uint8', (2, 2))
+      16.9±0.3μs       30.8±0.9μs     1.82  bench_shape_base.Block2D.time_block2d((32, 32), 'uint16', (2, 2))
+      39.6±0.7μs       71.8±0.9μs     1.81  bench_shape_base.Block2D.time_block2d((16, 16), 'uint8', (4, 4))
+      16.2±0.3μs       29.3±0.5μs     1.81  bench_shape_base.Block2D.time_block2d((16, 16), 'uint16', (2, 2))
+      16.6±0.1μs       30.0±0.3μs     1.80  bench_shape_base.Block2D.time_block2d((64, 64), 'uint8', (2, 2))
+      41.4±0.3μs         74.5±2μs     1.80  bench_shape_base.Block2D.time_block2d((16, 16), 'uint32', (4, 4))
+      42.3±0.7μs         75.7±2μs     1.79  bench_shape_base.Block2D.time_block2d((32, 32), 'uint64', (4, 4))
+      16.6±0.6μs       29.7±0.4μs     1.79  bench_shape_base.Block2D.time_block2d((16, 16), 'uint64', (2, 2))
+      18.1±0.1μs       32.3±0.8μs     1.79  bench_shape_base.Block2D.time_block2d((64, 64), 'uint32', (2, 2))
+      17.3±0.6μs       30.9±0.3μs     1.78  bench_shape_base.Block2D.time_block2d((64, 64), 'uint16', (2, 2))
+      41.7±0.6μs         74.2±2μs     1.78  bench_shape_base.Block2D.time_block2d((64, 64), 'uint16', (4, 4))
+      40.8±0.9μs       72.0±0.7μs     1.77  bench_shape_base.Block2D.time_block2d((16, 16), 'uint16', (4, 4))
+      42.7±0.4μs       75.0±0.3μs     1.76  bench_shape_base.Block2D.time_block2d((64, 64), 'uint32', (4, 4))
+        41.9±1μs         72.9±2μs     1.74  bench_shape_base.Block2D.time_block2d((16, 16), 'uint64', (4, 4))
+        41.9±1μs       72.8±0.5μs     1.74  bench_shape_base.Block2D.time_block2d((32, 32), 'uint32', (4, 4))
+        46.0±1μs         79.8±2μs     1.73  bench_shape_base.Block2D.time_block2d((128, 128), 'uint16', (4, 4))
+      18.2±0.3μs       31.4±0.7μs     1.72  bench_shape_base.Block2D.time_block2d((128, 128), 'uint8', (2, 2))
+      44.1±0.9μs       75.1±0.8μs     1.70  bench_shape_base.Block2D.time_block2d((128, 128), 'uint8', (4, 4))
+        45.1±1μs         75.7±2μs     1.68  bench_shape_base.Block2D.time_block2d((64, 64), 'uint64', (4, 4))
+        44.0±1μs         73.4±2μs     1.67  bench_shape_base.Block2D.time_block2d((64, 64), 'uint8', (4, 4))
+      49.4±0.3μs       79.3±0.4μs     1.60  bench_shape_base.Block2D.time_block2d((256, 256), 'uint8', (4, 4))
+      20.6±0.4μs       32.9±0.4μs     1.60  bench_shape_base.Block2D.time_block2d((128, 128), 'uint16', (2, 2))
+      19.7±0.5μs       31.4±0.2μs     1.59  bench_shape_base.Block2D.time_block2d((64, 64), 'uint64', (2, 2))
+        50.0±2μs         78.9±3μs     1.58  bench_shape_base.Block2D.time_block2d((128, 128), 'uint32', (4, 4))
+      54.3±0.2μs         81.4±1μs     1.50  bench_shape_base.Block2D.time_block2d((128, 128), 'uint64', (4, 4))
+      23.5±0.6μs         34.6±1μs     1.47  bench_shape_base.Block2D.time_block2d((128, 128), 'uint32', (2, 2))
+        57.8±1μs         85.0±2μs     1.47  bench_shape_base.Block2D.time_block2d((256, 256), 'uint16', (4, 4))
+      24.5±0.7μs       35.4±0.4μs     1.45  bench_shape_base.Block2D.time_block2d((256, 256), 'uint8', (2, 2))
+      28.3±0.3μs         38.4±1μs     1.36  bench_shape_base.Block2D.time_block2d((128, 128), 'uint64', (2, 2))
+      29.8±0.2μs         40.2±1μs     1.35  bench_shape_base.Block2D.time_block2d((256, 256), 'uint16', (2, 2))
-       175±200μs          150±4μs     0.85  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint8', (4, 4))
-      72.0±100μs       59.6±0.8μs     0.83  bench_shape_base.Block2D.time_block2d((512, 512), 'uint16', (2, 2))
-      70.8±100μs       56.9±0.1μs     0.80  bench_shape_base.Block2D.time_block2d((256, 256), 'uint64', (2, 2))
-       167±200μs          133±1μs     0.80  bench_shape_base.Block2D.time_block2d((512, 512), 'uint32', (4, 4))
-       132±200μs         89.4±3μs     0.68  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint8', (2, 2))
-       131±200μs       82.4±0.8μs     0.63  bench_shape_base.Block2D.time_block2d((512, 512), 'uint32', (2, 2))
-       377±500μs        191±0.8μs     0.51  bench_shape_base.Block2D.time_block2d((512, 512), 'uint64', (4, 4))
-       403±500μs          203±4μs     0.51  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint16', (4, 4))
-      1.20±0.8ms         472±20μs     0.39  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint32', (4, 4))
-        6.14±2ms         985±20μs     0.16  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint64', (4, 4))
-        6.28±2ms         983±50μs     0.16  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint64', (2, 2))
-        2.94±1ms         360±20μs     0.12  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint32', (2, 2))
-      1.27±0.5ms          146±3μs     0.12  bench_shape_base.Block2D.time_block2d((1024, 1024), 'uint16', (2, 2))
-      1.32±0.5ms          126±4μs     0.10  bench_shape_base.Block2D.time_block2d((512, 512), 'uint64', (2, 2))

A few small micro optimizations

This PR builds on-top of pretty optimized code already in numpy. Getting the performance close hard/impossible.

List comprehensions are really expensive. Unfortunately, blocking allows for really fancy block shapes, which means that we cant do much but to use them in python.

A few things can be optimized:

  1. Generators are slow. Thanks @eric-wieser for teaching me about that.
  2. Empty lists are falsy. You can use this to test for unlikely cases using list comprehensions.
  3. Creating tuples is expensive. Create them once, reuse them.

I'm unsure how these optimizations affect the performance in PyPy

3D results:

For large arrays, the benchmark #11965 shows that there is now little room to improve compared to a straight concatenate operation. This is really expected since here we are probably copying the arrays more than the cost it takes to generate the slices.

benchmarks The benchmarks show results for a final array of size `5 n`^3. Therefore, for n=10, the final array has 125,000 elements, each 8 bytes (int 64), so close to 1MB. `n=100` is closer to 1GB

New results

===== ========== =============
--              mode          
----- ------------------------
  n     block     concatenate 
===== ========== =============
  1    74.0±2μs    2.03±0.1μs 
  10   150±6μs      37.6±1μs  
 100   366±7ms      393±10ms  
===== ========== =============

(I'm pretty convinced that the difference in time here between block and concatenate for n=100 is artificial, though not sure how to prove it.)

Improvement compared to the old results

asv continuous -E conda:3.6 -b Block.time_3d master block_single_concatenate_call 
       before           after         ratio
     [b5f56572]       [a2b4b356]
     <master>         <block_single_concatenate_call>
+      40.7±0.2μs         74.0±2μs     1.82  bench_shape_base.Block.time_3d(1, 'block')
-      1.10±0.02s          366±7ms     0.33  bench_shape_base.Block.time_3d(100, 'block')
-        681±20μs          150±6μs     0.22  bench_shape_base.Block.time_3d(10, 'block')

Extensions

Unblocking

@eric-wieser discussed the possibility of using this algorithm to unblock arrays. This is rather straightforward to implement once we lock down the implementation. Details are in the comments below

User friendly options

Finally, this probably lends itself well to offering options like order, which would allow one to choose the order of the final array, or even to specify the array that they want to provide through an out parameter.

Closures are avoided entirely, so I think that gh-10620 is not an issue.

@eric-wieser
Copy link
Member

eric-wieser commented Sep 17, 2018

I had this in mind a while ago, and was hoping someone else would give it a go :)

This algorithm opens up the option of an "unblock" operation, which does exactly the same thing, but with the assignment in the other direction. Something to explore later.

@eric-wieser
Copy link
Member

I'm afraid I introduced a conflict by merging #11910. The upshot is you could use closures again, if they help

@hmaarrfk hmaarrfk force-pushed the block_single_concatenate_call branch from 5ecfe33 to 2a3bd7f Compare September 17, 2018 06:14
@hmaarrfk
Copy link
Contributor Author

I'm afraid I introduced a conflict by merging #11910. The upshot is you could use closures again, if they help

no worries.

def _shape(array, result_ndim):
# array is either a scalar or an array
# type(arrays) is not list
# if it is a scalar, tell it we have shape==(1,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment looks wrong.

Perhaps a better function would be shape_with_ndim(shape, ndim) with description "pad a shape with leading 1 to make it be of dimension ndim". I'm not super happy with that function name, but I think it would be clearer if it operated on a shape rather than an array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below - you can drop this in favor of atleast_nd

shape[concat_depth+1:] != shapes[0][concat_depth+1:]):
raise ValueError('Mismatched array shapes in block.')
shape_on_dim = sum(shape[concat_depth] for shape in shapes)
shape = list(shapes[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is to make a copy, I assume? Would be nice to indicate that.

I think I'd prefer

first_shape = shapes[0]
...
shape = shapes[:concat_depth] + (shape_on_dim,) + shapes[concat_depth+1:]

because shapes are typically tuples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you like the improvement.

for arr in arrays]
else:
# We've 'bottomed out'
return _nx.asanyarray(arrays)
Copy link
Member

@eric-wieser eric-wieser Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might as well just use atleast_nd(arrays, result_ndim) here, which also saves the need for your _shape function (it becomes just arr.shape)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you are right. I kinda didn't want to cast everything to an ndarray at first, but then I realized that it was in the tests so I ended up using it.

return block_recursion(arrays)

def _asanyarray_recursion(arrays, list_ndim, depth=0):
"""Convert all inputs to arrays."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the atleast_nd(arrays, result_ndim) change below, the function name should become to_uniform_dim_arrays or similar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

shape = _shape(arrays, result.ndim)
slices = tuple(slice(start, start+size)
for start, size in zip(start_indicies, shape))
result[slices] = arrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing the assignment, can you assemble a list of (slices, array) to return to the caller? That would let us implement unblock easily too. So this function would become _match_slices or similar

shape_on_dim = sum(shape[concat_depth] for shape in shapes)
# Take a shape, any shape
shape = shapes[0]
shape = shape[:concat_depth] + (shape_on_dim,) + shape[concat_depth+1:]
Copy link
Member

@eric-wieser eric-wieser Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helper function for lines 457 to 465: result_shape = _concatenate_shape((shape1, shape2, ...), axis=axis), where all those shapes are just tuples

"""
def atleast_nd(a, ndim):
Copy link
Member

@eric-wieser eric-wieser Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean this atleast_nd function, which already does allow subclasses. I think you should keep it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep it if you insist. I just really don't like 1 liner, single caller internal functions. We could just make this function public too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a PR somewhere about making it public. The nice thing about having it internal like this is that if we do decide to make it public, we already know where we can leverage it internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Sep 17, 2018

Suggestions kept since they are being hidden by github:

@eric-wieser wrt lines 447
Instead of doing the assignment, can you assemble a list of (slices, array) to return to the caller? That would let us implement unblock easily too. So this function would become _match_slices or similar

@eric-wieser
Helper function for lines 457 to 465: result_shape = _concatenate_shape((shape1, shape2, ...), axis=axis), where all those shapes are just tuples

These require thinking and it is too late now for me to do that. Goodnight!

Thank you for your help.

@eric-wieser
Copy link
Member

eric-wieser commented Sep 17, 2018

The latter shouldn't really require any work at all, just a copy-paste of the function contents. No rush though - thanks for the good work so far!


Regarding unblock - one of my original goals was to add python-style unpacking. Where in python we could write;

a, *b = range(5)

I was thinking of aiming for

a = np.empty(2)
b = np.empty(3)
np.b_[a, b] = np.arange(5)  #desugars to `np.unblock(np.arange(5), out=[a, b])`

@hmaarrfk hmaarrfk force-pushed the block_single_concatenate_call branch from 1ecbdba to dfa18f2 Compare September 17, 2018 06:45
raise ValueError('Mismatched array shapes in block.')
shape_on_dim = sum(shape[concat_depth] for shape in shapes)
return (first_shape[:concat_depth] + (shape_on_dim,) +
first_shape[concat_depth+1:])
Copy link
Member

@eric-wieser eric-wieser Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helper would be:

def _concatenate_shapes(shapes, axis):
    """
    concatenate(arrs, axis).shape == _concatenate_shapes([a.shape for a in arrs], axis)
    """
    first_shape = shapes[0]
    for shape in shapes[1:]:
        if (shape[:axis] != first_shape[:axis] or
                shape[axis+1:] != first_shape[axis+1:]):
            raise ValueError('Mismatched array shapes in block.')
    shape_on_axis = sum(shape[axis] for shape in shapes)
    return first_shape[:axis] + (shape_on_axis,) + first_shape[axis+1:]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -442,22 +442,25 @@ def _assignment_recursion(result, arrays, list_ndim, start_indicies,
return arrays.shape


def _concatenate_shapes(shapes, axis):

This comment was marked as resolved.

@hmaarrfk
Copy link
Contributor Author

I think you should make this unblock a separate issue.

We could do what you suggested, or what might also be useful is provided a list of shapes, We could then return views into the original array.

Do I need to comment the internal functions more than what I have now?

Personally, having gone through the super commented code of pad, I actually felt that the comments to internal functions were distracting. Especially since I couldn't see the signature and function body on the same screen...

@eric-wieser
Copy link
Member

eric-wieser commented Sep 17, 2018

having gone through the super commented code of pad, I actually felt that the comments to internal functions were distracting.

I agree

Do I need to comment the internal functions more than what I have now?

Maybe a little more, but what you have looks pretty good.

I think you should make this unblock a separate issue.

unblock should indeed be a separate issue - but if you can separate the slice computation from the assignment now, then you can combine your shape recursion with the slice recursion.

start_indicies[concat_depth] += shape[concat_depth]
# lists are mutable, so we have to reset the start_indicies list
start_indicies[concat_depth:] = (0, ) * (result.ndim - concat_depth)
return shape
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks off to me - doesn't it just return the shape of the very last array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to comment it out, then run pytest.

It took me a while to figure out what was wrong. It is the same reason why you don't have a function signature with

def foo(baz, bar=[]):
    bar.append(baz)

I use a to have a tuple based implementation, but it looked worse IMO
2a3bd7f

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this comment - there's a simpler tuple based implementation possible

@hmaarrfk
Copy link
Contributor Author

I'm getting better performance thatn concatenate on large arrays..... is that a red herring?

[ 50.00%] ··· ===== ========= =============
              --              mode         
              ----- -----------------------
                n     block    concatenate 
              ===== ========= =============
                1    190±0μs     19.5±0μs  
                10   645±0μs     476±0μs   
               100   447±0ms     509±0ms   
              ===== ========= =============

htop says that assignment is single threaded :/

This is kinda consistent.

I think the poor performance on the 10 shaped array is due to the 2 extra unnecessary recursions (and probably the sanity checking code that could get cleaned up).

@eric-wieser
Copy link
Member

eric-wieser commented Sep 17, 2018

The slice-based recursion could look something like:

def _concat_info(arrays, list_ndim, start_indicies, depth=0):
    """
    Returns ``shape, [(idx_1, arr_1), (idx_2, arr_2), ...]```
    """
    if depth < list_ndim:
        item_pairs = []
        item_shapes = []
        concat_depth = depth + result.ndim - list_ndim
        for arr in arrays:
            shape, slices = _concat_info(arr, list_ndim,
                                          start_indicies, depth=depth+1)
            start_indicies[concat_depth] += shape[concat_depth]
            item_shapes.append(shape)
            item_pairs.append(slices)
        # lists are mutable, so we have to reset the start_indicies list
        start_indicies[concat_depth:] = (0,) * (result.ndim - concat_depth)

        shape = _concatenate_shapes(item_shapes, axis=concat_depth)
        slices = [pair for pairs in item_pairs for pair in pairs]
        return shape, slices
    else:
        # We've bottom out, assign the array
        idx = tuple(
            slice(start, start+size)
            for start, size in zip(start_indicies, arrays.shape)
        )
        return arrays.shape, [(idx, arrays)]

shape, pairs = _concat_info(...)
result = np.empty(shape)
for idx, subarray in pairs:
    result[idx] = subarray

start_indicies, depth=depth+1)
start_indicies[concat_depth] += shape[concat_depth]
# lists are mutable, so we have to reset the start_indicies list
start_indicies[concat_depth:] = (0, ) * (result.ndim - concat_depth)
Copy link
Member

@eric-wieser eric-wieser Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better way of handling this loop to avoid needing to reset the indices - initialize start_indicies=(), and do:

i = 0  # or a better name
for arr in arrays:
    shape = _assignment_recursion(result, arr, list_ndim,
                                  start_indicies + (i,), depth=depth+1)
    i += shape[concat_depth]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, typo: indices, not indicies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats not a typo, I just can't spell 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fits into the broader category of failing to convert a sequence of phonemes in your head into a sequence of characters in memory, which I'm happy to write off as a typo :)

@hmaarrfk
Copy link
Contributor Author

@eric-wieser That final refactoring is basically what is needed to get

_to_uniform_dim_recursion
_shape_recursion
_dtype_recursion
_assignment_recursion

All in 1 recursion.

I was hoping to do it with some kind of list comprehension. I guess that won't happen though.

@eric-wieser
Copy link
Member

To get it in a more comprehension-y form, I think you'd need to eliminate the start_indices argument entirely, and do repeated addition in the recursive step.

@hmaarrfk
Copy link
Contributor Author

Honestly, list comprehension seems like a cop-out for python not to implement a JIT.

I understand generator expressions, but if they actually worked on a JIT, you wouldn't need to learn to read loops backward.

One day, I'll jump into PyPy.

@eric-wieser
Copy link
Member

Honestly, list comprehension seems like a cop-out for python not to implement a JIT.

I've really got no idea what you're talking about. As far as I'm aware, list comprehensions are nothing but (useful) syntactic sugar

@hmaarrfk
Copy link
Contributor Author

Why was I convinced that list comprehensions were faster than loops. Eye opening!

@eric-wieser
Copy link
Member

eric-wieser commented Sep 17, 2018

An OO-based approach to give you your comprehension-based recursive step:

def _concatenate_offsets(shapes, axis):
    offset = 0
    offsets = []
    for shape in shape:
        offsets.append(offset)
        offset += shape[axis]
    return offsets


class concat_info:
    __slots__ = ('shape', 'dtype', 'locs')
    def __init__(self, shape, dtype, locs):
        self.shape = shape
        self.dtype = dtype
        self.locs = locs

    @classmethod
    def concatenate(cls, infos, axis):
        dtype = np.result_type(*(info.dtype for info in infos))
        shape = _concatenate_shapes([info.shape for info in infos], axis)
        offsets = _concatenate_offsets([info.shape for info in infos], axis)
        locs = [
            (
                loc[:axis] + (loc[axis] + offset,) + loc[axis+1:],
                arr
            )
            for info, offset in zip(infos, offsets)
            for loc, arr in info.locs
        ]
        return cls(shape, dtype, locs)

    @classmethod
    def single(cls, arr):
        return concat_info(arr.shape, arr.dtype, [((0,)*arr.ndim, arr)])

    @property
    def regions(self):
        for loc, arr in self.locs:
            region = tuple(
                slice(start, start + size)
                for start, size in zip(loc, arr.shape)
            )
            yield region, arr

    def store(cls, out=None):
        if out is None:
            out = np.empty(self.shape, self.dtype)
        for index, arr in self.regions:
            out[index] = arr
        return out

    def load(cls, in_):
        for index, arr in self.regions:
            arr[...] = in_[index]


def info_recurse(arrays, list_ndim, result_ndim, depth=0):
    if depth < list_ndim:
        axis = depth + result_ndim - list_ndim
        return concat_info.concatenate([
            info_recurse(arr, list_ndim, result_ndim, depth+1)
        ], axis=axis)
    else:
        return concat_info.single(arrays)

def block(...)
    info = info_recurse(arrays,  list_ndim, result_ndim)
    return info.store()

I don't know what the cost of the class is

@ahaldane
Copy link
Member

Looks good, including the new F-order test.

I think everyone is happy with it now, so let's go ahead and merge soon. I'll wait a little bit in case other reviewers have final comments (any of you feel free to merge if you're happy too).

(There is a release note conflict to resolve, but that's easily taken care of with the online editor before merging)

@hmaarrfk
Copy link
Contributor Author

@ahaldane should we add a test for slice based blocking preserving F order?

@ahaldane
Copy link
Member

ahaldane commented Oct 19, 2018

Yeah, that would be a good addition.

We're not guaranteering users anything about order, but it would be nice to make sure we don't accidentally lose the F-order speed we get now in future updates.

@hmaarrfk hmaarrfk force-pushed the block_single_concatenate_call branch 2 times, most recently from fe7cab0 to a4b616a Compare October 20, 2018 12:49
@hmaarrfk
Copy link
Contributor Author

@ahaldane I rebased (after pulling in your changes) and added some simple tests for F and C order.

The CIs are failing for some strange reason. I'll try to force push to trigger them later.

@hmaarrfk hmaarrfk force-pushed the block_single_concatenate_call branch from a4b616a to 8e02803 Compare October 20, 2018 22:05
[arr_f, arr_f]]]

assert block(b_c).flags['C_CONTIGUOUS']
assert block(b_mixed).flags['C_CONTIGUOUS']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have a test for the mixed case, because we don't want to guarantee anything about this case either way. Future updates should feel free to change the order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

assert block(b_mixed).flags['C_CONTIGUOUS']
# ``_block_force_concatenate`` returns some mixed `C`/`F` array
if block == _block_force_slicing:
assert block(b_f).flags['F_CONTIGUOUS']
Copy link
Member

@ahaldane ahaldane Oct 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to make this test cover both the concatenate and the slicing code-path. What about removing the outer nesting in the test input so that we get a case which should be in F-order no matter what.

I mean, do:

 b_f = [[arr_f, arr_f],
        [arr_f, arr_f]]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That test exists above no?

@hmaarrfk hmaarrfk force-pushed the block_single_concatenate_call branch 2 times, most recently from 3390b68 to 22e3424 Compare October 21, 2018 00:33
```

Thses are called slice prefixes since they are used in the recursive
blocking algorithm to compute the left most slice during the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be left-hand slices or left-most slices, since its more than one slice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.


if any(shape[:axis] != first_shape_pre or
shape[axis+1:] != first_shape_post for shape in shapes):
raise ValueError('Mismatched array shapes in block.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to include "along axis {axis}" in this message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'm always annoyed when I have to find that information myself.

@hmaarrfk hmaarrfk force-pushed the block_single_concatenate_call branch from 22e3424 to 245cb80 Compare October 21, 2018 01:03
[arr_c, arr_c]]

b_f = [[arr_f, arr_f],
[arr_f, arr_f]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation error here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@hmaarrfk hmaarrfk force-pushed the block_single_concatenate_call branch from 245cb80 to f164d2e Compare October 23, 2018 20:42
@ahaldane
Copy link
Member

All right, I just gave it one more readthrough, and all looks good.

Merging time! Here goes...

Thanks @hmaarrfk for the PR and @eric-wieser for reviews.

@ahaldane ahaldane merged commit bde5929 into numpy:master Oct 24, 2018
argnames = sorted(arglist[0])
metafunc.parametrize(argnames,
[[funcargs[name] for name in argnames]
for funcargs in arglist])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a lot of magic for something that can be achieved with a parametrized fixture

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off the top of my head, I think something like:

class Tests:
    @pytest.fixture(params=[block1, block2, ...])
    def block(request): # maybe needs self, not sure
        return request.param

    def method(self, block): ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a working example of a fixture i can build off?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the one above not work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants