-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
MAINT: Block algorithm with a single copy per call to block
#11971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: Block algorithm with a single copy per call to block
#11971
Conversation
I had this in mind a while ago, and was hoping someone else would give it a go :) This algorithm opens up the option of an "unblock" operation, which does exactly the same thing, but with the assignment in the other direction. Something to explore later. |
I'm afraid I introduced a conflict by merging #11910. The upshot is you could use closures again, if they help |
5ecfe33
to
2a3bd7f
Compare
no worries. |
numpy/core/shape_base.py
Outdated
def _shape(array, result_ndim): | ||
# array is either a scalar or an array | ||
# type(arrays) is not list | ||
# if it is a scalar, tell it we have shape==(1,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment looks wrong.
Perhaps a better function would be shape_with_ndim(shape, ndim)
with description "pad a shape with leading 1 to make it be of dimension ndim
". I'm not super happy with that function name, but I think it would be clearer if it operated on a shape rather than an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below - you can drop this in favor of atleast_nd
numpy/core/shape_base.py
Outdated
shape[concat_depth+1:] != shapes[0][concat_depth+1:]): | ||
raise ValueError('Mismatched array shapes in block.') | ||
shape_on_dim = sum(shape[concat_depth] for shape in shapes) | ||
shape = list(shapes[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal here is to make a copy, I assume? Would be nice to indicate that.
I think I'd prefer
first_shape = shapes[0]
...
shape = shapes[:concat_depth] + (shape_on_dim,) + shapes[concat_depth+1:]
because shapes are typically tuples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you like the improvement.
numpy/core/shape_base.py
Outdated
for arr in arrays] | ||
else: | ||
# We've 'bottomed out' | ||
return _nx.asanyarray(arrays) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might as well just use atleast_nd(arrays, result_ndim)
here, which also saves the need for your _shape
function (it becomes just arr.shape
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you are right. I kinda didn't want to cast everything to an ndarray at first, but then I realized that it was in the tests so I ended up using it.
numpy/core/shape_base.py
Outdated
return block_recursion(arrays) | ||
|
||
def _asanyarray_recursion(arrays, list_ndim, depth=0): | ||
"""Convert all inputs to arrays.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the atleast_nd(arrays, result_ndim)
change below, the function name should become to_uniform_dim_arrays
or similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
numpy/core/shape_base.py
Outdated
shape = _shape(arrays, result.ndim) | ||
slices = tuple(slice(start, start+size) | ||
for start, size in zip(start_indicies, shape)) | ||
result[slices] = arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing the assignment, can you assemble a list of (slices, array)
to return to the caller? That would let us implement unblock
easily too. So this function would become _match_slices
or similar
numpy/core/shape_base.py
Outdated
shape_on_dim = sum(shape[concat_depth] for shape in shapes) | ||
# Take a shape, any shape | ||
shape = shapes[0] | ||
shape = shape[:concat_depth] + (shape_on_dim,) + shape[concat_depth+1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helper function for lines 457 to 465: result_shape = _concatenate_shape((shape1, shape2, ...), axis=axis)
, where all those shapes are just tuples
numpy/core/shape_base.py
Outdated
""" | ||
def atleast_nd(a, ndim): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean this atleast_nd
function, which already does allow subclasses. I think you should keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll keep it if you insist. I just really don't like 1 liner, single caller internal functions. We could just make this function public too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a PR somewhere about making it public. The nice thing about having it internal like this is that if we do decide to make it public, we already know where we can leverage it internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
Suggestions kept since they are being hidden by github: @eric-wieser wrt lines 447 @eric-wieser These require thinking and it is too late now for me to do that. Goodnight! Thank you for your help. |
The latter shouldn't really require any work at all, just a copy-paste of the function contents. No rush though - thanks for the good work so far! Regarding
I was thinking of aiming for
|
1ecbdba
to
dfa18f2
Compare
numpy/core/shape_base.py
Outdated
raise ValueError('Mismatched array shapes in block.') | ||
shape_on_dim = sum(shape[concat_depth] for shape in shapes) | ||
return (first_shape[:concat_depth] + (shape_on_dim,) + | ||
first_shape[concat_depth+1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helper would be:
def _concatenate_shapes(shapes, axis):
"""
concatenate(arrs, axis).shape == _concatenate_shapes([a.shape for a in arrs], axis)
"""
first_shape = shapes[0]
for shape in shapes[1:]:
if (shape[:axis] != first_shape[:axis] or
shape[axis+1:] != first_shape[axis+1:]):
raise ValueError('Mismatched array shapes in block.')
shape_on_axis = sum(shape[axis] for shape in shapes)
return first_shape[:axis] + (shape_on_axis,) + first_shape[axis+1:]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
numpy/core/shape_base.py
Outdated
@@ -442,22 +442,25 @@ def _assignment_recursion(result, arrays, list_ndim, start_indicies, | |||
return arrays.shape | |||
|
|||
|
|||
def _concatenate_shapes(shapes, axis): |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
I think you should make this unblock a separate issue. We could do what you suggested, or what might also be useful is provided a list of shapes, We could then return views into the original array. Do I need to comment the internal functions more than what I have now? Personally, having gone through the super commented code of |
I agree
Maybe a little more, but what you have looks pretty good.
|
numpy/core/shape_base.py
Outdated
start_indicies[concat_depth] += shape[concat_depth] | ||
# lists are mutable, so we have to reset the start_indicies list | ||
start_indicies[concat_depth:] = (0, ) * (result.ndim - concat_depth) | ||
return shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks off to me - doesn't it just return the shape of the very last array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to comment it out, then run pytest.
It took me a while to figure out what was wrong. It is the same reason why you don't have a function signature with
def foo(baz, bar=[]):
bar.append(baz)
I use a to have a tuple based implementation, but it looked worse IMO
2a3bd7f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this comment - there's a simpler tuple based implementation possible
I'm getting better performance thatn concatenate on large arrays..... is that a red herring?
htop says that assignment is single threaded :/ This is kinda consistent. I think the poor performance on the 10 shaped array is due to the 2 extra unnecessary recursions (and probably the sanity checking code that could get cleaned up). |
The slice-based recursion could look something like: def _concat_info(arrays, list_ndim, start_indicies, depth=0):
"""
Returns ``shape, [(idx_1, arr_1), (idx_2, arr_2), ...]```
"""
if depth < list_ndim:
item_pairs = []
item_shapes = []
concat_depth = depth + result.ndim - list_ndim
for arr in arrays:
shape, slices = _concat_info(arr, list_ndim,
start_indicies, depth=depth+1)
start_indicies[concat_depth] += shape[concat_depth]
item_shapes.append(shape)
item_pairs.append(slices)
# lists are mutable, so we have to reset the start_indicies list
start_indicies[concat_depth:] = (0,) * (result.ndim - concat_depth)
shape = _concatenate_shapes(item_shapes, axis=concat_depth)
slices = [pair for pairs in item_pairs for pair in pairs]
return shape, slices
else:
# We've bottom out, assign the array
idx = tuple(
slice(start, start+size)
for start, size in zip(start_indicies, arrays.shape)
)
return arrays.shape, [(idx, arrays)]
shape, pairs = _concat_info(...)
result = np.empty(shape)
for idx, subarray in pairs:
result[idx] = subarray |
numpy/core/shape_base.py
Outdated
start_indicies, depth=depth+1) | ||
start_indicies[concat_depth] += shape[concat_depth] | ||
# lists are mutable, so we have to reset the start_indicies list | ||
start_indicies[concat_depth:] = (0, ) * (result.ndim - concat_depth) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better way of handling this loop to avoid needing to reset the indices - initialize start_indicies=()
, and do:
i = 0 # or a better name
for arr in arrays:
shape = _assignment_recursion(result, arr, list_ndim,
start_indicies + (i,), depth=depth+1)
i += shape[concat_depth]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, typo: indices
, not indicies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats not a typo, I just can't spell 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fits into the broader category of failing to convert a sequence of phonemes in your head into a sequence of characters in memory, which I'm happy to write off as a typo :)
@eric-wieser That final refactoring is basically what is needed to get
All in 1 recursion. I was hoping to do it with some kind of list comprehension. I guess that won't happen though. |
To get it in a more comprehension-y form, I think you'd need to eliminate the |
Honestly, list comprehension seems like a cop-out for python not to implement a JIT. I understand generator expressions, but if they actually worked on a JIT, you wouldn't need to learn to read loops backward. One day, I'll jump into PyPy. |
I've really got no idea what you're talking about. As far as I'm aware, list comprehensions are nothing but (useful) syntactic sugar |
Why was I convinced that list comprehensions were faster than loops. Eye opening! |
An OO-based approach to give you your comprehension-based recursive step: def _concatenate_offsets(shapes, axis):
offset = 0
offsets = []
for shape in shape:
offsets.append(offset)
offset += shape[axis]
return offsets
class concat_info:
__slots__ = ('shape', 'dtype', 'locs')
def __init__(self, shape, dtype, locs):
self.shape = shape
self.dtype = dtype
self.locs = locs
@classmethod
def concatenate(cls, infos, axis):
dtype = np.result_type(*(info.dtype for info in infos))
shape = _concatenate_shapes([info.shape for info in infos], axis)
offsets = _concatenate_offsets([info.shape for info in infos], axis)
locs = [
(
loc[:axis] + (loc[axis] + offset,) + loc[axis+1:],
arr
)
for info, offset in zip(infos, offsets)
for loc, arr in info.locs
]
return cls(shape, dtype, locs)
@classmethod
def single(cls, arr):
return concat_info(arr.shape, arr.dtype, [((0,)*arr.ndim, arr)])
@property
def regions(self):
for loc, arr in self.locs:
region = tuple(
slice(start, start + size)
for start, size in zip(loc, arr.shape)
)
yield region, arr
def store(cls, out=None):
if out is None:
out = np.empty(self.shape, self.dtype)
for index, arr in self.regions:
out[index] = arr
return out
def load(cls, in_):
for index, arr in self.regions:
arr[...] = in_[index]
def info_recurse(arrays, list_ndim, result_ndim, depth=0):
if depth < list_ndim:
axis = depth + result_ndim - list_ndim
return concat_info.concatenate([
info_recurse(arr, list_ndim, result_ndim, depth+1)
], axis=axis)
else:
return concat_info.single(arrays)
def block(...)
info = info_recurse(arrays, list_ndim, result_ndim)
return info.store() I don't know what the cost of the class is |
Looks good, including the new F-order test. I think everyone is happy with it now, so let's go ahead and merge soon. I'll wait a little bit in case other reviewers have final comments (any of you feel free to merge if you're happy too). (There is a release note conflict to resolve, but that's easily taken care of with the online editor before merging) |
@ahaldane should we add a test for slice based blocking preserving F order? |
Yeah, that would be a good addition. We're not guaranteering users anything about order, but it would be nice to make sure we don't accidentally lose the F-order speed we get now in future updates. |
fe7cab0
to
a4b616a
Compare
@ahaldane I rebased (after pulling in your changes) and added some simple tests for F and C order. The CIs are failing for some strange reason. I'll try to force push to trigger them later. |
a4b616a
to
8e02803
Compare
numpy/core/tests/test_shape_base.py
Outdated
[arr_f, arr_f]]] | ||
|
||
assert block(b_c).flags['C_CONTIGUOUS'] | ||
assert block(b_mixed).flags['C_CONTIGUOUS'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have a test for the mixed case, because we don't want to guarantee anything about this case either way. Future updates should feel free to change the order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
numpy/core/tests/test_shape_base.py
Outdated
assert block(b_mixed).flags['C_CONTIGUOUS'] | ||
# ``_block_force_concatenate`` returns some mixed `C`/`F` array | ||
if block == _block_force_slicing: | ||
assert block(b_f).flags['F_CONTIGUOUS'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to make this test cover both the concatenate and the slicing code-path. What about removing the outer nesting in the test input so that we get a case which should be in F-order no matter what.
I mean, do:
b_f = [[arr_f, arr_f],
[arr_f, arr_f]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That test exists above no?
3390b68
to
22e3424
Compare
numpy/core/shape_base.py
Outdated
``` | ||
|
||
Thses are called slice prefixes since they are used in the recursive | ||
blocking algorithm to compute the left most slice during the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be left-hand slices
or left-most slices
, since its more than one slice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
numpy/core/shape_base.py
Outdated
|
||
if any(shape[:axis] != first_shape_pre or | ||
shape[axis+1:] != first_shape_post for shape in shapes): | ||
raise ValueError('Mismatched array shapes in block.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to include "along axis {axis}" in this message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'm always annoyed when I have to find that information myself.
22e3424
to
245cb80
Compare
numpy/core/tests/test_shape_base.py
Outdated
[arr_c, arr_c]] | ||
|
||
b_f = [[arr_f, arr_f], | ||
[arr_f, arr_f]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation error here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
…call to ``np.block``.
245cb80
to
f164d2e
Compare
All right, I just gave it one more readthrough, and all looks good. Merging time! Here goes... Thanks @hmaarrfk for the PR and @eric-wieser for reviews. |
argnames = sorted(arglist[0]) | ||
metafunc.parametrize(argnames, | ||
[[funcargs[name] for name in argnames] | ||
for funcargs in arglist]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a lot of magic for something that can be achieved with a parametrized fixture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Off the top of my head, I think something like:
class Tests:
@pytest.fixture(params=[block1, block2, ...])
def block(request): # maybe needs self, not sure
return request.param
def method(self, block): ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have a working example of a fixture i can build off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the one above not work?
Block used to make repeated calls to concatenate. For a 3D block, this would copy the arrays 3 times. 1 time for each dimension of the block. This is 2 times too many.
This proposed algorithm does so with 1 memory copy.
Wisdom allows numpy to guess which algorithm is better performing and chooses that one automatically. We can add a switch, to let the user decide.
TODO:
XREF: Other issues I found regarding concatenate that we might want to test for
block
#11971 (comment), and previous ENH: Add an out argument to concatenate #9209 (comment) and issue BUG: Fix bug with size 1-dims in CreateSortedStridePerm #2696The general steps are:
dtype
of the result.A benchmark was added to estimate the 2D blocking performance in a straightforward manner.
Approach details
2D results
For small arrays, this approach seems to suffer from large overhead. Benchmarks show that "small arrays" are anything as large as
256x256
arrays a modern i7 from 2017.The performance of
master
is optimized in PR #11991. This shaves the overhead time for a 2x2 block from 20us to 15 us. Benchmarks in this section are compared to that PR. The overhead for this method is 30 us when blocking a small 2x2 array.My general sense is that small arrays are pretty important. While the
concatenate
approach was "wasteful" for large arrays, it was quick for small arrays. There is probably room for both approaches.Benchmark results compared to PR #11991
A few small micro optimizations
This PR builds on-top of pretty optimized code already in numpy. Getting the performance close hard/impossible.
List comprehensions are really expensive. Unfortunately, blocking allows for really fancy block shapes, which means that we cant do much but to use them in python.
A few things can be optimized:
I'm unsure how these optimizations affect the performance in PyPy
3D results:
For large arrays, the benchmark #11965 shows that there is now little room to improve compared to a straight concatenate operation. This is really expected since here we are probably copying the arrays more than the cost it takes to generate the slices.
benchmarks
The benchmarks show results for a final array of size `5 n`^3. Therefore, for n=10, the final array has 125,000 elements, each 8 bytes (int 64), so close to 1MB. `n=100` is closer to 1GBNew results
(I'm pretty convinced that the difference in time here between block and concatenate for n=100 is artificial, though not sure how to prove it.)
Improvement compared to the old results
Extensions
Unblocking
@eric-wieser discussed the possibility of using this algorithm to
unblock
arrays. This is rather straightforward to implement once we lock down the implementation. Details are in the comments belowUser friendly options
Finally, this probably lends itself well to offering options like
order
, which would allow one to choose the order of the final array, or even to specify the array that they want to provide through anout
parameter.Closures are avoided entirely, so I think that gh-10620 is not an issue.