DOC: correction to numpy.pad docstring #13149

grlee77 · 2019-03-17T21:08:29Z

When a custom function is provided to numpy.pad, the docstring states that this function should return a rank 1 vector. However, it actually seems that it is required that the rank 1 vector input be modified in-place and it is not necessary to return an array.

I also updated a test case to make sure it is consistent with the modified description proposed here.

Two slight modifications to the example from the docstring illustrate this behavior.

1.) Padding fails if a copy of vector is made internally:

def pad_with(vector, pad_width, iaxis, kwargs):
    pad_value = kwargs.get('padder', 10)
    vector = vector.copy()
    vector[:pad_width[0]] = pad_value
    vector[-pad_width[1]:] = pad_value
    return vector
a = np.arange(6)
a = a.reshape((2, 3))
np.pad(a, 2, pad_with)

Out:
array([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 2, 0, 0],
[0, 0, 3, 4, 5, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])

2.) Padding works if modification is done in-place, even without any return value:

def pad_with(vector, pad_width, iaxis, kwargs):
    pad_value = kwargs.get('padder', 10)
    vector[:pad_width[0]] = pad_value
    vector[-pad_width[1]:] = pad_value
    return
a = np.arange(6)
a = a.reshape((2, 3))
np.pad(a, 2, pad_with)

Out:
array([[10, 10, 10, 10, 10, 10, 10],
[10, 10, 10, 10, 10, 10, 10],
[10, 10, 0, 1, 2, 10, 10],
[10, 10, 3, 4, 5, 10, 10],
[10, 10, 10, 10, 10, 10, 10],
[10, 10, 10, 10, 10, 10, 10]])

mattip · 2019-03-19T07:44:41Z

LGTM, the docstring check passes so it seems the documentation fix is correct. Is there a reason the design is to modify the vector in-place? Is it more efficient that way? Perhaps the docstring could hint at the motivation for the (in my opinion) non-standard design.

grlee77 · 2019-03-19T16:03:04Z

Is there a reason the design is to modify the vector in-place?

I am not entirely sure if that is intentional, but the relevant usage of the provided function occurs with apply_along_axis:

numpy/numpy/lib/arraypad.py

Lines 1223 to 1236 in 17b31fd

    
           newmat = np.zeros(new_shape, narray.dtype) 
        
           # Insert the original array into the padded array 
        
           newmat[offset_slices] = narray 
        
           # This is the core of pad ... 
        
           for iaxis in rank: 
        
               np.apply_along_axis(function, 
        
                                   iaxis, 
        
                                   newmat, 
        
                                   pad_width[iaxis], 
        
                                   iaxis, 
        
                                   kwargs) 
        
           return newmat

In that section of code, the output of apply_along_axis is never assigned, so it must be relying on in-place modification of newmat. I assume this is for reason of efficiency, but am not sure. That section of code does not seem to have been modified in years.

I guess now the question is whether to regard this as an error in documentation or as a bug in np.pad. I think if the output of np.apply_along_axis were assigned back to newmat within the loop above then perhaps the existing docstring could be left as-is.

I came across the issue when working on PyWavelets/pywt#478. I was surprised to find that it was necessary to avoid an internal copy, which made me suspect that the return value was not being used.

grlee77 · 2019-03-19T16:08:27Z

I started looking at np.apply_along_axis and it seems that it is still going to be allocating a new output array, even if it is never being used. In that case, I am not sure there is any benefit to relying on it doing the modification in-place.

grlee77 · 2019-03-19T16:40:21Z

I did some preliminary testing and did not see any performance difference for padding a relatively large array of size (256, 256, 256) by 2 with or without assignment to newmat in the loop, so I don't think it would be a problem to fix the behaviour to match the current docstring.

However, if we do keep the requirement for in-place modification, it is possible to get a substantial performance improvement (I saw around 30-40% for my test case) if we continue to rely on in-place modification, but replace calls to np.apply_along_axis with calls to a locally defined version of that function that assumes in-place operation.

i.e. using _apply_along_axis_inplace as defined below (basically just an excerpt from the existing apply_along_axis, omitting the buffer allocation and assignment code):

def _apply_along_axis_inplace(func1d, axis, arr, *args, **kwargs):
    """
    Apply a function to 1-D slices along the given axis.

    This is like apply_along_axis, but relies on the modification to arr being
    done in-place.
    """
    # handle negative axes
    arr = asanyarray(arr)
    nd = arr.ndim
    axis = normalize_axis_index(axis, nd)

    # arr, with the iteration axis at the end
    in_dims = list(range(nd))
    inarr_view = transpose(arr, in_dims[:axis] + in_dims[axis+1:] + [axis])

    # compute indices for the iteration axes, and append a trailing ellipsis to
    # prevent 0d arrays decaying to scalars, which fixes gh-8642
    inds = ndindex(inarr_view.shape[:-1])
    inds = (ind + (Ellipsis,) for ind in inds)

    # invoke the function on the first item
    try:
        ind0 = next(inds)
    except StopIteration:
        raise ValueError('Cannot apply_along_axis when any iteration dimensions are 0')
    func1d(inarr_view[ind0], *args, **kwargs)
    for ind in inds:
        func1d(inarr_view[ind], *args, **kwargs)
    return

I suspect use of such user-defined functions is relatively rare, though, so I'm not sure it is worth optimizing this case?

grlee77 · 2019-03-19T16:42:00Z

@lagru: I know you have been working on reducing memory allocations in numpy.pad. Do you have any opinion on whether we should update the docstring as proposed here (leaving open the possibility to optimize as suggested in the prior comment) or if we should just fix the behaviour so that it matches the current docstring?

lagru · 2019-03-20T16:20:48Z

First of, I have largely ignored that part of the function.

I strongly prefer for only one strategy to work. Either modify in place or use the returned output. I'd guess that the second approach would be a lot slower so I tend to prefer the first solution which would be more inline with general strategy in MAINT: Rewrite numpy.pad without concatenate #11358.
I dislike adding a second slightly modified version of apply_along_axis. Not sure if it would be better to add a switch to the original apply_along_axis that toggles allocating the output.

[...] whether we should update the docstring as proposed here (leaving open the possibility to optimize as suggested in the prior comment) or [...]

I prefer this solution and think it would be worth to further discuss how to avoid the unnecessary allocation in apply_along_axis.

numpy/lib/arraypad.py

grlee77 · 2019-03-20T18:32:46Z

I dislike adding a second slightly modified version of apply_along_axis.

Yeah, I agree that it would be better to support it within apply_along_axis itself. It is an easy change within the function to allow skipping the buffer creation. The tricky part is if/how to modify the function signature and whether to publicly expose this functionality in the API. It currently has the signature:
apply_along_axis(func1d, axis, arr, *args, **kwargs)
where *args and **kwargs get passed along to func1d. This makes it hard to add new optional arguments without potentially breaking existing code that could have potentially been passing arguments of the same name along to func1d.

If we don't care to make the functionality public, a hack to avoid modifying the signature could be to check for some "private" keyword such as _apply_inplace and pop it from kwargs before passing them along to func1d. That doesn't seem the most appealing, though....

grlee77 · 2019-03-20T18:42:59Z

It looks like this in-place assumption has been around since at least numpy 1.7 (that was the oldest tag I looked at). Assuming there is agreement to document it as requiring in-place operation, the modified test case here should make sure this is not inadvertently changed going forward.

We can decide about potentially optimizing apply_along_axis in a separate issue.

numpy/lib/arraypad.py

mattip · 2019-03-25T19:24:46Z

Thanks @grlee77, it would be nice to open a follow-on issue to optimize apply_along_axis

* DOC: fix mistatement in numpy.pad docstring

grlee77 added 2 commits March 17, 2019 17:04

DOC: fix mistatement in numpy.pad docstring

8b42fe6

TST: update test case to omit explicit return value

ee324d3

charris added 04 - Documentation component: numpy.lib labels Mar 17, 2019

lagru reviewed Mar 20, 2019

View reviewed changes

numpy/lib/arraypad.py Show resolved Hide resolved

DOC: variable name: pad_tuple->iaxis_pad_width

f96a52d

eric-wieser reviewed Mar 23, 2019

View reviewed changes

numpy/lib/arraypad.py Outdated Show resolved Hide resolved

TST/DOC: remove empty return statements

9e81eb7

mattip merged commit d6dcaed into numpy:master Mar 25, 2019

grlee77 mentioned this pull request Mar 26, 2019

better support for in-place modification in apply_along_axis #13192

Closed

grlee77 added a commit to grlee77/numpy that referenced this pull request Mar 26, 2019

DOC: correction to numpy.pad docstring (numpy#13149)

46e39c8

* DOC: fix mistatement in numpy.pad docstring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: correction to numpy.pad docstring #13149

DOC: correction to numpy.pad docstring #13149

grlee77 commented Mar 17, 2019

mattip commented Mar 19, 2019

grlee77 commented Mar 19, 2019

grlee77 commented Mar 19, 2019

grlee77 commented Mar 19, 2019

grlee77 commented Mar 19, 2019

lagru commented Mar 20, 2019

grlee77 commented Mar 20, 2019

grlee77 commented Mar 20, 2019

mattip commented Mar 25, 2019

DOC: correction to numpy.pad docstring #13149

DOC: correction to numpy.pad docstring #13149

Conversation

grlee77 commented Mar 17, 2019

mattip commented Mar 19, 2019

grlee77 commented Mar 19, 2019

grlee77 commented Mar 19, 2019

grlee77 commented Mar 19, 2019

grlee77 commented Mar 19, 2019

lagru commented Mar 20, 2019

grlee77 commented Mar 20, 2019

grlee77 commented Mar 20, 2019

mattip commented Mar 25, 2019