Skip to content

ENH: add padding options to diff #8206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 26, 2018
Merged

Conversation

mattharrigan
Copy link
Contributor

Add kwargs to_begin and to_end, allowing for values to be inserted
on either end of the differences. Similiar to options for ediff1d.
Closes #8132

@shoyer
Copy link
Member

shoyer commented Oct 24, 2016

I agreed with @madphysicist -- this needs to be able to handle N-dimensional to_begin/to_end with the appropriate handles.

Ideally, this would be done with broadcasting, so you say np.diff(a, to_begin=0, axis=1) even if a.ndim > 1. np.broadcast_to should be helpful here.

Also, please make sure that this works with n>1. Right now the test coverage for higher order differentiation with to_begin/to_end feels pretty sparse.

@mattharrigan
Copy link
Contributor Author

Just to make sure I understand, currently a can have arbitrary dimensions, but the suggestion is to additionally allow to_begin * to_end to have arbitrary dimensions? I think to_begin and to_end shouldn't support arbitrary dimensions, but potentially allow them to be up to a.ndim dimensional. Correct?

For a.shape = (3,4,5), axis=0, to_begin.shape = (2,4), should diff error or insert like to_begin[:, :, newaxis]? I think the axis argument really complicates broadcasting.

@madphysicist
Copy link
Contributor

madphysicist commented Oct 24, 2016

to_begin and to_end should just broadcast correctly. They should only support a limited number of dimensions. In your example of a.shape = (3, 4, 5), I would expect any of the following to work for to_begin or to_end:

  • scalar -> broadcast to (1, 4, 5)
  • 1D array of shape (x,) -> broadscast to (x, 4, 5)
  • 2D array of shape (x, 4)
  • 3D array of shape (x, 1, 5)
  • 3D array of shape (x, 4, 5)
    I hope I did not get the broadcasting rules backwards here.

My original comment referred to the fact that the shapes are not will documented, however you choose to implement them. I expect a to have an arbitrary number of dimensions, and the endcaps to match that.

@mattharrigan
Copy link
Contributor Author

mattharrigan commented Oct 24, 2016

I apologize if I'm showing my ignorance of broadcasting rules, but I think the second and third examples are not compatible with typical broadcasting rules, which goes from right to left with equal dimensions or one of them is equal to 1.

Basically the axis argument complicates things to were I can't think of a simple rule to reliably allow higher dimensions for to_begin and to_end. I think the critical step is how to determine the shape of the output when that is initialized. I am definitely open to suggestions though!

@madphysicist
Copy link
Contributor

I think you are right and I am being the ignorant one here. My expectation is based more on intuition/wishful thinking in this case. However, I have a PR for a function called atleast_nd open (#7804), which might actually help in this situation.

@mattharrigan
Copy link
Contributor Author

I think atleast_nd would help tremendously in this case. Thanks for pointing that out.

@madphysicist
Copy link
Contributor

Too bad that PR is just sitting there on hold... :-)

@madphysicist
Copy link
Contributor

Thanks for finally finding a legitimate use case for my pet function.

@seberg
Copy link
Member

seberg commented Oct 24, 2016

array(..., ndmin=blah) might be enough here ;p.

@madphysicist
Copy link
Contributor

madphysicist commented Oct 24, 2016

That's all that ndarray really does, anyway, at least in this case, where you are appending dims.

@mattharrigan
Copy link
Contributor Author

The latest code has a known test failure related to returning subclasses. See this. Solutions should probably be similar. Will update once that is determined

@mattharrigan
Copy link
Contributor Author

mattharrigan commented Oct 27, 2016

I ventured into the C internals to understand what concatenate actually does. I think the applicable code is here. Two things it does which this diff implementation does not is check the dtype and subclass/priority of all 3 potential input arrays. Here I am forcing the dtype and subclass of the output to match the primary input array "a". I think that is ok and would match a typical user's expectation. Still its probably broken for certain types of subarrays, but that is a much bigger issue. Please let me know what you think.

result = np.empty(tuple(shape), dtype=a.dtype)

# wrap ndarray subclasses
wrap = getattr(a, "__array_prepare__", a.__array_wrap__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should use __array_wrap__ directly here; as I understand it, __array_prepare__ is for ufuncs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to not follow what is done in other sections of the numpy code base. Unless an expert weighs in of course

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhvk is an expert :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But not an expert with good memory -- so do point me where else __array_prepare__ is used! It may well help me make Quantity function with something it didn't function before!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! Mine were https://github.com/numpy/numpy/blob/master/numpy/core/fromnumeric.py#L42 and https://github.com/numpy/numpy/blob/master/numpy/lib/function_base.py#L4608
Some grepping suggests __array_wrap__ is used more often than __array_prepare__, but there is no obvious conclusion.

The documentation is not terribly clear -- see https://docs.scipy.org/doc/numpy-1.11.0/reference/arrays.classes.html -- but it does suggest that by default __array_prepare__ does nothing while __array_wrap__ changes the class to be that of the instance to which it is attached. I think that makes slightly more sense here. Since it also simplifies the code, I would suggest just to use a.__array_wrap__ unconditionally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array_prepare it is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__array_wrap__ I hope you meant ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, copy paste error

# make to_begin a 1D array
if to_begin is None:
l_begin = 0
elif isinstance(to_begin, str) and to_begin == 'first':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to simply write if to_begin == 'first': (i.e., omit the isinstance). Python guarantees that equality checks never fail, i.e., that this evaluates to False if to_begin is not stringy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy arrays have special semantics when tested for equality, otherwise it spews user warnings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duh, yes, of course, you get back an array of bool... Should have thought of that.

@mhvk
Copy link
Contributor

mhvk commented Oct 27, 2016

Broader question: should first be applied repeatedly for higher-order differences (such that repeated cumsum recovered the data)?

Also, the option I would probably use myself if it were available were to_begin='extrapolate' (or something similar), which would set the first item equal to the first difference (arguably, one could then have an to_end='extrapolate' as well). Though absolutely fine not to implement those! It is not exactly much work to do to_begin=0. and then follow the call with result[0] = result[1]. (Indeed, for the same reason perhaps first should not be implemented either.)

# compute the length of the diff'd portion
# force length to be non negative
l_diff = a.shape[axis] - 1
if l_diff < 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just write l_diff = max(a.shape[axis] - 1, 0)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

if to_end is None:
l_end = 0
else:
to_end = np.atleast_1d(to_end)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need at least 1D here? I'd write np.asanyarray(to_end) (might as well pass subclass here too).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its required for scalar inputs. For example this errors out len(np.array(0))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but you can also rewrite l_end = to_end.size, which works for scalars as well. (Obviously, this doesn't matter much in the broader perspective!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll make that change. curiously np.asanyarray(0).ndim returns 0, so i switched the == ` to < 2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, scalars are zero-dimensional, but with your change, that should work fine.

if to_end.ndim == 1:
l_end = len(to_end)
else:
to_end = np.array(to_end, ndmin=nd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, add copy=False, subok=True to avoid unnecessary copy and allow subclasses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially another use case for atleast_nd :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually confused why the ndmin is necessary here: all those prepended ones don't matter for the broadcasting anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add those kwargs, I mistakenly thought that was the default.

The extra prepended ones are required for the next line, but to your point there are probably smarter ways of doing that

else:
to_end = np.atleast_1d(to_end)
if to_end.ndim == 1:
l_end = len(to_end)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case, I think you'd still need to reshape l_end so that its dimension with values is at the right axis, no? I.e.,

to_end.shape = (l_end,) + (1,) * (nd-axis-1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broadcasting magic seems to take care of that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that is true for a 3-d array where you take the diff on axis=1 and have a 1-dimensional to_end with more than 1 element?

a = np.zeros((5,4,3))
to_end = np.array([1,2])
a[:, 2:, :] = to_end
# ValueError: could not broadcast input array from shape (2) into shape (5,2,3)
to_end.shape = (2, 1)
a[:, 2:, :] = to_end
# works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like a good test case. does this answer it sufficiently?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because your x.take returns a multidimensional array which will already have a consistent shape. You'd need to test with a one-dimensional array with length > 1. But it certainly is a good place to add the test! (and I must add that I admire how thoroughly your tests already were).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, you're right. I missed that test case. Should work now

# copy values to end
if l_end > 0:
end_slice = [slice(None)] * nd
end_slice[axis] = slice(l_begin + l_diff, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just

end_slice = (slice(None),) * axis + (slice(l_begin + l_diff, None),)

(or even just put that whole expression inside the square brackets in the line below)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or calculate end_slice in the earlier if to_end is None... else clause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that works, the slice which isn't None must be at axis, not necessarily at the end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion puts it at axis and leaves out all the slice(None) following it (which are not required).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly I'm not expert enough to comment on the reasoning, but that change causes some tests to fail

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we convert back to tuple before indexing, so as not to add more work to #4434?

@mattharrigan
Copy link
Contributor Author

there were quite a few comments, thanks for the good feedback. I think I got them, but let me know if I missed some. I am holding out for atleast_nd though

@mhvk
Copy link
Contributor

mhvk commented Oct 28, 2016

@mattharrigan - apart from the trivial comment (and not quite understanding why my suggestion for end_slice failed...), two more general questions:

  1. I'm not sure that treating a 1-d array as something that will just extend the result along axis is a good idea, as it breaks standard broadcasting rules. E.g., consider
np.diff([[1, 2], [4, 8]], to_begin=[1, 4])
# with your PR:
array([[1, 4, 1],
       [1, 4, 4]])
# but from regular broadcasting I would expect
array([[1, 1],
       [4, 4]])
# i.e., the same as if I did to_begin=[[1, 4]]

I think it is slightly odd to break the broadcasting expectation here, especially since the regular use case surely is just to add a single element so that one keeps the original shape. The advantage of assuming this is that you do not have to do any array shaping of to_begin and to_end (which perhaps also suggests it is the right thing to do).

  1. As I mentioned above, I think it may be worth thinking through a little what to do with higher order differences, at least for to_begin='first'. If the goal is to ensure that with that option, it becomes the inverse of cumsum, then I think for higher order one should add multiple elements in front, i.e., for that case, the recursive call should be
return np.diff(np.diff(a, to_begin='first'), n-1, to_begin='first')

@mattharrigan
Copy link
Contributor Author

should those questions be posed to the numpy mailing list?

@mhvk
Copy link
Contributor

mhvk commented Oct 28, 2016

should those questions be posed to the numpy mailing list?

Seems sensible, so I wrote a reply to the chain you started originally.

@@ -1094,6 +1094,18 @@ def diff(a, n=1, axis=-1):
axis : int, optional
The axis along which the difference is taken, default is the
last axis.
prepend : array_like
Values to prepend to the beginning of "a" along axis before
Copy link
Member

@eric-wieser eric-wieser Jun 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth noting that this only exists as a more efficient option than np.concatenate

Edit: nevermind, that's not even true.

@@ -1139,6 +1151,8 @@ def diff(a, n=1, axis=-1):
array([ 1, 2, 3, -7])
>>> np.diff(x, n=2)
array([ 1, 1, -10])
>>> np.cumsum(np.diff(x, prepend=0))
Copy link
Member

@eric-wieser eric-wieser Jun 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show the equivalent np.concatenate invocation in an example too?

@mattip
Copy link
Member

mattip commented Jun 24, 2018

Needs a rebase to fix merge conflicts with the release notes?

@mattharrigan
Copy link
Contributor Author

@eric-wieser and @mattip: done

@@ -1094,6 +1094,12 @@ def diff(a, n=1, axis=-1):
axis : int, optional
The axis along which the difference is taken, default is the
last axis.
prepend, append : array_like, optional
Values to prepend or append to the beginning of "a" along axis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop "to the beginning" - it's wrong for "append", and implied for "prepend"

@@ -26,6 +26,11 @@ Compatibility notes
C API changes
=============

``np.diff`` Added kwargs prepend and append
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't relevant to the C API. Perhaps move it to "improvements"?

@mattharrigan
Copy link
Contributor Author

@eric-wieser: done, sorry for the errors. I don't understand why there is a merge conflict on the release notes.

@mattharrigan
Copy link
Contributor Author

ready to merge

@mattharrigan
Copy link
Contributor Author

I would greatly appreciate a maintainer reviewing this PR and merging or commenting. Thank you

@charris
Copy link
Member

charris commented Aug 27, 2018

@mattharrigan I fixed the merge conflict.

@mattharrigan
Copy link
Contributor Author

@charris thank you.

I think it is ready to merge

performing the difference. Scalar values are expanded to
arrays with length 1 in the direction of axis and the shape
of the input array in along all other axes. Otherwise the
dimension and shape must match "a" except along axis.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test verifying that ValueError is raised if the shape doesn't match?

@@ -1173,6 +1183,29 @@ def diff(a, n=1, axis=-1):
"order must be non-negative but got " + repr(n))

a = asanyarray(a)

combined = list()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: combined = [] might be a little more idiomatic here.

@mattharrigan mattharrigan force-pushed the diff-to-begin branch 4 times, most recently from 3026bed to 7d29e1c Compare September 26, 2018 00:47
@mattharrigan
Copy link
Contributor Author

@shoyer updated per your comments. I believe this is ready to commit

@shoyer
Copy link
Member

shoyer commented Sep 26, 2018

OK, I'm just waiting on CI to pass before merging

Copy link
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This risks throwing an IndexError when it should throw an AxisError


if len(combined) > 1:
a = np.concatenate(combined, axis)

nd = a.ndim
axis = normalize_axis_index(axis, nd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line needs to come before your additions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please show me a short example? I'll make this change and also add a test

Copy link
Contributor Author

@mattharrigan mattharrigan Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume its just a variation of this, from TestDiff.test_axis:
assert_raises(np.AxisError, diff, x, axis=3)
assert_raises(np.AxisError, diff, x, axis=-4)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, those tests, but with your new arguments

@mattharrigan
Copy link
Contributor Author

@shoyer merge?

@shoyer shoyer merged commit fe1c1fb into numpy:master Sep 26, 2018
@shoyer
Copy link
Member

shoyer commented Sep 26, 2018

thanks @mattharrigan, especially for your patience with us!

@mattharrigan
Copy link
Contributor Author

thanks for all you help and patience with me too. It feels good to actually finish this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes component: numpy.lib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants