-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: add padding options to diff #8206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I agreed with @madphysicist -- this needs to be able to handle N-dimensional Ideally, this would be done with broadcasting, so you say Also, please make sure that this works with |
Just to make sure I understand, currently a can have arbitrary dimensions, but the suggestion is to additionally allow to_begin * to_end to have arbitrary dimensions? I think to_begin and to_end shouldn't support arbitrary dimensions, but potentially allow them to be up to a.ndim dimensional. Correct? For a.shape = (3,4,5), axis=0, to_begin.shape = (2,4), should diff error or insert like to_begin[:, :, newaxis]? I think the axis argument really complicates broadcasting. |
My original comment referred to the fact that the shapes are not will documented, however you choose to implement them. I expect |
I apologize if I'm showing my ignorance of broadcasting rules, but I think the second and third examples are not compatible with typical broadcasting rules, which goes from right to left with equal dimensions or one of them is equal to 1. Basically the axis argument complicates things to were I can't think of a simple rule to reliably allow higher dimensions for to_begin and to_end. I think the critical step is how to determine the shape of the output when that is initialized. I am definitely open to suggestions though! |
I think you are right and I am being the ignorant one here. My expectation is based more on intuition/wishful thinking in this case. However, I have a PR for a function called |
I think atleast_nd would help tremendously in this case. Thanks for pointing that out. |
Too bad that PR is just sitting there on hold... :-) |
Thanks for finally finding a legitimate use case for my pet function. |
|
That's all that |
The latest code has a known test failure related to returning subclasses. See this. Solutions should probably be similar. Will update once that is determined |
I ventured into the C internals to understand what concatenate actually does. I think the applicable code is here. Two things it does which this diff implementation does not is check the dtype and subclass/priority of all 3 potential input arrays. Here I am forcing the dtype and subclass of the output to match the primary input array "a". I think that is ok and would match a typical user's expectation. Still its probably broken for certain types of subarrays, but that is a much bigger issue. Please let me know what you think. |
numpy/lib/function_base.py
Outdated
result = np.empty(tuple(shape), dtype=a.dtype) | ||
|
||
# wrap ndarray subclasses | ||
wrap = getattr(a, "__array_prepare__", a.__array_wrap__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should use __array_wrap__
directly here; as I understand it, __array_prepare__
is for ufuncs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hesitant to not follow what is done in other sections of the numpy code base. Unless an expert weighs in of course
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhvk is an expert :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But not an expert with good memory -- so do point me where else __array_prepare__
is used! It may well help me make Quantity
function with something it didn't function before!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/numpy/numpy/blob/master/numpy/linalg/linalg.py#L106 was my inspiration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha! Mine were https://github.com/numpy/numpy/blob/master/numpy/core/fromnumeric.py#L42 and https://github.com/numpy/numpy/blob/master/numpy/lib/function_base.py#L4608
Some grepping suggests __array_wrap__
is used more often than __array_prepare__
, but there is no obvious conclusion.
The documentation is not terribly clear -- see https://docs.scipy.org/doc/numpy-1.11.0/reference/arrays.classes.html -- but it does suggest that by default __array_prepare__
does nothing while __array_wrap__
changes the class to be that of the instance to which it is attached. I think that makes slightly more sense here. Since it also simplifies the code, I would suggest just to use a.__array_wrap__
unconditionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array_prepare it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__array_wrap__
I hope you meant ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, copy paste error
numpy/lib/function_base.py
Outdated
# make to_begin a 1D array | ||
if to_begin is None: | ||
l_begin = 0 | ||
elif isinstance(to_begin, str) and to_begin == 'first': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to simply write if to_begin == 'first':
(i.e., omit the isinstance
). Python guarantees that equality checks never fail, i.e., that this evaluates to False
if to_begin
is not stringy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numpy arrays have special semantics when tested for equality, otherwise it spews user warnings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duh, yes, of course, you get back an array of bool... Should have thought of that.
Broader question: should Also, the option I would probably use myself if it were available were |
numpy/lib/function_base.py
Outdated
# compute the length of the diff'd portion | ||
# force length to be non negative | ||
l_diff = a.shape[axis] - 1 | ||
if l_diff < 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just write l_diff = max(a.shape[axis] - 1, 0)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
numpy/lib/function_base.py
Outdated
if to_end is None: | ||
l_end = 0 | ||
else: | ||
to_end = np.atleast_1d(to_end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need at least 1D here? I'd write np.asanyarray(to_end)
(might as well pass subclass here too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its required for scalar inputs. For example this errors out len(np.array(0))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, but you can also rewrite l_end = to_end.size
, which works for scalars as well. (Obviously, this doesn't matter much in the broader perspective!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll make that change. curiously np.asanyarray(0).ndim returns 0, so i switched the == ` to < 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, scalars are zero-dimensional, but with your change, that should work fine.
numpy/lib/function_base.py
Outdated
if to_end.ndim == 1: | ||
l_end = len(to_end) | ||
else: | ||
to_end = np.array(to_end, ndmin=nd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, add copy=False, subok=True
to avoid unnecessary copy and allow subclasses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially another use case for atleast_nd
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually confused why the ndmin
is necessary here: all those prepended ones don't matter for the broadcasting anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add those kwargs, I mistakenly thought that was the default.
The extra prepended ones are required for the next line, but to your point there are probably smarter ways of doing that
numpy/lib/function_base.py
Outdated
else: | ||
to_end = np.atleast_1d(to_end) | ||
if to_end.ndim == 1: | ||
l_end = len(to_end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this case, I think you'd still need to reshape l_end
so that its dimension with values is at the right axis, no? I.e.,
to_end.shape = (l_end,) + (1,) * (nd-axis-1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
broadcasting magic seems to take care of that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that is true for a 3-d array where you take the diff on axis=1 and have a 1-dimensional to_end
with more than 1 element?
a = np.zeros((5,4,3))
to_end = np.array([1,2])
a[:, 2:, :] = to_end
# ValueError: could not broadcast input array from shape (2) into shape (5,2,3)
to_end.shape = (2, 1)
a[:, 2:, :] = to_end
# works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds like a good test case. does this answer it sufficiently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because your x.take
returns a multidimensional array which will already have a consistent shape. You'd need to test with a one-dimensional array with length > 1. But it certainly is a good place to add the test! (and I must add that I admire how thoroughly your tests already were).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, you're right. I missed that test case. Should work now
numpy/lib/function_base.py
Outdated
# copy values to end | ||
if l_end > 0: | ||
end_slice = [slice(None)] * nd | ||
end_slice[axis] = slice(l_begin + l_diff, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just
end_slice = (slice(None),) * axis + (slice(l_begin + l_diff, None),)
(or even just put that whole expression inside the square brackets in the line below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or calculate end_slice
in the earlier if to_end is None... else
clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that works, the slice which isn't None must be at axis, not necessarily at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion puts it at axis
and leaves out all the slice(None)
following it (which are not required).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly I'm not expert enough to comment on the reasoning, but that change causes some tests to fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we convert back to tuple
before indexing, so as not to add more work to #4434?
there were quite a few comments, thanks for the good feedback. I think I got them, but let me know if I missed some. I am holding out for atleast_nd though |
@mattharrigan - apart from the trivial comment (and not quite understanding why my suggestion for
I think it is slightly odd to break the broadcasting expectation here, especially since the regular use case surely is just to add a single element so that one keeps the original shape. The advantage of assuming this is that you do not have to do any array shaping of
|
should those questions be posed to the numpy mailing list? |
Seems sensible, so I wrote a reply to the chain you started originally. |
numpy/lib/function_base.py
Outdated
@@ -1094,6 +1094,18 @@ def diff(a, n=1, axis=-1): | |||
axis : int, optional | |||
The axis along which the difference is taken, default is the | |||
last axis. | |||
prepend : array_like | |||
Values to prepend to the beginning of "a" along axis before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth noting that this only exists as a more efficient option than np.concatenate
Edit: nevermind, that's not even true.
numpy/lib/function_base.py
Outdated
@@ -1139,6 +1151,8 @@ def diff(a, n=1, axis=-1): | |||
array([ 1, 2, 3, -7]) | |||
>>> np.diff(x, n=2) | |||
array([ 1, 1, -10]) | |||
>>> np.cumsum(np.diff(x, prepend=0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show the equivalent np.concatenate
invocation in an example too?
Needs a rebase to fix merge conflicts with the release notes? |
edd7552
to
e25adfc
Compare
@eric-wieser and @mattip: done |
numpy/lib/function_base.py
Outdated
@@ -1094,6 +1094,12 @@ def diff(a, n=1, axis=-1): | |||
axis : int, optional | |||
The axis along which the difference is taken, default is the | |||
last axis. | |||
prepend, append : array_like, optional | |||
Values to prepend or append to the beginning of "a" along axis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop "to the beginning" - it's wrong for "append", and implied for "prepend"
doc/release/1.16.0-notes.rst
Outdated
@@ -26,6 +26,11 @@ Compatibility notes | |||
C API changes | |||
============= | |||
|
|||
``np.diff`` Added kwargs prepend and append |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't relevant to the C API. Perhaps move it to "improvements"?
e25adfc
to
b2824e9
Compare
@eric-wieser: done, sorry for the errors. I don't understand why there is a merge conflict on the release notes. |
ready to merge |
I would greatly appreciate a maintainer reviewing this PR and merging or commenting. Thank you |
@mattharrigan I fixed the merge conflict. |
@charris thank you. I think it is ready to merge |
performing the difference. Scalar values are expanded to | ||
arrays with length 1 in the direction of axis and the shape | ||
of the input array in along all other axes. Otherwise the | ||
dimension and shape must match "a" except along axis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test verifying that ValueError
is raised if the shape doesn't match?
numpy/lib/function_base.py
Outdated
@@ -1173,6 +1183,29 @@ def diff(a, n=1, axis=-1): | |||
"order must be non-negative but got " + repr(n)) | |||
|
|||
a = asanyarray(a) | |||
|
|||
combined = list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: combined = []
might be a little more idiomatic here.
3026bed
to
7d29e1c
Compare
@shoyer updated per your comments. I believe this is ready to commit |
OK, I'm just waiting on CI to pass before merging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This risks throwing an IndexError
when it should throw an AxisError
numpy/lib/function_base.py
Outdated
|
||
if len(combined) > 1: | ||
a = np.concatenate(combined, axis) | ||
|
||
nd = a.ndim | ||
axis = normalize_axis_index(axis, nd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line needs to come before your additions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please show me a short example? I'll make this change and also add a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume its just a variation of this, from TestDiff.test_axis:
assert_raises(np.AxisError, diff, x, axis=3)
assert_raises(np.AxisError, diff, x, axis=-4)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, those tests, but with your new arguments
7d29e1c
to
8ec391a
Compare
@shoyer merge? |
thanks @mattharrigan, especially for your patience with us! |
thanks for all you help and patience with me too. It feels good to actually finish this. |
Add kwargs to_begin and to_end, allowing for values to be inserted
on either end of the differences. Similiar to options for ediff1d.
Closes #8132