MAINT: Rewrite numpy.pad without concatenate #11358

lagru · 2018-06-16T08:54:49Z

Description / Motivation

As suggested in #11033 (comment) and #11126 the current implementation of numpy.pad uses copies more than necessary. Currently most of the pad modes use numpy.concatenate under the hood to create the new array. This has to happen twice for each padded axis. It might be faster to pre-allocate the returned array once with the correct final shape and just set the appropriate edge values instead of concatenating them.

Roadmap

Support all existing pad modes (pass existing unit tests)
Add documentation and comments of new code
Profile, benchmark & optimize code
Ensure test suite is complete for old and new implementation
Adress feedback & reviews...

Creating the linear ramp as an array with 1-sized dimensions except for the one given by `axis` allows implicit broadcasting to the needed shape. This seems to be even a little bit faster that doing this by hand and allows the simplicifaction of the algorithm. Note: Profiling and optimization will be done again at a later stage.

Addresses feedback raised in PR.

This completes the first draft of the complete rewrite meaning all unit tests should pass from this commit onwards.

The set functions were nearly the same, apart from some index offsets. Merging them reduces code duplication.

The rewrite in past commits fixed this bug.

And include test to protect against regression.

Major changes & goals: Don't deal with pad area in the front and back separately. This modularity isn't needed and makes handling of the right edge more awkward. All modes now deal with the left and right side at the same time. Move the creation of the linear ramps fully to its own function which behaves like a vectorized version of linspace. Separate calculation and application of the pad area where possible. This means that _get_edges can be reused for _get_linear_ramps. Combine _normalize_shape and _validate_lengths in a single function which should handles common cases faster. Add new mode "empty" which leaves the padded areas undefined. Add documentation where it was missing.

numpy/lib/arraypad.py

numpy/lib/tests/test_arraypad.py

numpy/lib/arraypad.py

Reduce branching and try to make the calculation of the various indices easier to understand.

lagru · 2019-02-19T11:34:59Z

Not sure what the failure in Shippable is about.

@seberg, @eric-wieser, @mattip I think that I have addressed all your suggestions and feel like this is close to being acceptable. Do you want to do a final iteration or do you agree?

numpy/lib/tests/test_arraypad.py

Keep parametrization ordered, otherwise pytest-xdist might believe that different tests were collected during parallelization causing test failures.

lagru · 2019-02-20T08:21:20Z

Should I add this PR to the release notes for 1.17? If so, how detailed should I explain the changes? The fix for #11216 and the new mode "empty" are obvious but what is the convention on describing internal optimizations, especially if their effect is heavily dependent on the chosen input?

charris · 2019-02-20T13:19:05Z

There is an Improvements section, I'd just say that performance has been improved by filling in pre-allocated arrays instead of using concatenate, more detail isn't needed.

as well as the new mode "empty" in release notes (see numpygh-11358).

mattip · 2019-03-06T07:32:39Z

This seems ready to go in.

lagru · 2019-03-16T11:32:47Z

I resolved the merge conflict with the recent changes in #4808. Is there anything else blocking this? Should I perhaps send a small note on the mailing list?

seberg · 2019-03-16T17:31:52Z

No, nothing blocking it, except I want to make one more careful pass and was sick and then tied down in long experiments and other things :/ for a while. Thanks for fixing the conflict and sorry about the delay when it probably could simply have been been merged for a while.

lagru · 2019-03-16T17:43:56Z

@seberg Oh, no worries. I hope you're feeling better! I'm aware that changes of this scope do need careful consideration and that we are all doing this in our free time. There is no immediate reason to rush this. So if that wasn't clear, take my comment above as a curious query and not as a request. Please don't feel obligated to rush this! 🙂

seberg · 2019-03-24T14:40:12Z

OK, read through it once more and of course found no issues, so I could just squash and merge (probably will just do that tomorrow or so). Frankly, the biggest "issue" I saw is one of the Notes that I did not quite understand, at least not quickly (the one about "The modes 'reflect', 'symmetric', ..., lest the indexing tricks in non-integer...").

Some things to maybe open an issue for:

That singleton dimension extension for 'reflect' indeed seems ugly legacy behaviour.
stat_length should probably be forced integer, but is not. This is unchanged from the previous version, so I think we should leave it as is for now.

I bet both were already discussed previously ;).

lagru · 2019-03-24T20:13:08Z

I bet both were already discussed previously ;).

Yep, I think both points came up already with the same conclusions.

I saw is one of the Notes that I did not quite understand, at least not quickly (the one about "The modes 'reflect', 'symmetric', and 'wrap' must be padded with a single function, lest the indexing tricks in non-integer multiples of the original shape would violate repetition in the final iteration.").

Reading this now, I'm a little bit confused myself. I'm not sure anymore about the first part "must be padded with a single function". I guess it's a note meant to remind me of an earlier failed attempt. I remember that I found implementing these modes the most challenging.

I'll explain the second part about "non-integer multiples" with an example:

>>> a = np.arange(4)
>>> np.pad(a, (0, 6), "wrap")
array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1])

In this case, we perform 2 iterations to pad the dimension. First we pad by 4 elements and then only by 2 elements. With non-integer multiples of the original shape, I think I meant that we can only pad by a fraction of the original shape (in this case 2).
Considering this now, this is factually wrong or misleading: the area considered as "original" grows with each iteration. If we were to pad an array of shape (4,) on one side by 12, the implementation would only need two iterations (1st: pad by 4, 2nd: pad by 8).

I guess these notes are obsolete and in any case worded very badly. I'll remove them as they aren't useful now.

These notes are badly worded or actually misleading. For a better explanation on how these functions work have a look at the context and comments just above the lines calling these functions.

seberg · 2019-03-25T09:23:17Z

@lagru and everyone else, thanks a lot. This is all a bit improvement. Sorry if it took a bit long to merge.

lagru · 2019-03-25T14:04:56Z

Big thanks from me as well! I feel like the improvement was substantial thanks to everyone involved (especially @eric-wieser and @seberg).

mattip · 2019-03-25T14:13:16Z

While this is still fresh in people's memory cache, maybe some of the open issues that mention pad (not all are actually about np.pad) could be closed?

* ENH: Add support for constant, edge, linear_ramp to new numpy.pad Passes unit tests: - TestConstant - TestEdge - TestZeroPadWidth - TestLegacyVectorFunction - TestNdarrayPadWidth - TestUnicodeInput - TestLinearRamp * MAINT: Simplify diff / change order of functions * MAINT: Revert to old handling of keyword-only arguments * ENH: Add support for stat modes * ENH: Add support for "reflect" mode * MAINT: Remove _slice_column * ENH: Add support for "symmetric" mode * MAINT: Simplify mode "linear_ramp" Creating the linear ramp as an array with 1-sized dimensions except for the one given by `axis` allows implicit broadcasting to the needed shape. This seems to be even a little bit faster that doing this by hand and allows the simplicifaction of the algorithm. Note: Profiling and optimization will be done again at a later stage. * MAINT: Reorder arguments of a sum and fix typo Addresses feedback raised in PR. * ENH: Add support for "wrap" mode This completes the first draft of the complete rewrite meaning all unit tests should pass from this commit onwards. * MAINT: Merge functions for "reflect" and "symmetric" mode The set functions were nearly the same, apart from some index offsets. Merging them reduces code duplication. * TST: Add regression test for numpygh-11216 The rewrite in past commits fixed this bug. * BUG: Fix edge case for _set_wrap_both when pad_amt contains 0. And include test to protect against regression. * MAINT: Simplify and optimize pad modes Major changes & goals: Don't deal with pad area in the front and back separately. This modularity isn't needed and makes handling of the right edge more awkward. All modes now deal with the left and right side at the same time. Move the creation of the linear ramps fully to its own function which behaves like a vectorized version of linspace. Separate calculation and application of the pad area where possible. This means that _get_edges can be reused for _get_linear_ramps. Combine _normalize_shape and _validate_lengths in a single function which should handles common cases faster. Add new mode "empty" which leaves the padded areas undefined. Add documentation where it was missing. * TST: Don't use np.empty in unit tests * MAINT: Reorder workflow in numpy.pad and deal with empty dimensions Only modes "constant" and "empty" can extend dimensions of size 0. Deal with this edge case gracefully for all other modes either fail or return empty array with padded non-zero dimensions. Handle default values closer to their actual usage. And validate keyword arguments that must be numbers. * MAINT: Add small tweaks to control flow and documentation * BUG: Ensure wrap mode works if right_pad is 0 * ENH: Use reduced region of interest for iterative padding When padding multiple dimensions iteratively corner values are unnecessarily overwritten multiple times. This function reduces the working area for the first dimensions so that corners are excluded. * MAINT: Restore original argument order in _slice_at_axis * MAINT: Keep original error message of broadcast_to * MAINT: Restore old behavior for non-number end_values. * BENCH: Make the pad benchmark pagefault in setup * ENH/TST: Preserve memory layout (order) of the input array and add appropriate unit test. * STY: Revert cosmetical changes to reduce diff * MAINT: Pin dtype to float64 for np.pad's benchmarks * MAINT: Remove redundant code path in _view_roi * MAINT/TST: Provide proper error message for unsupported modes and add appropriate unit test. * STY: Keep docstrings consistent and fix typo. * MAINT: Simplify logical workflow in pad * MAINT: Remove dtype argument from _linear_ramp The responsibility of rounding (but without type conversion) is not really need in _linear_ramp and only makes it a little bit harder to reason about. * DOC: Add version tag to new argument "empty" * MAINT: Default to C-order for padded arrays unless the input is F-contiguous. * MAINT: Name slice of original area consistently for all arguments describing the same thing. * STY: Reduce vertical space * MAINT: Remove shape argument from _slice_at_axis Simplifies calls to this function and the function itself. Using `(...,)` instead should keep this unambiguous. This change is not compatible with Python 2.7 which doesn't support this syntax outside sequence slicing. If that is wanted one could use `(Ellipsis,)` instead. * TST: Test if end_values of linear_ramp are exact which was not given in the old implementation `_arange_ndarray`. * DOC: Improve comments and wrap long line * MAINT: Refactor index_pair to width_pair Calling the right value an index is just plain wrong as it can't be used as such. * MAINT: Make _linear_ramp compatible with size=0 * MAINT: Don't rely on negative indices for slicing Calculating the proper positive index of the start of the right pad area makes it possible to omit the extra code paths for a width of 0. This should make the code easier to reason about. * MAINT: Skip calculation of right_stat if identical If the input area for both sides is the same we don't need to calculate it twice. * TST: Adapt tests from numpygh-12789 to rewrite of pad * TST: Add tests for mode "empty" * TST: Test dtype persistence for all modes * TST: Test exception for unsupported modes * TST: Test repeated wrapping for each side individually. Reaches some only partially covered if-statments in _set_wrap_both. * TST: Test padding of empty dimension with constant * TST: Test if end_values of linear_ramp are exact which was not given in the old implementation `_arange_ndarray`. (Was accidentally overwritten during the last merge). * TST: Test persistence of memory layout Adapted from an older commit 3ac4d2a which was accidentally overwritten during the last merge. * MAINT: Simplify branching in _set_reflect_both Reduce branching and try to make the calculation of the various indices easier to understand. * TST: Parametrize TestConditionalShortcuts class * TST: Test empty dimension padding for all modes * TST: Keep test parametrization ordered Keep parametrization ordered, otherwise pytest-xdist might believe that different tests were collected during parallelization causing test failures. * DOC: Describe performance improvement of np.pad as well as the new mode "empty" in release notes (see numpygh-11358). * DOC: Remove outdated / misleading notes These notes are badly worded or actually misleading. For a better explanation on how these functions work have a look at the context and comments just above the lines calling these functions.

ENH: Add support for constant, edge, linear_ramp to new numpy.pad

a1d33d2

Passes unit tests: - TestConstant - TestEdge - TestZeroPadWidth - TestLegacyVectorFunction - TestNdarrayPadWidth - TestUnicodeInput - TestLinearRamp

charris changed the title ~~WIP: ENH: Rewrite numpy.pad without concatenate~~ WIP: MAINT: Rewrite numpy.pad without concatenate Jun 16, 2018

charris added component: numpy.lib 03 - Maintenance 51 - In progress labels Jun 16, 2018

eric-wieser reviewed Jun 16, 2018

View reviewed changes

numpy/lib/arraypad.py Outdated Show resolved Hide resolved

eric-wieser reviewed Jun 16, 2018

View reviewed changes

numpy/lib/arraypad.py Outdated Show resolved Hide resolved

eric-wieser reviewed Jun 16, 2018

View reviewed changes

numpy/lib/arraypad.py Show resolved Hide resolved

lagru added 5 commits June 17, 2018 14:01

MAINT: Simplify diff / change order of functions

df0a936

MAINT: Revert to old handling of keyword-only arguments

42930a6

ENH: Add support for stat modes

028c3a8

ENH: Add support for "reflect" mode

e58a83e

MAINT: Remove _slice_column

e2700fd

lagru force-pushed the pad branch from 2fbed41 to e2700fd Compare June 23, 2018 08:39

eric-wieser reviewed Jun 24, 2018

View reviewed changes

numpy/lib/arraypad.py Outdated Show resolved Hide resolved

eric-wieser reviewed Jun 24, 2018

View reviewed changes

numpy/lib/arraypad.py Outdated Show resolved Hide resolved

eric-wieser reviewed Jun 24, 2018

View reviewed changes

numpy/lib/arraypad.py Outdated Show resolved Hide resolved

lagru added 8 commits June 25, 2018 13:27

ENH: Add support for "symmetric" mode

5bacfdb

MAINT: Reorder arguments of a sum and fix typo

f55a79c

Addresses feedback raised in PR.

ENH: Add support for "wrap" mode

4fa818b

This completes the first draft of the complete rewrite meaning all unit tests should pass from this commit onwards.

MAINT: Merge functions for "reflect" and "symmetric" mode

a078444

The set functions were nearly the same, apart from some index offsets. Merging them reduces code duplication.

TST: Add regression test for numpygh-11216

9966caa

The rewrite in past commits fixed this bug.

BUG: Fix edge case for _set_wrap_both when pad_amt contains 0.

b0d32eb

And include test to protect against regression.

eric-wieser reviewed Jun 28, 2018

View reviewed changes

numpy/lib/arraypad.py Outdated Show resolved Hide resolved

eric-wieser reviewed Jun 28, 2018

View reviewed changes

numpy/lib/arraypad.py Outdated Show resolved Hide resolved

eric-wieser reviewed Jun 28, 2018

View reviewed changes

numpy/lib/tests/test_arraypad.py Outdated Show resolved Hide resolved

lagru commented Feb 15, 2019

View reviewed changes

numpy/lib/arraypad.py Show resolved Hide resolved

lagru added 3 commits February 18, 2019 15:22

MAINT: Simplify branching in _set_reflect_both

95e5863

Reduce branching and try to make the calculation of the various indices easier to understand.

TST: Parametrize TestConditionalShortcuts class

92ee158

TST: Test empty dimension padding for all modes

8fc3174

seberg reviewed Feb 19, 2019

View reviewed changes

numpy/lib/tests/test_arraypad.py Show resolved Hide resolved

TST: Keep test parametrization ordered

65845a3

Keep parametrization ordered, otherwise pytest-xdist might believe that different tests were collected during parallelization causing test failures.

lagru added 2 commits February 20, 2019 17:41

DOC: Describe performance improvement of np.pad

6ca26e9

as well as the new mode "empty" in release notes (see numpygh-11358).

Merge branch 'master' into pad

1e381e6

seberg self-assigned this Feb 28, 2019

Merge branch 'master' into pad

faf8716

lagru mentioned this pull request Mar 20, 2019

DOC: correction to numpy.pad docstring #13149

Merged

DOC: Remove outdated / misleading notes

d0a94a3

These notes are badly worded or actually misleading. For a better explanation on how these functions work have a look at the context and comments just above the lines calling these functions.

seberg merged commit 81f0dda into numpy:master Mar 25, 2019

lagru deleted the pad branch March 25, 2019 13:51

grlee77 mentioned this pull request Aug 13, 2019

replace cupy.pad with a heavily refactored version from NumPy 1.17 cupy/cupy#2399

Merged

2 tasks

jakirkham mentioned this pull request Nov 20, 2019

un-numpy-like behavior of da.pad with stats dask/dask#5303

Closed

grlee77 mentioned this pull request Jul 30, 2020

Add rolling ball algorithm for background substraction scikit-image/scikit-image#4851

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: Rewrite numpy.pad without concatenate #11358

MAINT: Rewrite numpy.pad without concatenate #11358

lagru commented Jun 16, 2018 •

edited

Loading

eric-wieser commented Jun 16, 2018

lagru commented Jun 17, 2018

lagru commented Feb 19, 2019

lagru commented Feb 20, 2019

charris commented Feb 20, 2019

mattip commented Mar 6, 2019

lagru commented Mar 16, 2019

seberg commented Mar 16, 2019

lagru commented Mar 16, 2019

seberg commented Mar 24, 2019

lagru commented Mar 24, 2019

seberg commented Mar 25, 2019

lagru commented Mar 25, 2019 •

edited

Loading

mattip commented Mar 25, 2019

MAINT: Rewrite numpy.pad without concatenate #11358

MAINT: Rewrite numpy.pad without concatenate #11358

Conversation

lagru commented Jun 16, 2018 • edited Loading

Description / Motivation

Roadmap

Related

eric-wieser commented Jun 16, 2018

lagru commented Jun 17, 2018

lagru commented Feb 19, 2019

lagru commented Feb 20, 2019

charris commented Feb 20, 2019

mattip commented Mar 6, 2019

lagru commented Mar 16, 2019

seberg commented Mar 16, 2019

lagru commented Mar 16, 2019

seberg commented Mar 24, 2019

lagru commented Mar 24, 2019

seberg commented Mar 25, 2019

lagru commented Mar 25, 2019 • edited Loading

mattip commented Mar 25, 2019

lagru commented Jun 16, 2018 •

edited

Loading

lagru commented Mar 25, 2019 •

edited

Loading