How to limit cross correlation window width in Numpy? #5954

HoniSanders · 2015-06-09T01:57:43Z

I am learning numpy/scipy, coming from a MATLAB background. The xcorr function in Matlab has an optional argument "maxlag" that limits the lag range from –maxlag to maxlag. This is very useful if you are looking at the cross-correlation between two very long time series but are only interested in the correlation within a certain time range. The performance increases are enormous considering that cross-correlation is incredibly expensive to compute.

What is troubling me is that numpy.correlate does not have a maxlag feature. This means that even if I only want to see correlations between two time series with lags between -100 and +100 ms, for example, it will still calculate the correlation for every lag between -20000 and +20000 ms (which is the length of the time series). This (theoretically) gives a 200x performance hit! Is it possible that I could contribute this feature?

I have introduced this question as a scipy issue and on the scipy-dev list. It seems the best place to start is with numpy.correlate, so that is what I am requesting. I have done a simple implementation which gives 50x speedup under my conditions.

This is my first experience with contributing to open-source software, so any pointers are appreciated.

Other issues related to correlate functions include ENH: Fold fftconvolve into convolve/correlate functions as a parameter #2651, Use FFT in np.correlate/convolve? (Trac #1260) #1858, and normalized cross-correlation (Trac #1714) #2310.

seberg · 2015-06-09T06:51:14Z

Out of curiosity, how do matlab (maybe R?) handle rescaling effects due to different values when the max lag is not much smaller then the amount of data?
I think I often use np.correlate(a, b[-max_lag:max_lag], mode='valid'), to use always the same amount of data (and as much of the same value as possible). Maybe I am just crazy here, but wondering.
Not sure I opened an issue, but I think some function hidden in matplotlib had this kind of thing, and when maxlag got larger, I was annoyed by rescaling effects IIRC.

HoniSanders · 2015-06-09T06:56:04Z

Not sure what you mean by rescaling effects.

Your workaround throws out huge amounts of data. You still want to know how b[-maxlag-1] correlates with entries in a that are within maxlag entries from that.

matplotlib.pyplot.xcorr has a maxlag argument, but it is just a wrapper for numpy.correlate, so it doesn't get any performance benefits because it ends up calculating the entire cross-correlation and then throwing part away before returning to the user.

seberg · 2015-06-09T07:37:27Z

Maybe that was a bit silly, since I was more concerned about getting something like xcorr right, as opposed to adding a keyword to current correlate.
I think the idea is not bad, maybe the correct question to answer is:
We have three modes, 'valid', 'same', 'full'. What are the exact ranges for all of these. Once you know that, adding such a keyword to the C code should be simple. This might be trivial, but on first sight, I am not sure it is.

seberg · 2015-06-09T07:47:44Z

On the one hand, maxlag you could say: Well if "valid" mode gives something smaller, that is fine. But then we want maxlag to be -maxlag, ...,0, ..., maxlag. Which is unclear for different sized arrays, making the keyword argument useless in "valid" mode?

HoniSanders · 2015-06-09T17:16:25Z

yes, i plan to do that. I’m not 100% sure of how to deal with vectors of different sizes. Should I center the second vector with respect to the first or right align them?

On Jun 9, 2015, at 3:47 AM, seberg notifications@github.com wrote:

On the one hand, maxlag you could say: Well if "valid" mode gives something smaller, that is fine. But then we want maxlag to be -maxlag, ...,0, ..., maxlag. Which is unclear for different sized arrays, making the keyword argument useless in "valid" mode?

—
Reply to this email directly or view it on GitHub #5954 (comment).

HoniSanders · 2015-06-17T02:24:53Z

I have now implemented this functionality in numpy.correlate() and numpy.convolve(). master...bringingheavendown:maxlag. The files that were edited are:
numpy/core/src/multiarray/multiarraymodule.c
numpy/core/numeric.py
numpy/core/tests/test_numeric.py
Please look over the code, my design decisions, and the unit tests I have written. This is my first time contributing, so I am not confident about any of these and welcome feedback.

seberg · 2015-06-17T08:41:21Z

Hey, its a bit difficultl to discuss things like this (can't get a diff easily, etc.). It will be a bit of hoop jumping, but maybe you can see it as an exercise....

Could you try to:

create a new branch locally, based on your current master. Then the new branch will have all your changes in it as well.
Reset your master branch to the current numpy master branch (these two steps are mostly to keep your things clean, they are not always necessary, but always good habit).
Go to your new branch, and git rebase master to put it on top of master (may not be necessary, but if there is a conflict there, we probably can't merge the changes anyway).
Push the new branch to github. github usually automatically asks you if you want to create a pull request at this point.
Create a pull request, even if we close it in the end, everyone can see all your changes comfortably and many are accustomed/have aliases to fetching a PR changes to test them locally as well.

Note that while it is often said that changing history is bad in git, the opposite is true for pull requests, you can always rewrite the branch history completly when people ask you to change things (we usually do this at the end to clean up the branch history a bit).

seberg · 2015-06-17T09:30:55Z

Ah, you actually got the branch already I guess. Then you can just do a PR as well I think, but that gives an easy way to view the diff in any case:
master...bringingheavendown:maxlag

HoniSanders · 2015-06-17T16:42:33Z

Here are the specific issues that could use attention in this implementation:

How to deal with the optional lagvec output array?
Is it ok that this implementation breaks the previous undocumented behavior of mode=0, 1, or 2 returning the correlation for mode='valid', 'same', or 'full' respectively? If not, how should maxlag input be passed?
Should the checks that are included in convolve() be added to correlate()?
Is it ok that I used INT_MAX to represent requests for a default argument in the C code?
Do I need to check for "default arguments" at all, given that the calling functions should now always call _pyarray_correlate with the appropriate number of arguments? I kind of feel there is some legacy behavior I should be maintaining, but I'm not 100% sure.
Is the error message in _pyarray_correlate "PyErr_SetString(PyExc_ValueError, "mode must be 0, 1, 2, or 3")" still valuable? What should it be replaced with?
Are there other error checks I should add to the code at this or other points?
Could you all clean up the unit tests/add some new ones to be absolutely sure that it deals properly with weird combinations of input array sizes and lag inputs? I couldn't figure out how to run all the same unit tests with the order of input arrays switched.

Eridum · 2015-10-24T20:58:18Z

Hi bringingheavendown, the correlate function modification you made is exactly what I was looking for. I think the way you handle the minlag, maxlag is perfect. Any hope of seeing it in the offical numpy ?

njsmith · 2015-10-25T00:02:44Z

The way to move forward on this would be:

For questions about what the user-facing API should look like, send an
email to the numpy-discussion list outlining what you have and what the
options you see are. (At, say, the level of detail you'd put in the
function documentation.)
For questions about internal implementation stuff, submit whatever you
have as a pull request for people to look at; that's our standard procedure
for doing code review. It doesn't have to be perfect to post a PR :-).

On Sat, Oct 24, 2015 at 1:58 PM, Eridum notifications@github.com wrote:

Hi bringingheavendown, the correlate function modification you made is
exactly what I was looking for. I think the way you handle the minlag,
maxlag is perfect. Any hope of seeing it in the offical numpy ?

—
Reply to this email directly or view it on GitHub
#5954 (comment).

Nathaniel J. Smith -- http://vorpus.org

HoniSanders · 2015-10-26T15:05:35Z

There already is a pull request: #5978. It passed all tests on all machines except for one, but I couldn't get enough help debugging it. What can I do about that?

There also already was discussion about the user-facing API on the numpy-discussion list back in June. At this point, I feel that the code is complete and just needs to be incorporated. If more discussion is desired, we can do that. In any case, I only dropped this because it didn't seem like it was going anywhere. Let me know what the process is at this point, and I would be happy to do it.

njsmith · 2015-10-26T22:45:21Z

Sorry, I missed the pull request, was just looking at this bug. It sounds
like things are moving forward in that PR now though.

On Mon, Oct 26, 2015 at 8:05 AM, bringingheavendown <
notifications@github.com> wrote:

There already is a pull request: #5978
#5978. It passed all tests on all
machines except for one, but I couldn't get enough help debugging it. What
can I do about that? What are the next steps you are envisioning?
There also already was discussion about the user-facing API on the
numpy-discussion list back in June. At this point, I feel that the code is
complete and just needs to be incorporated. If more discussion is desired,
we can do that. In any case, I only dropped this because it didn't seem
like it was going anywhere.

—
Reply to this email directly or view it on GitHub
#5954 (comment).

Nathaniel J. Smith -- http://vorpus.org

HoniSanders · 2015-11-18T01:35:16Z

The pull request is just waiting for approval now. I can squash commits and give appropriate commit messages once I know everything is ready to go forward.

rgommers · 2016-05-28T12:51:01Z

Content in this issue overlaps with the actual PR (gh-5978), which has more up-to-date comments. So I'll close this issue to limit the amount of places the same maxlag enhancement is discussed.

replaced int with npy_intp for lag parameters in _pyarray_correlate and related functions renamed lagvector input to correlate() and convolve() to be named return_lags

_pyarray_correlate ignores minlag, maxlag, and lagstep for modes other than 3

numpygh-5978)

HoniSanders mentioned this issue Jun 17, 2015

ENH: adding maxlag mode to convolve and correlate #5978

Closed

rgommers closed this as completed May 28, 2016

rgommers pushed a commit to rgommers/numpy that referenced this issue May 28, 2016

ENH: add maxlag to numpy.correlate (see numpygh-5954)

2b56e58

rgommers pushed a commit to rgommers/numpy that referenced this issue May 28, 2016

BUG: fixed default arguments in correlate pipeline (See numpy#5954)

e1acde8

_pyarray_correlate ignores minlag, maxlag, and lagstep for modes other than 3

HoniSanders added a commit to HoniSanders/numpy that referenced this issue Jul 11, 2017

ENH: add maxlag to np.correlate & convolve (numpygh-5954, numpygh-4940,

e7228b0

numpygh-5978)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to limit cross correlation window width in Numpy? #5954

How to limit cross correlation window width in Numpy? #5954

HoniSanders commented Jun 9, 2015

seberg commented Jun 9, 2015

HoniSanders commented Jun 9, 2015

seberg commented Jun 9, 2015

seberg commented Jun 9, 2015

HoniSanders commented Jun 9, 2015

HoniSanders commented Jun 17, 2015

seberg commented Jun 17, 2015

seberg commented Jun 17, 2015

HoniSanders commented Jun 17, 2015

Eridum commented Oct 24, 2015

njsmith commented Oct 25, 2015

HoniSanders commented Oct 26, 2015

njsmith commented Oct 26, 2015

HoniSanders commented Nov 18, 2015

rgommers commented May 28, 2016

How to limit cross correlation window width in Numpy? #5954

How to limit cross correlation window width in Numpy? #5954

Comments

HoniSanders commented Jun 9, 2015

seberg commented Jun 9, 2015

HoniSanders commented Jun 9, 2015

seberg commented Jun 9, 2015

seberg commented Jun 9, 2015

HoniSanders commented Jun 9, 2015

HoniSanders commented Jun 17, 2015

seberg commented Jun 17, 2015

seberg commented Jun 17, 2015

HoniSanders commented Jun 17, 2015

Eridum commented Oct 24, 2015

njsmith commented Oct 25, 2015

HoniSanders commented Oct 26, 2015

njsmith commented Oct 26, 2015

HoniSanders commented Nov 18, 2015

rgommers commented May 28, 2016