Skip to content

How to limit cross correlation window width in Numpy? #5954

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
HoniSanders opened this issue Jun 9, 2015 · 15 comments
Closed

How to limit cross correlation window width in Numpy? #5954

HoniSanders opened this issue Jun 9, 2015 · 15 comments

Comments

@HoniSanders
Copy link

I am learning numpy/scipy, coming from a MATLAB background. The xcorr function in Matlab has an optional argument "maxlag" that limits the lag range from –maxlag to maxlag. This is very useful if you are looking at the cross-correlation between two very long time series but are only interested in the correlation within a certain time range. The performance increases are enormous considering that cross-correlation is incredibly expensive to compute.

What is troubling me is that numpy.correlate does not have a maxlag feature. This means that even if I only want to see correlations between two time series with lags between -100 and +100 ms, for example, it will still calculate the correlation for every lag between -20000 and +20000 ms (which is the length of the time series). This (theoretically) gives a 200x performance hit! Is it possible that I could contribute this feature?

I have introduced this question as a scipy issue and on the scipy-dev list. It seems the best place to start is with numpy.correlate, so that is what I am requesting. I have done a simple implementation which gives 50x speedup under my conditions.

This is my first experience with contributing to open-source software, so any pointers are appreciated.

Other issues related to correlate functions include ENH: Fold fftconvolve into convolve/correlate functions as a parameter #2651, Use FFT in np.correlate/convolve? (Trac #1260) #1858, and normalized cross-correlation (Trac #1714) #2310.

@seberg
Copy link
Member

seberg commented Jun 9, 2015

Out of curiosity, how do matlab (maybe R?) handle rescaling effects due to different values when the max lag is not much smaller then the amount of data?
I think I often use np.correlate(a, b[-max_lag:max_lag], mode='valid'), to use always the same amount of data (and as much of the same value as possible). Maybe I am just crazy here, but wondering.
Not sure I opened an issue, but I think some function hidden in matplotlib had this kind of thing, and when maxlag got larger, I was annoyed by rescaling effects IIRC.

@HoniSanders
Copy link
Author

Not sure what you mean by rescaling effects.

Your workaround throws out huge amounts of data. You still want to know how b[-maxlag-1] correlates with entries in a that are within maxlag entries from that.

matplotlib.pyplot.xcorr has a maxlag argument, but it is just a wrapper for numpy.correlate, so it doesn't get any performance benefits because it ends up calculating the entire cross-correlation and then throwing part away before returning to the user.

@seberg
Copy link
Member

seberg commented Jun 9, 2015

Maybe that was a bit silly, since I was more concerned about getting something like xcorr right, as opposed to adding a keyword to current correlate.
I think the idea is not bad, maybe the correct question to answer is:
We have three modes, 'valid', 'same', 'full'. What are the exact ranges for all of these. Once you know that, adding such a keyword to the C code should be simple. This might be trivial, but on first sight, I am not sure it is.

@seberg
Copy link
Member

seberg commented Jun 9, 2015

On the one hand, maxlag you could say: Well if "valid" mode gives something smaller, that is fine. But then we want maxlag to be -maxlag, ...,0, ..., maxlag. Which is unclear for different sized arrays, making the keyword argument useless in "valid" mode?

@HoniSanders
Copy link
Author

yes, i plan to do that. I’m not 100% sure of how to deal with vectors of different sizes. Should I center the second vector with respect to the first or right align them?

On Jun 9, 2015, at 3:47 AM, seberg notifications@github.com wrote:

On the one hand, maxlag you could say: Well if "valid" mode gives something smaller, that is fine. But then we want maxlag to be -maxlag, ...,0, ..., maxlag. Which is unclear for different sized arrays, making the keyword argument useless in "valid" mode?


Reply to this email directly or view it on GitHub #5954 (comment).

@HoniSanders
Copy link
Author

I have now implemented this functionality in numpy.correlate() and numpy.convolve(). master...bringingheavendown:maxlag. The files that were edited are:
numpy/core/src/multiarray/multiarraymodule.c
numpy/core/numeric.py
numpy/core/tests/test_numeric.py
Please look over the code, my design decisions, and the unit tests I have written. This is my first time contributing, so I am not confident about any of these and welcome feedback.

@seberg
Copy link
Member

seberg commented Jun 17, 2015

Hey, its a bit difficultl to discuss things like this (can't get a diff easily, etc.). It will be a bit of hoop jumping, but maybe you can see it as an exercise....

Could you try to:

  1. create a new branch locally, based on your current master. Then the new branch will have all your changes in it as well.
  2. Reset your master branch to the current numpy master branch (these two steps are mostly to keep your things clean, they are not always necessary, but always good habit).
  3. Go to your new branch, and git rebase master to put it on top of master (may not be necessary, but if there is a conflict there, we probably can't merge the changes anyway).
  4. Push the new branch to github. github usually automatically asks you if you want to create a pull request at this point.
  5. Create a pull request, even if we close it in the end, everyone can see all your changes comfortably and many are accustomed/have aliases to fetching a PR changes to test them locally as well.

Note that while it is often said that changing history is bad in git, the opposite is true for pull requests, you can always rewrite the branch history completly when people ask you to change things (we usually do this at the end to clean up the branch history a bit).

@seberg
Copy link
Member

seberg commented Jun 17, 2015

Ah, you actually got the branch already I guess. Then you can just do a PR as well I think, but that gives an easy way to view the diff in any case:
master...bringingheavendown:maxlag

@HoniSanders
Copy link
Author

Here are the specific issues that could use attention in this implementation:

How to deal with the optional lagvec output array?
Is it ok that this implementation breaks the previous undocumented behavior of mode=0, 1, or 2 returning the correlation for mode='valid', 'same', or 'full' respectively? If not, how should maxlag input be passed?
Should the checks that are included in convolve() be added to correlate()?
Is it ok that I used INT_MAX to represent requests for a default argument in the C code?
Do I need to check for "default arguments" at all, given that the calling functions should now always call _pyarray_correlate with the appropriate number of arguments? I kind of feel there is some legacy behavior I should be maintaining, but I'm not 100% sure.
Is the error message in _pyarray_correlate "PyErr_SetString(PyExc_ValueError, "mode must be 0, 1, 2, or 3")" still valuable? What should it be replaced with?
Are there other error checks I should add to the code at this or other points?
Could you all clean up the unit tests/add some new ones to be absolutely sure that it deals properly with weird combinations of input array sizes and lag inputs? I couldn't figure out how to run all the same unit tests with the order of input arrays switched.

@Eridum
Copy link

Eridum commented Oct 24, 2015

Hi bringingheavendown, the correlate function modification you made is exactly what I was looking for. I think the way you handle the minlag, maxlag is perfect. Any hope of seeing it in the offical numpy ?

@njsmith
Copy link
Member

njsmith commented Oct 25, 2015

The way to move forward on this would be:

  • For questions about what the user-facing API should look like, send an
    email to the numpy-discussion list outlining what you have and what the
    options you see are. (At, say, the level of detail you'd put in the
    function documentation.)
  • For questions about internal implementation stuff, submit whatever you
    have as a pull request for people to look at; that's our standard procedure
    for doing code review. It doesn't have to be perfect to post a PR :-).

On Sat, Oct 24, 2015 at 1:58 PM, Eridum notifications@github.com wrote:

Hi bringingheavendown, the correlate function modification you made is
exactly what I was looking for. I think the way you handle the minlag,
maxlag is perfect. Any hope of seeing it in the offical numpy ?


Reply to this email directly or view it on GitHub
#5954 (comment).

Nathaniel J. Smith -- http://vorpus.org

@HoniSanders
Copy link
Author

There already is a pull request: #5978. It passed all tests on all machines except for one, but I couldn't get enough help debugging it. What can I do about that?

There also already was discussion about the user-facing API on the numpy-discussion list back in June. At this point, I feel that the code is complete and just needs to be incorporated. If more discussion is desired, we can do that. In any case, I only dropped this because it didn't seem like it was going anywhere. Let me know what the process is at this point, and I would be happy to do it.

@njsmith
Copy link
Member

njsmith commented Oct 26, 2015

Sorry, I missed the pull request, was just looking at this bug. It sounds
like things are moving forward in that PR now though.

On Mon, Oct 26, 2015 at 8:05 AM, bringingheavendown <
notifications@github.com> wrote:

There already is a pull request: #5978
#5978. It passed all tests on all
machines except for one, but I couldn't get enough help debugging it. What
can I do about that? What are the next steps you are envisioning?
There also already was discussion about the user-facing API on the
numpy-discussion list back in June. At this point, I feel that the code is
complete and just needs to be incorporated. If more discussion is desired,
we can do that. In any case, I only dropped this because it didn't seem
like it was going anywhere.


Reply to this email directly or view it on GitHub
#5954 (comment).

Nathaniel J. Smith -- http://vorpus.org

@HoniSanders
Copy link
Author

The pull request is just waiting for approval now. I can squash commits and give appropriate commit messages once I know everything is ready to go forward.

@rgommers
Copy link
Member

Content in this issue overlaps with the actual PR (gh-5978), which has more up-to-date comments. So I'll close this issue to limit the amount of places the same maxlag enhancement is discussed.

rgommers pushed a commit to rgommers/numpy that referenced this issue May 28, 2016
rgommers pushed a commit to rgommers/numpy that referenced this issue May 28, 2016
replaced int with npy_intp for lag parameters in
_pyarray_correlate and related functions
renamed lagvector input to correlate() and convolve()
to be named return_lags
rgommers pushed a commit to rgommers/numpy that referenced this issue May 28, 2016
_pyarray_correlate ignores minlag, maxlag, and lagstep for modes other than 3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants