Skip to content

ENH: adding maxlag mode to convolve and correlate #5978

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

HoniSanders
Copy link

What was troubling me is that numpy.correlate does not have a maxlag feature. This means that even if I only want to see correlations between two time series with lags between -100 and +100 ms, for example, it will still calculate the correlation for every lag between -20000 and +20000 ms (which is the length of the time series). This (theoretically) gives a 200x performance hit! Is it possible that I could contribute this feature?

I have introduced this question as a numpy issue, a scipy issue and on the scipy-dev list. It seems the best place to start is with numpy.correlate, so that is what I am requesting.

This is my first experience with contributing to open-source software, so any pointers are appreciated.

Previous discussion of this functionality can be found at another discussion on numpy correlate (and convolution). Other issues related to correlate functions include ENH: Fold fftconvolve into convolve/correlate functions as a parameter #2651, Use FFT in np.correlate/convolve? (Trac #1260) #1858, and normalized cross-correlation (Trac #1714) #2310.

I have added extra options for the "mode" parameter, namely an int or an int tuple that determine for which lags the correlation/convolution should be calculated. These correspond to the values in the vector that the same tuple would generate if used as an argument to the numpy.arange() function.

Here are the specific issues that could use attention in this implementation:

Is it ok that this implementation breaks the previous undocumented behavior of mode=0, 1, or 2 returning the correlation for mode='valid', 'same', or 'full' respectively? If not, how should maxlag input be passed?
Should the checks that are included in convolve() be added to correlate()?
Is it ok that I used NPY_MAX_INT to represent requests for a default argument in the C code?
Do I need to check for "default arguments" at all, given that the calling functions should now always call _pyarray_correlate with the appropriate number of arguments? I kind of feel there is some legacy behavior I should be maintaining, but I'm not 100% sure.
Is the error message in _pyarray_correlate "PyErr_SetString(PyExc_ValueError, "mode must be 0, 1, 2, or 3")" still valuable? What should it be replaced with?
Are there other error checks I should add to the code at this or other points?
Could you all clean up the unit tests/add some new ones to be absolutely sure that it deals properly with weird combinations of input array sizes and lag inputs? I couldn't figure out how to run all the same unit tests with the order of input arrays switched.

@HoniSanders
Copy link
Author

Looking at the log, the build failed with errors, none of which were from lines that I changed, so I'm not sure what's going on there.

@rgommers
Copy link
Member

The build didn't fail, the tests failed under Python 3.3 with as output:

.............*** Error in `python3': free(): invalid next size (fast): 0x0a5f40a0 ***
./tools/travis-test.sh: line 90: 16691 Aborted                 $PYTHON ../tools/test-installed-numpy.py
The command "./tools/travis-test.sh" exited with 134.

Not clear from that if it's from this PR.

@rgommers
Copy link
Member

Squashing a few commits and/or more informative commit messages would be helpful. The first one is fine; the rest should be DOC, STY or MAINT with a description of what changes are in that particular commits.

@rgommers rgommers changed the title Maxlag ENH: adding lagvector keyword to convolve and correlate Jun 18, 2015
the signal boundary have no effect.
the signal boundary have no effect. This corresponds with a lag tuple
of (0, N-M+1, 1) for N>M or (-M+N, 1, 1) for M>N.
lagvector : bool, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A clearer name would be returns_lags

@rgommers
Copy link
Member

Is it ok that this implementation breaks the previous undocumented behavior of mode=0, 1, or 2 returning the correlation for mode='valid', 'same', or 'full' respectively? If not, how should maxlag input be passed?

Is it really necessary to break this? Why not for example add a maxlag=None keyword?

@rgommers rgommers changed the title ENH: adding lagvector keyword to convolve and correlate ENH: adding maxlag mode to convolve and correlate Jun 18, 2015
@HoniSanders
Copy link
Author

OK. I renamed the lagvector input to returns_lags.
How do I edit the commit history?
I don't understand how the maxlag=None keyword would help. Under the functionality that I added, int values of mode represent the maxlag that should be used generating a (minlag, maxlag, lagstep) tuple of (-mode+1,mode,1). The potential problem is that users may have used mode=2 to mean mode='full' even though that was not a documented allowed input. I am not sure that retaining this functionality is valuable enough to disallow int values of mode as maxlag, as that is what users would expect from a maxlag argument. I would be happy to discuss this at greater length as I wish to make my design decisions clear so that you can properly review them.
Also, what is the appropriate thing to do about the failed tests on the single build version?

@homu
Copy link
Contributor

homu commented Jun 21, 2015

☔ The latest upstream changes (presumably #5992) made this pull request unmergeable. Please resolve the merge conflicts.

@HoniSanders
Copy link
Author

its actually bac2fdf that made this request unmergeable.

@rgommers
Copy link
Member

Editing commit messages can be done via an interactive rebase. For example, to edit the last two messages, do:

$ git rebase -i HEAD^^
$ # in the editor, change "pick" to "reword", then edit each message as it comes up

It's not necessary to go back and do this retroactively though.

If you need help with the rebase that is anyway needed, just let me know.

@rgommers
Copy link
Member

Regarding the Python 3.4 failure, it really looks like a bug introduced by this PR so needs debugging and fixing.

@rgommers
Copy link
Member

(continuing on this PR tonight)

@rgommers
Copy link
Member

I don't understand how the maxlag=None keyword would help. ....

I now see what was suggested on the mailing list thread that you linked to above, and why you reused integer mode. Here is my confusion:

# Convention here for simplicity: n1 > n2
n1 = 1000
n2 = 101

valid_len = n1 - n2 - 1  # 900
same_len = n1  # 1000
full_len = n1 + n2 - 1  # 1100

# Computation always does a for-loop over at least (n1 - n2 - 1) = 900 points
# Now for ``maxlag=10``, what length output am I interested in? And for ``maxlag=950``?

My intuitive understanding about modes valid, same, full is that I'm caring about about boundary effects (i.e. the accuracy of the computed results). For maxlag I'm interested in more efficient computation. I can see wanting maxlag results with or without boundary effects. This is why I suggested a separate maxlag=None keyword. Now I could simply be confused about the implementation details and those modes effectively tweak the same iteration boundaries, but I'd like to see how you think about this conceptually.

@rgommers
Copy link
Member

Style comment: it would be good to put most comments above the code instead of behind it. This is what our C style guide says (https://github.com/numpy/numpy/blob/master/doc/C_STYLE_GUIDE.rst.txt)
The current code is hardly readable with my default editor width of just over 80 chars.

@HoniSanders
Copy link
Author

@sturlamolden had suggested that maxlag be a replacement for mode because both mode and maxlag arguments deal with the length of the output. I will explain the current implementation in the context of the example you have laid out.

# Convention here for simplicity: n1 > n2
n1 =  len(v1) = 1000
n2 = len(v2) = 101

valid_len = n1 - n2 - 1  # 900
# corresponds to maxlag tuple (0, n1-n2-1, 1)
same_len = n1  # 1000
# corresponds to maxlag tuple (-n2/2, n1-n2/2, 1)
full_len = n1 + n2 - 1  # 1100
# corresponds to maxlag tuple (-n2+1, n1, 1)

# Computation always does a for-loop over at least (n1 - n2 - 1) = 900 points
# Now for ``maxlag=10``, what length output am I interested in? 
# Answer:  19.  You receive lags starting at -9 (v2[0] 9 left of v1[0]; i.e.,v2[9] lined up with v1[0]).  
# Lags continue through +9 (v2[0] lined up with v1[9]).  19 calculations done.
# And for ``maxlag=950``?
# Answer:  1899.  The first 848 will all be 0 because there will be no overlap between v2 and v1.
# The final value will be that for v2[0] lined up with v1[949], for which there is only partial overlap.

I am not confident that this is the best behavior. It could be as it seems you are suggesting that we should take the lags given by the mode argument and the lags given by the maxlag argument and take the min of those. This assumes that the user has some expectation for the quality of the data they are receiving (the mode argument) and also a separate expectation of the maximum amount of data they are interested in (the maxlag argument). In cases of conflict, we should choose the smaller of the two, as surely they are never interested in more than maxlag data points, but on the other hand they are also not interested in maxlag data points if those data points are not of the quality specified by the mode argument. Should I implement this?

Unrelated questions:
I have never done rebase before, but I imagine I could with some basic explanation of what the goal of the rebase is.
How do I debug the python 3.4 failure?

@sturlamolden
Copy link
Contributor

The purpose of rebase is to make sure your PR can be automatically merged.

@HoniSanders
Copy link
Author

Is rebase also involved in removing extraneous commits and changing the commit messages?

@sturlamolden
Copy link
Contributor

Yes you can use rebase to squash commits and edit commit messages.

@juliantaylor
Copy link
Contributor

there is some problem:

test_lags (test_numeric.TestCorrelate) ... ==5859== Invalid read of size 8
==5859==    at 0x4861304: small_correlate (arraytypes.c.src:3892)
==5859==    by 0x493D2C6: _pyarray_correlate (multiarraymodule.c:1273)
==5859==    by 0x493D88D: PyArray_Correlate2 (multiarraymodule.c:1402)
==5859==    by 0x4942545: array_correlate2 (multiarraymodule.c:2978)
==5859==    by 0x80BC41F: PyCFunction_Call (methodobject.c:85)
==5859==    by 0x814D6F4: call_function (ceval.c:4356)
==5859==    by 0x8148BAF: PyEval_EvalFrameEx (ceval.c:2993)
==5859==    by 0x814B460: PyEval_EvalCodeEx (ceval.c:3588)
==5859==    by 0x814DBCE: fast_function (ceval.c:4452)
==5859==    by 0x814D87C: call_function (ceval.c:4377)
==5859==    by 0x8148BAF: PyEval_EvalFrameEx (ceval.c:2993)
==5859==    by 0x814B460: PyEval_EvalCodeEx (ceval.c:3588)
==5859==  Address 0x5b4f940 is 8 bytes before a block of size 40 alloc'd
==5859==    at 0x40291CC: malloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==5859==    by 0x482ABA2: PyDataMem_NEW (alloc.c:173)
==5859==    by 0x482A908: _npy_alloc_cache (alloc.c:42)
==5859==    by 0x482A9A9: npy_alloc_cache (alloc.c:69)
==5859==    by 0x488E0C2: PyArray_NewFromDescr_int (ctors.c:1041)
==5859==    by 0x488E56D: PyArray_NewFromDescr (ctors.c:1124)
==5859==    by 0x525F87C: execute_legacy_ufunc_loop (ufunc_object.c:1608)
==5859==    by 0x52627AD: PyUFunc_GenericFunction (ufunc_object.c:2623)
==5859==    by 0x526801B: ufunc_generic_call (ufunc_object.c:4249)
==5859==    by 0x80690DA: PyObject_Call (abstract.c:2529)
==5859==    by 0x80698F3: PyObject_CallFunctionObjArgs (abstract.c:2756)
==5859==    by 0x4967C2F: PyArray_GenericBinaryFunction (number.c:321)
==5859== 
==5859== Invalid write of size 8
==5859==    at 0x486139D: small_correlate (arraytypes.c.src:3895)
==5859==    by 0x493D2C6: _pyarray_correlate (multiarraymodule.c:1273)
==5859==    by 0x493D88D: PyArray_Correlate2 (multiarraymodule.c:1402)
==5859==    by 0x4942545: array_correlate2 (multiarraymodule.c:2978)
==5859==    by 0x80BC41F: PyCFunction_Call (methodobject.c:85)
==5859==    by 0x814D6F4: call_function (ceval.c:4356)
==5859==    by 0x8148BAF: PyEval_EvalFrameEx (ceval.c:2993)
==5859==    by 0x814B460: PyEval_EvalCodeEx (ceval.c:3588)
==5859==    by 0x814DBCE: fast_function (ceval.c:4452)
==5859==    by 0x814D87C: call_function (ceval.c:4377)
==5859==    by 0x8148BAF: PyEval_EvalFrameEx (ceval.c:2993)
==5859==    by 0x814B460: PyEval_EvalCodeEx (ceval.c:3588)
==5859==  Address 0x5c96930 is 0 bytes after a block of size 8 alloc'd
==5859==    at 0x40291CC: malloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==5859==    by 0x482ABA2: PyDataMem_NEW (alloc.c:173)
==5859==    by 0x482A908: _npy_alloc_cache (alloc.c:42)
==5859==    by 0x482A9A9: npy_alloc_cache (alloc.c:69)
==5859==    by 0x488E0C2: PyArray_NewFromDescr_int (ctors.c:1041)
==5859==    by 0x488E56D: PyArray_NewFromDescr (ctors.c:1124)
==5859==    by 0x488E9B2: PyArray_New (ctors.c:1246)
==5859==    by 0x493A678: new_array_for_sum (multiarraymodule.c:800)
==5859==    by 0x493D09B: _pyarray_correlate (multiarraymodule.c:1235)
==5859==    by 0x493D88D: PyArray_Correlate2 (multiarraymodule.c:1402)
==5859==    by 0x4942545: array_correlate2 (multiarraymodule.c:2978)
==5859==    by 0x80BC41F: PyCFunction_Call (methodobject.c:85)
==5859== 

I currently haven't got my suppression files available, so I can't narrow it down further with --db-attach

@HoniSanders
Copy link
Author

Ok thanks a lot for that info. Was there a change in small_correlate between 2.7 and 3.4? Or does the version number not help us very much?

@juliantaylor
Copy link
Contributor

It has nothing to do with python2 or 3. The difference in travis is that the failing test is on a 32 bit platform. But the issue also appears on 64 bit, it just doesn't crash due to coincidence. valgrind complains also on 64 bit.

@HoniSanders
Copy link
Author

hmmm. ok i'll look into it. that is very helpful. where do you see the warnings?

@homu
Copy link
Contributor

homu commented Jul 25, 2015

☔ The latest upstream changes (presumably #6047) made this pull request unmergeable. Please resolve the merge conflicts.

@pv
Copy link
Member

pv commented Oct 26, 2015

What the errors most likely mean is that there's out-of-bounds access in some of the C code,
as also indicated by valgrind above.
If so, these can be usually found by reading through the code and double-checking it with this in mind.

You can also install valgrind on your own computer to reproduce what julian did above, so that it is easier to do it yourself.

@HoniSanders
Copy link
Author

Yes, I think almost certainly that is due to my use of NPY_MAX_INTP as an invalid value passed to _pyarray_correlate() in order to retain backwards compatibility of the C code with hypothetical python code that calls the C code directly without going through the correlate() function. I am not sure what the appropriate way to deal with that is.

Here is an explanation of the context: correlate() in numeric.py calls array_correlate2() in multiarraymodule.c. It used to pass three arguments: the two vectors and a integer value (0, 1, or 2) to let the C code know what “mode” (valid, same, full) to use. This mode argument was optional, and array_correlate2() had a default value for mode that it would use, and then pass on to deeper C functions, most importantly _pyarray_correlate(). In my edit, mode is still passed as a single int value, but is ignored unless the int tuple (minlag,maxlag,lagstep) is passed incorrectly, which it is not in my python code. In the case that the “optional” arguments of mode and the lag tuple are not passed from python to C (again, my code does not do this), it is not appropriate for array_correlate2() to generate the lag tuple itself, rather it passes an invalid value to _pyarray_correlate() which then figures out what the appropriate lag tuple is given the variable amount of values it was passed. I chose to use NPY_MAX_INTP as the invalid value signalling that a recalculation must be performed. I couldn’t think of anything else to use as 0 and negative integers are acceptable values for the lag tuple.

I would prefer just making the lag tuple mandatory arguments to the C code, and therefore sidestepping the whole issue of default arguments. The python code I have written deals with default arguments itself, and there is no other python code in the numpy codebase that uses this C code, so it seems safe, but as a newcomer, I am not 100% sure if this is ok for compatibility.

On Oct 26, 2015, at 11:17 AM, Pauli Virtanen notifications@github.com wrote:

What the errors most likely mean is that there's out-of-bounds access in some of the C code.


Reply to this email directly or view it on GitHub #5978 (comment).

@rgommers
Copy link
Member

I'd like to see some performance benchmarks: does your refactoring affect the speed of the other, more typical use cases? Most likely not, but has to be tested.

I've added some benchmarks in gh-7689. Looks like this PR is several times slower for small arrays.

@JLansey
Copy link

JLansey commented Jun 9, 2016

Can we make this a different version of np.correlate then that includes the max_lag capability. The speed improvements on larger arrays when using a small lag is one reason I found this page. Please don't give up on this! Thanks everyone working on this.

@homu
Copy link
Contributor

homu commented Aug 5, 2016

☔ The latest upstream changes (presumably #7177) made this pull request unmergeable. Please resolve the merge conflicts.

@@ -1120,7 +1212,6 @@ def restoredot():
alterdot : `restoredot` undoes the effects of `alterdot`.

"""
# 2014-08-13, 1.10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these lines removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think they were removed in another commit from the master branch that was merged into this.

@sturlamolden
Copy link
Contributor

It is time to ping this PR since it has been dormant for a year.

@LeonardAB
Copy link

any update on this issue?

@HoniSanders
Copy link
Author

HoniSanders commented Dec 19, 2017 via email

@LeonardAB
Copy link

Thank you so much for developing this feature. While waiting for the PR to work, is it possible to use this maxlag feature as a standalone function in my code? what should I do? which code should I use?

@tritemio
Copy link

tritemio commented Dec 20, 2017

@LeonardAB, you can use the function ucorrelate from the pycorrelate package which has the maxlag keyword. See: https://stackoverflow.com/a/47893831/2304916

@LeonardAB
Copy link

@tritemio thanks! I just tried to use matplotlib.xcorr instead, with some adjustment. It works perfectly for my purpose. https://stackoverflow.com/a/47897581/5122657

@orena1
Copy link
Contributor

orena1 commented Sep 4, 2020

#5978 (comment)
3 years, this is major! its a basic function in MATLAB and should be implemented here.

@sturlamolden
Copy link
Contributor

#5978 (comment)
3 years, this is major! its a basic function in MATLAB and should be implemented here.

Except we are not trying to copy Matlab.

@orena1
Copy link
Contributor

orena1 commented Sep 5, 2020

#5978 (comment)
3 years, this is major! its a basic function in MATLAB and should be implemented here.

Except we are not trying to copy Matlab.

It is also a basic function in R. I am not stating it so to say that we need to copy Matlab or R, I am saying it as we need to support basic functionalities. Now, how do you know that a specific functionality is basic? If other languages have it for a long time it is a good sign that it is basic :)

@seberg seberg added the triage review Issue/PR to be discussed at the next triage meeting label Sep 5, 2020
@seberg
Copy link
Member

seberg commented Sep 5, 2020

I tend to agree this is useful and common enough. I am not sure why it stalled a few years back. But there seems to be a few things missing in the PR, probably that includes the API not being quite hashed out (although API is maybe easier to modify once the core functionality is there).

SciPy does also have a slightly more advanced correlation in scipy.signal.correlate (and an ndimensional one, in scipy.ndimage), but on first sight they do not seem to implement a maxlag like feature as well.

My feeling is that the addition itself is fine, but there is a quite a bit of work and some thoughts left to be done here. I realize this is a far fetch, but often the easiest way to push something like this forward is someone new to the project and comfortable with the complexity (here C level code) picking it up, so that core devs can focus on review.

PS: We also don't support the fft-trick in NumPy, which is maybe something we prefer to remain outsorced to scipy. So that would be one reason to actually keep it in scipy, but I this feature is likely useful in matplotlib, which does not depend on scipy.

@seberg
Copy link
Member

seberg commented Sep 10, 2020

I am closing this in favor of the issue gh-17286, since the PR is pretty outdated by now (although there may be a lot of useful things here if anyone picks it up).

We had discussed this briefly, and since SciPy has more full featured correlation/convolution, it seems that any API decision/addition should be made together with SciPy. Probably, starting this project in SciPy and later adding it also to NumPy is the easier way to approach it.

@seberg seberg closed this Sep 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: numpy._core triage review Issue/PR to be discussed at the next triage meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.