-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: adding maxlag mode to convolve and correlate #5978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looking at the log, the build failed with errors, none of which were from lines that I changed, so I'm not sure what's going on there. |
The build didn't fail, the tests failed under Python 3.3 with as output:
Not clear from that if it's from this PR. |
Squashing a few commits and/or more informative commit messages would be helpful. The first one is fine; the rest should be |
numpy/core/numeric.py
Outdated
the signal boundary have no effect. | ||
the signal boundary have no effect. This corresponds with a lag tuple | ||
of (0, N-M+1, 1) for N>M or (-M+N, 1, 1) for M>N. | ||
lagvector : bool, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A clearer name would be returns_lags
Is it really necessary to break this? Why not for example add a |
OK. I renamed the lagvector input to returns_lags. |
☔ The latest upstream changes (presumably #5992) made this pull request unmergeable. Please resolve the merge conflicts. |
its actually bac2fdf that made this request unmergeable. |
Editing commit messages can be done via an interactive rebase. For example, to edit the last two messages, do:
It's not necessary to go back and do this retroactively though. If you need help with the rebase that is anyway needed, just let me know. |
Regarding the Python 3.4 failure, it really looks like a bug introduced by this PR so needs debugging and fixing. |
(continuing on this PR tonight) |
I now see what was suggested on the mailing list thread that you linked to above, and why you reused integer mode. Here is my confusion:
My intuitive understanding about modes |
Style comment: it would be good to put most comments above the code instead of behind it. This is what our C style guide says (https://github.com/numpy/numpy/blob/master/doc/C_STYLE_GUIDE.rst.txt) |
@sturlamolden had suggested that maxlag be a replacement for mode because both mode and maxlag arguments deal with the length of the output. I will explain the current implementation in the context of the example you have laid out.
I am not confident that this is the best behavior. It could be as it seems you are suggesting that we should take the lags given by the mode argument and the lags given by the maxlag argument and take the min of those. This assumes that the user has some expectation for the quality of the data they are receiving (the mode argument) and also a separate expectation of the maximum amount of data they are interested in (the maxlag argument). In cases of conflict, we should choose the smaller of the two, as surely they are never interested in more than maxlag data points, but on the other hand they are also not interested in maxlag data points if those data points are not of the quality specified by the mode argument. Should I implement this? Unrelated questions: |
The purpose of rebase is to make sure your PR can be automatically merged. |
Is rebase also involved in removing extraneous commits and changing the commit messages? |
Yes you can use rebase to squash commits and edit commit messages. |
there is some problem:
I currently haven't got my suppression files available, so I can't narrow it down further with --db-attach |
Ok thanks a lot for that info. Was there a change in small_correlate between 2.7 and 3.4? Or does the version number not help us very much? |
It has nothing to do with python2 or 3. The difference in travis is that the failing test is on a 32 bit platform. But the issue also appears on 64 bit, it just doesn't crash due to coincidence. valgrind complains also on 64 bit. |
hmmm. ok i'll look into it. that is very helpful. where do you see the warnings? |
☔ The latest upstream changes (presumably #6047) made this pull request unmergeable. Please resolve the merge conflicts. |
What the errors most likely mean is that there's out-of-bounds access in some of the C code, You can also install valgrind on your own computer to reproduce what julian did above, so that it is easier to do it yourself. |
Yes, I think almost certainly that is due to my use of NPY_MAX_INTP as an invalid value passed to _pyarray_correlate() in order to retain backwards compatibility of the C code with hypothetical python code that calls the C code directly without going through the correlate() function. I am not sure what the appropriate way to deal with that is. Here is an explanation of the context: correlate() in numeric.py calls array_correlate2() in multiarraymodule.c. It used to pass three arguments: the two vectors and a integer value (0, 1, or 2) to let the C code know what “mode” (valid, same, full) to use. This mode argument was optional, and array_correlate2() had a default value for mode that it would use, and then pass on to deeper C functions, most importantly _pyarray_correlate(). In my edit, mode is still passed as a single int value, but is ignored unless the int tuple (minlag,maxlag,lagstep) is passed incorrectly, which it is not in my python code. In the case that the “optional” arguments of mode and the lag tuple are not passed from python to C (again, my code does not do this), it is not appropriate for array_correlate2() to generate the lag tuple itself, rather it passes an invalid value to _pyarray_correlate() which then figures out what the appropriate lag tuple is given the variable amount of values it was passed. I chose to use NPY_MAX_INTP as the invalid value signalling that a recalculation must be performed. I couldn’t think of anything else to use as 0 and negative integers are acceptable values for the lag tuple. I would prefer just making the lag tuple mandatory arguments to the C code, and therefore sidestepping the whole issue of default arguments. The python code I have written deals with default arguments itself, and there is no other python code in the numpy codebase that uses this C code, so it seems safe, but as a newcomer, I am not 100% sure if this is ok for compatibility.
|
I've added some benchmarks in gh-7689. Looks like this PR is several times slower for small arrays. |
Can we make this a different version of np.correlate then that includes the max_lag capability. The speed improvements on larger arrays when using a small lag is one reason I found this page. Please don't give up on this! Thanks everyone working on this. |
☔ The latest upstream changes (presumably #7177) made this pull request unmergeable. Please resolve the merge conflicts. |
numpy/core/numeric.py
Outdated
@@ -1120,7 +1212,6 @@ def restoredot(): | |||
alterdot : `restoredot` undoes the effects of `alterdot`. | |||
|
|||
""" | |||
# 2014-08-13, 1.10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why were these lines removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think they were removed in another commit from the master branch that was merged into this.
It is time to ping this PR since it has been dormant for a year. |
any update on this issue? |
Unfortunately, my PR as it is written is not fully functional. I will work on it when I have time.
… On Dec 19, 2017, at 2:36 AM, Leonard AB ***@***.***> wrote:
any update on this issue?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#5978 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGPFeF0WVFaRizpy-u7gEusCWuCC6hWkks5tB2d3gaJpZM4FFtOS>.
|
Thank you so much for developing this feature. While waiting for the PR to work, is it possible to use this maxlag feature as a standalone function in my code? what should I do? which code should I use? |
@LeonardAB, you can use the function ucorrelate from the pycorrelate package which has the maxlag keyword. See: https://stackoverflow.com/a/47893831/2304916 |
@tritemio thanks! I just tried to use matplotlib.xcorr instead, with some adjustment. It works perfectly for my purpose. https://stackoverflow.com/a/47897581/5122657 |
#5978 (comment) |
Except we are not trying to copy Matlab. |
It is also a basic function in R. I am not stating it so to say that we need to copy Matlab or R, I am saying it as we need to support basic functionalities. Now, how do you know that a specific functionality is basic? If other languages have it for a long time it is a good sign that it is basic :) |
I tend to agree this is useful and common enough. I am not sure why it stalled a few years back. But there seems to be a few things missing in the PR, probably that includes the API not being quite hashed out (although API is maybe easier to modify once the core functionality is there). SciPy does also have a slightly more advanced correlation in My feeling is that the addition itself is fine, but there is a quite a bit of work and some thoughts left to be done here. I realize this is a far fetch, but often the easiest way to push something like this forward is someone new to the project and comfortable with the complexity (here C level code) picking it up, so that core devs can focus on review. PS: We also don't support the fft-trick in NumPy, which is maybe something we prefer to remain outsorced to scipy. So that would be one reason to actually keep it in scipy, but I this feature is likely useful in matplotlib, which does not depend on scipy. |
I am closing this in favor of the issue gh-17286, since the PR is pretty outdated by now (although there may be a lot of useful things here if anyone picks it up). We had discussed this briefly, and since SciPy has more full featured correlation/convolution, it seems that any API decision/addition should be made together with SciPy. Probably, starting this project in SciPy and later adding it also to NumPy is the easier way to approach it. |
What was troubling me is that numpy.correlate does not have a maxlag feature. This means that even if I only want to see correlations between two time series with lags between -100 and +100 ms, for example, it will still calculate the correlation for every lag between -20000 and +20000 ms (which is the length of the time series). This (theoretically) gives a 200x performance hit! Is it possible that I could contribute this feature?
I have introduced this question as a numpy issue, a scipy issue and on the scipy-dev list. It seems the best place to start is with numpy.correlate, so that is what I am requesting.
This is my first experience with contributing to open-source software, so any pointers are appreciated.
Previous discussion of this functionality can be found at another discussion on numpy correlate (and convolution). Other issues related to correlate functions include ENH: Fold fftconvolve into convolve/correlate functions as a parameter #2651, Use FFT in np.correlate/convolve? (Trac #1260) #1858, and normalized cross-correlation (Trac #1714) #2310.
I have added extra options for the "mode" parameter, namely an int or an int tuple that determine for which lags the correlation/convolution should be calculated. These correspond to the values in the vector that the same tuple would generate if used as an argument to the numpy.arange() function.
Here are the specific issues that could use attention in this implementation:
Is it ok that this implementation breaks the previous undocumented behavior of mode=0, 1, or 2 returning the correlation for mode='valid', 'same', or 'full' respectively? If not, how should maxlag input be passed?
Should the checks that are included in convolve() be added to correlate()?
Is it ok that I used NPY_MAX_INT to represent requests for a default argument in the C code?
Do I need to check for "default arguments" at all, given that the calling functions should now always call _pyarray_correlate with the appropriate number of arguments? I kind of feel there is some legacy behavior I should be maintaining, but I'm not 100% sure.
Is the error message in _pyarray_correlate "PyErr_SetString(PyExc_ValueError, "mode must be 0, 1, 2, or 3")" still valuable? What should it be replaced with?
Are there other error checks I should add to the code at this or other points?
Could you all clean up the unit tests/add some new ones to be absolutely sure that it deals properly with weird combinations of input array sizes and lag inputs? I couldn't figure out how to run all the same unit tests with the order of input arrays switched.