[Numpy-discussion] How to limit cross correlation window width in Numpy?

Discussion:

Honi Sanders

2015-06-09 01:54:18 UTC

I am learning numpy/scipy, coming from a MATLAB background. The xcorr function in Matlab has an optional argument "maxlag" that limits the lag range from –maxlag to maxlag. This is very useful if you are looking at the cross-correlation between two very long time series but are only interested in the correlation within a certain time range. The performance increases are enormous considering that cross-correlation is incredibly expensive to compute.

What is troubling me is that numpy.correlate does not have a maxlag feature. This means that even if I only want to see correlations between two time series with lags between -100 and +100 ms, for example, it will still calculate the correlation for every lag between -20000 and +20000 ms (which is the length of the time series). This (theoretically) gives a 200x performance hit! Is it possible that I could contribute this feature?

I have introduced this question as a scipy issue https://github.com/scipy/scipy/issues/4940 and on the spicy-dev list (http://mail.scipy.org/pipermail/scipy-dev/2015-June/020757.html). It seems the best place to start is with numpy.correlate, so that is what I am requesting. I have done a simple implementation (https://gist.github.com/bringingheavendown/b4ce18aa007118e4e084) which gives 50x speedup under my conditions (https://github.com/scipy/scipy/issues/4940#issuecomment-110187847).

This is my first experience with contributing to open-source software, so any pointers are appreciated.

Honi Sanders

2015-06-17 02:38:40 UTC

Permalink

I have now implemented this functionality in numpy.correlate() and numpy.convolve(). https://github.com/bringingheavendown/numpy. The files that were edited are:
numpy/core/src/multiarray/multiarraymodule.c
numpy/core/numeric.py
numpy/core/tests/test_numeric.py
Please look over the code, my design decisions, and the unit tests I have written. This is my first time contributing, so I am not confident about any of these and welcome feedback.

Post by Honi Sanders
I am learning numpy/scipy, coming from a MATLAB background. The xcorr function in Matlab has an optional argument "maxlag" that limits the lag range from –maxlag to maxlag. This is very useful if you are looking at the cross-correlation between two very long time series but are only interested in the correlation within a certain time range. The performance increases are enormous considering that cross-correlation is incredibly expensive to compute.
What is troubling me is that numpy.correlate does not have a maxlag feature. This means that even if I only want to see correlations between two time series with lags between -100 and +100 ms, for example, it will still calculate the correlation for every lag between -20000 and +20000 ms (which is the length of the time series). This (theoretically) gives a 200x performance hit! Is it possible that I could contribute this feature?
I have introduced this question as a scipy issue https://github.com/scipy/scipy/issues/4940 and on the spicy-dev list (http://mail.scipy.org/pipermail/scipy-dev/2015-June/020757.html). It seems the best place to start is with numpy.correlate, so that is what I am requesting. I have done a simple implementation (https://gist.github.com/bringingheavendown/b4ce18aa007118e4e084) which gives 50x speedup under my conditions (https://github.com/scipy/scipy/issues/4940#issuecomment-110187847).
This is my first experience with contributing to open-source software, so any pointers are appreciated.

Sturla Molden

2015-06-17 22:13:25 UTC

Permalink

Post by Honi Sanders
numpy/core/src/multiarray/multiarraymodule.c
numpy/core/numeric.py
numpy/core/tests/test_numeric.py
Please look over the code, my design decisions, and the unit tests I have written. This is my first time contributing, so I am not confident about any of these and welcome feedback.

I'll just repeat here what I already said on Github.

I think this stems from the need to compute cross-correlograms as used
in statistical signal analysis, whereas numpy.correlate and
scipy.signal.correlate are better suited for matched filtering.

I think the best solution would be to add a function called
scipy.signal.correlogram, which would return a cross-correlation and an
array of time lags. It could take minlag and maxlag as optional arguments.

Adding maxlag and minlag arguments to numpy.convolve makes very little
sense, as far as I am concerned.

Sturla

Honi Sanders

2015-06-17 22:22:33 UTC

Permalink

I will also repeat what I said in response on Github (discussions at: https://github.com/scipy/scipy/issues/4940, https://github.com/numpy/numpy/issues/5954):
I do want a function that computes cross-correlograms, however the implementation is exactly the same for cross-correlograms as convolution. Not only that, is the numpy.correlate() function not for computing cross-correlograms? Maxlag and lagstep still make sense in the context of convolution. Say you have a time series (this is not the best example) of rain amounts and you have a kernel for plant growth given rain in the recent past. Your time series is the entire year, but you are only interested in the plant growth during the months of April through August. Not only that, you do not need daily readout of plant growth; weekly resolution is enough for your needs. You wouldn’t want to compute the convolution for the entire time series, instead you would do: numpy.convolve(rain, growth_kernel, (april, september, 7), lagvec) and get lagvec back with the indices of the sundays in april through august, and a return vector with the amount of plant growth on those days.

I don’t really think it would be good to add an entirely new function to scipy.signal. It was already hard enough as a new user trying to figure out which of the five seemingly identical functions in numpy, scipy, and matplotlib that I should be using. Besides, if all of these functions are essentially doing the same computation, there should only be a single base implementation that they all use, so that 1) the learning curve is decreased and 2) that any optimizations be passed on to all of the functions instead of having to be independently reimplemented several times. So, even if we do decide that scipy.signal should have a new correlogram command, it should be a wrapper for numpy.correlate. But why wouldn't one just use scipy.signal.correlate for the 1d case as well?

Also, see https://github.com/numpy/numpy/pull/5978 for the pull request with a list of specific issues in my implementation that may need attention.

Honi

Post by Sturla Molden

I'll just repeat here what I already said on Github.
I think this stems from the need to compute cross-correlograms as used
in statistical signal analysis, whereas numpy.correlate and
scipy.signal.correlate are better suited for matched filtering.
I think the best solution would be to add a function called
scipy.signal.correlogram, which would return a cross-correlation and an
array of time lags. It could take minlag and maxlag as optional arguments.
Adding maxlag and minlag arguments to numpy.convolve makes very little
sense, as far as I am concerned.
Sturla
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Mansour Moufid

2015-06-18 04:16:04 UTC

Permalink

Hello,

There is a simple solution.

The cross-correlation of two arrays of lengths m and n is of length
m + n - 1, where m is usually much larger than n.

If you need to compute the cross-correlation with a bound on the lag
of k, then truncate the longer array to length k - n + 1.

That is,

def _correlate(x, y, maxlag):
n = y.shape[0]
return numpy.correlate(x[:maxlag - n + 1], y)

As for the lag array, it is defined as

-n + 1, ..., 0, ..., m + 1

so truncate it too,

-n + 1, ..., 0, ..., maxlag - n + 2

By the way, you should truncate to a power of two.

Yours,
Mansour

Sturla Molden

2015-06-18 11:49:38 UTC

Permalink

Post by Mansour Moufid
The cross-correlation of two arrays of lengths m and n is of length
m + n - 1, where m is usually much larger than n.

He is thinking about the situation where m == n and m is much larger than
maxlag.

Truncating the input arrays would also throw away data.

This is about correlating two long signals, not about correlating a signal
with a much shorter template.

Sturla

Mansour Moufid

2015-06-21 17:53:02 UTC

Permalink

I just realized that NumPy uses the time domain algorithm for correlation.
So it would be much easier to modify the correlation functions in SciPy
than in NumPy.

Honi Sanders

2015-06-21 22:46:08 UTC

Permalink

Did you check out my implementation? I was able to modify the Numpy correlate function just fine. https://github.com/numpy/numpy/compare/master...bringingheavendown:maxlag

Post by Mansour Moufid
I just realized that NumPy uses the time domain algorithm for correlation.
So it would be much easier to modify the correlation functions in SciPy
than in NumPy.
_______________________________________________
NumPy-Discussion mailing list
http://mail.scipy.org/mailman/listinfo/numpy-discussion