Discussion:
[Numpy-discussion] Combining covariance and correlation coefficient into one numpy.cov call
Mathew S. Madhavacheril
2016-10-26 17:27:50 UTC
Permalink
Hi all,

I posted a pull request:
https://github.com/numpy/numpy/pull/8211

which adds a function `numpy.covcorr` that calculates both
the covariance matrix and correlation coefficient with a single
call to `numpy.cov` (which is often an expensive call for large
data-sets). A function `numpy.covtocorr` has also been added
that converts a covariance matrix to a correlation coefficent,
and `numpy.corrcoef` has been modified to call this. The
motivation here is that one often needs the covariance for
subsequent analysis and the correlation coefficient for
visualization, so instead of forcing the user to write their own
code to convert one to the other, we want to allow both to
be obtained from `numpy` as efficiently as possible.

Best,
Mathew
Stephan Hoyer
2016-10-26 17:46:48 UTC
Permalink
I wonder if the goals of this addition could be achieved by simply adding
an optional `cov` argument to np.corr, which would provide a pre-computed
covariance.

Either way, `covcorr` feels like a helper function that could exist in user
code rather than numpy proper.

On Wed, Oct 26, 2016 at 10:27 AM, Mathew S. Madhavacheril <
Post by Mathew S. Madhavacheril
Hi all,
https://github.com/numpy/numpy/pull/8211
which adds a function `numpy.covcorr` that calculates both
the covariance matrix and correlation coefficient with a single
call to `numpy.cov` (which is often an expensive call for large
data-sets). A function `numpy.covtocorr` has also been added
that converts a covariance matrix to a correlation coefficent,
and `numpy.corrcoef` has been modified to call this. The
motivation here is that one often needs the covariance for
subsequent analysis and the correlation coefficient for
visualization, so instead of forcing the user to write their own
code to convert one to the other, we want to allow both to
be obtained from `numpy` as efficiently as possible.
Best,
Mathew
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Mathew S. Madhavacheril
2016-10-26 18:03:36 UTC
Permalink
Post by Stephan Hoyer
I wonder if the goals of this addition could be achieved by simply adding
an optional `cov` argument
to np.corr, which would provide a pre-computed covariance.
That's a fair suggestion which I'm happy to switch to. This eliminates the
need for two new functions.
I'll add an optional `cov = False` argument to numpy.corrcoef that returns
a tuple (corr, cov) instead.
Post by Stephan Hoyer
Either way, `covcorr` feels like a helper function that could exist in
user code rather than numpy proper.
The user would have to re-implement the part that converts the covariance
matrix to a correlation
coefficient. I made this PR to avoid that code duplication.

Mathew
Post by Stephan Hoyer
On Wed, Oct 26, 2016 at 10:27 AM, Mathew S. Madhavacheril <
Post by Mathew S. Madhavacheril
Hi all,
https://github.com/numpy/numpy/pull/8211
which adds a function `numpy.covcorr` that calculates both
the covariance matrix and correlation coefficient with a single
call to `numpy.cov` (which is often an expensive call for large
data-sets). A function `numpy.covtocorr` has also been added
that converts a covariance matrix to a correlation coefficent,
and `numpy.corrcoef` has been modified to call this. The
motivation here is that one often needs the covariance for
subsequent analysis and the correlation coefficient for
visualization, so instead of forcing the user to write their own
code to convert one to the other, we want to allow both to
be obtained from `numpy` as efficiently as possible.
Best,
Mathew
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Stephan Hoyer
2016-10-26 18:13:54 UTC
Permalink
On Wed, Oct 26, 2016 at 11:03 AM, Mathew S. Madhavacheril <
Post by Mathew S. Madhavacheril
Post by Stephan Hoyer
I wonder if the goals of this addition could be achieved by simply adding
an optional `cov` argument
to np.corr, which would provide a pre-computed covariance.
That's a fair suggestion which I'm happy to switch to. This eliminates the
need for two new functions.
I'll add an optional `cov = False` argument to numpy.corrcoef that returns
a tuple (corr, cov) instead.
Post by Stephan Hoyer
Either way, `covcorr` feels like a helper function that could exist in
user code rather than numpy proper.
The user would have to re-implement the part that converts the covariance
matrix to a correlation
coefficient. I made this PR to avoid that code duplication.
With the API I was envisioning (or even your proposed API, for that
matter), this function would only be a few lines, e.g.,

def covcorr(x):
cov = np.cov(x)
corr = np.corrcoef(x, cov=cov)
return (cov, corr)

Generally, functions this short should be provided as recipes (if at all)
rather than be added to numpy proper, unless the need for them is extremely
common.
Mathew S. Madhavacheril
2016-10-26 18:26:32 UTC
Permalink
Post by Stephan Hoyer
On Wed, Oct 26, 2016 at 11:03 AM, Mathew S. Madhavacheril <
Post by Mathew S. Madhavacheril
Post by Stephan Hoyer
I wonder if the goals of this addition could be achieved by simply
adding an optional `cov` argument
to np.corr, which would provide a pre-computed covariance.
That's a fair suggestion which I'm happy to switch to. This eliminates
the need for two new functions.
I'll add an optional `cov = False` argument to numpy.corrcoef that
returns a tuple (corr, cov) instead.
Post by Stephan Hoyer
Either way, `covcorr` feels like a helper function that could exist in
user code rather than numpy proper.
The user would have to re-implement the part that converts the covariance
matrix to a correlation
coefficient. I made this PR to avoid that code duplication.
With the API I was envisioning (or even your proposed API, for that
matter), this function would only be a few lines, e.g.,
cov = np.cov(x)
corr = np.corrcoef(x, cov=cov)
return (cov, corr)
Generally, functions this short should be provided as recipes (if at all)
rather than be added to numpy proper, unless the need for them is extremely
common.
Ah, I see what you were suggesting now. I agree that a function like
covcorr need not be provided
by numpy itself, but it would be tremendously useful if a pre-computed
covariance could
be provided to np.corrcoef. I can update this PR to just add `cov = None`
to numpy.corrcoef and
do an `if cov is not None` before calculating the covariance. Note however
that in the case
that `cov` is specified for np.corrcoef, the non-optional `x` argument is
redundant.
Post by Stephan Hoyer
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Nathaniel Smith
2016-10-26 18:56:41 UTC
Permalink
Post by Stephan Hoyer
On Wed, Oct 26, 2016 at 11:03 AM, Mathew S. Madhavacheril
Post by Mathew S. Madhavacheril
Post by Stephan Hoyer
I wonder if the goals of this addition could be achieved by simply adding
an optional `cov` argument
to np.corr, which would provide a pre-computed covariance.
That's a fair suggestion which I'm happy to switch to. This eliminates the
need for two new functions.
I'll add an optional `cov = False` argument to numpy.corrcoef that returns
a tuple (corr, cov) instead.
Post by Stephan Hoyer
Either way, `covcorr` feels like a helper function that could exist in
user code rather than numpy proper.
The user would have to re-implement the part that converts the covariance
matrix to a correlation
coefficient. I made this PR to avoid that code duplication.
With the API I was envisioning (or even your proposed API, for that matter),
this function would only be a few lines, e.g.,
cov = np.cov(x)
corr = np.corrcoef(x, cov=cov)
IIUC, if you have a covariance matrix then you can compute the
correlation matrix directly, without looking at 'x', so corrcoef(x,
cov=cov) is a bit odd-looking. I think probably the API that makes the
most sense is just to expose something like the covtocorr function
(maybe it could have a less telegraphic name?)? And then, yeah, users
can use that to build their own covcorr or whatever if they want it.

-n
--
Nathaniel J. Smith -- https://vorpus.org
Mathew S. Madhavacheril
2016-10-26 19:11:22 UTC
Permalink
Post by Stephan Hoyer
Post by Stephan Hoyer
On Wed, Oct 26, 2016 at 11:03 AM, Mathew S. Madhavacheril
Post by Mathew S. Madhavacheril
Post by Stephan Hoyer
I wonder if the goals of this addition could be achieved by simply
adding
Post by Stephan Hoyer
Post by Mathew S. Madhavacheril
Post by Stephan Hoyer
an optional `cov` argument
to np.corr, which would provide a pre-computed covariance.
That's a fair suggestion which I'm happy to switch to. This eliminates
the
Post by Stephan Hoyer
Post by Mathew S. Madhavacheril
need for two new functions.
I'll add an optional `cov = False` argument to numpy.corrcoef that
returns
Post by Stephan Hoyer
Post by Mathew S. Madhavacheril
a tuple (corr, cov) instead.
Post by Stephan Hoyer
Either way, `covcorr` feels like a helper function that could exist in
user code rather than numpy proper.
The user would have to re-implement the part that converts the
covariance
Post by Stephan Hoyer
Post by Mathew S. Madhavacheril
matrix to a correlation
coefficient. I made this PR to avoid that code duplication.
With the API I was envisioning (or even your proposed API, for that
matter),
Post by Stephan Hoyer
this function would only be a few lines, e.g.,
cov = np.cov(x)
corr = np.corrcoef(x, cov=cov)
IIUC, if you have a covariance matrix then you can compute the
correlation matrix directly, without looking at 'x', so corrcoef(x,
cov=cov) is a bit odd-looking. I think probably the API that makes the
most sense is just to expose something like the covtocorr function
(maybe it could have a less telegraphic name?)? And then, yeah, users
can use that to build their own covcorr or whatever if they want it.
Right, agreed, this is why I said `x` becomes redundant when `cov` is
specified
when calling `numpy.corrcoef`. So we have two alternatives:

1) Have `np.corrcoef` accept a boolean optional argument `covmat = False`
that lets
one obtain a tuple containing the covariance and the correlation matrices
in the same call
2) Modify my original PR so that `np.covtocorr` remains (with possibly a
better
name) but remove `np.covcorr` since this is easy for the user to add.

My preference is option 2.
j***@gmail.com
2016-10-26 19:20:15 UTC
Permalink
On Wed, Oct 26, 2016 at 3:11 PM, Mathew S. Madhavacheril <
Post by Mathew S. Madhavacheril
Post by Stephan Hoyer
Post by Stephan Hoyer
On Wed, Oct 26, 2016 at 11:03 AM, Mathew S. Madhavacheril
Post by Mathew S. Madhavacheril
Post by Stephan Hoyer
I wonder if the goals of this addition could be achieved by simply
adding
Post by Stephan Hoyer
Post by Mathew S. Madhavacheril
Post by Stephan Hoyer
an optional `cov` argument
to np.corr, which would provide a pre-computed covariance.
That's a fair suggestion which I'm happy to switch to. This eliminates
the
Post by Stephan Hoyer
Post by Mathew S. Madhavacheril
need for two new functions.
I'll add an optional `cov = False` argument to numpy.corrcoef that
returns
Post by Stephan Hoyer
Post by Mathew S. Madhavacheril
a tuple (corr, cov) instead.
Post by Stephan Hoyer
Either way, `covcorr` feels like a helper function that could exist in
user code rather than numpy proper.
The user would have to re-implement the part that converts the
covariance
Post by Stephan Hoyer
Post by Mathew S. Madhavacheril
matrix to a correlation
coefficient. I made this PR to avoid that code duplication.
With the API I was envisioning (or even your proposed API, for that
matter),
Post by Stephan Hoyer
this function would only be a few lines, e.g.,
cov = np.cov(x)
corr = np.corrcoef(x, cov=cov)
IIUC, if you have a covariance matrix then you can compute the
correlation matrix directly, without looking at 'x', so corrcoef(x,
cov=cov) is a bit odd-looking. I think probably the API that makes the
most sense is just to expose something like the covtocorr function
(maybe it could have a less telegraphic name?)? And then, yeah, users
can use that to build their own covcorr or whatever if they want it.
Right, agreed, this is why I said `x` becomes redundant when `cov` is
specified
1) Have `np.corrcoef` accept a boolean optional argument `covmat = False`
that lets
one obtain a tuple containing the covariance and the correlation matrices
in the same call
2) Modify my original PR so that `np.covtocorr` remains (with possibly a
better
name) but remove `np.covcorr` since this is easy for the user to add.
My preference is option 2.
cov2corr is a useful function
http://www.statsmodels.org/dev/generated/statsmodels.stats.moment_helpers.cov2corr.html
I also wrote the inverse function corr2cov, but AFAIR use it only in some
test cases.


I don't think adding any of the options to corrcoef or covcor is useful
since there is no computational advantage to it.
What I'm missing are functions that return the intermediate results, e.g.
var and mean or cov and mean.

(For statsmodels I decided to return mean and cov or mean and var in the
related functions. Some R packages return the mean as an option.)

Josef
Post by Mathew S. Madhavacheril
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Mathew S. Madhavacheril
2016-10-26 20:12:05 UTC
Permalink
Post by j***@gmail.com
On Wed, Oct 26, 2016 at 3:11 PM, Mathew S. Madhavacheril <
Post by Mathew S. Madhavacheril
Post by Stephan Hoyer
Post by Stephan Hoyer
On Wed, Oct 26, 2016 at 11:03 AM, Mathew S. Madhavacheril
Post by Mathew S. Madhavacheril
Post by Stephan Hoyer
I wonder if the goals of this addition could be achieved by simply
adding
Post by Stephan Hoyer
Post by Mathew S. Madhavacheril
Post by Stephan Hoyer
an optional `cov` argument
to np.corr, which would provide a pre-computed covariance.
That's a fair suggestion which I'm happy to switch to. This
eliminates the
Post by Stephan Hoyer
Post by Mathew S. Madhavacheril
need for two new functions.
I'll add an optional `cov = False` argument to numpy.corrcoef that
returns
Post by Stephan Hoyer
Post by Mathew S. Madhavacheril
a tuple (corr, cov) instead.
Post by Stephan Hoyer
Either way, `covcorr` feels like a helper function that could exist
in
Post by Stephan Hoyer
Post by Mathew S. Madhavacheril
Post by Stephan Hoyer
user code rather than numpy proper.
The user would have to re-implement the part that converts the
covariance
Post by Stephan Hoyer
Post by Mathew S. Madhavacheril
matrix to a correlation
coefficient. I made this PR to avoid that code duplication.
With the API I was envisioning (or even your proposed API, for that
matter),
Post by Stephan Hoyer
this function would only be a few lines, e.g.,
cov = np.cov(x)
corr = np.corrcoef(x, cov=cov)
IIUC, if you have a covariance matrix then you can compute the
correlation matrix directly, without looking at 'x', so corrcoef(x,
cov=cov) is a bit odd-looking. I think probably the API that makes the
most sense is just to expose something like the covtocorr function
(maybe it could have a less telegraphic name?)? And then, yeah, users
can use that to build their own covcorr or whatever if they want it.
Right, agreed, this is why I said `x` becomes redundant when `cov` is
specified
1) Have `np.corrcoef` accept a boolean optional argument `covmat = False`
that lets
one obtain a tuple containing the covariance and the correlation matrices
in the same call
2) Modify my original PR so that `np.covtocorr` remains (with possibly a
better
name) but remove `np.covcorr` since this is easy for the user to add.
My preference is option 2.
cov2corr is a useful function
http://www.statsmodels.org/dev/generated/statsmodels.stats.
moment_helpers.cov2corr.html
I also wrote the inverse function corr2cov, but AFAIR use it only in some
test cases.
I don't think adding any of the options to corrcoef or covcor is useful
since there is no computational advantage to it.
I'm not sure I agree with that statement. If a user wants to calculate both
a covariance and correlation matrix,
they currently have two options:
A) Call np.cov and np.corrcoef separately, which takes at least twice as
long as one call to np.cov. For data-sets that
I am used to, a np.cov call takes 5-10 seconds.
B) Call np.cov and then separately implement their own correlation matrix
code, which means the user
isn't able to fully take advantage of code that is already in numpy.

In any case, I've updated the PR:
https://github.com/numpy/numpy/pull/8211

Relative to my original PR, it:
a) removes the numpy.covcorr function which the user can easily implement
b) have numpy.cov2corr be the function exposed in the API (previously
called numpy.covtocorr in the PR), which accepts a pre-calculated covariance
matrix
c) have numpy.corrcoef call numpy.cov2corr
Post by j***@gmail.com
Post by Mathew S. Madhavacheril
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
https://mail.scipy.org/mailman/listinfo/numpy-discussion
Loading...