ENH: add a weighted covariance calculation. #4960

tpoole · 2014-08-13T12:25:29Z

Weights can now be included in a covariance calculation in a similar manner as
the existing implementation of a weighted average.

seberg · 2014-08-13T15:50:41Z

Looks nice on first sight. In gh-3864 we seemed to figure that it may be better to implement two types of weights, something to do with the weights representing frequency information or accuracy information (I am not sure right now, and some people seemed to differentiate even more, but it didn't seem to matter). My guess is, that the ones you implemented are frequency type weights?

I don't have a very clear idea about what is right at the moment, so maybe you have an opinion/comment on the issue? Maybe just these are good, too, and we should just mention in the docs exactly what ours represent.

tpoole · 2014-08-13T17:51:25Z

Having been through the other thread it seems to me like ndawe's final summation is sensible: only handle two cases, frequencies and relative probabilities. I don't see why it would be useful to handle both concurrently, so a boolean switch indicating the type of the weights is probably the least confusing way to proceed and this allows a user to use fractional frequencies if they wish.

seberg · 2014-08-13T19:00:16Z

Sounds good to me, I guess there really is not much point in having both at the same time and it should be easier to understand like this (I suppose in the other issue I might have tended towards having both seperatly). Since you seem to plan on finishing this up, can you write a mail to the list when it is functioning? Before we merge it in the end, as a new feature, it should go over the list.

seberg · 2014-08-13T19:02:14Z

numpy/lib/function_base.py

+            raise RuntimeError("incompatible numbers of samples and weights")
+        if any(weights < 0):
+            raise RuntimeError("weights cannot be negative")
+        weights = weights.astype(np.float64)


I am sometimes pendatic about such a cast (user might use float128). But for the moment, there is a weights = np.asarray(weights) missing (at the top of the branch). The user might supply a list or a weird subclass.

MechCoder · 2014-12-25T18:32:33Z

Anything left to bless this PR for merge? Anything that I can help with?

charris · 2015-04-22T02:30:55Z

@MechCoder Could you rebase this?

ndawe · 2015-04-26T04:25:12Z

Thanks @tpoole. Maybe we can finally get this merged in. Is this PR still alive and should we go ahead with discussing this on the list? Sorry for dropping the ball on that with my PR.

tpoole · 2015-04-26T07:15:48Z

It's been mentioned on the list a couple of times and there wasn't any significant feedback, though I'm not sure whether this is due to general acceptance or a lack of interest/understanding.

charris · 2015-04-27T19:00:45Z

I'm thinking a combination of two weights, repeat and reliability, and ddof. The weights can be combined, but not by simple multiplication, at least for the correction factor in @ndawe formula above. The sum of squared weights needs to be replaced by ddof * sum(n * r**2), where n are the repeat weights and r are the reliability weights. I'm not sure what ddof does in the reliability case, but ddof=1 will give the unbiased estimate in all cases.

charris · 2015-04-27T19:14:07Z

@MechCoder Could you explain precisely how scikits-learn wants to use the weights? I suspect that may be different than either of the two alternatives here.

tpoole · 2015-04-27T22:02:18Z

It's difficult to see how mixed weights would be used. When is it appropriate to assign a relative reliability to N occurrences of a value?

charris · 2015-04-27T22:12:50Z

Frequency just seems to be a shorthand for not listing everything, and it well might be that one would want to weight the repeated observations with a reliability factor. I'm also going by SAS, which has both and they don't seem to be mutually exclusive. Since it is easy to combine the two using a formula that works in the conventional ways when either is set to 1, I think we might as well do so.

tpoole · 2015-04-27T23:07:44Z

I've not dug particularly deep into the documentation, but it seems like SAS requires a discrete choice of correction factor in any case - see VARDEF at the bottom of http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146729.htm .

Normalizing frequencies is easy, and I'm reasonably confident of the normalization in the pure reliability weights regime (though even this is computed differently in different stats packages), but once hybrid weights appear it's not at all clear what the normalization should be.

charris · 2015-04-27T23:25:18Z

ddof =1 always give the unbiased normalization. If probabilities are used for frequencies, set ddof = 0, etc. The revised normalization factor works correctly for both pure frequencies and pure reliability (they are normally different). The method is 1) use product n*r inside the big sum, 2) use product in the normalization factor except use n*r**2 for the sum of squares term, 3) multiply the sum of squares term by ddof.

tpoole · 2015-04-28T00:08:48Z

I can see, algorithmically, that the new procedure reproduces sensible results in the cases of pure frequencies or probabilities, but I'm concerned about the mathematical correctness of this approach when they are mixed. The estimation of the sample size proceeds differently in each case, and, whilst this way of handling it looks nice, it's not obvious that they can be combined in this way.

charris · 2015-04-28T00:52:34Z

Think of the repeat weight as an integer number of repeats of that observation and write out the observation matrix repeating each column that number of times with the appropriate reliability weight, then compute the usual reliability factor. Note that fractional repeats can be considered scaled integers if you want to examine the repeats = probability limit.

charris · 2015-04-29T16:50:12Z

@tpoole Essentially, the repeat weights and reliablity weights are orthogonal. Dose that make sense to you?

tpoole · 2015-04-29T16:57:12Z

Yes, I've convinced myself that this approach produces meaningful results.

charris · 2015-04-29T20:16:56Z

numpy/lib/function_base.py

+    w = None
+    if frequencies is not None:
+        frequencies = np.asarray(frequencies)
+        if not np.issubdtype(frequencies.dtype, np.integer):


They don't need to be integer, it is just the common case. The result is actually scale free.

NVM, the result isn't scale free with respect to frequency.

However, the probablilty case, as opposed to the integer case, can be handled by setting ddof to zero, i.e., ddof gets scaled down to zero. so I think allowing reals here is good. Hmm...

For the normalization and ddof to work correctly the sum of the frequencies must correspond to the total number of samples. Making the frequencies integer is a way of ensuring this.

Yeah, but in the limit that they go to infinity (probability) one wants to set ddof=0 and use the probabilities. So it is certainly worth recommending the common use as integers, but not raising an error if they aren't. OTOH, one can get the same result using the reliability weights instead with ddof=0, so maybe integers are OK.

I think frequencies should be integer, and clearly differentiated from the other, more general type of weights. That's the motivation behind labelling them frequencies, rather than fweights, so their correct usage is more obvious. The other type of weight deserves a generic name as it can correspond to a number of things depending on how you use it - reliability or inverse error bars, some measure of relative importance, or probabilities. It's particularly well suited for probabilities as the weights get normalized (sum(w)=1) before use.

OK. Maybe just check for integer values?

if not np.all(frequency == np.around(a)):

charris · 2015-05-03T17:49:38Z

Be nice to finish this up. The latex bits should be in the Notes section as it clutters the main section with markup that is unreadable to most. The new weights also need a .. versionadded:: 1.10 line, see the entry for ddof as an example. There should also be an entry in doc/release/1.10.0-notes.rst in the New Features section.

tpoole · 2015-05-07T22:43:40Z

I think we should leave the 'frequencies' and 'weights' arguments. They represent different types of quantity that should be emphasised by differences in their names - 'frequencies' are frequencies, and 'weights' are a more general weighting that can represent relative importance, error or probabilities.

ndawe · 2015-05-08T00:22:35Z

I agree. That is also consistent with numpy.average(a, axis=None, weights=None, returned=False).

seberg · 2015-05-08T11:37:57Z

I am not sure I agree. In the sense of averaging both type of weights (if you call them that) seem identical. So frequencies translates to weights in averaging as well.
Not sure what the best thing is, but one plus for the (a|f)weights would be that you have to think if you go from average to here. Plus other statistics toolboxes seem to use this naming scheme?

charris · 2015-05-08T16:53:31Z

Plus other statistics toolboxes seem to use this naming scheme?

I was going by Stata fweight (frequency weight) and aweight (analytic weight). Stata also has a pweight ( probability weight). I'm not sure exactly how Stata uses all those, but I think we get most things using ddof in combination with the weights.

tpoole · 2015-05-08T18:53:29Z

They're not the same in the sense of averaging - if the sum of the frequencies doesn't correspond to the total number of samples then the prefactor makes no sense, whereas for weights the total number of samples is estimated by the variation in the magnitudes of the weights. That being said, keeping consistent nomenclature with other packages is definitely sensible.

charris · 2015-05-12T18:40:53Z

numpy/lib/function_base.py

+        if any(aweights < 0):
+            raise ValueError(
+                "aweights cannot be negative")
+        aweights /= float(sum(aweights))


/= can cause problems if aweights is not float.

charris · 2015-05-12T19:31:20Z

Looks almost there. Could you squash your commits using git rebase -i HEAD~8 followed by a force push?

charris · 2015-05-12T22:25:33Z

numpy/lib/function_base.py

@@ -1921,36 +1951,77 @@ def cov(m, y=None, rowvar=1, bias=0, ddof=None):
        y = np.asarray(y)
        dtype = np.result_type(m, y, np.float64)
    X = array(m, ndmin=2, dtype=dtype)
+    if rowvar == 0 and X.shape[0] != 1:


Hmm, I see this is inherited, X.shape[0] == 1] always overrode rowvar == 0. Doesn't seem right, but that is how it was...

charris · 2015-05-12T22:31:42Z

Looks good. If you want to special case ddof == 0 go ahead, otherwise let me know and I'll put this in.

'fweights' allows integer frequencies to be specified for observation vectors, and 'aweights' provides a more general importance or probabalistic weighting.

tpoole · 2015-05-13T09:21:06Z

Done.

ENH: add a weighted covariance calculation.

charris · 2015-05-13T16:04:19Z

Great, in it goes. Thanks @tpoole for getting this done.

ndawe · 2015-05-14T01:39:24Z

Many thanks! Congrats @tpoole.

MechCoder · 2015-06-01T08:01:37Z

@charris Sorry for the late reply. For example, in certain classification problems, samples corresponding to a certain class might be over represented, in this case these samples are given a lesser sample weight.

It corresponds to the aweights in this pull request. However, the weighted variance is calculated inside scikit-learn by

np.average((X - np.average(X, weights=sample_weights)) ** 2, weights=sample_weights)

This can be obtained from the current PR by setting ddof=0

seberg reviewed Aug 13, 2014
View reviewed changes

seberg mentioned this pull request Oct 9, 2014

ENH: Weight support for np.var #5164

Closed

rgommers added 01 - Enhancement component: numpy.lib labels Mar 8, 2015

charris added this to the 1.10.0 release milestone Apr 7, 2015

tpoole changed the title ~~ENH: add a weighted covariance calculation to numpy.function_base.~~ ENH: add a weighted covariance calculation. Apr 26, 2015

charris reviewed Apr 29, 2015
View reviewed changes

charris mentioned this pull request May 4, 2015

ENH: added functionality nancov to numpy #5698

Closed

charris reviewed May 12, 2015
View reviewed changes

ENH: add 'fweights' and 'aweights' arguments to covariance calculations.

d87d2ca

'fweights' allows integer frequencies to be specified for observation vectors, and 'aweights' provides a more general importance or probabalistic weighting.

charris added a commit that referenced this pull request May 13, 2015

Merge pull request #4960 from tpoole/weighted_cov

9ceb5cd

ENH: add a weighted covariance calculation.

charris merged commit 9ceb5cd into numpy:master May 13, 2015

charris mentioned this pull request May 14, 2015

cov and corrcoef: add support for weights #3864

Closed

toslunar mentioned this pull request Jun 12, 2018

Implement cupy.corrcoef and cupy.cov cupy/cupy#1110

Merged

tpoole mentioned this pull request May 5, 2019

[MRG]: Fweight and aweight support for np.var #5930

Closed

ENH: add a weighted covariance calculation. #4960

ENH: add a weighted covariance calculation. #4960

Conversation

tpoole commented Aug 13, 2014

seberg commented Aug 13, 2014

tpoole commented Aug 13, 2014

seberg commented Aug 13, 2014

Choose a reason for hiding this comment

MechCoder commented Dec 25, 2014

charris commented Apr 22, 2015

ndawe commented Apr 26, 2015

tpoole commented Apr 26, 2015

charris commented Apr 27, 2015

charris commented Apr 27, 2015

tpoole commented Apr 27, 2015

charris commented Apr 27, 2015

tpoole commented Apr 27, 2015

charris commented Apr 27, 2015

tpoole commented Apr 28, 2015

charris commented Apr 28, 2015

charris commented Apr 29, 2015

tpoole commented Apr 29, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charris commented May 3, 2015

tpoole commented May 7, 2015

ndawe commented May 8, 2015

seberg commented May 8, 2015

charris commented May 8, 2015

tpoole commented May 8, 2015

Choose a reason for hiding this comment

charris commented May 12, 2015

Choose a reason for hiding this comment

charris commented May 12, 2015

tpoole commented May 13, 2015

charris commented May 13, 2015

ndawe commented May 14, 2015

MechCoder commented Jun 1, 2015