Skip to content

ENH: add a weighted covariance calculation. #4960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 13, 2015
Merged

ENH: add a weighted covariance calculation. #4960

merged 1 commit into from
May 13, 2015

Conversation

tpoole
Copy link
Contributor

@tpoole tpoole commented Aug 13, 2014

Weights can now be included in a covariance calculation in a similar manner as
the existing implementation of a weighted average.

@seberg
Copy link
Member

seberg commented Aug 13, 2014

Looks nice on first sight. In gh-3864 we seemed to figure that it may be better to implement two types of weights, something to do with the weights representing frequency information or accuracy information (I am not sure right now, and some people seemed to differentiate even more, but it didn't seem to matter). My guess is, that the ones you implemented are frequency type weights?

I don't have a very clear idea about what is right at the moment, so maybe you have an opinion/comment on the issue? Maybe just these are good, too, and we should just mention in the docs exactly what ours represent.

@tpoole
Copy link
Contributor Author

tpoole commented Aug 13, 2014

Having been through the other thread it seems to me like ndawe's final summation is sensible: only handle two cases, frequencies and relative probabilities. I don't see why it would be useful to handle both concurrently, so a boolean switch indicating the type of the weights is probably the least confusing way to proceed and this allows a user to use fractional frequencies if they wish.

@seberg
Copy link
Member

seberg commented Aug 13, 2014

Sounds good to me, I guess there really is not much point in having both at the same time and it should be easier to understand like this (I suppose in the other issue I might have tended towards having both seperatly). Since you seem to plan on finishing this up, can you write a mail to the list when it is functioning? Before we merge it in the end, as a new feature, it should go over the list.

raise RuntimeError("incompatible numbers of samples and weights")
if any(weights < 0):
raise RuntimeError("weights cannot be negative")
weights = weights.astype(np.float64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sometimes pendatic about such a cast (user might use float128). But for the moment, there is a weights = np.asarray(weights) missing (at the top of the branch). The user might supply a list or a weird subclass.

@MechCoder
Copy link
Contributor

Anything left to bless this PR for merge? Anything that I can help with?

@charris charris added this to the 1.10.0 release milestone Apr 7, 2015
@charris
Copy link
Member

charris commented Apr 22, 2015

@MechCoder Could you rebase this?

@ndawe
Copy link

ndawe commented Apr 26, 2015

Thanks @tpoole. Maybe we can finally get this merged in. Is this PR still alive and should we go ahead with discussing this on the list? Sorry for dropping the ball on that with my PR.

@tpoole
Copy link
Contributor Author

tpoole commented Apr 26, 2015

It's been mentioned on the list a couple of times and there wasn't any significant feedback, though I'm not sure whether this is due to general acceptance or a lack of interest/understanding.

@tpoole tpoole changed the title ENH: add a weighted covariance calculation to numpy.function_base. ENH: add a weighted covariance calculation. Apr 26, 2015
@charris
Copy link
Member

charris commented Apr 27, 2015

I'm thinking a combination of two weights, repeat and reliability, and ddof. The weights can be combined, but not by simple multiplication, at least for the correction factor in @ndawe formula above. The sum of squared weights needs to be replaced by ddof * sum(n * r**2), where n are the repeat weights and r are the reliability weights. I'm not sure what ddof does in the reliability case, but ddof=1 will give the unbiased estimate in all cases.

@charris
Copy link
Member

charris commented Apr 27, 2015

@MechCoder Could you explain precisely how scikits-learn wants to use the weights? I suspect that may be different than either of the two alternatives here.

@tpoole
Copy link
Contributor Author

tpoole commented Apr 27, 2015

It's difficult to see how mixed weights would be used. When is it appropriate to assign a relative reliability to N occurrences of a value?

@charris
Copy link
Member

charris commented Apr 27, 2015

Frequency just seems to be a shorthand for not listing everything, and it well might be that one would want to weight the repeated observations with a reliability factor. I'm also going by SAS, which has both and they don't seem to be mutually exclusive. Since it is easy to combine the two using a formula that works in the conventional ways when either is set to 1, I think we might as well do so.

@tpoole
Copy link
Contributor Author

tpoole commented Apr 27, 2015

I've not dug particularly deep into the documentation, but it seems like SAS requires a discrete choice of correction factor in any case - see VARDEF at the bottom of http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146729.htm .

Normalizing frequencies is easy, and I'm reasonably confident of the normalization in the pure reliability weights regime (though even this is computed differently in different stats packages), but once hybrid weights appear it's not at all clear what the normalization should be.

@charris
Copy link
Member

charris commented Apr 27, 2015

ddof =1 always give the unbiased normalization. If probabilities are used for frequencies, set ddof = 0, etc. The revised normalization factor works correctly for both pure frequencies and pure reliability (they are normally different). The method is 1) use product n*r inside the big sum, 2) use product in the normalization factor except use n*r**2 for the sum of squares term, 3) multiply the sum of squares term by ddof.

@tpoole
Copy link
Contributor Author

tpoole commented Apr 28, 2015

I can see, algorithmically, that the new procedure reproduces sensible results in the cases of pure frequencies or probabilities, but I'm concerned about the mathematical correctness of this approach when they are mixed. The estimation of the sample size proceeds differently in each case, and, whilst this way of handling it looks nice, it's not obvious that they can be combined in this way.

@charris
Copy link
Member

charris commented Apr 28, 2015

Think of the repeat weight as an integer number of repeats of that observation and write out the observation matrix repeating each column that number of times with the appropriate reliability weight, then compute the usual reliability factor. Note that fractional repeats can be considered scaled integers if you want to examine the repeats = probability limit.

@charris
Copy link
Member

charris commented Apr 29, 2015

@tpoole Essentially, the repeat weights and reliablity weights are orthogonal. Dose that make sense to you?

@tpoole
Copy link
Contributor Author

tpoole commented Apr 29, 2015

Yes, I've convinced myself that this approach produces meaningful results.

w = None
if frequencies is not None:
frequencies = np.asarray(frequencies)
if not np.issubdtype(frequencies.dtype, np.integer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't need to be integer, it is just the common case. The result is actually scale free.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM, the result isn't scale free with respect to frequency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the probablilty case, as opposed to the integer case, can be handled by setting ddof to zero, i.e., ddof gets scaled down to zero. so I think allowing reals here is good. Hmm...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the normalization and ddof to work correctly the sum of the frequencies must correspond to the total number of samples. Making the frequencies integer is a way of ensuring this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but in the limit that they go to infinity (probability) one wants to set ddof=0 and use the probabilities. So it is certainly worth recommending the common use as integers, but not raising an error if they aren't. OTOH, one can get the same result using the reliability weights instead with ddof=0, so maybe integers are OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think frequencies should be integer, and clearly differentiated from the other, more general type of weights. That's the motivation behind labelling them frequencies, rather than fweights, so their correct usage is more obvious. The other type of weight deserves a generic name as it can correspond to a number of things depending on how you use it - reliability or inverse error bars, some measure of relative importance, or probabilities. It's particularly well suited for probabilities as the weights get normalized (sum(w)=1) before use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Maybe just check for integer values?

if not np.all(frequency == np.around(a)):

@charris
Copy link
Member

charris commented May 3, 2015

Be nice to finish this up. The latex bits should be in the Notes section as it clutters the main section with markup that is unreadable to most. The new weights also need a .. versionadded:: 1.10 line, see the entry for ddof as an example. There should also be an entry in doc/release/1.10.0-notes.rst in the New Features section.

@tpoole
Copy link
Contributor Author

tpoole commented May 7, 2015

I think we should leave the 'frequencies' and 'weights' arguments. They represent different types of quantity that should be emphasised by differences in their names - 'frequencies' are frequencies, and 'weights' are a more general weighting that can represent relative importance, error or probabilities.

@ndawe
Copy link

ndawe commented May 8, 2015

I agree. That is also consistent with numpy.average(a, axis=None, weights=None, returned=False).

@seberg
Copy link
Member

seberg commented May 8, 2015

I am not sure I agree. In the sense of averaging both type of weights (if you call them that) seem identical. So frequencies translates to weights in averaging as well.
Not sure what the best thing is, but one plus for the (a|f)weights would be that you have to think if you go from average to here. Plus other statistics toolboxes seem to use this naming scheme?

@charris
Copy link
Member

charris commented May 8, 2015

Plus other statistics toolboxes seem to use this naming scheme?

I was going by Stata fweight (frequency weight) and aweight (analytic weight). Stata also has a pweight ( probability weight). I'm not sure exactly how Stata uses all those, but I think we get most things using ddof in combination with the weights.

@tpoole
Copy link
Contributor Author

tpoole commented May 8, 2015

They're not the same in the sense of averaging - if the sum of the frequencies doesn't correspond to the total number of samples then the prefactor makes no sense, whereas for weights the total number of samples is estimated by the variation in the magnitudes of the weights. That being said, keeping consistent nomenclature with other packages is definitely sensible.

if any(aweights < 0):
raise ValueError(
"aweights cannot be negative")
aweights /= float(sum(aweights))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/= can cause problems if aweights is not float.

@charris
Copy link
Member

charris commented May 12, 2015

Looks almost there. Could you squash your commits using git rebase -i HEAD~8 followed by a force push?

@@ -1921,36 +1951,77 @@ def cov(m, y=None, rowvar=1, bias=0, ddof=None):
y = np.asarray(y)
dtype = np.result_type(m, y, np.float64)
X = array(m, ndmin=2, dtype=dtype)
if rowvar == 0 and X.shape[0] != 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see this is inherited, X.shape[0] == 1] always overrode rowvar == 0. Doesn't seem right, but that is how it was...

@charris
Copy link
Member

charris commented May 12, 2015

Looks good. If you want to special case ddof == 0 go ahead, otherwise let me know and I'll put this in.

'fweights' allows integer frequencies to be specified for observation vectors,
and 'aweights' provides a more general importance or probabalistic weighting.
@tpoole
Copy link
Contributor Author

tpoole commented May 13, 2015

Done.

charris added a commit that referenced this pull request May 13, 2015
ENH: add a weighted covariance calculation.
@charris charris merged commit 9ceb5cd into numpy:master May 13, 2015
@charris
Copy link
Member

charris commented May 13, 2015

Great, in it goes. Thanks @tpoole for getting this done.

@ndawe
Copy link

ndawe commented May 14, 2015

Many thanks! Congrats @tpoole.

@MechCoder
Copy link
Contributor

@charris Sorry for the late reply. For example, in certain classification problems, samples corresponding to a certain class might be over represented, in this case these samples are given a lesser sample weight.

It corresponds to the aweights in this pull request. However, the weighted variance is calculated inside scikit-learn by

np.average((X - np.average(X, weights=sample_weights)) ** 2, weights=sample_weights)

This can be obtained from the current PR by setting ddof=0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants