-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
cov and corrcoef: add support for weights #3864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if weights is not None: | ||
weights = array(weights, dtype=float) | ||
X -= average(X, axis=1-axis, weights=weights)[tup] | ||
N = weights.sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't look right.
if the sum of weights is 1 and ddof and bias set to default values you always get a division by zero in the scaling due to fact = N - ddof
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juliantaylor very good point. The weights should first be normalized to sum to N (the unweighted number of observations) right? I am looking at http://stats.stackexchange.com/a/61298/29084
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A ValueError should also be raised if the weights sum to zero.
@juliantaylor this should now be the correct implementation. If the weights are not repeat-type, as in they don't represent an integer number of occurrences of each observation, then we only have the biased definition. So this PR adds two new arguments: |
weights : array-like, optional | ||
A 1-D array of weights with a length equal to the number of | ||
observations. | ||
repeat_weights : int, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds to me, like it should be boolean. Before adding new keyword arguments, a short note to the mailing list would be good, someone might have some cool ideas about the api. I have never used such a weighted cov, are the actual weights integers if this is True, or do they just sum up to an integer (or well, can be interpreted as degrees of freedom or something like that)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to make it a boolean but then I saw that bias
was an int so I assumed that for some reason numpy preferred to use 0/1 switches instead of False/True. I also put this new kwarg last so it would not interfere with existing calls to this function that assume the current order. I will send an email to the list asking for more input and ideas. If repeat_weights
is True then the elements of weights must be integers for the results to make sense. The weights are in general not integer frequencies though and in that case we only have the biased covariance. Maybe there are better names for this argument... frequency_weights
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet the function is just so old, that when it was written python did not have booleans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow...
@juliantaylor, yes it will conflict gh-3882, can't be helped... one of these will have to do an annoying merge. @ndawe, needs a ping to the mailing list and tests. I am a bit wondering about the "repeat-type" argument. It seems to me that all it does is forcing |
I don't mind doing a rebase on top of #3882 since I see that as a more important issue. |
@seberg Thanks for the ping, I will look at it. In statsmodels I have only frequency weights (which I had to correct after the stackoverflow question). I don't see unit tests yet in this PR< |
I think the bool What I remember from the SAS documentation, they are just multiplied together but affect the total "sample size" differently. I never looked at the details for the non-frequency weights in this case. If those have the interpretation of the inverse standard deviation of the individual observations, then they shouldn't affect the denominator, I guess. This should be compared with other packages (and unit tested against them). |
Ah, that sounds good, having |
I mixed them up, Stata has fweights (repeated observations) and aweights, but I just tried, and it doesn't seem to allow them at the same time, although I don't see a reason why it shouldn't. In Stata, scaling Doubling fweights, doubles the number of assumed observations, that's easier to understand. I haven't looked for the SAS or SPSS documentation. (SPSS has only one weight, fweights, in the formula collection, IIRC). I have no idea what R does, since it doesn't have such a nice documentation. |
SAS: I don't remember which procedures I was reading, never for cov specifically for example here SAS has FREQ and WEIGHTS I remember reading another SAS documentation page that had f_i and w_i at the same time in the formulas. |
Graah. Assuming that Stata blog post is right, I see 4 different types of weights (but they can be squashed):
I suppose So, I guess having |
Sorry for the silence. I'll get back to this with comments and suggested changes as soon as possible. |
@ndawe, I just pushed the other pull request into master, so you might want to fetch it before you start working on it more (since I expect it is an ugly merge). |
@seberg, I rebased and resolved the conflicts. So have we decided on two arguments? |
@ndawe, it sound reasonable to me, but please write to the list (since it is a new feature, should be done anyway). Maybe some statisticians have comments, and honestly I am not deep enough into these to make such a call. |
The list discussion may be the next logical step, but this is not ready. We basically somewhat decided that an other approach is probably better and this is not in here. However, since the list may disagree with that... |
I think the discussion on the different types of weights is making it more complicated than it needs to be. I only see the need for |
Sorry, I probably have to think this through more some time. If I recall correctly, I fully agree ;). I am not sure if the way to implement is what you got now, or to have |
Frankly, I'd just have one weight and let the user set up the correct value, but that may be a bit much... |
Yes, I quite frankly think that people are overthinking this. That's my feeling based on the discussion above. |
In the meantime I've moved on with my private implementation since it became clear that this would take more time than it's worth to get pushed into numpy. It's a weighted covariance... not that hard. Give users the ability to specify one type of weights (or just two, with frequencies) and let them set up the weight values according to what they need. Done. |
Maybe.... I am fine with just one weight as long as the documentation is clear about what it is and what it is not. And possibly a link to somewhere explaining how you would get the other stuff. |
Still thinking about this and researching the topic. |
I'm thinking a combination of two weights, repeat and reliability, and ddof. The weights can be combined, but not by simple multiplication, at least for the correction factor in @ndawe formula above. The sum of squared weights needs to be replaced by |
Closed by #4960. |
Given that
average
supports weights [1], and that when searching around there have been requests [2] for implementations of weighted covariance, I think it would be beneficial to also support weights incov
andcorrcoef
[3, 4].[1] http://docs.scipy.org/doc/numpy/reference/generated/numpy.average.html
[2] http://stackoverflow.com/questions/11442962/python-package-that-supports-weighted-covariance-computation
[3] http://stats.stackexchange.com/a/61298/29084
[4] https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_covariance