Possible bug in compute_class_weight()

Hi, 

I think there might be a bug in the compute_class_weight() method in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/class_weight.py. Lines 50 and 51 are as follows:

```
    recip_freq = 1. / bincount(y_ind)
    weight = recip_freq[le.transform(classes)] / np.mean(recip_freq)
```

For a two class problem, where there are N0 data points belonging to class 0 and N1 data points belonging to class 1, this is equivalent to saying weight[0] = w0 =  (1 / N0) / (0.5 \* (1 / N0 + 1 / N1)) = 2 \* N1 / (N0 + N1) and weight[1] = w1 = (1 / N1) / (0.5 \* (1 / N0 + 1 / N1)). 

These expressions for w0 and w1 do not correspond with my intuition or with the expressions given in <a href="http://gking.harvard.edu/files/0s.pdf">this paper</a> by King and Zeng.

Intuitively, assuming that I am trying to "balance" my classes using my class weights, I would expect my class weights (w0 and w1, respectively) to satisfy w0 \* N0 + w1 \* N1 = N0 + N1 and w0 \* N0 = w1 \* N1. Solving for w0 gives (N0 + N1) / (2 \* N0) while solving for w1 gives (N0 + N1) / (2 \* N1).

This intuition is backed up by King and Zeng's paper, which says (on pages 144--145) that if tau is the fraction of ones in the population (i.e., 0.5 for a balanced population, which is what we're assuming here) and y is the fraction of ones in the sample (i.e., N1 / (N0 + N1)) then w1 = tau / y = 0.5 / (N1 / (N0 + N1)) = (N0 + N1) / (2 \* N1) and similarly w0 = (N0 + N1) / (2 \* N0).

Is this what you were intending to compute in lines 50 and 51? If so, then I think there is a bug.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Possible bug in compute_class_weight() #4324

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Possible bug in compute_class_weight() #4324

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions