Description
Hi,
I think there might be a bug in the compute_class_weight() method in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/class_weight.py. Lines 50 and 51 are as follows:
recip_freq = 1. / bincount(y_ind)
weight = recip_freq[le.transform(classes)] / np.mean(recip_freq)
For a two class problem, where there are N0 data points belonging to class 0 and N1 data points belonging to class 1, this is equivalent to saying weight[0] = w0 = (1 / N0) / (0.5 * (1 / N0 + 1 / N1)) = 2 * N1 / (N0 + N1) and weight[1] = w1 = (1 / N1) / (0.5 * (1 / N0 + 1 / N1)).
These expressions for w0 and w1 do not correspond with my intuition or with the expressions given in this paper by King and Zeng.
Intuitively, assuming that I am trying to "balance" my classes using my class weights, I would expect my class weights (w0 and w1, respectively) to satisfy w0 * N0 + w1 * N1 = N0 + N1 and w0 * N0 = w1 * N1. Solving for w0 gives (N0 + N1) / (2 * N0) while solving for w1 gives (N0 + N1) / (2 * N1).
This intuition is backed up by King and Zeng's paper, which says (on pages 144--145) that if tau is the fraction of ones in the population (i.e., 0.5 for a balanced population, which is what we're assuming here) and y is the fraction of ones in the sample (i.e., N1 / (N0 + N1)) then w1 = tau / y = 0.5 / (N1 / (N0 + N1)) = (N0 + N1) / (2 * N1) and similarly w0 = (N0 + N1) / (2 * N0).
Is this what you were intending to compute in lines 50 and 51? If so, then I think there is a bug.