Skip to content

[MRG] FIX+TST regression in multinomial logstc reg when class_weight is auto #5420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

raghavrv
Copy link
Member

Fixes #5415

@lesteve @MechCoder Please review!

@raghavrv raghavrv changed the title FIX+TST regression in multinomial logstc reg when class_weight is auto [MRG] FIX+TST regression in multinomial logstc reg when class_weight is auto Oct 16, 2015
@lesteve
Copy link
Member

lesteve commented Oct 16, 2015

There are still Y_bin (note the capital Y) vs y_bin elsewhere in the file. Do you reckon their usage was intended, and if yes do you know the difference in meaning?

Using two variables that only differ by capitalization does not seem like a very good idea...

@raghavrv
Copy link
Member Author

It appears that the usage is intended (though I am not sure why we need to have the variable named Y_bin for multinomial alone, maybe just not to confuse it with the y_bin computed in the case of ovr I think...).

I think we can safely rename Y_bin --> y_bin without side-effects... (but that would be for another PR I feel?)

@@ -875,3 +875,11 @@ def test_warm_start():
assert_greater(2.0, cum_diff, msg)
else:
assert_greater(cum_diff, 2.0, msg)


@ignore_warnings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why ignore the warnings? which warnings are raised in that case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto is deprecated in favor of balanced? In that case put a please use assert_warns to make it expliclit that we expect a DeprecationWarning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning that auto will no longer be supported!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay will do so :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it me or is 'balanced' and 'auto' are treated the same, see: this line where the check is against 'balanced' and not 'auto'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check seems to against auto only?

    if class_weight == "auto":
        class_weight_ = compute_class_weight(class_weight, mask_classes,
                                             y_bin)
        sample_weight *= class_weight_[le.fit_transform(y_bin)]

Do you mean some other line? sorry I don't get what you are saying!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check seems to against auto only?

sorry I meant against 'auto' and not 'balanced', just as you said.

@raghavrv raghavrv force-pushed the regression_in_logistic branch from be44403 to a17b70b Compare October 16, 2015 14:48
@raghavrv
Copy link
Member Author

Or should I do that in this PR itself? (renaming Y_bin --> y_bin)

@ogrisel
Copy link
Member

ogrisel commented Oct 16, 2015

The rule is Y_bin for variables with a 2D shape and y_bin for 1D arrays.

@ogrisel
Copy link
Member

ogrisel commented Oct 16, 2015

Actually it's not a rule but a convention in the scikit-learn code base. Although it's not always respected it's better to follow it for consistency.

@raghavrv
Copy link
Member Author

Ah thanks for clarifying!! :)

@amueller
Copy link
Member

why is there a deprecation here?

@amueller
Copy link
Member

Or rather: does this work correctly with "balanced"?

@amueller
Copy link
Member

I guess merge this and open a new issue?

@lesteve
Copy link
Member

lesteve commented Oct 16, 2015

Or rather: does this work correctly with "balanced"?

I don't think 'auto' and 'balanced' give the same result:

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_classes=5, n_informative=5, random_state=42)
model_auto = LogisticRegression(multi_class='multinomial', class_weight='auto',
                                solver='lbfgs', random_state=0)
model_auto.fit(X, y)
model_balanced = LogisticRegression(multi_class='multinomial', class_weight='balanced',
                                    solver='lbfgs', random_state=0)

model_balanced.fit(X, y)

print(model_auto.coef_ - model_balanced.coef_)

Output:

[[  2.54746659e-03   7.43233265e-03   2.77261631e-03   1.41348054e-03
    1.21013318e-02  -4.56288617e-03   5.69287167e-04   3.33629140e-03
   -3.88104802e-04  -3.55079940e-03  -8.57260871e-04  -2.45097273e-04
    1.13524688e-02  -1.24548720e-03   6.89260176e-03  -9.50383491e-03
   -2.10713335e-03  -1.24984758e-03   3.73549840e-04  -5.89461179e-03]
 [  2.58017862e-03  -5.26711165e-03  -1.64755693e-03   1.42570316e-03
   -7.51183245e-04  -4.10772362e-03   4.18610654e-03   1.77434558e-03
    6.46642076e-04  -7.12319759e-03   1.72079922e-03  -7.03747457e-03
    1.13293794e-02   2.52818914e-03   2.43273106e-03  -4.66653993e-03
    6.56878824e-03   7.39691852e-03   3.33240485e-03  -5.31231989e-03]
 [ -5.66662259e-03  -1.14696637e-02  -3.77677123e-03   3.57538831e-03
   -2.71981075e-03   5.81756973e-03   2.93235232e-03   2.75617311e-03
    3.39429171e-03   7.01936893e-03  -2.84235294e-03   3.78963983e-04
   -9.90384206e-04  -1.11753851e-03  -5.80121488e-03   1.60822984e-02
    1.75337541e-03   5.06311469e-03  -1.73672061e-04   2.62750563e-03]
 [ -2.39735439e-03   4.92023929e-03   1.52716529e-03  -1.42230878e-03
   -9.35013669e-05   2.29086976e-03  -1.18793225e-03  -6.11619047e-03
   -8.21475297e-04   2.31691010e-03  -1.67619010e-03   6.32054327e-03
   -1.16387129e-02   1.80561897e-03   2.53636983e-03   4.33820296e-04
   -1.14790487e-02  -6.76347623e-03   9.24463117e-04  -1.40200438e-03]
 [  2.93633178e-03   4.38420342e-03   1.12454655e-03  -4.99226324e-03
   -8.53683648e-03   5.62170314e-04  -6.49981377e-03  -1.75061963e-03
   -2.83135369e-03   1.33771796e-03   3.65500470e-03   5.83064593e-04
   -1.00527511e-02  -1.97078240e-03  -6.06048778e-03  -2.34574388e-03
    5.26401842e-03  -4.44670941e-03  -4.45674574e-03   9.98143044e-03]]

@amueller
Copy link
Member

see #5422

@ogrisel
Copy link
Member

ogrisel commented Oct 16, 2015

Or rather: does this work correctly with "balanced"?

That's a good question. I would rather check that before merging. And add a test if there is no test for that case.

@amueller
Copy link
Member

@lesteve they are not supposed to give the same result. But having special treatment for one and not the other seems fishy

@ogrisel
Copy link
Member

ogrisel commented Oct 16, 2015

I don't think 'auto' and 'balanced' give the same result.

This is expected. This is why we deprecated 'auto'. The results are not what people would naively expect. It was a misleading sklearn specific heuristic. 'balanced' implements the industry standard.

@lesteve
Copy link
Member

lesteve commented Oct 16, 2015

OK sorry for the noise and thanks for the details.

@amueller
Copy link
Member

The fix is to do if class_weight in ["auto", "balanced"]: and pass the class_weight parameter to the compute_class_weight function. But why wasn't this caught in the tests? There is a test for class_weight="balanced" for linear models, right? maybe only in the binary case? hum

@amueller
Copy link
Member

    for name, Classifier in linear_classifiers:
        if name == "LogisticRegressionCV":
            # Contrary to RidgeClassifierCV, LogisticRegressionCV use actual
            # CV folds and fit a model for each CV iteration before averaging
            # the coef. Therefore it is expected to not behave exactly as the
            # other linear model.
            continue
        yield check_class_weight_balanced_linear_classifier, name, Classifier

@amueller
Copy link
Member

Can you remove the if and see if it passes after the fix I suggested above?

@raghavrv
Copy link
Member Author

I tried and it doesn't pass...!

@amueller
Copy link
Member

hm ok. We should have some form of regression test, though. I'm not sure the comment is correct. @ogrisel added it in 07560e4. We are learning models with exactly the same parameters. Cross-validation or not, that should be deterministic, right?

@amueller
Copy link
Member

Also I'm surprised by the behavior of LogisticRegressionCV. None of the other CV models does model averaging, right? This should be better documented, I think.

@MechCoder
Copy link
Member

@amueller

The idea is that in LogisticRegressionCV when class_weight is set to auto, the class_weights are calculated for every fold. While using StratifiedKFold the class frequency is not guaranteed to be the same for every train fold. So calculating the class_weights apriori and allowing LogisticRegressionCV to do it will not produce the same results.

In RidgeClassifierCV, since it is LOO, we can set the class_weights before.

We do not do model averaging, the comment in that sense is slightly wrong (unless refit is set to False). It should be .. model for each CV iteration before the final fit

@amueller
Copy link
Member

Ah, I overlooked the refit parameter. But that negates the argument, right? The final coef_ will have class weight computed with all samples, because refit=True by default.

@MechCoder
Copy link
Member

Just a second. Trying out something.

@MechCoder
Copy link
Member

Actually I do get the same coefs :|

@MechCoder
Copy link
Member

If LogisticRegressionCV is given the class_weights parameter as a dict, the value of the optimal C found is different from that of that in "auto" (especially in the case of very small data as in the tests). Took me some time to figure that out.

@MechCoder
Copy link
Member

The fix looks good to me though. (It was working before, I think it somehow broke while adding the class_weight=balaced parameter.

@amueller
Copy link
Member

why is the value of C found different if the parameters are the same?

@MechCoder
Copy link
Member

Because the parameters for every fold are not the same. Minimal code to reproduce

X = np.array([[-1.0, -1.0], [-1.0, 0], [-.8, -1.0],
                    [1.0, 1.0], [1.0, 0.0]])
y = np.array([0, 0, 0, 1, 1])
C = [0.0001, 0.35938137]
cv = check_cv(None, X, y, classifier=True) 

# Pre assigned class weights, do not change for every fold.
# Training a logreg model for every fold and averaging the score.
class_weight = {1: 5 / (np.sum(y == 1) * 2),
                0: 5 / (np.sum(y == 0) * 2)}
for Cs in C:
    score = 0.0
    for train, test in cv:
        logreg = LogisticRegression(C=Cs, class_weight=class_weight, solver='lbfgs')
        logreg.fit(X[train], y[train])
        score += logreg.score(X[test], y[test])
    print("%f score for %f" % (score/3.0, Cs))

# Compute the class weights for every fold.
for Cs in C:
    score = 0.0
    for train, test in cv:
        labels = [0, 1]
        t = compute_class_weight("balanced", [0, 1], y[train])
        class_weight = dict(zip(labels, t))
        logreg = LogisticRegression(solver='lbfgs', class_weight=class_weight, C=Cs)
        logreg.fit(X[train], y[train])
        score += logreg.score(X[test], y[test])
    print("%f score for %f" % (score/3.0, Cs))

Note that the first one chooses a C of 0.359 while the second chooses a C of 0.0001 (np argmax chooses the first C). This leads to differences in the final fit.

@TomDLT
Copy link
Member

TomDLT commented Oct 17, 2015

Sorry I introduced this bug in this commit 94eb619#diff-1452d97adf1a0b794930e405aa58d64bR613

It should be solved in #5008

@ogrisel
Copy link
Member

ogrisel commented Oct 17, 2015

This should be better documented, I think.

I agree.

@amueller
Copy link
Member

have to look more closely at #5008, not sure I have time for that today. For the failing test, we could just search over a single C [or maybe better two very far apart Cs].

@TomDLT
Copy link
Member

TomDLT commented Oct 19, 2015

The idea is that in LogisticRegressionCV when class_weight is set to auto, the class_weights are calculated for every fold. While using StratifiedKFold the class frequency is not guaranteed to be the same for every train fold. So calculating the class_weights apriori and allowing LogisticRegressionCV to do it will not produce the same results.

I changed this behavior in # 5008.

@raghavrv
Copy link
Member Author

I am closing this in favour of #5008 which fixes a few more issues which were raised here by @amueller @MechCoder and others...

@raghavrv raghavrv closed this Oct 19, 2015
@raghavrv raghavrv deleted the regression_in_logistic branch October 19, 2015 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants