[MRG+1] fix logistic regression class weights #5008

TomDLT · 2015-07-20T14:02:20Z

-This PR tests that these two classifiers are equivalent:

class_weights = compute_class_weight("balanced", np.unique(y), y)
clf1 = LogisticRegression(multi_class="multinomial", class_weight="balanced")
clf2 = LogisticRegression(multi_class="multinomial", class_weight=class_weights)

-[EDIT]This PR also tests that these two classifiers are equivalent: (fix #5450)

class_weights = compute_class_weight("balanced", np.unique(y), y)
clf1 = LogisticRegressionCV(class_weight="balanced")
clf2 = LogisticRegressionCV(class_weight=class_weights)

That is why we can remove this line

I also :

updated some remaining "auto" (deprecated) into "balanced".
added the test from [MRG] FIX+TST regression in multinomial logstc reg when class_weight is auto #5420, which tests the bug detailed in use class_weight='auto' and multi_class='multinomial' in LogisticRegression cause UnboundLocalError in line 637 of logistic.py #5415.
changed the variable names to distinguish y_bin and Y_bin

ogrisel · 2015-07-21T11:17:27Z

sklearn/linear_model/tests/test_logistic.py

+        clf2.fit(X, y)
+        assert_array_almost_equal(clf1.coef_, clf2.coef_, decimal=6)
+
+    # Binary case: remove 90% of class 0 and 100% of class 2


Why not also test multiclass OvR with 3 imbalanced classes?

I think I understand the equivalent class_weight_dict for 3 classes with OvR is no longer trivial to compute in that case.

exactly, at least we cannot used directly compute_class_weight

ogrisel · 2015-07-21T14:11:05Z

The way to deal with class_weight='balanced' and multiclass='ovr' here is not consistent with other estimators such as SGDClassifier for instance.

For SGDClassifier the expanded class_weight is computed on the total classification problem by calling compute_class_weight once with the full multiclass y and then for each binary subproblem i is fit with expanded_class_weight[i] for the positive class and 1. for the negative class (that is the "rest").

I don't know if this is principled or not but we should enforce consistency throughout the OvR classifiers.

@hannawallach @amueller any comment on this issue?

TomDLT · 2015-09-10T15:51:58Z

For SGDClassifier the expanded class_weight is computed on the total classification problem by calling compute_class_weight once with the full multiclass y and then for each binary subproblem i is fit with expanded_class_weight[i] for the positive class and 1. for the negative class (that is the "rest").

I don't know which way is preferable, but compute_class_weight is used only in SGD, LogisticRegression, and some SVC classes.
I didn't get into every details of all SVC classes, but LinearSVC seems to behave as SGD, as you can see in this line.

We could change LogisticRegression to behave in the same way, for consistency.
However, I don't know how to deprecate the change in a clean way.

This remark being said, this PR solves a small bug about class_weight="balanced" in the multinomial case and is not correlated to this consistency issue.

amueller · 2015-10-17T17:54:52Z

sklearn/linear_model/tests/test_logistic.py

-    # Test the liblinear fails when class_weight of type dict is
-    # provided, when it is multiclass. However it can handle
-    # binary problems.
+    msg = ("In LogisticRegressionCV the liblinear solver cannot handle "


Sorry I don't understand this.

It's been more than a year but the rationale was something like this.

Liblinear does not solve for the true multinomial loss and does a OvR, it now raises an error in sklearn if the loss is multinomial and the solver is liblinear.

In a OvR approach, for the other solvers if the class_weights are provided in a dict format, the class_weight are converted to sample_weights and then used. But liblinear does not support providing sample_weights, right now (#5274). However, in the future after the PR gets merged, we can do away with this.

But it does support class_weight, right? It's odd to support balanced but not dicts.

here is the problem :

_fit_liblinear(loss='logistic_regression') does not handle sample_weight, but handles class_weight directly, and only with binary problems (!) .

SAG, LBFGS and Newton-cg solvers handles class_weight by putting the weights into the sample_weight.

That is why we have a different handling of class_weight for liblinear.

Then, there are 3 cases, with solver='liblinear':

multinomial case, but it raise an error in _check_solver_option.

binary case, then the class_weight dictionnary is reconstructed with keys -1 and 1, probably in order to match the API of _fit_liblinear

OvR with n_classes > 2 case, and then it raises an error (@MechCoder). The reason is that the other solvers use the class weight for each sample, but it is not possible in fit_liblinear since it does not handles sample_weight yet. With the 'balanced' case, the class weights are computed after the binarization one-vs-rest, which can be used in the same way by all solvers, liblinear included.

Conclusion: We probably just need to merge #5274 to be consistent with all solvers in LogisticRegression

The inconsistency for class_weight='balanced' in the OvR case (inconsistency between SGDClassifier, LinearSVC and LogisticRegression) is another topic (see #5008 (comment))

what's left to do for #5274 ?

I think #5274 is ready, but it only adds samples weights in LogisticRegression and not in other classes that use liblinear.

one thing at a time right?

I was about to summarize it, but thanks @TomDLT for doing it.. that was the issue. Once 5724 is merged, we can get rid of this check.

Whether or not is the best way to do things is of course a different issue.

TomDLT · 2015-10-19T09:55:44Z

I added the test from #5420, which tests the bug detailed in #5415.
I also changed the variable names to distinguish y_bin and Y_bin

-This PR tests that these two classifiers are equivalent:

class_weights = compute_class_weight("balanced", np.unique(y), y)
clf1 = LogisticRegression(multi_class="multinomial", class_weight="balanced")
clf2 = LogisticRegression(multi_class="multinomial", class_weight=class_weights)

-[EDIT]This PR also tests that these two classifiers are equivalent: (fix #5450)

class_weights = compute_class_weight("balanced", np.unique(y), y)
clf1 = LogisticRegressionCV(class_weight="balanced")
clf2 = LogisticRegressionCV(class_weight=class_weights)

That is why we can remove this line

kastnerkyle · 2015-10-21T09:55:57Z

sklearn/linear_model/logistic.py

@@ -623,20 +623,20 @@ def logistic_regression_path(X, y, pos_class=None, Cs=10, fit_intercept=True,
        mask = (y == pos_class)
        y_bin = np.ones(y.shape, dtype=np.float64)
        y_bin[~mask] = -1.
+        # for compute_class_weight
+
+        if class_weight in ("auto", "balanced"):


Is there any internal comment convention for things that are deprecated? # TODO: looks horrible but some kind of tag could be nice.

Comment with the version is most helpful. #TODO is fine. I grep for the version so 0.19 in this case I guess

I'll add a comment

kastnerkyle · 2015-10-21T09:57:12Z

Besides my comment/question this seems pretty solid +1

amueller · 2015-10-21T10:19:57Z

Can you remove the line in common tests that excludes LogisticRegressionCV from the class_weight tests? That should work now, right?

TomDLT · 2015-10-21T11:02:19Z

Can you remove the line in common tests that excludes LogisticRegressionCV from the class_weight tests? That should work now, right?

Are you talking about this line? It has been removed.

amueller · 2015-10-21T11:38:07Z

Yes. Sorry I overlooked that.

amueller · 2015-10-21T11:46:01Z

sklearn/linear_model/logistic.py

@@ -610,7 +610,7 @@ def logistic_regression_path(X, y, pos_class=None, Cs=10, fit_intercept=True,
                                 "solver cannot handle multiclass with "
                                 "class_weight of type dict. Use the lbfgs, "
                                 "newton-cg or sag solvers or set "
-                                 "class_weight='auto'")
+                                 "class_weight='balanced'")


Sorry if I'm being stupid, but I still don't understand why this is. Why can liblinear do "balanced" but not a dict? Maybe this is outside the scope of this fix, but I feel the logic here is hard to follow. Some of the class weight is handled here, and other parts are handled further down in the if mulitclass == 'ovr'

This is not clear indeed, I am trying to understand it

I think this is related to the different handling of class_weight with multi_class='ovr'

It's kind of unrelated to what you are fixing, I'm not sure if we should address it in this PR or not. It just seems odd.

cf #5008 (comment)

Indeed, this is not related to this PR

amueller · 2015-10-21T16:19:03Z

ok LGTM

agramfort · 2015-10-22T08:39:30Z

+1 for merge

[MRG+1] fix logistic regression class weights

amueller · 2015-10-22T10:45:03Z

backporting

amueller · 2015-10-22T10:47:57Z

@TomDLT can you please add a whatsnew for 0.17?

[MRG+1] fix logistic regression class weights

ogrisel · 2015-10-30T13:31:11Z

Backported by @amueller to 0.17.X as 289c0a3.

TomDLT force-pushed the logistic_multiclass branch 2 times, most recently from f9fade2 to aec918f Compare July 20, 2015 14:41

ogrisel reviewed Jul 21, 2015
View reviewed changes

TomDLT force-pushed the logistic_multiclass branch 2 times, most recently from 50e91b3 to ed5004e Compare September 9, 2015 16:57

amueller added the Bug label Sep 9, 2015

TomDLT force-pushed the logistic_multiclass branch 3 times, most recently from 6d3af2f to 5e7ba8f Compare September 16, 2015 09:02

This was referenced Oct 17, 2015

use class_weight='auto' and multi_class='multinomial' in LogisticRegression cause UnboundLocalError in line 637 of logistic.py #5415

Closed

[MRG] FIX+TST regression in multinomial logstc reg when class_weight is auto #5420

Closed

ogrisel added this to the 0.17 milestone Oct 17, 2015

ogrisel added the Blocker label Oct 17, 2015

amueller reviewed Oct 17, 2015
View reviewed changes

TomDLT force-pushed the logistic_multiclass branch from 2fad9b6 to 7f0c91b Compare October 19, 2015 08:51

TomDLT force-pushed the logistic_multiclass branch 2 times, most recently from e543ed2 to 4768bf3 Compare October 19, 2015 12:20

TomDLT changed the title ~~[MRG] fix multinomial logistic regression class weights~~ [MRG] fix logistic regression class weights Oct 19, 2015

TomDLT force-pushed the logistic_multiclass branch 3 times, most recently from 8dd7128 to 3a649ce Compare October 20, 2015 08:54

amueller mentioned this pull request Oct 21, 2015

[WIP] Common test for sample weight #5461

Closed

kastnerkyle reviewed Oct 21, 2015
View reviewed changes

amueller reviewed Oct 21, 2015
View reviewed changes

TomDLT closed this Oct 21, 2015

TomDLT reopened this Oct 21, 2015

FIX class_weight in LogisticRegression and LogisticRegressionCV

d439dc4

TomDLT force-pushed the logistic_multiclass branch from 3a649ce to d439dc4 Compare October 21, 2015 15:42

agramfort changed the title ~~[MRG] fix logistic regression class weights~~ [MRG+1] fix logistic regression class weights Oct 22, 2015

amueller added a commit that referenced this pull request Oct 22, 2015

Merge pull request #5008 from TomDLT/logistic_multiclass

5d3bb93

[MRG+1] fix logistic regression class weights

amueller merged commit 5d3bb93 into scikit-learn:master Oct 22, 2015

glouppe pushed a commit to glouppe/scikit-learn that referenced this pull request Oct 22, 2015

Merge pull request scikit-learn#5008 from TomDLT/logistic_multiclass

caeefff

[MRG+1] fix logistic regression class weights

TomDLT added a commit that referenced this pull request Oct 22, 2015

update whats_new for #5008

efe8399

TomDLT mentioned this pull request Oct 26, 2015

App_veyor failure on master, in test_logistic_regression_sample_weights #5598

Closed

TomDLT mentioned this pull request Oct 30, 2015

[MRG+1] FIX increase tolerance of class weight check for OS X #5626

Merged

TomDLT mentioned this pull request May 24, 2016

[MRG+1] use class_weight through sample_weight in LogisticRegression with liblinear #6817

Merged

TomDLT deleted the logistic_multiclass branch June 15, 2018 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] fix logistic regression class weights #5008

[MRG+1] fix logistic regression class weights #5008

TomDLT commented Jul 20, 2015

ogrisel Jul 21, 2015

ogrisel Jul 21, 2015

TomDLT Jul 21, 2015

ogrisel commented Jul 21, 2015

TomDLT commented Sep 10, 2015

amueller Oct 17, 2015

MechCoder Oct 18, 2015

amueller Oct 21, 2015

TomDLT Oct 21, 2015

agramfort Oct 21, 2015 via email

TomDLT Oct 21, 2015

agramfort Oct 21, 2015 via email

TomDLT Oct 21, 2015

MechCoder Oct 21, 2015

MechCoder Oct 21, 2015

TomDLT commented Oct 19, 2015

kastnerkyle Oct 21, 2015

amueller Oct 21, 2015

TomDLT Oct 21, 2015

kastnerkyle commented Oct 21, 2015

amueller commented Oct 21, 2015

TomDLT commented Oct 21, 2015

amueller commented Oct 21, 2015

amueller Oct 21, 2015

TomDLT Oct 21, 2015

TomDLT Oct 21, 2015

amueller Oct 21, 2015

TomDLT Oct 21, 2015

amueller commented Oct 21, 2015

agramfort commented Oct 22, 2015

amueller commented Oct 22, 2015

amueller commented Oct 22, 2015

ogrisel commented Oct 30, 2015

[MRG+1] fix logistic regression class weights #5008

[MRG+1] fix logistic regression class weights #5008

Conversation

TomDLT commented Jul 20, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel commented Jul 21, 2015

TomDLT commented Sep 10, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agramfort Oct 21, 2015 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agramfort Oct 21, 2015 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomDLT commented Oct 19, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kastnerkyle commented Oct 21, 2015

amueller commented Oct 21, 2015

TomDLT commented Oct 21, 2015

amueller commented Oct 21, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller commented Oct 21, 2015

agramfort commented Oct 22, 2015

amueller commented Oct 22, 2015

amueller commented Oct 22, 2015

ogrisel commented Oct 30, 2015