Skip to content

[MRG+1] fix logistic regression class weights #5008

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 22, 2015

Conversation

TomDLT
Copy link
Member

@TomDLT TomDLT commented Jul 20, 2015

-This PR tests that these two classifiers are equivalent:

class_weights = compute_class_weight("balanced", np.unique(y), y)
clf1 = LogisticRegression(multi_class="multinomial", class_weight="balanced")
clf2 = LogisticRegression(multi_class="multinomial", class_weight=class_weights)

-[EDIT]This PR also tests that these two classifiers are equivalent: (fix #5450)

class_weights = compute_class_weight("balanced", np.unique(y), y)
clf1 = LogisticRegressionCV(class_weight="balanced")
clf2 = LogisticRegressionCV(class_weight=class_weights)

That is why we can remove this line

I also :

@TomDLT TomDLT force-pushed the logistic_multiclass branch 2 times, most recently from f9fade2 to aec918f Compare July 20, 2015 14:41
clf2.fit(X, y)
assert_array_almost_equal(clf1.coef_, clf2.coef_, decimal=6)

# Binary case: remove 90% of class 0 and 100% of class 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not also test multiclass OvR with 3 imbalanced classes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the equivalent class_weight_dict for 3 classes with OvR is no longer trivial to compute in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, at least we cannot used directly compute_class_weight

@ogrisel
Copy link
Member

ogrisel commented Jul 21, 2015

The way to deal with class_weight='balanced' and multiclass='ovr' here is not consistent with other estimators such as SGDClassifier for instance.

For SGDClassifier the expanded class_weight is computed on the total classification problem by calling compute_class_weight once with the full multiclass y and then for each binary subproblem i is fit with expanded_class_weight[i] for the positive class and 1. for the negative class (that is the "rest").

I don't know if this is principled or not but we should enforce consistency throughout the OvR classifiers.

@hannawallach @amueller any comment on this issue?

@TomDLT TomDLT force-pushed the logistic_multiclass branch 2 times, most recently from 50e91b3 to ed5004e Compare September 9, 2015 16:57
@amueller amueller added the Bug label Sep 9, 2015
@TomDLT
Copy link
Member Author

TomDLT commented Sep 10, 2015

For SGDClassifier the expanded class_weight is computed on the total classification problem by calling compute_class_weight once with the full multiclass y and then for each binary subproblem i is fit with expanded_class_weight[i] for the positive class and 1. for the negative class (that is the "rest").

I don't know which way is preferable, but compute_class_weight is used only in SGD, LogisticRegression, and some SVC classes.
I didn't get into every details of all SVC classes, but LinearSVC seems to behave as SGD, as you can see in this line.

We could change LogisticRegression to behave in the same way, for consistency.
However, I don't know how to deprecate the change in a clean way.


This remark being said, this PR solves a small bug about class_weight="balanced" in the multinomial case and is not correlated to this consistency issue.

# Test the liblinear fails when class_weight of type dict is
# provided, when it is multiclass. However it can handle
# binary problems.
msg = ("In LogisticRegressionCV the liblinear solver cannot handle "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't understand this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been more than a year but the rationale was something like this.

Liblinear does not solve for the true multinomial loss and does a OvR, it now raises an error in sklearn if the loss is multinomial and the solver is liblinear.

In a OvR approach, for the other solvers if the class_weights are provided in a dict format, the class_weight are converted to sample_weights and then used. But liblinear does not support providing sample_weights, right now (#5274). However, in the future after the PR gets merged, we can do away with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it does support class_weight, right? It's odd to support balanced but not dicts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the problem :

  • _fit_liblinear(loss='logistic_regression') does not handle sample_weight, but handles class_weight directly, and only with binary problems (!) .
  • SAG, LBFGS and Newton-cg solvers handles class_weight by putting the weights into the sample_weight.

That is why we have a different handling of class_weight for liblinear.


Then, there are 3 cases, with solver='liblinear':

  1. multinomial case, but it raise an error in _check_solver_option.
  2. binary case, then the class_weight dictionnary is reconstructed with keys -1 and 1, probably in order to match the API of _fit_liblinear
  3. OvR with n_classes > 2 case, and then it raises an error (@MechCoder). The reason is that the other solvers use the class weight for each sample, but it is not possible in fit_liblinear since it does not handles sample_weight yet. With the 'balanced' case, the class weights are computed after the binarization one-vs-rest, which can be used in the same way by all solvers, liblinear included.

Conclusion: We probably just need to merge #5274 to be consistent with all solvers in LogisticRegression

The inconsistency for class_weight='balanced' in the OvR case (inconsistency between SGDClassifier, LinearSVC and LogisticRegression) is another topic (see #5008 (comment))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #5274 is ready, but it only adds samples weights in LogisticRegression and not in other classes that use liblinear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to summarize it, but thanks @TomDLT for doing it.. that was the issue. Once 5724 is merged, we can get rid of this check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether or not is the best way to do things is of course a different issue.

@TomDLT TomDLT force-pushed the logistic_multiclass branch from 2fad9b6 to 7f0c91b Compare October 19, 2015 08:51
@TomDLT
Copy link
Member Author

TomDLT commented Oct 19, 2015

I added the test from #5420, which tests the bug detailed in #5415.
I also changed the variable names to distinguish y_bin and Y_bin


-This PR tests that these two classifiers are equivalent:

class_weights = compute_class_weight("balanced", np.unique(y), y)
clf1 = LogisticRegression(multi_class="multinomial", class_weight="balanced")
clf2 = LogisticRegression(multi_class="multinomial", class_weight=class_weights)

-[EDIT]This PR also tests that these two classifiers are equivalent: (fix #5450)

class_weights = compute_class_weight("balanced", np.unique(y), y)
clf1 = LogisticRegressionCV(class_weight="balanced")
clf2 = LogisticRegressionCV(class_weight=class_weights)

That is why we can remove this line

@TomDLT TomDLT force-pushed the logistic_multiclass branch 2 times, most recently from e543ed2 to 4768bf3 Compare October 19, 2015 12:20
@TomDLT TomDLT changed the title [MRG] fix multinomial logistic regression class weights [MRG] fix logistic regression class weights Oct 19, 2015
@TomDLT TomDLT force-pushed the logistic_multiclass branch 3 times, most recently from 8dd7128 to 3a649ce Compare October 20, 2015 08:54
@@ -623,20 +623,20 @@ def logistic_regression_path(X, y, pos_class=None, Cs=10, fit_intercept=True,
mask = (y == pos_class)
y_bin = np.ones(y.shape, dtype=np.float64)
y_bin[~mask] = -1.
# for compute_class_weight

if class_weight in ("auto", "balanced"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any internal comment convention for things that are deprecated? # TODO: looks horrible but some kind of tag could be nice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment with the version is most helpful. #TODO is fine. I grep for the version so 0.19 in this case I guess

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment

@kastnerkyle
Copy link
Member

Besides my comment/question this seems pretty solid +1

@amueller
Copy link
Member

Can you remove the line in common tests that excludes LogisticRegressionCV from the class_weight tests? That should work now, right?

@TomDLT
Copy link
Member Author

TomDLT commented Oct 21, 2015

Can you remove the line in common tests that excludes LogisticRegressionCV from the class_weight tests? That should work now, right?

Are you talking about this line? It has been removed.

@amueller
Copy link
Member

Yes. Sorry I overlooked that.

@@ -610,7 +610,7 @@ def logistic_regression_path(X, y, pos_class=None, Cs=10, fit_intercept=True,
"solver cannot handle multiclass with "
"class_weight of type dict. Use the lbfgs, "
"newton-cg or sag solvers or set "
"class_weight='auto'")
"class_weight='balanced'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I'm being stupid, but I still don't understand why this is. Why can liblinear do "balanced" but not a dict? Maybe this is outside the scope of this fix, but I feel the logic here is hard to follow. Some of the class weight is handled here, and other parts are handled further down in the if mulitclass == 'ovr'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not clear indeed, I am trying to understand it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is related to the different handling of class_weight with multi_class='ovr'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of unrelated to what you are fixing, I'm not sure if we should address it in this PR or not. It just seems odd.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cf #5008 (comment)

Indeed, this is not related to this PR

@TomDLT TomDLT closed this Oct 21, 2015
@TomDLT TomDLT reopened this Oct 21, 2015
@TomDLT TomDLT force-pushed the logistic_multiclass branch from 3a649ce to d439dc4 Compare October 21, 2015 15:42
@amueller
Copy link
Member

ok LGTM

@agramfort
Copy link
Member

+1 for merge

@agramfort agramfort changed the title [MRG] fix logistic regression class weights [MRG+1] fix logistic regression class weights Oct 22, 2015
amueller added a commit that referenced this pull request Oct 22, 2015
[MRG+1] fix logistic regression class weights
@amueller amueller merged commit 5d3bb93 into scikit-learn:master Oct 22, 2015
@amueller
Copy link
Member

backporting

@amueller
Copy link
Member

@TomDLT can you please add a whatsnew for 0.17?

glouppe pushed a commit to glouppe/scikit-learn that referenced this pull request Oct 22, 2015
[MRG+1] fix logistic regression class weights
TomDLT added a commit that referenced this pull request Oct 22, 2015
@ogrisel
Copy link
Member

ogrisel commented Oct 30, 2015

Backported by @amueller to 0.17.X as 289c0a3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LogisticRegressionCV handles class weights differently
6 participants