-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] fix logistic regression class weights #5008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f9fade2
to
aec918f
Compare
clf2.fit(X, y) | ||
assert_array_almost_equal(clf1.coef_, clf2.coef_, decimal=6) | ||
|
||
# Binary case: remove 90% of class 0 and 100% of class 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not also test multiclass OvR with 3 imbalanced classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand the equivalent class_weight_dict
for 3 classes with OvR is no longer trivial to compute in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly, at least we cannot used directly compute_class_weight
The way to deal with For I don't know if this is principled or not but we should enforce consistency throughout the OvR classifiers. @hannawallach @amueller any comment on this issue? |
50e91b3
to
ed5004e
Compare
I don't know which way is preferable, but We could change LogisticRegression to behave in the same way, for consistency. This remark being said, this PR solves a small bug about |
6d3af2f
to
5e7ba8f
Compare
# Test the liblinear fails when class_weight of type dict is | ||
# provided, when it is multiclass. However it can handle | ||
# binary problems. | ||
msg = ("In LogisticRegressionCV the liblinear solver cannot handle " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't understand this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been more than a year but the rationale was something like this.
Liblinear does not solve for the true multinomial loss and does a OvR, it now raises an error in sklearn if the loss is multinomial and the solver is liblinear.
In a OvR approach, for the other solvers if the class_weights are provided in a dict format, the class_weight are converted to sample_weights and then used. But liblinear does not support providing sample_weights, right now (#5274). However, in the future after the PR gets merged, we can do away with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it does support class_weight
, right? It's odd to support balanced
but not dicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is the problem :
_fit_liblinear(loss='logistic_regression')
does not handlesample_weight
, but handlesclass_weight
directly, and only with binary problems (!) .- SAG, LBFGS and Newton-cg solvers handles
class_weight
by putting the weights into thesample_weight
.
That is why we have a different handling of class_weight
for liblinear.
Then, there are 3 cases, with solver='liblinear'
:
- multinomial case, but it raise an error in
_check_solver_option
. - binary case, then the
class_weight
dictionnary is reconstructed with keys -1 and 1, probably in order to match the API of_fit_liblinear
- OvR with
n_classes > 2
case, and then it raises an error (@MechCoder). The reason is that the other solvers use the class weight for each sample, but it is not possible infit_liblinear
since it does not handlessample_weight
yet. With the 'balanced' case, the class weights are computed after the binarization one-vs-rest, which can be used in the same way by all solvers, liblinear included.
Conclusion: We probably just need to merge #5274 to be consistent with all solvers in LogisticRegression
The inconsistency for class_weight='balanced'
in the OvR case (inconsistency between SGDClassifier, LinearSVC and LogisticRegression) is another topic (see #5008 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think #5274 is ready, but it only adds samples weights in LogisticRegression
and not in other classes that use liblinear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was about to summarize it, but thanks @TomDLT for doing it.. that was the issue. Once 5724 is merged, we can get rid of this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether or not is the best way to do things is of course a different issue.
2fad9b6
to
7f0c91b
Compare
I added the test from #5420, which tests the bug detailed in #5415. -This PR tests that these two classifiers are equivalent: class_weights = compute_class_weight("balanced", np.unique(y), y)
clf1 = LogisticRegression(multi_class="multinomial", class_weight="balanced")
clf2 = LogisticRegression(multi_class="multinomial", class_weight=class_weights) -[EDIT]This PR also tests that these two classifiers are equivalent: (fix #5450) class_weights = compute_class_weight("balanced", np.unique(y), y)
clf1 = LogisticRegressionCV(class_weight="balanced")
clf2 = LogisticRegressionCV(class_weight=class_weights) That is why we can remove this line |
e543ed2
to
4768bf3
Compare
8dd7128
to
3a649ce
Compare
@@ -623,20 +623,20 @@ def logistic_regression_path(X, y, pos_class=None, Cs=10, fit_intercept=True, | |||
mask = (y == pos_class) | |||
y_bin = np.ones(y.shape, dtype=np.float64) | |||
y_bin[~mask] = -1. | |||
# for compute_class_weight | |||
|
|||
if class_weight in ("auto", "balanced"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any internal comment convention for things that are deprecated? # TODO:
looks horrible but some kind of tag could be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment with the version is most helpful. #TODO
is fine. I grep for the version so 0.19
in this case I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a comment
Besides my comment/question this seems pretty solid +1 |
Can you remove the line in common tests that excludes |
Are you talking about this line? It has been removed. |
Yes. Sorry I overlooked that. |
@@ -610,7 +610,7 @@ def logistic_regression_path(X, y, pos_class=None, Cs=10, fit_intercept=True, | |||
"solver cannot handle multiclass with " | |||
"class_weight of type dict. Use the lbfgs, " | |||
"newton-cg or sag solvers or set " | |||
"class_weight='auto'") | |||
"class_weight='balanced'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I'm being stupid, but I still don't understand why this is. Why can liblinear
do "balanced" but not a dict? Maybe this is outside the scope of this fix, but I feel the logic here is hard to follow. Some of the class weight is handled here, and other parts are handled further down in the if mulitclass == 'ovr'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not clear indeed, I am trying to understand it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is related to the different handling of class_weight
with multi_class='ovr'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of unrelated to what you are fixing, I'm not sure if we should address it in this PR or not. It just seems odd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this is not related to this PR
3a649ce
to
d439dc4
Compare
ok LGTM |
+1 for merge |
[MRG+1] fix logistic regression class weights
backporting |
@TomDLT can you please add a whatsnew for 0.17? |
[MRG+1] fix logistic regression class weights
-This PR tests that these two classifiers are equivalent:
-[EDIT]This PR also tests that these two classifiers are equivalent: (fix #5450)
That is why we can remove this line
I also :
"auto"
(deprecated) into"balanced"
.y_bin
andY_bin