-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Patch liblinear for sample_weights in LogisticRegression(and CV) #5274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Patch liblinear for sample_weights in LogisticRegression(and CV) #5274
Conversation
0aff304
to
a700d3c
Compare
a700d3c
to
2ef06ac
Compare
ping @vstolbunov . Also @fabianp could you have a look, since you know this part of the code very well. |
2ef06ac
to
268b1bc
Compare
tests pass now. |
I took a look last night and they had passed so I wasn't sure what the problem was? |
I think I forgot to build it locally and hence I got some segmentation faults. But now it's all right |
@TomDLT you might want to review this. |
# Test the above for l1 penalty and l2 penalty with dual=True. | ||
# since the patched liblinear code is different. | ||
clf_cw = LogisticRegression( | ||
solver="liblinear", fit_intercept=False, class_weight={0:1, 1:2}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{0: 1, 1: 2}
You should also compare the results with liblinear and other solvers, like in this test. And testing it, it reveals that liblinear does not handles |
clf_sw_none.fit(X, y) | ||
clf_sw_ones = LR(solver=solver, fit_intercept=False) | ||
clf_sw_ones.fit(X, y, sample_weight=np.ones(y.shape[0])) | ||
assert_array_almost_equal(clf_sw_none.coef_, clf_sw_ones.coef_, decimal=4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line too long
It looks pretty good to me. (yet I am not a C++ master) |
6e6ec68
to
867596b
Compare
@TomDLT Can you update the PR to MRG+1 if you are happy? (Btw, I don't know what your definition of a C++ master is, but whatever it is I'm not one either :P) |
867596b
to
5134286
Compare
If you are looking for a C++ master, @larsmans is a good candidate :) |
@@ -10,7 +10,7 @@ | |||
from ..base import BaseEstimator, ClassifierMixin, ChangedBehaviorWarning | |||
from ..preprocessing import LabelEncoder | |||
from ..multiclass import _ovr_decision_function | |||
from ..utils import check_array, check_random_state, column_or_1d | |||
from ..utils import check_array, check_consistent_length, check_random_state, column_or_1d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8, put the check_consistent_length import on its own line.
75758f6
to
84668a7
Compare
You should also be supporting in |
I thought that was for |
I'd thought it was a general patch to liblinear, but maybe.. |
b11922c
to
c01faae
Compare
@jnothman So we'll merge this and add support for other solvers later? |
c01faae
to
9d8d7e4
Compare
Rebased. Would be great if someone can give a final +1 |
played with this a bit and it worked great. Merging. |
[MRG+1] Patch liblinear for sample_weights in LogisticRegression(and CV)
the pyx file doesn't compile with cython 0.21, 0.22 or 0.23. You used 0.20, I'll try that next. I'm pretty scared of the casting that is going on there. This was found by @arthurmensch in #5557 |
Installed 0.20, still doesn't compile. |
I think I just forgot to push the generated C files. Just a second. |
How objective function changes in the case of sample_weight for Logistic Regression? Can you please provide the mathematical expression? I assume objective function changes like this E(\mathbf{w}) = - \sum_{n=1}^{N} {s_n t_n \ln y_n + (1-s_n t_n) \ln(1-y_n)} where s_n is the sample_weight of nth sample. The above equation modified according to equation 4.90 of Christopher Bishop's PRML book. Clarification: The equation is written in Latex. Could not post image |
@niteshroyal this is not the right place to ask usage questions, see http://scikit-learn.org/dev/faq.html#what-s-the-best-way-to-get-help-on-scikit-learn-usage |
ordinarily, weighting means solving an objective that is equivalent to
having the samples repeated in proportion to their weight
|
@jnothman I have seen that when the class weights are too imbalanced adding more degrees of freedom to the model won't always result in a lower (accordingly weighted) log_loss. So I suspect that, except for a numerical issue that I'm unaware of, the objective function is not exactly the same than the one log_loss is evaluating (equivalent to "samples repeated in proportion to their weight"). Of course, this isn't a generalization problem, the loss was computed over the training set. The parameter |
Ok, I think it was a numerical issue indeed, playing with the |
This add sample_weights to the liblinear solver for
LogisticRegression
andLogisticRegressionCV
. It had been already added to the other solvers in another PR