-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] ENH : SAGA support for LogisticRegression (+ L1 support), Ridge #8446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In the todo, you should probably add a benchmark that would show the
benefit.
|
By the way: I am quite excited about this!
|
Indeed quite a few benchmarks would be necessary. Added ! |
sklearn/linear_model/logistic.py
Outdated
from ..externals import six | ||
from ..metrics import SCORERS | ||
from ..utils.optimize import newton_cg | ||
from ..utils.validation import check_X_y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guessed my editor just put these automatically in alphabetic order, I can revert if necessary
I guessed my editor just put these automatically in alphabetic order, I can revert if necessary
It would be nice. If files start being reformated by editors, we will
have a lot of conflicts.
|
Codecov Report
@@ Coverage Diff @@
## master #8446 +/- ##
==========================================
+ Coverage 95.47% 95.48% +<.01%
==========================================
Files 342 342
Lines 60907 61007 +100
==========================================
+ Hits 58154 58255 +101
+ Misses 2753 2752 -1
Continue to review full report at Codecov.
|
A quick benchmark :
|
I added some code to do benchmarks, which should be removed for merging. With the conservative auto step size that is used, SAGA performs better in the first epochs and SAG gets better for finer convergence (a behavior already observed in the litterature). With more aggressive stepsizes, SAGA performs better than SAG. We could use line-search as it tends to produce better results, it is used in sgd in scikit-learn ? |
We could use linesearch as it tends to produce better results, it is used somewhere in scikit-learn ?
I don't think so. It's used in the LBFGS code, but that's not
scikit-learn.
|
@arthurmensch can you show a benchmark with |
@agramfort can you recall a dataset on which SAG would beat liblinear with l2 penalty ? |
rcv1 or covertype?
|
I did some benchmarks against lightning for the moment, on rcv1. For some reason we are 10x faster than lightning with L1 penalty, any thought @fabianp ? l2 logistic: l1 logistic: Credits for the shortcut in the composition of prox operators goes to @fabianp and @mblondel, but I think it looks a bit cleaner that way. I will post benchmarks against liblinear ASAP. |
This is looking great. And the fun part is that it seems that reimplementing code and comparing teaches us a lot, even when the two implementations are done by people that share so much. |
For single class l1/l2 logistic regression liblinear is hard to beat. But for multiclass, speed improvements look great: There is a discrepency in the final accuracy as both model are not the same (the loss is different, and liblinear penalizes the intercept, but I reckon they perform the same with correct cross validation. Two-class problem (there is a problem there as the final training score should not be that different). One of the advantage of using SAGA instead of liblinear is that we can perform CV with memory-mapped data, which can be crucial when datasets are huge. |
It could be interesting to see if both implementations return more or less the same weight vectors. I remember it took @fabianp and @zermelozf several iterations to implement the lazy updates correctly. We have a naive Pure python implementation in our tests to ensure correctness: https://github.com/scikit-learn-contrib/lightning/blob/master/lightning/impl/tests/test_sag.py#L85 |
These are the right benches liblinear vs lightning vs sklearn saga, for 1/20 of rcv1.
I am currently running the benches on the whole dataset, to see if saga does not get better than liblinear at some point. The benchmark file that is in this PR does not use callbaks or any hacks, so I think it can stay in the repo for future reference. |
I think that liblinear beat saga in the l1 case because coordinate
descent methods are a very efficient when the coefficients are sparse if
there is a good heuristic to select the coeffficient.
|
Should I try to add a solver for Lasso ? For the moment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work ! The code looks good so far.
I still need to understand the JIT prox update.
Use minibatches ? I cannot recall whether this is actually interesting.
Mark Schmidt says it's interesting in SAG (slide 73), yet I am not sure we need it in scikit-learn.
sklearn/linear_model/logistic.py
Outdated
|
||
if penalty == 'l1': | ||
if solver == 'sag': | ||
raise ValueError("Unsupported penalty. Use `saga` instead.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is never reached, thanks to _check_solver_option
, isn't it?
|
||
|
||
def get_auto_step_size(max_squared_sum, alpha_scaled, loss, fit_intercept): | ||
def get_auto_step_size(max_squared_sum, alpha_scaled, loss, fit_intercept, | ||
n_samples=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sklearn/linear_model/sag_fast.pyx
Outdated
|
||
|
||
|
||
cdef double lagged_update(double* weights, double wscale, int xnnz, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the docstring description
else: | ||
for class_ind in range(n_classes): | ||
idx = f_idx + class_ind | ||
if fabs(sum_gradient[idx] * cum_sum) < cum_sum_prox: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain this? and maybe add some comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice trick from lightning: instead of enrolling the whole delayed softmax(softmax(softmax(w - grad_update) - grad_update ...)
in the loop below, we factorize it as we do not cross the non-linearity due to the softmax. There is no academical reference for this, but we should indeed add some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if there was a small blog post with the derivations for this on the web. @fabianp sent me an unfinished draft for this but I gathered it was to stay a draft.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do not have time I can write a small blog post wit mathematical derivation for this part, with due reference to @fabianp @zermelozf @mblondel. Then we can reference it in comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be awesome. I'm sending you the .tex sources of that draft so you can use it as you please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even without the full blog post with the derivation it would be great to have an inline comment to state:
- what you said in https://github.com/scikit-learn/scikit-learn/pull/8446/files#r103928318
- give credits to the lightning implementation
sklearn/linear_model/sag_fast.pyx
Outdated
if reset: | ||
cumulative_sums[sample_itr - 1] = 0.0 | ||
if prox: | ||
cumulative_sums_prox[sample_itr - 1] = 0.0 | ||
|
||
# reset wscale to 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should return void and return 1.0 only in scale_weights
sklearn/linear_model/sag_fast.pyx
Outdated
cdef np.ndarray[double, ndim=1] cumulative_sums_prox_array | ||
cdef double* cumulative_sums_prox | ||
|
||
cdef bint prox = beta > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could be safe to check that saga is True
sklearn/linear_model/sag.py
Outdated
else: | ||
raise ValueError("Unknown loss function for SAG solver, got %s " | ||
"instead of 'log' or 'squared'" % loss) | ||
if is_saga: | ||
mun = min(2 * n_samples * alpha_scaled, L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does it come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SAGA original paper: proofs of convergence requires step_size < 1 / 3L
or step_size < 1 / (2(L + mu n)
, where mu
is the strong convexity modulus of the objective. We could use 1 / 3L
but this is more optimal in the low dimensional regime. I will add a reference. By the way SAG use 1 / L
whereas the only step size for which proofs are available is 1 / 16 L
. I think this is a sound heuristic but we should add a reference as well. 1 / L
is also a good heuristic for SAGA in most cases, but it actually make one test fail in the test suite :P We could allow the user to specify the step size, but it would complexify the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
For SAG, I used the recommendation of Mark Schmidt: (slide 65)
@@ -261,26 +272,42 @@ def sag_solver(X, y, sample_weight=None, loss='log', alpha=1., | |||
if max_squared_sum is None: | |||
max_squared_sum = row_norms(X, squared=True).max() | |||
step_size = get_auto_step_size(max_squared_sum, alpha_scaled, loss, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the step size depend on beta_scaled
in the L1 case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it depends on L which is the Lipschitz constant of the gradient of the smooth objective part (i.e logistic + optional l2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yes of course
Should we add an example for SAGA + some narrative documentation ? I suggest multinomial logistic + l1 on rcv1, with comparison against one versus all with I suggest we work on merging this PR before working on using SAGA for Lasso. There is also a question regarding whether we should propose both |
ping @TomDLT @agramfort @GaelVaroquaux I think this is ready for review. Reviews welcome from @fabianp and @mblondel for the core code if you have time. A few UX questions to answer before I move on :
I reckon adding |
How does SAGA perform on small datasets (e.g. iris)? |
We will need a very clear paragraph in the documentation that explains which solver to choose when. |
@@ -0,0 +1,114 @@ | |||
""" | |||
===================================================== | |||
Multiclass logisitic regression on newgroups20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also put "sparse" in the title.
features to zero. This is good if the goal is to extract the strongly | ||
discriminative vocabulary of each class. If the goal is to get the best | ||
predictive accuracy, it is better to use the non sparsity-inducing | ||
l2 penalty instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should also mention univariate feature selection as an alternative way to extract sparse discriminative vocabularies.
Maybe you could even extend the example by adding a pipeline of a sparse uni variate feature selection model + l2 penalized logistic regression to showcase a classification model with similar sparsity level as the l1 penalized variant.
Performance of multinomial logistic regression with | ||
L1 penalty. We use the SAGA algorithm for this purpose, which is fast. Test | ||
accuracy reaches > 0.8, while weight vectors remains *sparse* and | ||
*interpretable*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a note such that:
Note that this accuracy is far below what can be reached by an non-penalized linear model (I think ~0.93) but this should be checked and even more far below the accuracy non linear models such as a multi layer perceptron (0.98+).
sklearn/linear_model/sag_fast.pyx
Outdated
np.ndarray[double, ndim=2, mode='c'] sum_gradient_init, | ||
np.ndarray[double, ndim=2, mode='c'] gradient_memory_init, | ||
np.ndarray[bint, ndim=1, mode='c'] seen_init, | ||
int num_seen, | ||
bint fit_intercept, | ||
np.ndarray[double, ndim=1, mode='c'] intercept_sum_gradient_init, | ||
double intercept_decay, | ||
bint saga, | ||
bint verbose): | ||
"""Stochastic Average Gradient (SAG) solver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update this docstring to make it explicit that this function implements both SAG and SAGA.
I pushed better checks for the l1 penalty logistic regression tests. I still want to review other parts of the code / tests but maybe we should do the Lasso part in a separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a review and pushed some fixes (mostly missing updates in the documentation). I think we can merge this PR without waiting for elasticnet penalty and integration in the Lasso* classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for minor nitpicks
I think we can merge this PR without waiting for elasticnet penalty and integration in the Lasso* classes.
I agree
doc/modules/linear_model.rst
Outdated
Very Large dataset (`n_samples`) "sag" or "saga" | ||
================================= ===================================== | ||
|
||
The "saga" solver is almost always a the best choice. The "liblinear" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a
regression yields more accurate results and is faster to train on the larger | ||
scale dataset. | ||
|
||
Here we use the l1 sparsity that trims the weights of no to informative |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> of not informative
sklearn/linear_model/logistic.py
Outdated
@@ -967,6 +976,9 @@ class LogisticRegression(BaseEstimator, LinearClassifierMixin, | |||
Used to specify the norm used in the penalization. The 'newton-cg', | |||
'sag' and 'lbfgs' solvers support only l2 penalties. | |||
|
|||
.. versionadded:: 0.19 | |||
l1 penalty with SAGA solver (allowing 'multinomial + L1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix the '
sklearn/linear_model/sag_fast.pyx
Outdated
@@ -6,16 +6,16 @@ | |||
# Tom Dupre la Tour <tom.dupre-la-tour@m4x.org> | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add author
@@ -860,6 +895,71 @@ def test_logreg_intercept_scaling_zero(): | |||
assert_equal(clf.intercept_, 0.) | |||
|
|||
|
|||
def test_logreg_l1(): | |||
# Because liblinear penalizes the intercept and saga does not, we do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not
|
||
|
||
def test_logreg_l1_sparse_data(): | ||
# Because liblinear penalizes the intercept and saga does not, we do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not
I rebased an squashed everything down to a single commit. If CI is still green, let's merge. |
Merged! Thanks @arthurmensch! |
Great !! Thanks for the reviews and edits !
…On Mar 27, 2017 9:40 PM, "Olivier Grisel" ***@***.***> wrote:
Merged! Thanks @arthurmensch <https://github.com/arthurmensch>!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8446 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AD48Yyw-V14iu9Is-e8buk2M0o5Kb5prks5rqBCPgaJpZM4MKTET>
.
|
🍻 |
Awesome!!!!!
|
Wow!
…On 28 Mar 2017 5:03 pm, "Gael Varoquaux" ***@***.***> wrote:
Awesome!!!!!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8446 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wXdVv4--QFr1ufiVQkOVO36oULrks5rqKKngaJpZM4MKTET>
.
|
🍻 !!!
|
congrats @arthurmensch and co. I think this is a great example of development that started in scikit-learn-contrib and (with a lot of work and improvements) ended upstream. |
This PR proposes slight adaptations of the existing
sag_fast.pyx
module to be able to use SAGA in addition to the SAG algorithm. This is the way to go if we want to propose enet/l1 penalty for fast incremental solvers in for ridge and logistic regression.A SAGA implementation is already available in
lightning
, which is adapted from the original paper. This one is slightly different as it is built around understanding SAGA update as a "corrected" version of SAG.I believe it would also be possible to slightly adapt the module to have SVRG in addition to SAGA, if this interests people.
I tried to keep the changes made to
sag_fast.pyx
as scarse as possible. I reckon that thesag_fast.pyx
module could be made a little more readable using 2d memoryviews instead of using strided pointers everywhere. For further work.For the moment I adapted the
test_logistic.py
file to ensure correctness of the algorithm, but the saga algorithm should be tested withintest_sag.py
.SAGA paper
TODO
sag.py
directlyUse minibatches ? I cannot recall whether this is actually interesting.liblinear
andlightning
Add a SAGA solver for Lasso (which might imply a bit of refactoring...)rcv1
example with multinomial + L1Add elastic netl1_ratio
inLogisticRegression
ping @TomDLT @agramfort you might be interested by this :)