@@ -233,24 +233,23 @@ Cross-Validation.
233
233
Lasso
234
234
=====
235
235
236
- The :class: `Lasso ` is a linear model that estimates sparse coefficients.
236
+ The :class: `Lasso ` is a linear model that estimates sparse coefficients, i.e., it is
237
+ able to set coefficients exactly to zero.
237
238
It is useful in some contexts due to its tendency to prefer solutions
238
239
with fewer non-zero coefficients, effectively reducing the number of
239
240
features upon which the given solution is dependent. For this reason,
240
241
Lasso and its variants are fundamental to the field of compressed sensing.
241
- Under certain conditions, it can recover the exact set of non-zero
242
- coefficients (see
242
+ Under certain conditions, it can recover the exact set of non-zero coefficients (see
243
243
:ref: `sphx_glr_auto_examples_applications_plot_tomography_l1_reconstruction.py `).
244
244
245
245
Mathematically, it consists of a linear model with an added regularization term.
246
246
The objective function to minimize is:
247
247
248
- .. math :: \min_{w} { \frac{1}{2n_{\text{samples}}} ||X w - y||_2 ^ 2 + \alpha ||w||_1}
248
+ .. math :: \min_{w} P(w) = { \frac{1}{2n_{\text{samples}}} ||X w - y||_2 ^ 2 + \alpha ||w||_1}
249
249
250
- The lasso estimate thus solves the minimization of the
251
- least-squares penalty with :math: `\alpha ||w||_1 ` added, where
252
- :math: `\alpha ` is a constant and :math: `||w||_1 ` is the :math: `\ell _1 `-norm of
253
- the coefficient vector.
250
+ The lasso estimate thus solves the least-squares with added penalty
251
+ :math: `\alpha ||w||_1 `, where :math: `\alpha ` is a constant and :math: `||w||_1 ` is the
252
+ :math: `\ell _1 `-norm of the coefficient vector.
254
253
255
254
The implementation in the class :class: `Lasso ` uses coordinate descent as
256
255
the algorithm to fit the coefficients. See :ref: `least_angle_regression `
@@ -281,18 +280,86 @@ computes the coefficients along the full path of possible values.
281
280
282
281
.. dropdown :: References
283
282
284
- The following two references explain the iterations
285
- used in the coordinate descent solver of scikit-learn, as well as
286
- the duality gap computation used for convergence control.
283
+ The following references explain the origin of the Lasso as well as properties
284
+ of the Lasso problem and the duality gap computation used for convergence control.
287
285
288
- * "Regularization Path For Generalized linear Models by Coordinate Descent",
289
- Friedman, Hastie & Tibshirani, J Stat Softw, 2010 (` Paper
290
- <https://www.jstatsoft.org/article/view/v033i01/v33i01.pdf> `__).
286
+ * :doi: ` Robert Tibshirani. (1996) Regression Shrinkage and Selection Via the Lasso.
287
+ J. R. Stat. Soc. Ser. B Stat. Methodol., 58(1):267-288
288
+ <10.1111/j.2517-6161.1996.tb02080.x> `
291
289
* "An Interior-Point Method for Large-Scale L1-Regularized Least Squares,"
292
290
S. J. Kim, K. Koh, M. Lustig, S. Boyd and D. Gorinevsky,
293
291
in IEEE Journal of Selected Topics in Signal Processing, 2007
294
292
(`Paper <https://web.stanford.edu/~boyd/papers/pdf/l1_ls.pdf >`__)
295
293
294
+ .. _coordinate_descent :
295
+
296
+ Coordinate Descent with Gap Safe Screening Rules
297
+ ------------------------------------------------
298
+
299
+ Coordinate descent (CD) is a strategy so solve a minimization problem that considers a
300
+ single feature :math: `j` at a time. This way, the optimization problem is reduced to a
301
+ 1-dimensional problem which is easier to solve:
302
+
303
+ .. math :: \min_{w_j} {\frac{1}{2n_{\text{samples}}} ||x_j w_j + X_{-j}w_{-j} - y||_2 ^ 2 + \alpha |w_j|}
304
+
305
+ with index :math: `-j` meaning all features but :math: `j`. The solution is
306
+
307
+ .. math :: w_j = \frac{S(x_j^T (y - X_{-j}w_{-j}), \alpha)}{||x_j||_2^2}
308
+
309
+ with the soft-thresholding function
310
+ :math: `S(z, \alpha ) = \operatorname {sign}(z) \max (0 , |z|-\alpha )`.
311
+ Note that the soft-thresholding function is exactly zero whenever
312
+ :math: `\alpha \geq |z|`.
313
+ The CD solver then loops over the features either in a cycle, picking one feature after
314
+ the other in the order given by `X ` (`selection="cyclic" `), or by randomly picking
315
+ features (`selection="random" `).
316
+ It stops if the duality gap is smaller than the provided tolerance `tol `.
317
+
318
+ .. dropdown :: Mathematical details
319
+
320
+ The duality gap :math: `G(w, v)` is an upper bound of the difference between the
321
+ current primal objective function of the Lasso, :math: `P(w)`, and its minimum
322
+ :math: `P(w^\star )`, i.e. :math: `G(w, v) \leq P(w) - P(w^\star )`. It is given by
323
+ :math: `G(w, v) = P(w) - D(v)` with dual objective function
324
+
325
+ .. math :: D(v) = \frac{1}{2n_{\text{samples}}}(y^Tv - ||v||_2^2)
326
+
327
+ subject to :math: `v \in ||X^Tv||_{\infty } \leq n_{\text {samples}}\alpha `.
328
+ With (scaled) dual variable :math: `v = c r`, current residual :math: `r = y - Xw` and
329
+ dual scaling
330
+
331
+ .. math ::
332
+ c = \begin {cases}
333
+ 1 , & ||X^Tr||_{\infty } \leq n_{\text {samples}}\alpha , \\
334
+ \frac {n_{\text {samples}}\alpha }{||X^Tr||_{\infty }}, & \text {otherwise}
335
+ \end {cases}
336
+
337
+ the stopping criterion is
338
+
339
+ .. math :: \text{tol} \frac{||y||_2^2}{n_{\text{samples}}} < G(w, cr)\,.
340
+
341
+ A clever method to speedup the coordinate descent algorithm is to screen features such
342
+ that at optimum :math: `w_j = 0 `. Gap safe screening rules are such a
343
+ tool. Anywhere during the optimization algorithm, they can tell which feature we can
344
+ safely exclude, i.e., set to zero with certainty.
345
+
346
+ .. dropdown :: References
347
+
348
+ The first reference explains the coordinate descent solver used in scikit-learn, the
349
+ others treat gap safe screening rules.
350
+
351
+ * :doi: `Friedman, Hastie & Tibshirani. (2010).
352
+ Regularization Path For Generalized linear Models by Coordinate Descent.
353
+ J Stat Softw 33(1), 1-22 <10.18637/jss.v033.i01> `
354
+ * :arxiv: `O. Fercoq, A. Gramfort, J. Salmon. (2015).
355
+ Mind the duality gap: safer rules for the Lasso.
356
+ Proceedings of Machine Learning Research 37:333-342, 2015.
357
+ <1505.03410> `
358
+ * :arxiv: `E. Ndiaye, O. Fercoq, A. Gramfort, J. Salmon. (2017).
359
+ Gap Safe Screening Rules for Sparsity Enforcing Penalties.
360
+ Journal of Machine Learning Research 18(128):1-33, 2017.
361
+ <1611.05780> `
362
+
296
363
Setting regularization parameter
297
364
--------------------------------
298
365
0 commit comments