[MRG] LinearRegression Optimizations #17560

rithvikrao · 2020-06-11T02:06:24Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Changes scipy.linalg.lstsq call in the fit function in LinearRegression to include the check_finite=False flag. This is because a finiteness check is already completed in the data validation step. fit calls self._preprocess_data, which itself calls check_array, which is defined in utils/validation.py. check_array has a force_all_finite flag. The scipy.linalg.lstsq documentation can be found here.
Factors out Ridge code into _ridge_solvers.py so Ridge(solver = "cholesky", alpha = 0) can be used in LinearRegression without explicitly calling the Ridge estimator.
Enables optional solver parameter in LinearRegression, which if set to "cholesky" uses Cholesky Ridge method with alpha=0 for OLS fit.
Documentation and tests.

…for linear regression

…finite X

rth

Thanks @rithvikrao ! I think it's better to keep those private functions in _ridge and not create a new file. We can't import Ridge in LinearRegression, but it's fine to import private functions from _ridge module. It would also be easier to review, as right now it's hard to say what changed in those function (if anything did).

Generally you don't need to address everything in this PR, smaller PRs tend to get merged faster, and then you can open follow up improvements.

rithvikrao · 2020-06-18T17:08:31Z

Thanks @rithvikrao ! I think it's better to keep those private functions in _ridge and not create a new file. We can't import Ridge in LinearRegression, but it's fine to import private functions from _ridge module. It would also be easier to review, as right now it's hard to say what changed in those function (if anything did).

Generally you don't need to address everything in this PR, smaller PRs tend to get merged faster, and then you can open follow up improvements.

Thanks for the input @rth, much appreciated! Regarding moving private functions back to _ridge, I think this creates some circular import issues since _ridge and _base both import from each other. Specifically I get:

ImportError: cannot import name 'LinearClassifierMixin' from partially initialized module 'sklearn.linear_model._base' (most likely due to a circular import)

Nothing changed in _ridge other than moving those functions out into _ridge_solvers. Let me know what you think would be best?

Also, totally makes sense re: smaller PRs. I'll probably stop this one here (+ add some tests). Thank you!

amueller · 2020-06-20T16:46:24Z

You could either move _ridge to _base or maybe even better to a new file _ridge_utils.py or something that doesn't import from _base? Alternatively you could use lazy imports but that might not be the right solution here.

Sorry never mind that's what you did and it makes sense.

amueller

good start!

sklearn/linear_model/_base.py

flosincapite

Nice, digging the new tests!

sklearn/linear_model/_base.py

rithvikrao · 2020-06-23T20:48:25Z

@amueller Thanks for your comments! I've implemented and pushed the changes you asked for and they should be ready for review, pending CI passing.

sklearn/linear_model/_base.py

sklearn/linear_model/tests/test_base.py

sklearn/linear_model/_base.py

rithvikrao · 2020-06-24T21:50:43Z

@jnothman Thanks for the review! Just pushed the changes you suggested.

sklearn/linear_model/_base.py

rithvikrao · 2020-07-01T23:16:32Z

sklearn/linear_model/tests/test_base.py

@@ -156,7 +182,9 @@ def test_linear_regression_sparse(random_state=0):

 @pytest.mark.parametrize('normalize', [True, False])
 @pytest.mark.parametrize('fit_intercept', [True, False])
-def test_linear_regression_sparse_equal_dense(normalize, fit_intercept):
+@pytest.mark.parametrize('solver', ['lsqr'])


FYI, this test fails if I run it with solver = "cholesky". I think this is to be expected, because sparse and dense matrices are treated totally differently by the solver, but wanted to note that here in case I'm understanding something wrong.

rithvikrao · 2020-07-02T05:14:11Z

sklearn/linear_model/_ridge_solvers.py

+            coef = safe_sparse_dot(X.T, dual_coef, dense_output=True).T
+        except linalg.LinAlgError:
+            # use SVD solver if matrix is singular
+            coef = _solve_svd(X, y, alpha)


Is it ever possible for there to be a LinAlgError here? If n_features > n_samples, then I think the matrix cannot be singular. I'm having issues with test coverage because codecov is treating this like a new / untracked file, and I don't think a test could ever hit this exception.

cc @amueller

check if coverage was already missing in master

lorentzenchr

Just a few comments from a quick pass.

sklearn/linear_model/_base.py

doc/modules/linear_model.rst

sklearn/linear_model/_base.py

lorentzenchr · 2020-07-31T12:24:33Z

doc/modules/linear_model.rst

+The Cholesky solution is computed using the Cholesky factorization of
+X. If X is a matrix of shape `(n_samples, n_features)` this  method has


The cholesky solver solves the normal equation by a Cholesky decomposition of X'X or XX', whichever has smaller dimension. That's why the condition number of the Least Squares problem is doubled and the numerical solution can become more unstable compared to approaches that use a decomposition of X alone like lstsq does.

I think this would be interesting to expand the doc to include this remark, possibly as a new paragraph.

lorentzenchr · 2020-07-31T13:06:08Z

doc/modules/linear_model.rst

+approach. The ``LinearRegression`` class has an additional, optional
+``solver`` parameter, which if set to ``"cholesky"`` uses the Cholesky
+factorization instead. See `these notes
+<https://www.cs.ubc.ca/~schmidtm/Courses/540-F14/leastSquares.pdf>` for a


It might be better to place references under a .. topic:: References section. One very good reference for least squares, in my opinion, is https://www.math.uchicago.edu/~may/REU2012/REUPapers/Lee.pdf.

Suggested change

<https://www.cs.ubc.ca/~schmidtm/Courses/540-F14/leastSquares.pdf>` for a

<https://www.cs.ubc.ca/~schmidtm/Courses/540-F14/leastSquares.pdf>`__ for a

correct web link syntax at a minimum

thomasjpfan

First pass

thomasjpfan · 2020-08-07T20:37:04Z

sklearn/linear_model/_base.py

+        if solver not in ["lstsq", "cholesky"]:
+            raise ValueError("Solver must be either `lstsq` or `cholesky`")


By convention we validate hyperparameters in fit.

Furthermore it would be great to also expand the error message to report the observed invalid value of the solver parameter.

raise ValueError(f'Solver must be either "lstsq" or "cholesky", got: {repr(solver)}')

thomasjpfan · 2020-08-07T20:38:49Z

sklearn/linear_model/_base.py

-        if sp.issparse(X):
+        if self.solver == "cholesky":
+            n_samples, n_features = X.shape
+            ravel = False


This would be more explicit:

Suggested change

ravel = False

y_1d = y.ndim == 1

thomasjpfan · 2020-08-07T20:44:01Z

sklearn/linear_model/_base.py

+        Cholesky decomposition. If X is singular, then ``"cholesky"`` will
+        instead use an SVD-based solver. ``"cholesky"`` does not support `X`
+        matrices which are both singular and sparse.


I see this is the behavior for Ridge, but this is still really strange behavior.

jjerphan · 2021-02-09T15:44:58Z

Hi @rithvikrao, are you still working on this PR? 🙂

lorentzenchr · 2021-03-14T16:06:08Z

@jjerphan Would you like to continue?

jjerphan · 2021-03-16T21:09:42Z

@lorentzenchr : probably yes, when I get some free time. But if you want to move it forward, feel free to do so.

ogrisel · 2022-03-16T16:45:13Z

This is related to #22855.

jjerphan · 2022-03-24T17:13:45Z

I've just opened #22940 as a direct follow-up tentative.

lorentzenchr · 2022-04-26T19:27:36Z

Superseded by #22940.

github-actions bot added the module:linear_model label Jun 11, 2020

rithvikrao added 5 commits June 17, 2020 16:28

Change linalg.lstsq call to include check_finite=False

3ea2b72

Refactored ridge solvers into separate file, so Cholesky can be used …

bf0ed78

…for linear regression

Cholesky default for linreg for 2 or fewer dimensions and positive-de…

d3f7b74

…finite X

Linted code

bb40fd2

Ravel coefficients instead of reshaping

a9579c2

rithvikrao force-pushed the linear-regression branch from 7acbe37 to a9579c2 Compare June 17, 2020 23:29

rth reviewed Jun 18, 2020

View reviewed changes

amueller reviewed Jun 20, 2020

View reviewed changes

sklearn/linear_model/_base.py Outdated Show resolved Hide resolved

sklearn/linear_model/_base.py Outdated Show resolved Hide resolved

Cholesky is param instead of heuristic. + Tests & doc

c9ac69e

rithvikrao marked this pull request as ready for review June 22, 2020 22:29

flosincapite reviewed Jun 22, 2020

View reviewed changes

rithvikrao changed the title ~~[WIP] LinearRegression Optimizations~~ [MRG] LinearRegression Optimizations Jun 22, 2020

amueller reviewed Jun 23, 2020

View reviewed changes

sklearn/linear_model/_base.py Outdated Show resolved Hide resolved

amueller reviewed Jun 23, 2020

View reviewed changes

sklearn/linear_model/_base.py Show resolved Hide resolved

More Cholesky refactor, param in __init__ not fit

3fe0996

Clean up dependencies on Cholesky

75cde9b

jnothman reviewed Jun 24, 2020

View reviewed changes

sklearn/linear_model/_base.py Outdated Show resolved Hide resolved

sklearn/linear_model/tests/test_base.py Outdated Show resolved Hide resolved

sklearn/linear_model/_base.py Show resolved Hide resolved

Add docstring, sparse tests, singular/sparse TypeError

a1740b3

rithvikrao added 2 commits June 24, 2020 16:10

Parametrized tests correctly, sparse_equal_dense failing

0283e52

Remove cholesky from sparse_equal_dense, lint

3931d91

jnothman reviewed Jun 25, 2020

View reviewed changes

sklearn/linear_model/_base.py Show resolved Hide resolved

amueller reviewed Jun 30, 2020

View reviewed changes

sklearn/linear_model/_base.py Show resolved Hide resolved

rithvikrao added 2 commits July 1, 2020 16:11

Change svd to lsqr for solver, add singular matrix test

f8baf27

Lint

b437ab1

rithvikrao commented Jul 1, 2020

View reviewed changes

rithvikrao mentioned this pull request Jul 1, 2020

CircleCI Build Broken #17809

Closed

Merge remote-tracking branch 'upstream/master' into linear-regression

5c61941

rithvikrao commented Jul 2, 2020

View reviewed changes

rithvikrao requested review from jnothman and amueller July 6, 2020 22:52

rth self-requested a review July 10, 2020 21:46

lorentzenchr reviewed Jul 11, 2020

View reviewed changes

sklearn/linear_model/_base.py Outdated Show resolved Hide resolved

sklearn/linear_model/_base.py Show resolved Hide resolved

doc/modules/linear_model.rst Show resolved Hide resolved

sklearn/linear_model/_base.py Outdated Show resolved Hide resolved

rithvikrao added 6 commits July 22, 2020 15:44

Simplified alpha set code

be3224e

Merge remote-tracking branch 'upstream/master' into linear-regression

8e6545b

ValueError and complexity analysis

aafd093

Fixed tests and tested ValueError

6fc98c9

Fix unused local variable

76d3994

Doc fix

955c0d9

lorentzenchr reviewed Jul 31, 2020

View reviewed changes

thomasjpfan reviewed Aug 7, 2020

View reviewed changes

Base automatically changed from master to main January 22, 2021 10:52

lorentzenchr added the Stalled label Mar 14, 2021

lorentzenchr added the help wanted label Mar 16, 2021

lorentzenchr mentioned this pull request Mar 17, 2022

The fit performance of LinearRegression is sub-optimal #22855

Open

jjerphan mentioned this pull request Mar 24, 2022

ENH LinearRegression Optimizations #22940

Closed

4 tasks

lorentzenchr closed this Apr 26, 2022

		The Cholesky solution is computed using the Cholesky factorization of
		X. If X is a matrix of shape `(n_samples, n_features)` this method has

	<https://www.cs.ubc.ca/~schmidtm/Courses/540-F14/leastSquares.pdf>` for a
	<https://www.cs.ubc.ca/~schmidtm/Courses/540-F14/leastSquares.pdf>`__ for a

		if solver not in ["lstsq", "cholesky"]:
		raise ValueError("Solver must be either `lstsq` or `cholesky`")

Uh oh!

[MRG] LinearRegression Optimizations #17560

[MRG] LinearRegression Optimizations #17560

Uh oh!

Conversation

rithvikrao commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

rithvikrao commented Jun 18, 2020

Uh oh!

amueller commented Jun 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

flosincapite left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rithvikrao commented Jun 23, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rithvikrao commented Jun 24, 2020

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lorentzenchr Jul 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Feb 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjerphan commented Feb 9, 2021

Uh oh!

lorentzenchr commented Mar 14, 2021

Uh oh!

jjerphan commented Mar 16, 2021

Uh oh!

ogrisel commented Mar 16, 2022

rithvikrao commented Jun 11, 2020 •

edited

Loading

amueller commented Jun 20, 2020 •

edited

Loading

lorentzenchr Jul 31, 2020 •

edited

Loading

ogrisel Feb 9, 2021 •

edited

Loading