Skip to content

[MRG] LinearRegression Optimizations #17560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

rithvikrao
Copy link
Contributor

@rithvikrao rithvikrao commented Jun 11, 2020

Reference Issues/PRs

Fixes #14268.

What does this implement/fix? Explain your changes.

  • Changes scipy.linalg.lstsq call in the fit function in LinearRegression to include the check_finite=False flag. This is because a finiteness check is already completed in the data validation step. fit calls self._preprocess_data, which itself calls check_array, which is defined in utils/validation.py. check_array has a force_all_finite flag. The scipy.linalg.lstsq documentation can be found here.
  • Factors out Ridge code into _ridge_solvers.py so Ridge(solver = "cholesky", alpha = 0) can be used in LinearRegression without explicitly calling the Ridge estimator.
  • Enables optional solver parameter in LinearRegression, which if set to "cholesky" uses Cholesky Ridge method with alpha=0 for OLS fit.
  • Documentation and tests.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rithvikrao ! I think it's better to keep those private functions in _ridge and not create a new file. We can't import Ridge in LinearRegression, but it's fine to import private functions from _ridge module. It would also be easier to review, as right now it's hard to say what changed in those function (if anything did).

Generally you don't need to address everything in this PR, smaller PRs tend to get merged faster, and then you can open follow up improvements.

@rithvikrao
Copy link
Contributor Author

Thanks @rithvikrao ! I think it's better to keep those private functions in _ridge and not create a new file. We can't import Ridge in LinearRegression, but it's fine to import private functions from _ridge module. It would also be easier to review, as right now it's hard to say what changed in those function (if anything did).

Generally you don't need to address everything in this PR, smaller PRs tend to get merged faster, and then you can open follow up improvements.

Thanks for the input @rth, much appreciated! Regarding moving private functions back to _ridge, I think this creates some circular import issues since _ridge and _base both import from each other. Specifically I get:

ImportError: cannot import name 'LinearClassifierMixin' from partially initialized module 'sklearn.linear_model._base' (most likely due to a circular import)

Nothing changed in _ridge other than moving those functions out into _ridge_solvers. Let me know what you think would be best?

Also, totally makes sense re: smaller PRs. I'll probably stop this one here (+ add some tests). Thank you!

@amueller
Copy link
Member

amueller commented Jun 20, 2020

You could either move _ridge to _base or maybe even better to a new file _ridge_utils.py or something that doesn't import from _base? Alternatively you could use lazy imports but that might not be the right solution here.

Sorry never mind that's what you did and it makes sense.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good start!

@rithvikrao rithvikrao marked this pull request as ready for review June 22, 2020 22:29
Copy link

@flosincapite flosincapite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, digging the new tests!

@rithvikrao rithvikrao changed the title [WIP] LinearRegression Optimizations [MRG] LinearRegression Optimizations Jun 22, 2020
@rithvikrao
Copy link
Contributor Author

@amueller Thanks for your comments! I've implemented and pushed the changes you asked for and they should be ready for review, pending CI passing.

@rithvikrao
Copy link
Contributor Author

@jnothman Thanks for the review! Just pushed the changes you suggested.

@@ -156,7 +182,9 @@ def test_linear_regression_sparse(random_state=0):

@pytest.mark.parametrize('normalize', [True, False])
@pytest.mark.parametrize('fit_intercept', [True, False])
def test_linear_regression_sparse_equal_dense(normalize, fit_intercept):
@pytest.mark.parametrize('solver', ['lsqr'])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this test fails if I run it with solver = "cholesky". I think this is to be expected, because sparse and dense matrices are treated totally differently by the solver, but wanted to note that here in case I'm understanding something wrong.

coef = safe_sparse_dot(X.T, dual_coef, dense_output=True).T
except linalg.LinAlgError:
# use SVD solver if matrix is singular
coef = _solve_svd(X, y, alpha)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ever possible for there to be a LinAlgError here? If n_features > n_samples, then I think the matrix cannot be singular. I'm having issues with test coverage because codecov is treating this like a new / untracked file, and I don't think a test could ever hit this exception.

cc @amueller

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if coverage was already missing in master

@rithvikrao rithvikrao requested review from jnothman and amueller July 6, 2020 22:52
@rth rth self-requested a review July 10, 2020 21:46
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments from a quick pass.

Comment on lines +83 to +84
The Cholesky solution is computed using the Cholesky factorization of
X. If X is a matrix of shape `(n_samples, n_features)` this method has
Copy link
Member

@lorentzenchr lorentzenchr Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cholesky solver solves the normal equation by a Cholesky decomposition of X'X or XX', whichever has smaller dimension. That's why the condition number of the Least Squares problem is doubled and the numerical solution can become more unstable compared to approaches that use a decomposition of X alone like lstsq does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be interesting to expand the doc to include this remark, possibly as a new paragraph.

approach. The ``LinearRegression`` class has an additional, optional
``solver`` parameter, which if set to ``"cholesky"`` uses the Cholesky
factorization instead. See `these notes
<https://www.cs.ubc.ca/~schmidtm/Courses/540-F14/leastSquares.pdf>` for a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to place references under a .. topic:: References section. One very good reference for least squares, in my opinion, is https://www.math.uchicago.edu/~may/REU2012/REUPapers/Lee.pdf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<https://www.cs.ubc.ca/~schmidtm/Courses/540-F14/leastSquares.pdf>` for a
<https://www.cs.ubc.ca/~schmidtm/Courses/540-F14/leastSquares.pdf>`__ for a

correct web link syntax at a minimum

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass

Comment on lines +488 to +489
if solver not in ["lstsq", "cholesky"]:
raise ValueError("Solver must be either `lstsq` or `cholesky`")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By convention we validate hyperparameters in fit.

Copy link
Member

@ogrisel ogrisel Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore it would be great to also expand the error message to report the observed invalid value of the solver parameter.

    raise ValueError(f'Solver must be either "lstsq" or "cholesky", got: {repr(solver)}')

if sp.issparse(X):
if self.solver == "cholesky":
n_samples, n_features = X.shape
ravel = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be more explicit:

Suggested change
ravel = False
y_1d = y.ndim == 1

Comment on lines +426 to +428
Cholesky decomposition. If X is singular, then ``"cholesky"`` will
instead use an SVD-based solver. ``"cholesky"`` does not support `X`
matrices which are both singular and sparse.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is the behavior for Ridge, but this is still really strange behavior.

Base automatically changed from master to main January 22, 2021 10:52
@jjerphan
Copy link
Member

jjerphan commented Feb 9, 2021

Hi @rithvikrao, are you still working on this PR? 🙂

@lorentzenchr
Copy link
Member

@jjerphan Would you like to continue?

@jjerphan
Copy link
Member

@lorentzenchr : probably yes, when I get some free time. But if you want to move it forward, feel free to do so.

@ogrisel
Copy link
Member

ogrisel commented Mar 16, 2022

This is related to #22855.

@jjerphan
Copy link
Member

I've just opened #22940 as a direct follow-up tentative.

@lorentzenchr
Copy link
Member

Superseded by #22940.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to use different solver to LinearRegression
9 participants