-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] LinearRegression Optimizations #17560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…for linear regression
7acbe37
to
a9579c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rithvikrao ! I think it's better to keep those private functions in _ridge
and not create a new file. We can't import Ridge in LinearRegression, but it's fine to import private functions from _ridge
module. It would also be easier to review, as right now it's hard to say what changed in those function (if anything did).
Generally you don't need to address everything in this PR, smaller PRs tend to get merged faster, and then you can open follow up improvements.
Thanks for the input @rth, much appreciated! Regarding moving private functions back to
Nothing changed in Also, totally makes sense re: smaller PRs. I'll probably stop this one here (+ add some tests). Thank you! |
Sorry never mind that's what you did and it makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good start!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, digging the new tests!
@amueller Thanks for your comments! I've implemented and pushed the changes you asked for and they should be ready for review, pending CI passing. |
@jnothman Thanks for the review! Just pushed the changes you suggested. |
@@ -156,7 +182,9 @@ def test_linear_regression_sparse(random_state=0): | |||
|
|||
@pytest.mark.parametrize('normalize', [True, False]) | |||
@pytest.mark.parametrize('fit_intercept', [True, False]) | |||
def test_linear_regression_sparse_equal_dense(normalize, fit_intercept): | |||
@pytest.mark.parametrize('solver', ['lsqr']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this test fails if I run it with solver = "cholesky"
. I think this is to be expected, because sparse and dense matrices are treated totally differently by the solver, but wanted to note that here in case I'm understanding something wrong.
coef = safe_sparse_dot(X.T, dual_coef, dense_output=True).T | ||
except linalg.LinAlgError: | ||
# use SVD solver if matrix is singular | ||
coef = _solve_svd(X, y, alpha) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ever possible for there to be a LinAlgError
here? If n_features > n_samples
, then I think the matrix cannot be singular. I'm having issues with test coverage because codecov is treating this like a new / untracked file, and I don't think a test could ever hit this exception.
cc @amueller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check if coverage was already missing in master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments from a quick pass.
The Cholesky solution is computed using the Cholesky factorization of | ||
X. If X is a matrix of shape `(n_samples, n_features)` this method has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cholesky solver solves the normal equation by a Cholesky decomposition of X'X or XX', whichever has smaller dimension. That's why the condition number of the Least Squares problem is doubled and the numerical solution can become more unstable compared to approaches that use a decomposition of X alone like lstsq
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be interesting to expand the doc to include this remark, possibly as a new paragraph.
approach. The ``LinearRegression`` class has an additional, optional | ||
``solver`` parameter, which if set to ``"cholesky"`` uses the Cholesky | ||
factorization instead. See `these notes | ||
<https://www.cs.ubc.ca/~schmidtm/Courses/540-F14/leastSquares.pdf>` for a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to place references under a .. topic:: References
section. One very good reference for least squares, in my opinion, is https://www.math.uchicago.edu/~may/REU2012/REUPapers/Lee.pdf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<https://www.cs.ubc.ca/~schmidtm/Courses/540-F14/leastSquares.pdf>` for a | |
<https://www.cs.ubc.ca/~schmidtm/Courses/540-F14/leastSquares.pdf>`__ for a |
correct web link syntax at a minimum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass
if solver not in ["lstsq", "cholesky"]: | ||
raise ValueError("Solver must be either `lstsq` or `cholesky`") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By convention we validate hyperparameters in fit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore it would be great to also expand the error message to report the observed invalid value of the solver
parameter.
raise ValueError(f'Solver must be either "lstsq" or "cholesky", got: {repr(solver)}')
if sp.issparse(X): | ||
if self.solver == "cholesky": | ||
n_samples, n_features = X.shape | ||
ravel = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be more explicit:
ravel = False | |
y_1d = y.ndim == 1 |
Cholesky decomposition. If X is singular, then ``"cholesky"`` will | ||
instead use an SVD-based solver. ``"cholesky"`` does not support `X` | ||
matrices which are both singular and sparse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is the behavior for Ridge
, but this is still really strange behavior.
Hi @rithvikrao, are you still working on this PR? 🙂 |
@jjerphan Would you like to continue? |
@lorentzenchr : probably yes, when I get some free time. But if you want to move it forward, feel free to do so. |
This is related to #22855. |
I've just opened #22940 as a direct follow-up tentative. |
Superseded by #22940. |
Reference Issues/PRs
Fixes #14268.
What does this implement/fix? Explain your changes.
scipy.linalg.lstsq
call in thefit
function inLinearRegression
to include thecheck_finite=False
flag. This is because a finiteness check is already completed in the data validation step.fit
callsself._preprocess_data
, which itself callscheck_array
, which is defined inutils/validation.py
.check_array
has aforce_all_finite
flag. Thescipy.linalg.lstsq
documentation can be found here.Ridge
code into_ridge_solvers.py
soRidge(solver = "cholesky", alpha = 0)
can be used inLinearRegression
without explicitly calling theRidge
estimator.solver
parameter inLinearRegression
, which if set to"cholesky"
uses Cholesky Ridge method withalpha=0
for OLS fit.