-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Fix LinearRegression
's numerical stability on rank deficient data by setting the cond
parameter in the call to scipy.linalg.lstsq
#30040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
test_linear_regression_sample_weight_consistency
test_linear_regression_sample_weight_consistency
test_linear_regression_sample_weight_consistency
The scipy solver indeed passes when setting |
We probably need to run a few variations of |
Indeed it's quite difficult to find a good robust choice for There is some explanation on choosing this cutoff ratio for singular values in: |
The test fails with the data with shape
I am surprised that using Could you try to re-run with |
Note that the
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.lstsq.html This is not the same as the
https://numpy.org/doc/stable/reference/generated/numpy.linalg.lstsq.html
EDIT: I misread the scipy docstring. Both definitions match even if the name of the parameter is different. |
I have the feeling that only the numpy parametrization will make it possible to achieve what we are looking for. |
Actually, the same tests pass with dense numpy arrays. So the problem appears to be specific to the use of scipy sparse datastructures. I realize that for the sparse case, we use the "lsqr" solver that has no direct equivalent to the So we should probably focus this particular PR to the fix for the dense case and open a dedicated issue+PR for the fix of the sparse case. |
…ample_weight_bug_scipy
The added xfail tag (on csr array) is not ideal, because the test is actually passing for |
It fails with |
test_linear_regression_sample_weight_consistency
doc/whats_new/upcoming_changes/sklearn.linear_model/30040.fix.rst
Outdated
Show resolved
Hide resolved
"sample_weight is not equivalent to removing/repeating samples." | ||
), | ||
} | ||
return tags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact we can remove this while there is still a problem with sparse inputs makes me realize that we should expand check_sample_weight_equivalence
to also test fitting with sparse inputs (when the estimator accepts sparse inputs). Let's open a dedicated PR for this (e.g. by introducing a new check named check_sample_weight_equivalence_on_sparse_data
).
@antoinebaker this PR is still marked as draft, but I have the feeling that it is now ready for review (actually, I just did). Could you confirm? |
LinearRegression
's numerical stability on rank deficient data by setting the cond
parameter in the call to scipy.linalg.lstsq
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
can I work? |
Reference Issues/PRs
Same as #30030 but keeping
scipy.linalg.lstsq
solver.#29818 and #26164 revealed that LinearRegression was failing the sample weight consistency check (using weights should be equivalent to removing/repeating samples).
Related to #22947 #25948
What does this implement/fix? Explain your changes.
The
scipy.linalg.lstsq
solver can fail the sample weight consistency test, especially for wide dataset (n_features > n_samples
) after centeringX,y
(as done whenfit_intercept=True
).Setting the
cond
parameter (cut-off ratio on singular values) to the value recommended bynumpy.linalg.lstsq
documentation seems to fix the bug.