Skip to content

Fix LinearRegression's numerical stability on rank deficient data by setting the cond parameter in the call to scipy.linalg.lstsq #30040

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

antoinebaker
Copy link
Contributor

@antoinebaker antoinebaker commented Oct 10, 2024

Reference Issues/PRs

Same as #30030 but keeping scipy.linalg.lstsq solver.

#29818 and #26164 revealed that LinearRegression was failing the sample weight consistency check (using weights should be equivalent to removing/repeating samples).

Related to #22947 #25948

What does this implement/fix? Explain your changes.

The scipy.linalg.lstsq solver can fail the sample weight consistency test, especially for wide dataset (n_features > n_samples) after centering X,y (as done when fit_intercept=True).

Setting the cond parameter (cut-off ratio on singular values) to the value recommended by numpy.linalg.lstsq documentation seems to fix the bug.

test_linear_regression_sample_weight_consistency
@antoinebaker antoinebaker changed the title use cond parameter [all random seeds] Fix LinearRegression sample weight bug (scipy) Oct 10, 2024
Copy link

github-actions bot commented Oct 10, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 414a8a3. Link to the linter CI: here

test_linear_regression_sample_weight_consistency
test_linear_regression_sample_weight_consistency
@antoinebaker
Copy link
Contributor Author

The scipy solver indeed passes when setting cond = max(X.shape) * np.finfo(X.dtype).eps (new numpy default) or just cond = np.finfo(X.dtype).resolution, but fails when using cond = np.finfo(X.dtype).eps.

@ogrisel
Copy link
Member

ogrisel commented Oct 14, 2024

The scipy solver indeed passes when setting cond = max(X.shape) * np.finfo(X.dtype).eps (new numpy default) or just cond = np.finfo(X.dtype).resolution, but fails when using cond = np.finfo(X.dtype).eps.

We probably need to run a few variations of test_linear_regression_sample_weight_consistency with different values of n_samples and n_features to make sure that our choice for cond is robust.

@antoinebaker
Copy link
Contributor Author

antoinebaker commented Oct 15, 2024

Indeed it's quite difficult to find a good robust choice for cond. For instance, cond=resolution will often fail for n_samples, n_features = 100, 100, but then if you increase cond it can become stable again. Testing locally, the numpy default max(X.shape)*eps seems robust to different data shapes.

There is some explanation on choosing this cutoff ratio for singular values in:
https://numpy.org/doc/stable/reference/generated/numpy.linalg.matrix_rank.html

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2024

The test fails with the data with shape (100, 100) with considerable discrepancies:

E           Max absolute difference among violations: 0.00012815
E           Max relative difference among violations: 0.02112615

I am surprised that using fit_intercept=False seems to make the problem worse.

Could you try to re-run with (r)cond=np.finfo(A.dtype).eps / 2. * np.sqrt(m + n + 1.) from the expected round-off error quoted in the docstring of the numpy.linalg.matrix_rank function? Both with numpy and scipy?

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2024

Note that the cond parameter of scipy is documented as:

cond float, optional

Cutoff for ‘small’ singular values; used to determine effective rank of a. Singular values smaller than cond * largest_singular_value are considered zero.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.lstsq.html

This is not the same as the rcond parameter in numpy:

rcond float, optional

Cut-off ratio for small singular values of a. For the purposes of rank determination, singular values are treated as zero if they are smaller than rcond times the largest singular value of a. The default uses the machine precision times max(M, N).

https://numpy.org/doc/stable/reference/generated/numpy.linalg.lstsq.html

The rcond parameter in numpy is a threshold relative to the largest singular value, while the cond threshold in scipy is independent of the magnitude of the largest singular value.

EDIT: I misread the scipy docstring. Both definitions match even if the name of the parameter is different.

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2024

I have the feeling that only the numpy parametrization will make it possible to achieve what we are looking for.

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2024

The test fails with the data with shape (100, 100) with considerable discrepancies:

Actually, the same tests pass with dense numpy arrays. So the problem appears to be specific to the use of scipy sparse datastructures. I realize that for the sparse case, we use the "lsqr" solver that has no direct equivalent to the (r)cond parameter of lstsq because it does not rely on an explicit SVD internally.

So we should probably focus this particular PR to the fix for the dense case and open a dedicated issue+PR for the fix of the sparse case.

@antoinebaker
Copy link
Contributor Author

The added xfail tag (on csr array) is not ideal, because the test is actually passing for X_shape=(10,5) and for 98/100 seeds for X_shape=(10,20). It is always failing for X_shape=(100,100). But I guess it still gives a reminder to fix #30131.

@antoinebaker
Copy link
Contributor Author

Could you try to re-run with (r)cond=np.finfo(A.dtype).eps / 2. * np.sqrt(m + n + 1.) from the expected round-off error quoted in the docstring of the numpy.linalg.matrix_rank function? Both with numpy and scipy?

It fails with (r)cond=np.finfo(A.dtype).eps / 2. * np.sqrt(m + n + 1.) for both scipy and numpy on X_shape=(100,100).

test_linear_regression_sample_weight_consistency
"sample_weight is not equivalent to removing/repeating samples."
),
}
return tags
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact we can remove this while there is still a problem with sparse inputs makes me realize that we should expand check_sample_weight_equivalence to also test fitting with sparse inputs (when the estimator accepts sparse inputs). Let's open a dedicated PR for this (e.g. by introducing a new check named check_sample_weight_equivalence_on_sparse_data).

@ogrisel
Copy link
Member

ogrisel commented Oct 22, 2024

@antoinebaker this PR is still marked as draft, but I have the feeling that it is now ready for review (actually, I just did). Could you confirm?

@ogrisel ogrisel changed the title Fix LinearRegression sample weight bug (scipy) Fix LinearRegression's numerical stability on rank deficient data by setting the cond parameter in the call to scipy.linalg.lstsq Oct 22, 2024
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@antoinebaker antoinebaker marked this pull request as ready for review October 22, 2024 16:34
@ogrisel ogrisel added Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one! labels Oct 23, 2024
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan merged commit bef9d18 into scikit-learn:main Oct 24, 2024
38 checks passed
@himanshuyadav666
Copy link

can I work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:linear_model Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
Development

Successfully merging this pull request may close these issues.

4 participants