Fix `LinearRegression`'s numerical stability on rank deficient data by setting the `cond` parameter in the call to `scipy.linalg.lstsq` #30040

antoinebaker · 2024-10-10T08:09:34Z

Reference Issues/PRs

Same as #30030 but keeping scipy.linalg.lstsq solver.

#29818 and #26164 revealed that LinearRegression was failing the sample weight consistency check (using weights should be equivalent to removing/repeating samples).

Related to #22947 #25948

What does this implement/fix? Explain your changes.

The scipy.linalg.lstsq solver can fail the sample weight consistency test, especially for wide dataset (n_features > n_samples) after centering X,y (as done when fit_intercept=True).

Setting the cond parameter (cut-off ratio on singular values) to the value recommended by numpy.linalg.lstsq documentation seems to fix the bug.

test_linear_regression_sample_weight_consistency

github-actions · 2024-10-10T08:10:55Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 414a8a3. Link to the linter CI: here}

test_linear_regression_sample_weight_consistency

antoinebaker · 2024-10-14T08:55:13Z

The scipy solver indeed passes when setting cond = max(X.shape) * np.finfo(X.dtype).eps (new numpy default) or just cond = np.finfo(X.dtype).resolution, but fails when using cond = np.finfo(X.dtype).eps.

ogrisel · 2024-10-14T09:25:02Z

The scipy solver indeed passes when setting cond = max(X.shape) * np.finfo(X.dtype).eps (new numpy default) or just cond = np.finfo(X.dtype).resolution, but fails when using cond = np.finfo(X.dtype).eps.

We probably need to run a few variations of test_linear_regression_sample_weight_consistency with different values of n_samples and n_features to make sure that our choice for cond is robust.

antoinebaker · 2024-10-15T09:32:32Z

Indeed it's quite difficult to find a good robust choice for cond. For instance, cond=resolution will often fail for n_samples, n_features = 100, 100, but then if you increase cond it can become stable again. Testing locally, the numpy default max(X.shape)*eps seems robust to different data shapes.

There is some explanation on choosing this cutoff ratio for singular values in:
https://numpy.org/doc/stable/reference/generated/numpy.linalg.matrix_rank.html

ogrisel · 2024-10-21T10:20:35Z

The test fails with the data with shape (100, 100) with considerable discrepancies:

E           Max absolute difference among violations: 0.00012815
E           Max relative difference among violations: 0.02112615

I am surprised that using fit_intercept=False seems to make the problem worse.

Could you try to re-run with (r)cond=np.finfo(A.dtype).eps / 2. * np.sqrt(m + n + 1.) from the expected round-off error quoted in the docstring of the numpy.linalg.matrix_rank function? Both with numpy and scipy?

ogrisel · 2024-10-21T10:28:41Z

Note that the cond parameter of scipy is documented as:

cond float, optional

Cutoff for ‘small’ singular values; used to determine effective rank of a. Singular values smaller than cond * largest_singular_value are considered zero.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.lstsq.html

This is not the same as the rcond parameter in numpy:

rcond float, optional

Cut-off ratio for small singular values of a. For the purposes of rank determination, singular values are treated as zero if they are smaller than rcond times the largest singular value of a. The default uses the machine precision times max(M, N).

https://numpy.org/doc/stable/reference/generated/numpy.linalg.lstsq.html

~~The rcond parameter in numpy is a threshold relative to the largest singular value, while the cond threshold in scipy is independent of the magnitude of the largest singular value.~~

EDIT: I misread the scipy docstring. Both definitions match even if the name of the parameter is different.

ogrisel · 2024-10-21T10:30:45Z

I have the feeling that only the numpy parametrization will make it possible to achieve what we are looking for.

ogrisel · 2024-10-21T14:23:24Z

The test fails with the data with shape (100, 100) with considerable discrepancies:

Actually, the same tests pass with dense numpy arrays. So the problem appears to be specific to the use of scipy sparse datastructures. I realize that for the sparse case, we use the "lsqr" solver that has no direct equivalent to the (r)cond parameter of lstsq because it does not rely on an explicit SVD internally.

So we should probably focus this particular PR to the fix for the dense case and open a dedicated issue+PR for the fix of the sparse case.

doc/whats_new/v1.6.rst

…ample_weight_bug_scipy

antoinebaker · 2024-10-22T09:43:45Z

The added xfail tag (on csr array) is not ideal, because the test is actually passing for X_shape=(10,5) and for 98/100 seeds for X_shape=(10,20). It is always failing for X_shape=(100,100). But I guess it still gives a reminder to fix #30131.

antoinebaker · 2024-10-22T11:39:30Z

Could you try to re-run with (r)cond=np.finfo(A.dtype).eps / 2. * np.sqrt(m + n + 1.) from the expected round-off error quoted in the docstring of the numpy.linalg.matrix_rank function? Both with numpy and scipy?

It fails with (r)cond=np.finfo(A.dtype).eps / 2. * np.sqrt(m + n + 1.) for both scipy and numpy on X_shape=(100,100).

test_linear_regression_sample_weight_consistency

doc/whats_new/upcoming_changes/sklearn.linear_model/30040.fix.rst

ogrisel · 2024-10-22T15:14:37Z

sklearn/linear_model/_base.py

-                "sample_weight is not equivalent to removing/repeating samples."
-            ),
-        }
-        return tags


The fact we can remove this while there is still a problem with sparse inputs makes me realize that we should expand check_sample_weight_equivalence to also test fitting with sparse inputs (when the estimator accepts sparse inputs). Let's open a dedicated PR for this (e.g. by introducing a new check named check_sample_weight_equivalence_on_sparse_data).

ogrisel · 2024-10-22T15:16:08Z

@antoinebaker this PR is still marked as draft, but I have the feeling that it is now ready for review (actually, I just did). Could you confirm?

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

thomasjpfan

LGTM

himanshuyadav666 · 2024-11-02T22:04:39Z

can I work?

use cond parameter [all random seeds]

3d28963

test_linear_regression_sample_weight_consistency

github-actions bot added the module:linear_model label Oct 10, 2024

antoinebaker changed the title ~~use cond parameter [all random seeds]~~ Fix LinearRegression sample weight bug (scipy) Oct 10, 2024

antoinebaker added 2 commits October 10, 2024 10:39

fix docstring [ci skip]

29283f3

changelog [all random seeds]

0b291ae

test_linear_regression_sample_weight_consistency

antoinebaker mentioned this pull request Oct 11, 2024

FIX LinearRegression sample weight bug (numpy solver) #30030

Closed

using resolution [all random seeds]

2118873

test_linear_regression_sample_weight_consistency

antoinebaker and others added 3 commits October 15, 2024 09:57

Merge branch 'main' into linear_regression_sample_weight_bug_scipy

ff1d45f

remove xfail tags

1f04e22

more tests

c211945

numpy default for cond

ba3833f

ogrisel mentioned this pull request Oct 16, 2024

List of estimators with known incorrect handling of sample_weight #16298

Open

54 tasks

ogrisel reviewed Oct 21, 2024

View reviewed changes

doc/whats_new/v1.6.rst Outdated Show resolved Hide resolved

antoinebaker added 3 commits October 21, 2024 17:30

Merge remote-tracking branch 'upstream/main' into linear_regression_s…

0ba0af6

…ample_weight_bug_scipy

xfail csr tests

d0d9500

changelog

47aea9d

trigger CI [all random seeds]

607795f

test_linear_regression_sample_weight_consistency

ogrisel approved these changes Oct 22, 2024

View reviewed changes

ogrisel changed the title ~~Fix LinearRegression sample weight bug (scipy)~~ Fix LinearRegression's numerical stability on rank deficient data by setting the cond parameter in the call to scipy.linalg.lstsq Oct 22, 2024

ogrisel mentioned this pull request Oct 22, 2024

LinearRegression on sparse matrices is not sample weight consistent #30131

Closed

Update doc/whats_new/upcoming_changes/sklearn.linear_model/30040.fix.rst

414a8a3

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel mentioned this pull request Oct 22, 2024

FIX the tests for convergence to the minimum norm solution of unpenalized ridge / OLS #25948

Draft

16 tasks

antoinebaker marked this pull request as ready for review October 22, 2024 16:34

antoinebaker mentioned this pull request Oct 23, 2024

Check sample weight equivalence on sparse data #30137

Merged

1 task

ogrisel added Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one! labels Oct 23, 2024

thomasjpfan approved these changes Oct 24, 2024

View reviewed changes

thomasjpfan merged commit bef9d18 into scikit-learn:main Oct 24, 2024
38 checks passed

ogrisel mentioned this pull request Dec 23, 2024

TST remove xfail marker for check_sample_weight_equivalence_on_dense_data and LinearRegression #30535

Merged

Uh oh!

Fix LinearRegression's numerical stability on rank deficient data by setting the cond parameter in the call to scipy.linalg.lstsq #30040

Fix LinearRegression's numerical stability on rank deficient data by setting the cond parameter in the call to scipy.linalg.lstsq #30040

Uh oh!

Conversation

antoinebaker commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

github-actions bot commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

antoinebaker commented Oct 14, 2024

Uh oh!

ogrisel commented Oct 14, 2024

Uh oh!

antoinebaker commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Oct 21, 2024

Uh oh!

ogrisel commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

antoinebaker commented Oct 22, 2024

Uh oh!

antoinebaker commented Oct 22, 2024

Uh oh!

Uh oh!

ogrisel Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Oct 22, 2024

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

himanshuyadav666 commented Nov 2, 2024

Uh oh!

Uh oh!

Fix `LinearRegression`'s numerical stability on rank deficient data by setting the `cond` parameter in the call to `scipy.linalg.lstsq` #30040

Fix `LinearRegression`'s numerical stability on rank deficient data by setting the `cond` parameter in the call to `scipy.linalg.lstsq` #30040

antoinebaker commented Oct 10, 2024 •

edited

Loading

github-actions bot commented Oct 10, 2024 •

edited

Loading

antoinebaker commented Oct 15, 2024 •

edited

Loading

ogrisel commented Oct 21, 2024 •

edited

Loading

ogrisel commented Oct 21, 2024 •

edited

Loading

ogrisel commented Oct 21, 2024 •

edited

Loading