FIX LinearRegression sample weight bug (numpy solver) #30030

antoinebaker · 2024-10-08T09:29:18Z

Reference Issues/PRs

#29818 and #26164 revealed that LinearRegression was failing the sample weight consistency check (using weights should be equivalent to removing/repeating samples).

Related to #22947 #25948

What does this implement/fix? Explain your changes.

The scipy.linalg.lstsq solver fails the sample weight consistency test for wide dataset (n_features > n_samples) after centering X,y (as done when fit_intercept=True).

Using the numpy.linalg.lstsq solver seems to solve the bug.

test_linear_regression_sample_weight_consistency now uses n_features > n_samples to fail more consistently.

github-actions · 2024-10-08T09:30:32Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 4c8d6e9. Link to the linter CI: here}

glemaitre · 2024-10-08T10:26:24Z

Just by curiosity, are NumPy and SciPy using different implementation of LAPACK? Is there a way to choose the right driver in SciPy that would provide the same behaviour than the one of NumPy or the issue is elsewhere?

antoinebaker · 2024-10-08T14:02:19Z

Just by curiosity, are NumPy and SciPy using different implementation of LAPACK? Is there a way to choose the right driver in SciPy that would provide the same behaviour than the one of NumPy or the issue is elsewhere?

In scipy the test unfortunately fails for all lapack_driver options.

glemaitre · 2024-10-08T15:23:49Z

So looking a bit more, it seems that NumPy use the GELSD driver.

https://github.com/numpy/numpy/blob/v2.0.0/numpy/linalg/umath_linalg.cpp#L4037

ogrisel · 2024-10-09T10:09:21Z

Note that before merging such a change we would need to verify whether there is any speed impact on a few synthetic datasets with various shapes and sizes. But first we need to make all the tests pass consistently.

Some of the tests fail because of a future warning for the rcond parameter that will change its default value. Let's set it explicitly to the future default value to silence the warning and make the code behave consistently across NumPy (or SciPy) versions, e.g. rcond=max(X.shape) * np.finfo(X.dtype.resolution) or rcond=max(X.shape) * np.finfo(X.dtype.eps).

Note that scipy.linalg.lstq also exposes a cond parameter with a seemingly different meaning, but it's not clear what the default cond=None means and how it interacts with the choice of the lapack driver. Here are the links to the doc:

So maybe scipy is not buggy, but it's just our way of calling it with the default cond=None that is a bug. See also my previous attempt at using cond=1e-15 (which is the same as cond=np.finfo(X.dtype).resolution with float64 input data): #26164 (comment)

Once settled on a choice of setting scipy cond or numpy rcond, please push a commit with the following commit message to run the tests with all admissible random seed values on each of the CI worker (different version of numpy/scipy, CPU architectures / operating systems)...:

Trigger CI for [all random seeds]
test_linear_regression_sample_weight_consistency

As explained in #28959.

EDIT: fixed a typo in the commit message used to trigger all random seeds tests.

test_linear_regression_sample_weight_consistency

antoinebaker · 2024-10-09T15:51:32Z

Indeed @ogrisel the cut-off ratio for the singular values cond/rcond seems to be the real culprit !
On my machine, both numpy.linalg.lstsq and scipy.linalg.lstsq fail with cond=eps (old numpy default value).
However they both pass with cond=max(X.shape)*eps (new numpy default value) and whatever the lapack_driver used for scipy.

I guess we need the benchmark to decide between numpy and the various options for scipy.

antoinebaker · 2024-10-09T16:06:54Z

The remaining failures are only for csr matrices (one seed and just below tolerance), I think I made the test a bit too stringent by increasing n_features.

ogrisel · 2024-10-09T16:15:26Z

Could you then please open a similar but concurrent draft PR that attempts to set the cond parameter for scipy.linalg.lstsq instead, and try to make the test pass for this with all random seeds?

glemaitre · 2024-10-09T16:36:55Z

sklearn/linear_model/_base.py

@@ -673,7 +673,11 @@ def rmatvec(b):
                )
                self.coef_ = np.vstack([out[0] for out in outs])
        else:
-            self.coef_, _, self.rank_, self.singular_ = linalg.lstsq(X, y)


In case that we keep scipy lstsq, it might be safe here to pass check_finite=False since we validate X and y or do we consider that the centering could trigger inf and nan for some reason?

Seems like a good idea to improve the performance, I'll add the option in the benchmark. After validating X,y and checking sample_weight I think we are safe, except that _check_sample_weight with ensure_non_negative=True still allows all zero weights, so we should probably raise an error in that case.

test_linear_regression_sample_weight_consistency

antoinebaker · 2024-10-10T08:33:21Z

I revert to the old test (with n_samples, n_features = 10, 5) as it was already able to detect the failure, especially if we test for all random seeds. But maybe we could refactor this test once #29818 is merged, for instance do a sparse version of
check_sample_weight_equivalence ?

@ogrisel was also speaking of a more extensive test suite (beyond just the sample weight) for LinearRegression and Ridge (sparse/dense, fit_intercept=True/False, wide/long dataset, very ill-conditioned or not, etc..)

ogrisel · 2024-10-11T09:53:38Z

The more extensive test suite should better be tackled by reviving the stalled #25948 PR.

ogrisel · 2024-10-11T09:54:25Z

@antoinebaker could you please add a changelog entry targeting scikit-learn 1.6?

ogrisel · 2024-10-11T09:56:51Z

Please also open an alternative PR that does the minimal fix to make the test pass with scipy.linalg.lstq instead of numpy.linalg.lstq.

I suspect that scipy.linalg.lstq(..., cond=np.finfo(X.dtype).resolution) might work, but I would rather check with an [all random seeds] CI run for that option as well.

test_linear_regression_sample_weight_consistency

antoinebaker · 2024-10-11T16:00:41Z

Oh sorry @ogrisel I forgot to link to the concurrent PR #30040. All tests are passing with the scipy solver, with cond set to the numpy default or just the resolution.

Doing some naive benchmarks, it seems that scipy with lapack_driver='gelsy' seems to be a bit faster in agreement with the documentation, but it really depends on the data shape.

test_linear_regression_sample_weight_consistency

ogrisel · 2024-10-22T15:17:45Z

Closing in favor of #30040.

use numpy least squares solver

8c8179e

github-actions bot added the module:linear_model label Oct 8, 2024

pass rcond [all random seeds]

f7d2743

test_linear_regression_sample_weight_consistency

glemaitre reviewed Oct 9, 2024

View reviewed changes

revert to old test [all random seeds]

1ff5626

test_linear_regression_sample_weight_consistency

antoinebaker mentioned this pull request Oct 10, 2024

Fix LinearRegression's numerical stability on rank deficient data by setting the cond parameter in the call to scipy.linalg.lstsq #30040

Merged

changelog [all random seeds]

784f54d

test_linear_regression_sample_weight_consistency

oops wrong PR [all random seeds]

4c8d6e9

test_linear_regression_sample_weight_consistency

ogrisel mentioned this pull request Oct 16, 2024

List of estimators with known incorrect handling of sample_weight #16298

Open

54 tasks

antoinebaker changed the title ~~FIX LinearRegression sample weight bug~~ FIX LinearRegression sample weight bug (numpy solver) Oct 22, 2024

ogrisel closed this Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX LinearRegression sample weight bug (numpy solver) #30030

FIX LinearRegression sample weight bug (numpy solver) #30030

antoinebaker commented Oct 8, 2024

github-actions bot commented Oct 8, 2024 •

edited

Loading

glemaitre commented Oct 8, 2024

antoinebaker commented Oct 8, 2024 •

edited

Loading

glemaitre commented Oct 8, 2024

ogrisel commented Oct 9, 2024 •

edited

Loading

antoinebaker commented Oct 9, 2024

antoinebaker commented Oct 9, 2024

ogrisel commented Oct 9, 2024 •

edited

Loading

glemaitre Oct 9, 2024

antoinebaker Oct 10, 2024 •

edited

Loading

antoinebaker commented Oct 10, 2024

ogrisel commented Oct 11, 2024

ogrisel commented Oct 11, 2024

ogrisel commented Oct 11, 2024 •

edited

Loading

antoinebaker commented Oct 11, 2024 •

edited

Loading

ogrisel commented Oct 22, 2024

FIX LinearRegression sample weight bug (numpy solver) #30030

FIX LinearRegression sample weight bug (numpy solver) #30030

Conversation

antoinebaker commented Oct 8, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

github-actions bot commented Oct 8, 2024 • edited Loading

✔️ Linting Passed

glemaitre commented Oct 8, 2024

antoinebaker commented Oct 8, 2024 • edited Loading

glemaitre commented Oct 8, 2024

ogrisel commented Oct 9, 2024 • edited Loading

antoinebaker commented Oct 9, 2024

antoinebaker commented Oct 9, 2024

ogrisel commented Oct 9, 2024 • edited Loading

glemaitre Oct 9, 2024

Choose a reason for hiding this comment

antoinebaker Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

antoinebaker commented Oct 10, 2024

ogrisel commented Oct 11, 2024

ogrisel commented Oct 11, 2024

ogrisel commented Oct 11, 2024 • edited Loading

antoinebaker commented Oct 11, 2024 • edited Loading

ogrisel commented Oct 22, 2024

github-actions bot commented Oct 8, 2024 •

edited

Loading

antoinebaker commented Oct 8, 2024 •

edited

Loading

ogrisel commented Oct 9, 2024 •

edited

Loading

ogrisel commented Oct 9, 2024 •

edited

Loading

antoinebaker Oct 10, 2024 •

edited

Loading

ogrisel commented Oct 11, 2024 •

edited

Loading

antoinebaker commented Oct 11, 2024 •

edited

Loading