Skip to content

FIX LinearRegression sample weight bug (numpy solver) #30030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

antoinebaker
Copy link
Contributor

Reference Issues/PRs

#29818 and #26164 revealed that LinearRegression was failing the sample weight consistency check (using weights should be equivalent to removing/repeating samples).

Related to #22947 #25948

What does this implement/fix? Explain your changes.

The scipy.linalg.lstsq solver fails the sample weight consistency test for wide dataset (n_features > n_samples) after centering X,y (as done when fit_intercept=True).

Using the numpy.linalg.lstsq solver seems to solve the bug.

test_linear_regression_sample_weight_consistency now uses n_features > n_samples to fail more consistently.

Copy link

github-actions bot commented Oct 8, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 4c8d6e9. Link to the linter CI: here

@glemaitre
Copy link
Member

Just by curiosity, are NumPy and SciPy using different implementation of LAPACK? Is there a way to choose the right driver in SciPy that would provide the same behaviour than the one of NumPy or the issue is elsewhere?

@antoinebaker
Copy link
Contributor Author

antoinebaker commented Oct 8, 2024

Just by curiosity, are NumPy and SciPy using different implementation of LAPACK? Is there a way to choose the right driver in SciPy that would provide the same behaviour than the one of NumPy or the issue is elsewhere?

In scipy the test unfortunately fails for all lapack_driver options.

@glemaitre
Copy link
Member

So looking a bit more, it seems that NumPy use the GELSD driver.

https://github.com/numpy/numpy/blob/v2.0.0/numpy/linalg/umath_linalg.cpp#L4037

@ogrisel
Copy link
Member

ogrisel commented Oct 9, 2024

Note that before merging such a change we would need to verify whether there is any speed impact on a few synthetic datasets with various shapes and sizes. But first we need to make all the tests pass consistently.

Some of the tests fail because of a future warning for the rcond parameter that will change its default value. Let's set it explicitly to the future default value to silence the warning and make the code behave consistently across NumPy (or SciPy) versions, e.g. rcond=max(X.shape) * np.finfo(X.dtype.resolution) or rcond=max(X.shape) * np.finfo(X.dtype.eps).

Note that scipy.linalg.lstq also exposes a cond parameter with a seemingly different meaning, but it's not clear what the default cond=None means and how it interacts with the choice of the lapack driver. Here are the links to the doc:

So maybe scipy is not buggy, but it's just our way of calling it with the default cond=None that is a bug. See also my previous attempt at using cond=1e-15 (which is the same as cond=np.finfo(X.dtype).resolution with float64 input data): #26164 (comment)

Once settled on a choice of setting scipy cond or numpy rcond, please push a commit with the following commit message to run the tests with all admissible random seed values on each of the CI worker (different version of numpy/scipy, CPU architectures / operating systems)...:

Trigger CI for [all random seeds]
test_linear_regression_sample_weight_consistency

As explained in #28959.

EDIT: fixed a typo in the commit message used to trigger all random seeds tests.

test_linear_regression_sample_weight_consistency
@antoinebaker
Copy link
Contributor Author

Indeed @ogrisel the cut-off ratio for the singular values cond/rcond seems to be the real culprit !
On my machine, both numpy.linalg.lstsq and scipy.linalg.lstsq fail with cond=eps (old numpy default value).
However they both pass with cond=max(X.shape)*eps (new numpy default value) and whatever the lapack_driver used for scipy.

I guess we need the benchmark to decide between numpy and the various options for scipy.

@antoinebaker
Copy link
Contributor Author

The remaining failures are only for csr matrices (one seed and just below tolerance), I think I made the test a bit too stringent by increasing n_features.

@ogrisel
Copy link
Member

ogrisel commented Oct 9, 2024

Could you then please open a similar but concurrent draft PR that attempts to set the cond parameter for scipy.linalg.lstsq instead, and try to make the test pass for this with all random seeds?

@@ -673,7 +673,11 @@ def rmatvec(b):
)
self.coef_ = np.vstack([out[0] for out in outs])
else:
self.coef_, _, self.rank_, self.singular_ = linalg.lstsq(X, y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case that we keep scipy lstsq, it might be safe here to pass check_finite=False since we validate X and y or do we consider that the centering could trigger inf and nan for some reason?

Copy link
Contributor Author

@antoinebaker antoinebaker Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea to improve the performance, I'll add the option in the benchmark. After validating X,y and checking sample_weight I think we are safe, except that _check_sample_weight with ensure_non_negative=True still allows all zero weights, so we should probably raise an error in that case.

test_linear_regression_sample_weight_consistency
@antoinebaker
Copy link
Contributor Author

I revert to the old test (with n_samples, n_features = 10, 5) as it was already able to detect the failure, especially if we test for all random seeds. But maybe we could refactor this test once #29818 is merged, for instance do a sparse version of
check_sample_weight_equivalence ?

@ogrisel was also speaking of a more extensive test suite (beyond just the sample weight) for LinearRegression and Ridge (sparse/dense, fit_intercept=True/False, wide/long dataset, very ill-conditioned or not, etc..)

@ogrisel
Copy link
Member

ogrisel commented Oct 11, 2024

The more extensive test suite should better be tackled by reviving the stalled #25948 PR.

@ogrisel
Copy link
Member

ogrisel commented Oct 11, 2024

@antoinebaker could you please add a changelog entry targeting scikit-learn 1.6?

@ogrisel
Copy link
Member

ogrisel commented Oct 11, 2024

Please also open an alternative PR that does the minimal fix to make the test pass with scipy.linalg.lstq instead of numpy.linalg.lstq.

I suspect that scipy.linalg.lstq(..., cond=np.finfo(X.dtype).resolution) might work, but I would rather check with an [all random seeds] CI run for that option as well.

test_linear_regression_sample_weight_consistency
@antoinebaker
Copy link
Contributor Author

antoinebaker commented Oct 11, 2024

Oh sorry @ogrisel I forgot to link to the concurrent PR #30040. All tests are passing with the scipy solver, with cond set to the numpy default or just the resolution.

Doing some naive benchmarks, it seems that scipy with lapack_driver='gelsy' seems to be a bit faster in agreement with the documentation, but it really depends on the data shape.

image
image

test_linear_regression_sample_weight_consistency
@antoinebaker antoinebaker changed the title FIX LinearRegression sample weight bug FIX LinearRegression sample weight bug (numpy solver) Oct 22, 2024
@ogrisel
Copy link
Member

ogrisel commented Oct 22, 2024

Closing in favor of #30040.

@ogrisel ogrisel closed this Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants