-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX LinearRegression sample weight bug (numpy solver) #30030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX LinearRegression sample weight bug (numpy solver) #30030
Conversation
Just by curiosity, are NumPy and SciPy using different implementation of LAPACK? Is there a way to choose the right driver in SciPy that would provide the same behaviour than the one of NumPy or the issue is elsewhere? |
In scipy the test unfortunately fails for all |
So looking a bit more, it seems that NumPy use the GELSD driver. https://github.com/numpy/numpy/blob/v2.0.0/numpy/linalg/umath_linalg.cpp#L4037 |
Note that before merging such a change we would need to verify whether there is any speed impact on a few synthetic datasets with various shapes and sizes. But first we need to make all the tests pass consistently. Some of the tests fail because of a future warning for the Note that
So maybe scipy is not buggy, but it's just our way of calling it with the default Once settled on a choice of setting scipy
As explained in #28959. EDIT: fixed a typo in the commit message used to trigger all random seeds tests. |
test_linear_regression_sample_weight_consistency
Indeed @ogrisel the cut-off ratio for the singular values I guess we need the benchmark to decide between numpy and the various options for scipy. |
The remaining failures are only for |
Could you then please open a similar but concurrent draft PR that attempts to set the |
@@ -673,7 +673,11 @@ def rmatvec(b): | |||
) | |||
self.coef_ = np.vstack([out[0] for out in outs]) | |||
else: | |||
self.coef_, _, self.rank_, self.singular_ = linalg.lstsq(X, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case that we keep scipy lstsq
, it might be safe here to pass check_finite=False
since we validate X
and y
or do we consider that the centering could trigger inf
and nan
for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good idea to improve the performance, I'll add the option in the benchmark. After validating X,y
and checking sample_weight
I think we are safe, except that _check_sample_weight
with ensure_non_negative=True
still allows all zero weights, so we should probably raise an error in that case.
test_linear_regression_sample_weight_consistency
I revert to the old test (with @ogrisel was also speaking of a more extensive test suite (beyond just the sample weight) for |
The more extensive test suite should better be tackled by reviving the stalled #25948 PR. |
@antoinebaker could you please add a changelog entry targeting scikit-learn 1.6? |
Please also open an alternative PR that does the minimal fix to make the test pass with I suspect that |
test_linear_regression_sample_weight_consistency
Oh sorry @ogrisel I forgot to link to the concurrent PR #30040. All tests are passing with the scipy solver, with Doing some naive benchmarks, it seems that scipy with |
test_linear_regression_sample_weight_consistency
Closing in favor of #30040. |
Reference Issues/PRs
#29818 and #26164 revealed that LinearRegression was failing the sample weight consistency check (using weights should be equivalent to removing/repeating samples).
Related to #22947 #25948
What does this implement/fix? Explain your changes.
The
scipy.linalg.lstsq
solver fails the sample weight consistency test for wide dataset (n_features > n_samples
) after centeringX,y
(as done whenfit_intercept=True
).Using the
numpy.linalg.lstsq
solver seems to solve the bug.test_linear_regression_sample_weight_consistency
now usesn_features > n_samples
to fail more consistently.