-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
LinearRegression with zero sample_weights is not the same as excluding those rows #26164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Interestingly, I cannot reproduce:
|
I can reproduce it on my Linux machine through:
|
Note that my Mac has an ARM processor. |
Could you try different random seeds on our MAC? I detected this bug while working on #15554 and then running |
Indeed, this is an issue with the random seed. In both configs, it fails 20% of the time. |
@lorentzenchr Does it mean that the |
I don't know. If so, we should report to scipy. |
I confirm it also works on my M1 mac (similar versions for the dependencies and openblas). @lorentzenchr can you reproduce the problem with Can you also please try to change the lapack driver used by |
If you put 4 samples to 0 instead of 5, or alternatively increase the |
I edited the script of the summary to show that it fails in 20% of the time for 100 seeds. |
I confirm that I also reproduce 20% failures with the new script. I also confirm that using |
I will try to change the lapack driver and maybe play with the |
With the |
By setting But I am not sure how to set a good value for |
My take at this issue: It's the minimum norm issue with import numpy as np
from scipy.linalg import lstsq
import scipy.sparse as sparse
from sklearn.linear_model import LinearRegression
rng = np.random.RandomState(2) # Seed dependent, np.allclose should be False.
n_samples, n_features = 10, 5
X = rng.rand(n_samples, n_features)
y = rng.rand(n_samples)
reg = LinearRegression()
sample_weight = rng.uniform(low=0.01, high=2, size=X.shape[0])
sample_weight_0 = sample_weight.copy()
sample_weight_0[-5:] = 0
y[-5:] *= 1000 # to make excluding those samples important
reg.fit(X, y, sample_weight=sample_weight_0)
coef_0 = np.r_[reg.coef_.copy(), reg.intercept_]
reg.fit(X[:-5], y[:-5], sample_weight=sample_weight[:-5])
coef_1 = np.r_[reg.coef_.copy(), reg.intercept_]
# For fun, we also use the lsqr solve by converting so sparse X
reg.fit(sparse.csr_array(X), y, sample_weight=sample_weight_0)
coef_2 = np.r_[reg.coef_.copy(), reg.intercept_] # same as coef_1
np.allclose(coef_0, coef_1, rtol=1e-6), np.allclose(coef_1, coef_2, rtol=1e-6) Should be # We solve with plain lstsq and intercept term.
X_with_intercept = np.c_[X, np.ones(10)]
coef_lstsq = lstsq(np.sqrt(sample_weight_0)[:, None] * X_with_intercept, np.sqrt(sample_weight_0) * y)[0]
# Norm of residues
[
np.linalg.norm(sample_weight_0 * (X_with_intercept @ coef_0 - y)),
np.linalg.norm(sample_weight_0 * (X_with_intercept @ coef_1 - y)),
np.linalg.norm(sample_weight_0 * (X_with_intercept @ coef_lstsq - y)),
] All around 1e-16. So all are valid solutions. # Norm of coefficients
[
np.linalg.norm(coef_0),
np.linalg.norm(coef_1),
np.linalg.norm(coef_lstsq),
]
|
As discussed in #22947 the So I would rather pass Later we can also discuss other solvers for |
I close in favor/duplicate of #22947. |
Describe the bug
Excluding rows having
sample_weight == 0
inLinearRegression
does not give the same results.Steps/Code to Reproduce
Expected Results
Always
True
.Actual Results
The print statement gives:
Versions
The text was updated successfully, but these errors were encountered: