-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MRG fix Normalize for linear models when used with sample_weight #19426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG fix Normalize for linear models when used with sample_weight #19426
Conversation
…maikia/scikit-learn into normalize_fix_for_sample_weight
this one should be good to go |
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@ogrisel let's see if it helps to understand the problem. I wrote from scratch a
This line basically means you take a scale that is given by n_samples import numpy as np
from numpy.testing import assert_allclose
from sklearn.linear_model import Ridge
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model._base import _preprocess_data
np.random.seed(0)
n_samples = 100
X, y = make_regression(n_samples=n_samples, n_features=3)
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Check that setting the sw to 0 for the first 10 samples
# is equivalent to removing them:
sw = np.ones(X_train.shape[0])
sw[:10] = 0.
params = dict(
fit_intercept=True,
normalize=True,
alpha=1,
solver='lsqr'
)
ridge_sw_zero = Ridge(**params).fit(X_train, y_train, sample_weight=sw)
ridge_trimmed = Ridge(**params).fit(X_train[10:], y_train[10:])
def ridge(X, y, sample_weight, alpha, X_test, y_test):
X_mean = np.average(X, weights=sample_weight, axis=0)
X_var = (np.average(X**2, weights=sample_weight, axis=0) -
np.average(X, weights=sample_weight, axis=0)**2)
X_var *= X.shape[0] # difference between normalize and standardize
X_scale = np.sqrt(X_var)
X = (X - X_mean) / X_scale
y_mean = np.average(y, weights=sample_weight, axis=0)
y = y - y_mean
X *= np.sqrt(sample_weight)[:, None]
y *= np.sqrt(sample_weight)
coef = np.linalg.solve(
X.T @ X + alpha * np.eye(X.shape[1]), X.T @ y
)
coef = coef / X_scale
intercept = y_mean - np.dot(X_mean, coef)
y_pred = np.dot(X_test, coef) + intercept
print(r2_score(y_test, y_pred))
return y_pred
assert_allclose(
ridge_sw_zero.predict(X_test),
ridge(X_train, y_train, sw, params['alpha'], X_test, y_test),
rtol=1e-1
)
assert_allclose(
ridge_trimmed.predict(X_test),
ridge(X_train[10:], y_train[10:], sw[10:], params['alpha'], X_test, y_test),
rtol=1e-1
)
assert_allclose(
ridge(X_train, y_train, sw, params['alpha'], X_test, y_test),
ridge(X_train[10:], y_train[10:], sw[10:], params['alpha'], X_test, y_test),
rtol=1e-1
) |
here is what the new code after killing normalize should do. Just solving the problem with l-bfgs import numpy as np
from numpy.testing import assert_allclose
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from scipy.optimize import fmin_l_bfgs_b, check_grad
np.random.seed(0)
n_samples = 100
X, y = make_regression(n_samples=n_samples, n_features=3)
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Check that setting the sw to 0 for the first 10 samples
# is equivalent to removing them:
sw = 10 * np.ones(X_train.shape[0])
sw[:10] = 0.
alpha = 1
def ridge_bfgs(X, y, sample_weight, alpha, X_test):
def f(w):
coef, intercept = w[:-1], w[-1]
return (
0.5 * np.sum(sample_weight * (y - X @ coef - intercept) ** 2) +
0.5 * alpha * np.sum(coef ** 2)
)
def fprime(w):
coef, intercept = w[:-1], w[-1]
residual = y - X @ coef - intercept
grad = np.empty(X.shape[1] + 1)
grad[:-1] = -X.T @ (sample_weight * residual) + alpha * coef
grad[-1] = -np.sum(sample_weight * residual)
return grad
dim = X.shape[1] + 1
print(check_grad(f, fprime, np.random.RandomState(42).randn(dim)))
w0 = np.zeros(dim)
w0[-1] = np.average(y, weights=sample_weight)
w = fmin_l_bfgs_b(f, w0, fprime)[0]
coef, intercept = w[:-1], w[-1]
y_pred = X_test @ coef + intercept
return y_pred
y_pred = ridge_bfgs(X_train, y_train, sw, alpha, X_test)
y_pred_trimmed = ridge_bfgs(X_train[10:], y_train[10:], sw[10:], alpha, X_test)
assert_allclose(
y_pred,
y_pred_trimmed,
rtol=1e-5
) |
Thanks @agramfort. I don't understand the need for the init of the intercept coef in the l-bfgs solver: w0[-1] = np.average(y, weights=sample_weight) since the problem is convex, initializing the intercept at 0 should work, no? I tried and it seems to work on my machine. @maikia I think it would be good to include Alex's last minimal snippet as a new test and compare that we can reproduce the same results with our |
yes w0[-1] = np.average(y, weights=sample_weight) is not necessary. It's
just a better init
but it should not change anything.
… |
@agramfort in the l-bfgs snippet the output of
but the gradient expression seems correct to me. Do you have any idea why it's the case? dim is 4 in this case, so I would expect this value to be much smaller. |
I have extended the l-bfgs code snippet to compare with the scikit-learn https://gist.github.com/ogrisel/ea98add276dc9624fea8e41a5bb6c989 I have also extended (in the same gist) to add a check for the equivalence between fitting with |
The remaing failure happen in
Note that in #19503 I extended the test to also cover the case where we pass I fail to see a pattern in the failures: it can either fail on float32 or float64 (but not always for either), sometimes it's all solver for a given tasks that are not consistent together with the SVD solver. Sometimes it's only the sparse-cg solver... |
Reading the error message more carefully, it seems that the PR causes some coefs to have |
Ok I think I fixed the problem with the handling of near constant features. |
I re-enabled the test for the failing edge case (non-zero constant feature on sparse data). I think we need to understand what's going on in this case. I am still not sure but at least the error is simpler now. Then we will need to understand the discrepancy with @agramfort's minimal ridge implementations. |
Ok I put more details in #19450 (comment) . I can solve the problem here but @maikia is right and we also have it in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's merge this one. The remaining problems are properly documented in the linked issues.
Thanks @ogrisel ! |
_preprocess_data
in_base
for linear models, when evoked withnormalize = True
andsample_weight
does not weight standard deviation and therefore does not behave as expected. This PR updates it.Currently we are working on deprecation of
normalize
: #3020as pointed out by @ogrisel in #17743 (comment) when testing for the same result when running lasso or ridge with
normalize
set to True or in a pipeline with StandardScaler withnormalize
set to False, the test fails.After through investigation @agramfort realized the above mentioned issue. If not updated we cannot expect the pipeline with a StandardScaler to give the same results as linear_model with normalize set to True.
This issue has been also noted previously by @jnothman
scikit-learn/sklearn/linear_model/tests/test_base.py
Line 473 in 1ea7905
Possibly it might also related to: #17444
cc @glemaitre