-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG + 1] Improve tests for sample_weight in LinearRegression and Ridge #5526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG + 1] Improve tests for sample_weight in LinearRegression and Ridge #5526
Conversation
b58187f
to
66633b8
Compare
ping @sonnyhu @agramfort @ogrisel ? |
or @GaelVaroquaux who reviewed the PR together with myself. |
LGTM |
hmm, shouldn't this also show up in @ainafp 's common tests? |
👍 |
if sample_weight is not None: | ||
# Sample weight can be implemented via a simple rescaling. | ||
X, y = _rescale_data(X, y, sample_weight) | ||
|
||
X, y, X_mean, y_mean, X_std = self._center_data( | ||
X, y, self.fit_intercept, self.normalize, self.copy_X, | ||
sample_weight=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not equivalent as now you rescale the non-centered data. Which means that sample weight interact with intercept.
can you clarify why it's more correct this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or I saw #5357 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how to update sci kit to get this bug fixed?
On Thu, Oct 22, 2015 at 9:56 AM, Alexandre Gramfort <
notifications@github.com> wrote:
In sklearn/linear_model/base.py
#5526 (comment)
:if sample_weight is not None: # Sample weight can be implemented via a simple rescaling. X, y = _rescale_data(X, y, sample_weight)
X, y, X_mean, y_mean, X_std = self._center_data(
X, y, self.fit_intercept, self.normalize, self.copy_X,
sample_weight=None)
or I saw #5357 (comment)
#5357 (comment)—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/5526/files#r42750244.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Sandy4321 we are working on solving the issue at the moment. In the meantime, if you want to use sample weights, you can instead multiply X
(row by row) and y
with the square root of the weights, as it is done in the code above. Fitting LinearRegression
without sample weights still runs without issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see , but question is : how technically to do this, simple to use pip to
reinstall?
On Thu, Oct 22, 2015 at 10:42 AM, Giorgio Patrini notifications@github.com
wrote:
In sklearn/linear_model/base.py
#5526 (comment)
:if sample_weight is not None: # Sample weight can be implemented via a simple rescaling. X, y = _rescale_data(X, y, sample_weight)
X, y, X_mean, y_mean, X_std = self._center_data(
X, y, self.fit_intercept, self.normalize, self.copy_X,
sample_weight=None)
@Sandy4321 https://github.com/Sandy4321 we are working on solving the
issue at the moment. In the meantime, if you want to use sample weights,
you can instead multiply X(row by row) and y with the square root of the
weights, as it is done in the code above. Fitting LinearRegression
without sample weights still runs without issues.—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/5526/files#r42756415.
@agramfort We are reducing a weighted least square problem to a non-weigthed by first performing a change of variables. In the new variables, normalization/centering are weights-independent. I believe |
I believe intercept_ has to interact with sample_weight. If we think to the
intercept as a dummy column of X, from a linear algebra point of view, a
diagonal matrix of the sample weights multiplies that column too.
I am correct in thinking that it's the same picture as a weighted mean?
|
It is. We are minimizing a (weighted) L2-norm here. Indeed, the current implementation does the trick a-priori, before the change of variable. See here. |
70ea898
to
c1c35d0
Compare
I think I agree with @giorgiop More simply, if we just take the loss function as |
|
||
for n_samples, n_features in ((6, 5), (5, 10)): | ||
for fit_intercept in [True, False]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be a big with fit_intercept=True
right? But I guess it is good to test.
iirc when you do the orthogonalization against a constant (ie center), in On Friday, October 23, 2015, Manoj Kumar notifications@github.com wrote:
|
I don't know. I had a look at the test but I don't have an answer just yet. |
2c2897a
to
c7e276a
Compare
So,
|
#5515 shouldn't have a constant sample weight vector. It is drawn uniformly On Friday, October 23, 2015, Giorgio Patrini notifications@github.com
|
Using |
maybe linear regression is not among the checked estimators for some reason? On Friday, October 23, 2015, Giorgio Patrini notifications@github.com
|
b2e6dcb
to
2dcad5e
Compare
@mblondel I think that kind of test is not support to pass. I have removed it and used the idea based on weighted least square that we discussed, with data augmentation by dummy variable. It should be all right. I am amending with some cosmetics as suggest by @agramfort in person, and squashing. |
@agramfort suggested also that we move the test based on weighted least square on |
That looks like it could be done within or in conjunction with the other PR On Tue, Nov 10, 2015 at 5:01 PM, Giorgio Patrini notifications@github.com
|
2dcad5e
to
b76d5a4
Compare
Sure we can. It's going to follow a different logic, but has the same goal. |
let me know when I shall review
|
Please go ahead! |
|
||
assert_array_almost_equal(coefs1, coefs2) | ||
if intercept is False: | ||
assert_array_almost_equal(coefs1, coefs3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coefs3 -> coefs2
also please add note that the same tests are needed for sparse data |
that's it for me! |
There is a comment already in the code for this. I will move it to a place where it is more evident. |
b76d5a4
to
1d16ec4
Compare
ok. Let me know when I can do another round
|
Comments should have been addressed. |
Indeed this is a much more solid test. Thanks for that! |
[MRG + 1] Improve tests for sample_weight in LinearRegression and Ridge
thanks @giorgiop ! |
rng = np.random.RandomState(0) | ||
|
||
for n_samples, n_features in ((6, 5), (5, 10)): | ||
# It would not work with under-determined systems | ||
for n_samples, n_features in ((6, 5), ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this loop is pretty useless now, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is, my bad.
I have noticed in #5357 that a test introduced in 0.17 was not correctly checking for the effect of
sample_weight
in LinearRegression. To reproduce the issue on master: