-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Consistent checks for sample weights in linear models #15530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -262,7 +261,7 @@ def ridge_regression(X, y, alpha, sample_weight=None, solver='auto', | |||
assumed to be specific to the targets. Hence they must correspond in | |||
number. | |||
|
|||
sample_weight : float or numpy array of shape (n_samples,), default=None | |||
sample_weight : float or array-like of shape (n_samples,), default=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing float seems like a bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is treated as float * np.ones_like(y)
. So not really a bug. Any further thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's how we treat it in _check_sample_weight
. Not sure there is a use-case for it, if not we could deprecate, but that's an issue orthogonal to this PR
@@ -754,7 +752,7 @@ def fit(self, X, y, sample_weight=None): | |||
y : array-like of shape (n_samples,) or (n_samples, n_targets) | |||
Target values | |||
|
|||
sample_weight : float or numpy array of shape [n_samples] | |||
sample_weight : float or array-like of shape (n_samples,), default=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lorentzenchr ! LGTM, aside for the 2nd comment above. We merged other such PR without additional tests, as there are already common tests for sample weights, and this is mostly a refactoring (with a few additional checks).
@@ -262,7 +261,7 @@ def ridge_regression(X, y, alpha, sample_weight=None, solver='auto', | |||
assumed to be specific to the targets. Hence they must correspond in | |||
number. | |||
|
|||
sample_weight : float or numpy array of shape (n_samples,), default=None | |||
sample_weight : float or array-like of shape (n_samples,), default=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's how we treat it in _check_sample_weight
. Not sure there is a use-case for it, if not we could deprecate, but that's an issue orthogonal to this PR
if np.any(self.alphas <= 0): | ||
raise ValueError( | ||
"alphas must be positive. Got {} containing some " | ||
"negative or null value instead.".format(self.alphas)) | ||
|
||
sample_weight = _check_sample_weight(sample_weight, X, dtype=X.dtype) | ||
X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=[np.float64], | ||
multi_output=True, y_numeric=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation for moving this here? I would revert to avoid cosmetic changes. In this case it would allow to fail early for wrong alphas, but that's shouldn't be a very common occurrence to optimize for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_check_sample_weight
uses X.dtype
, therefore, as in most other places, I'd like to first check X
and y
and then sample_weight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but both check_X_y
and _check_sample_weight
were already used here before in that order. At least that's what the diff on Github says... All this does is to skip the sample weight test if sample_weight is not None
and so I wondering why we needed to move it around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, my motivation was to have both checks together. So either move _check_sample_weight
up or check_X_y
down. As if np.any(self.alpha <= 0)
seems to be the cheaper check, I decided for the latter. I'll revert that and do the former.
check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'], | ||
multi_output=True) | ||
X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'], | ||
multi_output=True, y_numeric=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a bug indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, but it is intricate. RidgeClassifierCV.fit
calls _BaseRidgeCV.fit
, which calls either Ridge.fit
or _RidgeGCV.fit
. The last two also do the usual X, y = check_X_y
. Nevertheless, I prefer to have it explicitly there.
I don't mind either way, but so far this PR does mostly refactoring and style improvement. If you are going to address #15438 I imagine that might require a change of behavior? That might be easier to review in a separate bug fix PR with tests. In that case removing WIP tag might get help getting a second review. BTW if you are planning to add generic sample weight tests in the future, it might be worthwhile to add some of those directly as a |
@agramfort Any other comments on this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beside my nitpick about having consistent docstrings for sample_weight LGTM
@@ -262,7 +261,7 @@ def ridge_regression(X, y, alpha, sample_weight=None, solver='auto', | |||
assumed to be specific to the targets. Hence they must correspond in | |||
number. | |||
|
|||
sample_weight : float or numpy array of shape (n_samples,), default=None | |||
sample_weight : float or array-like of shape (n_samples,), default=None | |||
Individual weights for each sample. If sample_weight is not None and | |||
solver='auto', the solver will be set to 'cholesky'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstring should say what sample_weight as float means. It's not "individual weights" for each sample.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope you approve my short explanation on floats.
sklearn/linear_model/_ridge.py
Outdated
@@ -754,7 +752,7 @@ def fit(self, X, y, sample_weight=None): | |||
y : array-like of shape (n_samples,) or (n_samples, n_targets) | |||
Target values | |||
|
|||
sample_weight : float or numpy array of shape [n_samples] | |||
sample_weight : float or array-like of shape (n_samples,), default=None | |||
Individual weights for each sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please clarify docstring here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lorentzenchr, merging!
Reference Issues/PRs
Fixes #15358 for
linear_model
.What is done in this PR:
Use
_check_sample_weight
throughout linear models.