Skip to content

ENH Consistent checks for sample weights in linear models #15530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 15, 2019

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Nov 3, 2019

Reference Issues/PRs

Fixes #15358 for linear_model.

What is done in this PR:

Use _check_sample_weight throughout linear models.

@@ -262,7 +261,7 @@ def ridge_regression(X, y, alpha, sample_weight=None, solver='auto',
assumed to be specific to the targets. Hence they must correspond in
number.

sample_weight : float or numpy array of shape (n_samples,), default=None
sample_weight : float or array-like of shape (n_samples,), default=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing float seems like a bug

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is treated as float * np.ones_like(y). So not really a bug. Any further thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's how we treat it in _check_sample_weight. Not sure there is a use-case for it, if not we could deprecate, but that's an issue orthogonal to this PR

@@ -754,7 +752,7 @@ def fit(self, X, y, sample_weight=None):
y : array-like of shape (n_samples,) or (n_samples, n_targets)
Target values

sample_weight : float or numpy array of shape [n_samples]
sample_weight : float or array-like of shape (n_samples,), default=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lorentzenchr ! LGTM, aside for the 2nd comment above. We merged other such PR without additional tests, as there are already common tests for sample weights, and this is mostly a refactoring (with a few additional checks).

@@ -262,7 +261,7 @@ def ridge_regression(X, y, alpha, sample_weight=None, solver='auto',
assumed to be specific to the targets. Hence they must correspond in
number.

sample_weight : float or numpy array of shape (n_samples,), default=None
sample_weight : float or array-like of shape (n_samples,), default=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's how we treat it in _check_sample_weight. Not sure there is a use-case for it, if not we could deprecate, but that's an issue orthogonal to this PR

if np.any(self.alphas <= 0):
raise ValueError(
"alphas must be positive. Got {} containing some "
"negative or null value instead.".format(self.alphas))

sample_weight = _check_sample_weight(sample_weight, X, dtype=X.dtype)
X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=[np.float64],
multi_output=True, y_numeric=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for moving this here? I would revert to avoid cosmetic changes. In this case it would allow to fail early for wrong alphas, but that's shouldn't be a very common occurrence to optimize for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_check_sample_weight uses X.dtype, therefore, as in most other places, I'd like to first check X and y and then sample_weight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but both check_X_y and _check_sample_weight were already used here before in that order. At least that's what the diff on Github says... All this does is to skip the sample weight test if sample_weight is not None and so I wondering why we needed to move it around.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, my motivation was to have both checks together. So either move _check_sample_weight up or check_X_ydown. As if np.any(self.alpha <= 0) seems to be the cheaper check, I decided for the latter. I'll revert that and do the former.

check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
multi_output=True)
X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
multi_output=True, y_numeric=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a bug indeed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, but it is intricate. RidgeClassifierCV.fit calls _BaseRidgeCV.fit, which calls either Ridge.fitor _RidgeGCV.fit. The last two also do the usual X, y = check_X_y. Nevertheless, I prefer to have it explicitly there.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Nov 6, 2019

@rth You already approved the changes. So far this PR is still WIP, because I'd like to introduce additional checks on sample_weights that are not yet there. Several of them fail for sparse X (cf. #15438).
Do you prefer to add those checks in a separate PR?

@rth
Copy link
Member

rth commented Nov 6, 2019

I don't mind either way, but so far this PR does mostly refactoring and style improvement. If you are going to address #15438 I imagine that might require a change of behavior? That might be easier to review in a separate bug fix PR with tests. In that case removing WIP tag might get help getting a second review.

BTW if you are planning to add generic sample weight tests in the future, it might be worthwhile to add some of those directly as a check_* function in sklearn/utils/estimator_checks.py even if they wouldn't pass initially for other estimators, and so would have to be manually called in linear_mode/tests/test* outside of common tests.

@lorentzenchr lorentzenchr changed the title [WIP] Consistent checks for sample weights in linear models [MRG] Consistent checks for sample weights in linear models Nov 6, 2019
@rth
Copy link
Member

rth commented Nov 7, 2019

@agramfort Any other comments on this PR?

Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beside my nitpick about having consistent docstrings for sample_weight LGTM

@@ -262,7 +261,7 @@ def ridge_regression(X, y, alpha, sample_weight=None, solver='auto',
assumed to be specific to the targets. Hence they must correspond in
number.

sample_weight : float or numpy array of shape (n_samples,), default=None
sample_weight : float or array-like of shape (n_samples,), default=None
Individual weights for each sample. If sample_weight is not None and
solver='auto', the solver will be set to 'cholesky'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring should say what sample_weight as float means. It's not "individual weights" for each sample.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope you approve my short explanation on floats.

@@ -754,7 +752,7 @@ def fit(self, X, y, sample_weight=None):
y : array-like of shape (n_samples,) or (n_samples, n_targets)
Target values

sample_weight : float or numpy array of shape [n_samples]
sample_weight : float or array-like of shape (n_samples,), default=None
Individual weights for each sample
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please clarify docstring here too

@lorentzenchr lorentzenchr changed the title [MRG] Consistent checks for sample weights in linear models [MRG+2] Consistent checks for sample weights in linear models Nov 13, 2019
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lorentzenchr, merging!

@rth rth changed the title [MRG+2] Consistent checks for sample weights in linear models ENH Consistent checks for sample weights in linear models Nov 15, 2019
@rth rth merged commit 004426a into scikit-learn:master Nov 15, 2019
@lorentzenchr lorentzenchr deleted the check_sw branch November 16, 2019 10:06
adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request Nov 18, 2019
adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request Nov 18, 2019
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use _check_sample_weight to consistently validate sample_weight
3 participants