ENH Consistent checks for sample weights in linear models #15530

lorentzenchr · 2019-11-03T19:00:01Z

Reference Issues/PRs

Fixes #15358 for linear_model.

What is done in this PR:

Use _check_sample_weight throughout linear models.

agramfort · 2019-11-03T19:07:54Z

sklearn/linear_model/_ridge.py

@@ -262,7 +261,7 @@ def ridge_regression(X, y, alpha, sample_weight=None, solver='auto',
        assumed to be specific to the targets. Hence they must correspond in
        number.

-    sample_weight : float or numpy array of shape (n_samples,), default=None
+    sample_weight : float or array-like of shape (n_samples,), default=None


Passing float seems like a bug

It is treated as float * np.ones_like(y). So not really a bug. Any further thoughts?

Yes, that's how we treat it in _check_sample_weight. Not sure there is a use-case for it, if not we could deprecate, but that's an issue orthogonal to this PR

agramfort · 2019-11-03T19:08:39Z

sklearn/linear_model/_ridge.py

@@ -754,7 +752,7 @@ def fit(self, X, y, sample_weight=None):
        y : array-like of shape (n_samples,) or (n_samples, n_targets)
            Target values

-        sample_weight : float or numpy array of shape [n_samples]
+        sample_weight : float or array-like of shape (n_samples,), default=None


rth

Thanks @lorentzenchr ! LGTM, aside for the 2nd comment above. We merged other such PR without additional tests, as there are already common tests for sample weights, and this is mostly a refactoring (with a few additional checks).

rth · 2019-11-05T22:27:38Z

sklearn/linear_model/_ridge.py

@@ -262,7 +261,7 @@ def ridge_regression(X, y, alpha, sample_weight=None, solver='auto',
        assumed to be specific to the targets. Hence they must correspond in
        number.

-    sample_weight : float or numpy array of shape (n_samples,), default=None
+    sample_weight : float or array-like of shape (n_samples,), default=None


Yes, that's how we treat it in _check_sample_weight. Not sure there is a use-case for it, if not we could deprecate, but that's an issue orthogonal to this PR

rth · 2019-11-05T22:30:16Z

sklearn/linear_model/_ridge.py

        if np.any(self.alphas <= 0):
            raise ValueError(
                "alphas must be positive. Got {} containing some "
                "negative or null value instead.".format(self.alphas))

-        sample_weight = _check_sample_weight(sample_weight, X, dtype=X.dtype)
+        X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=[np.float64],
+                         multi_output=True, y_numeric=True)


What's the motivation for moving this here? I would revert to avoid cosmetic changes. In this case it would allow to fail early for wrong alphas, but that's shouldn't be a very common occurrence to optimize for.

_check_sample_weight uses X.dtype, therefore, as in most other places, I'd like to first check X and y and then sample_weight.

Yes, but both check_X_y and _check_sample_weight were already used here before in that order. At least that's what the diff on Github says... All this does is to skip the sample weight test if sample_weight is not None and so I wondering why we needed to move it around.

I think, my motivation was to have both checks together. So either move _check_sample_weight up or check_X_ydown. As if np.any(self.alpha <= 0) seems to be the cheaper check, I decided for the latter. I'll revert that and do the former.

rth · 2019-11-05T22:31:33Z

sklearn/linear_model/_ridge.py

-        check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
-                  multi_output=True)
+        X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
+                         multi_output=True, y_numeric=False)


Looks like a bug indeed.

Not really, but it is intricate. RidgeClassifierCV.fit calls _BaseRidgeCV.fit, which calls either Ridge.fitor _RidgeGCV.fit. The last two also do the usual X, y = check_X_y. Nevertheless, I prefer to have it explicitly there.

lorentzenchr · 2019-11-06T12:01:21Z

@rth You already approved the changes. So far this PR is still WIP, because I'd like to introduce additional checks on sample_weights that are not yet there. Several of them fail for sparse X (cf. #15438).
Do you prefer to add those checks in a separate PR?

rth · 2019-11-06T20:15:24Z

I don't mind either way, but so far this PR does mostly refactoring and style improvement. If you are going to address #15438 I imagine that might require a change of behavior? That might be easier to review in a separate bug fix PR with tests. In that case removing WIP tag might get help getting a second review.

BTW if you are planning to add generic sample weight tests in the future, it might be worthwhile to add some of those directly as a check_* function in sklearn/utils/estimator_checks.py even if they wouldn't pass initially for other estimators, and so would have to be manually called in linear_mode/tests/test* outside of common tests.

rth · 2019-11-07T18:54:36Z

@agramfort Any other comments on this PR?

agramfort

beside my nitpick about having consistent docstrings for sample_weight LGTM

agramfort · 2019-11-13T09:56:23Z

sklearn/linear_model/_ridge.py

@@ -262,7 +261,7 @@ def ridge_regression(X, y, alpha, sample_weight=None, solver='auto',
        assumed to be specific to the targets. Hence they must correspond in
        number.

-    sample_weight : float or numpy array of shape (n_samples,), default=None
+    sample_weight : float or array-like of shape (n_samples,), default=None
        Individual weights for each sample. If sample_weight is not None and
        solver='auto', the solver will be set to 'cholesky'.


docstring should say what sample_weight as float means. It's not "individual weights" for each sample.

I hope you approve my short explanation on floats.

agramfort · 2019-11-13T09:57:02Z

sklearn/linear_model/_ridge.py

@@ -754,7 +752,7 @@ def fit(self, X, y, sample_weight=None):
        y : array-like of shape (n_samples,) or (n_samples, n_targets)
            Target values

-        sample_weight : float or numpy array of shape [n_samples]
+        sample_weight : float or array-like of shape (n_samples,), default=None
            Individual weights for each sample


please clarify docstring here too

rth

Thanks @lorentzenchr, merging!

…rn#15530)

Christian Lorentzen added 5 commits November 3, 2019 00:53

MNT _check_sample_weights in _base.py

36791c5

MNT _check_sample_weight in _bayes.py

3e4e344

MNT _check_sample_weight in _ransac.py

e2d5da5

MNT _check_sample_weight in _ridge.py

4940102

DOC PEP 257 and scikit-learn#12356 in _ridge.py

bbcdd0d

agramfort reviewed Nov 3, 2019

View reviewed changes

rth approved these changes Nov 5, 2019

View reviewed changes

Revert moving check_X_y

d0c9912

lorentzenchr changed the title ~~[WIP] Consistent checks for sample weights in linear models~~ [MRG] Consistent checks for sample weights in linear models Nov 6, 2019

lorentzenchr mentioned this pull request Nov 6, 2019

TST Add tests for LinearRegression that sample weights act consistently #15554

Merged

agramfort approved these changes Nov 13, 2019

View reviewed changes

DOC explain sample_weight=float

5716dae

lorentzenchr changed the title ~~[MRG] Consistent checks for sample weights in linear models~~ [MRG+2] Consistent checks for sample weights in linear models Nov 13, 2019

rth reviewed Nov 15, 2019

View reviewed changes

rth changed the title ~~[MRG+2] Consistent checks for sample weights in linear models~~ ENH Consistent checks for sample weights in linear models Nov 15, 2019

rth merged commit 004426a into scikit-learn:master Nov 15, 2019

lorentzenchr deleted the check_sw branch November 16, 2019 10:06

adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request Nov 18, 2019

ENH Consistent checks for sample weights in linear models (scikit-lea…

10dead4

…rn#15530)

adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request Nov 18, 2019

ENH Consistent checks for sample weights in linear models (scikit-lea…

0fdac0d

…rn#15530)

adrinjalali pushed a commit that referenced this pull request Nov 19, 2019

ENH Consistent checks for sample weights in linear models (#15530)

3fb1cee

panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020

ENH Consistent checks for sample weights in linear models (scikit-lea…

fa9adc0

…rn#15530)

Uh oh!

ENH Consistent checks for sample weights in linear models #15530

ENH Consistent checks for sample weights in linear models #15530

Uh oh!

Conversation

lorentzenchr commented Nov 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What is done in this PR:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorentzenchr commented Nov 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Nov 6, 2019

Uh oh!

rth commented Nov 7, 2019

Uh oh!

agramfort left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lorentzenchr commented Nov 3, 2019 •

edited

Loading

rth left a comment •

edited

Loading

lorentzenchr commented Nov 6, 2019 •

edited

Loading