Skip to content

ENH add only_non_negative parameter to _check_sample_weight #20880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Sep 8, 2021

Conversation

simonandras
Copy link
Contributor

@simonandras simonandras commented Aug 28, 2021

Reference Issues/PRs

Fixes #20036.
Partially addressed #15531

What does this implement/fix? Explain your changes.

The LinearRegression() objects .fit() method gives unclear error message if the sample_weight parameter of the fit() method has negative value. The exception occured only at the call of linalg.lstsq(X, y). I think its better if the negative value check is done at the beginning of the .fit() method. The new message is: "The weights of the samples cannot be negative".

Any other comments?

I think the check for negative values should not be done inside the _check_sample_weight() function but inside the given .fit() method. Maybe there are other cases too where this exception message and the negative weight value check can be useful.

This is my first PR to the open source community and i wellcome any comment or advice related to this PR!

@glemaitre
Copy link
Member

@simonandras Could you check the message that I let on the original issue: #20036 (comment)

I think that we should change _check_sample_weight instead.

…ed on with the only_positive bool parameter (beckward compatible). For the check the alredy written check_non_negative is used.
…d rename the only_positive parameter to only_non_negative, because zeros are accepted.
@simonandras
Copy link
Contributor Author

simonandras commented Sep 1, 2021

Thank you for the reply!
I changed the _check_sample_weight function as you suggested. Now i used the alredy writen check_non_negative function for the non negativity check. This function has an error message writer part, only the "whom" parameter had to be added. Also added the only_non_negative parameter to _check_sample_weight (i chose this name instead of only_positive, because zeros are allowed).

@simonandras
Copy link
Contributor Author

I wasnt able to make all checks succesfull. I think the problem is with the test_adaboost_negative_weight_error function and the new error message but im not sure what the problem actually is and what to do with it. Any help would be appriciated!

@glemaitre glemaitre self-requested a review September 2, 2021 08:07
@glemaitre
Copy link
Member

Here I give you a patch that show as well how to modify the current code such that the test failures should be solved:

diff --git a/sklearn/ensemble/_weight_boosting.py b/sklearn/ensemble/_weight_boosting.py
index 77ef449ba1..ed3b0ad853 100644
--- a/sklearn/ensemble/_weight_boosting.py
+++ b/sklearn/ensemble/_weight_boosting.py
@@ -123,10 +123,10 @@ class BaseWeightBoosting(BaseEnsemble, metaclass=ABCMeta):
             y_numeric=is_regressor(self),
         )
 
-        sample_weight = _check_sample_weight(sample_weight, X, np.float64, copy=True)
+        sample_weight = _check_sample_weight(
+            sample_weight, X, np.float64, copy=True, non_negative=True
+        )
         sample_weight /= sample_weight.sum()
-        if np.any(sample_weight < 0):
-            raise ValueError("sample_weight cannot contain negative weights")
 
         # Check parameters
         self._validate_estimator()
@@ -136,7 +136,7 @@ class BaseWeightBoosting(BaseEnsemble, metaclass=ABCMeta):
         self.estimator_weights_ = np.zeros(self.n_estimators, dtype=np.float64)
         self.estimator_errors_ = np.ones(self.n_estimators, dtype=np.float64)
 
-        # Initializion of the random number instance that will be used to
+        # Initialization of the random number instance that will be used to
         # generate a seed at each iteration
         random_state = check_random_state(self.random_state)
 
diff --git a/sklearn/ensemble/tests/test_weight_boosting.py b/sklearn/ensemble/tests/test_weight_boosting.py
index 6927d47c11..159f83abf2 100755
--- a/sklearn/ensemble/tests/test_weight_boosting.py
+++ b/sklearn/ensemble/tests/test_weight_boosting.py
@@ -576,6 +576,6 @@ def test_adaboost_negative_weight_error(model, X, y):
     sample_weight = np.ones_like(y)
     sample_weight[-1] = -10
 
-    err_msg = "sample_weight cannot contain negative weight"
+    err_msg = "Negative values in data passed to `sample_weight`"
     with pytest.raises(ValueError, match=err_msg):
         model.fit(X, y, sample_weight=sample_weight)
diff --git a/sklearn/linear_model/_base.py b/sklearn/linear_model/_base.py
index 6e1d6acb0c..514525a14a 100644
--- a/sklearn/linear_model/_base.py
+++ b/sklearn/linear_model/_base.py
@@ -662,7 +662,7 @@ class LinearRegression(MultiOutputMixin, RegressorMixin, LinearModel):
             X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
         )
 
-        if sample_weight is not None:  # the weights should be non-negative
+        if sample_weight is not None:
             sample_weight = _check_sample_weight(
                 sample_weight, X, only_non_negative=True, dtype=X.dtype
             )
diff --git a/sklearn/neighbors/_kde.py b/sklearn/neighbors/_kde.py
index 328a13371b..dcfc7b51f4 100644
--- a/sklearn/neighbors/_kde.py
+++ b/sklearn/neighbors/_kde.py
@@ -191,9 +191,7 @@ class KernelDensity(BaseEstimator):
         X = self._validate_data(X, order="C", dtype=DTYPE)
 
         if sample_weight is not None:
-            sample_weight = _check_sample_weight(sample_weight, X, DTYPE)
-            if sample_weight.min() <= 0:
-                raise ValueError("sample_weight must have positive values")
+            sample_weight = _check_sample_weight(sample_weight, X, DTYPE, only_non_negative=True)
 
         kwargs = self.metric_params
         if kwargs is None:
diff --git a/sklearn/neighbors/tests/test_kde.py b/sklearn/neighbors/tests/test_kde.py
index 84f7623c8d..d4fb775c44 100644
--- a/sklearn/neighbors/tests/test_kde.py
+++ b/sklearn/neighbors/tests/test_kde.py
@@ -209,7 +209,7 @@ def test_sample_weight_invalid():
     data = np.reshape([1.0, 2.0, 3.0], (-1, 1))
 
     sample_weight = [0.1, -0.2, 0.3]
-    expected_err = "sample_weight must have positive values"
+    expected_err = "Negative values in data passed to `sample_weight`"
     with pytest.raises(ValueError, match=expected_err):
         kde.fit(data, sample_weight=sample_weight)
 
diff --git a/sklearn/utils/validation.py b/sklearn/utils/validation.py
index 554f821539..eb015d287d 100644
--- a/sklearn/utils/validation.py
+++ b/sklearn/utils/validation.py
@@ -1480,7 +1480,11 @@ def _check_psd_eigenvalues(lambdas, enable_warnings=False):
 
 
 def _check_sample_weight(
-    sample_weight, X, only_non_negative=False, dtype=None, copy=False
+    sample_weight,
+    X,
+    dtype=None,
+    copy=False,
+    only_non_negative=False,
 ):
     """Validate sample weights.
 
@@ -1497,9 +1501,6 @@ def _check_sample_weight(
     X : {ndarray, list, sparse matrix}
         Input data.
 
-    only_non_negative : if True then a non negativity check for the sample_weight
-        will be done.
-
     dtype : dtype, default=None
        dtype of the validated `sample_weight`.
        If None, and the input `sample_weight` is an array, the dtype of the
@@ -1510,6 +1511,9 @@ def _check_sample_weight(
     copy : bool, default=False
         If True, a copy of sample_weight will be created.
 
+    only_non_negative : bool, default=False,
+        Whether or not the weights are expected to be non-negative.
+
     Returns
     -------
     sample_weight : ndarray of shape (n_samples,)
@@ -1546,7 +1550,7 @@ def _check_sample_weight(
             )
 
     if only_non_negative:
-        check_non_negative(sample_weight, "sample weight")
+        check_non_negative(sample_weight, "`sample_weight`")
 
     return sample_weight

@glemaitre
Copy link
Member

Indeed, you need to:

  • move only_non_negative as the last parameter of the _check_sample_weight signature.
  • modify AdaBoost and KDE such that we don't check anymore by hand for negative weights.
  • change the error message in the associated test files.

Then everythin should be fine :)

@glemaitre glemaitre changed the title Adding informative error message for the negative weight case (linear regression) ENH add only_non_negative parameter to _check_sample_weight Sep 2, 2021
@glemaitre
Copy link
Member

In addition, we will want a test in test_validation.py to check that we raise the expected error when using _check_sample_weigth.
We don't need an entry in the changelog since the tool is part of the private API (probably dev API thouht).

@rth rth self-requested a review September 2, 2021 11:11
simonandras and others added 5 commits September 2, 2021 20:19
update documentation

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
delete unnecessary comment

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Add `...` to function name in the error message

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@simonandras
Copy link
Contributor Author

I made the changes and wrote the test inside the already exisiting test_check_sample_weight function as a new part. Every check is alright now except one, which is due to there is no new entry in the changelog. As you said we dont need an entry for this one, so this failed check can be ignored?

…ll make the parameter order like in the function definition
@glemaitre
Copy link
Member

glemaitre commented Sep 3, 2021

Please add an entry to the change log at doc/whats_new/v1.0.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:. This will solve the failure in the CI. I will shortly make another review.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Probably @rth can have a look at this one.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well!

@glemaitre
Copy link
Member

I will move this item for 1.1 since this is not a blocker for the release

@glemaitre glemaitre self-requested a review September 8, 2021 09:08
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the changelog entry. I will merge if the CIs pass

@glemaitre glemaitre merged commit 9c84abd into scikit-learn:main Sep 8, 2021
@simonandras simonandras deleted the my-new-branch branch September 8, 2021 12:38
@glemaitre glemaitre mentioned this pull request Oct 23, 2021
10 tasks
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
…it-learn#20880)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve error message for negative values in sample_weight
4 participants