[MRG+1] Make GradientBoostingClassifier error message more informative #10207

gxyd · 2017-11-26T14:30:07Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Takes commit from PR #6435 and tries to complete that PR.

jnothman · 2017-11-26T22:28:45Z

sklearn/ensemble/gradient_boosting.py

        else:
            sample_weight = column_or_1d(sample_weight, warn=True)
+            n_pos = np.sum(sample_weight * y)
+            n_neg = np.sum(sample_weight * (1 - y))
+            if n_pos == 0 or n_neg == 0:


do we need an approximate equality here? (np.allclose)

gxyd · 2017-11-27T07:33:45Z

I had to merge in the 'master' branch, otherwise it was raising 'cythonizing' error (resolved this issue). I have used 'np.isclose' instead of 'np.allclose'. Tests are still failing though.

jnothman · 2017-11-27T11:38:41Z

I think the problem is that you're testing for exactly two classes, but should be testing for at least two classes.

gxyd · 2017-12-03T18:06:59Z

I currently get errors like:

____________________________________________________ test_multi_target_sample_weights_api _____________________________________________________

    def test_multi_target_sample_weights_api():
        X = [[1, 2, 3], [4, 5, 6]]
        y = [[3.141, 2.718], [2.718, 3.141]]
        w = [0.8, 0.6]
    
        rgr = MultiOutputRegressor(Lasso())
        assert_raises_regex(ValueError, "does not support sample weights",
                            rgr.fit, X, y, w)
    
        # no exception should be raised if the base estimator supports weights
        rgr = MultiOutputRegressor(GradientBoostingRegressor(random_state=0))
>       rgr.fit(X, y, w)

X          = [[1, 2, 3], [4, 5, 6]]
rgr        = MultiOutputRegressor(estimator=GradientBoostingRegressor(alpha=0.9, criterion=...    validation_fraction=0.1, verbose=0, warm_start=False),
           n_jobs=1)
w          = [0.8, 0.6]
y          = [[3.141, 2.718], [2.718, 3.141]]

sklearn/tests/test_multioutput.py:112:

which I can understand. But the thing is I am supposed to expect y to have only integer values for classification problems isn't it?

jnothman · 2017-12-04T12:10:39Z

Classification targets may be strings or ints.

gxyd · 2017-12-04T12:45:00Z

Though the title of original issue was 'GradientBoostingClassifier', and Andy here #6435 (comment) commented that to add check in gradient boosting base class, indicating that it should go with both `GradientBoostingClassifier` as well as with `GradientBoostingRegressor`. Am I right in understanding till now? The current error is raised because, the condition for disappearing features for multioutput regressor would be different from MultioutputClassifier?

gxyd · 2017-12-05T12:06:05Z

Currently two types of errors are raised (enumratively total 3):

self = GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
    ... tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)
X = array([[-2., -1.],
       [-1., -1.],
       [-1., -2.],
       [ 1.,  1.],
       [ 1.,  2.],
       [ 2.,  1.]], dtype=float32)
y = array([ 1.,  1.,  1.,  1.,  1.,  1.]), sample_weight = array([ 1.,  1.,  1.,  1.,  1.,  1.], dtype=float32), monitor = None

        if np.count_nonzero(np.bincount(y_, weights=sample_weight)) < 2:
>           raise ValueError("y should contain atleast 2 classes after "
                             "sample_weight trims samples with zero weights.")
E           ValueError: y should contain atleast 2 classes after sample_weight trims samples with zero weights.

second types of error are

>               raise e
E               ValueError: y should contain 2 classes after sample_weight trims samples with zero weights.

X          = array([[ 1.64644051,  2.1455681 ,  1.80829013,  1.63464955,  1.2709644 ,
         1.93768234,  1.31276163,  2.675319  ,  2.89098828,  1.15032456]])
e          = ValueError('y should contain 2 classes after sample_weight trims samples with zero weights.',)
estimator  = GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
    ... tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)
estimator_orig = GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
    ... tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)
msgs       = ['1 sample', 'n_samples = 1', 'n_samples=1', 'one sample', '1 class', 'one class']
name       = 'GradientBoostingRegressor'
rnd        = <mtrand.RandomState object at 0x2b8f1bd35fa0>
y          = array([1])

I think the first type of errors can be fixed (do we actually expect it to raise error?) by using a fit method for GradientBoostingClassifier instead of using superclass's method.

I don't understand the second error.

gxyd · 2017-12-05T14:59:51Z

It seems like may be an unrelated failing of lgtm as well.

jnothman · 2017-12-06T05:20:32Z

Well you should be looking at test_gradient_boosting.py line 575 and estimator_checks.py line 671 to work out if your code or the tests need modification

jnothman

Why are your distinguishing between floats and other class dtypes?

jnothman · 2017-12-07T11:03:25Z

sklearn/ensemble/gradient_boosting.py

-        if np.count_nonzero(np.bincount(y_, weights=sample_weight)) < 2:
-            raise ValueError("y should contain atleast 2 classes after "
-                             "sample_weight trims samples with zero weights.")
+        if not np.issubdtype(y.dtype, np.dtype('float')):


I don't get why this is relevant

Since BaseGradientBoosting can only differentiate between a classification and regression problem via checking the dtype of y. The original issue mentions the problem to be with classification problem not with regression problems.

I was thinking of implementing a separate fit method for GradientBoostingClassifier containing this ValueError check excluding this float dtype comparison check.

float dtype check is not appropriate. Ints can be regression targets, for instance.
Better off using isinstance(self,...), or if you feel it is not going to duplicate much code, yes, you can implement a separate GBC.fit

Nope, that would mostly be a duplicate code. I've refrained from it.

gxyd · 2017-12-08T12:01:16Z

I think this is ready for review.

jnothman · 2017-12-09T11:32:57Z

sklearn/ensemble/gradient_boosting.py

+            if n_trim_classes < 2:
+                raise ValueError("y contains %d class after sample_weight "
+                                 "trimmed classes with zero weights, while a "
+                                 "minimum of 1 class is required."


Shouldn't this be "two classes"?

Aah, yes. Made a mistake here.

gxyd · 2017-12-09T11:39:47Z

Probably I have fixed it now.

jnothman · 2017-12-09T12:23:33Z

sklearn/ensemble/gradient_boosting.py

@@ -996,6 +997,17 @@ def fit(self, X, y, sample_weight=None, monitor=None):
        else:
            sample_weight = column_or_1d(sample_weight, warn=True)

+        if isinstance(self, GradientBoostingClassifier):


_validate_y already does this label encoding, albeit without a LabelEncoder. It's already specialised to regression/classification. Why not just add sample_weight as a parameter to _validate_y and handle it there?

That sounds like a neat way of doing this.

And would this checking be done for any classification algorithm with such y (i.e. in check_classification_targets) or just for GradientBoostingClassifier (i.e just in _validate_y method of GradientBoostingClassifier)?

I am not fully aware of how other classification algorithms are supposed respond to a single classification target (i.e whether raise error or not?).

jnothman · 2017-12-09T19:26:12Z

just for GBC for now.

jnothman · 2017-12-11T19:16:44Z

This pull request introduces 1 alert - view on lgtm.com

new alerts:

1 for Signature mismatch in overriding method

Comment posted by lgtm.com

jnothman · 2017-12-11T21:49:58Z

sklearn/ensemble/gradient_boosting.py

@@ -998,7 +998,10 @@ def fit(self, X, y, sample_weight=None, monitor=None):

        check_consistent_length(X, y, sample_weight)

-        y = self._validate_y(y)
+        if isinstance(self, GradientBoostingClassifier):


This is now a bit silly. We have a method called _validate_y in both regressor and classifier to take advantage of polymorphism. Better off adding sample_weight parameter to both

Actually I initially did this. But I kept remembering not to hinder with other classifier's or regressor's methods.

Stylistic decisions, while affecting readability and maintainability, can be hard to weigh up.

As they say:

There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.

I'm part dutch :O)

amueller · 2017-12-12T17:10:31Z

Is this a problem only in GradientBoostingClassifier? How do other classifiers behave in this case?
Otherwise LGTM

amueller

Maybe we should open an issue to add a common test?

glemaitre · 2017-12-12T17:10:46Z

sklearn/ensemble/tests/test_gradient_boosting.py

@@ -27,6 +27,7 @@
 from sklearn.utils.testing import assert_less
 from sklearn.utils.testing import assert_raise_message
 from sklearn.utils.testing import assert_raises
+from sklearn.utils.testing import assert_raise_message


The same import is done two lines above

gxyd · 2017-12-12T17:43:17Z

I think this is an issue with RandomForestClassifier as well (if I am making a right use of it):

In [4]: X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])

In [5]: y = np.array([1, 1, 1, 2, 2, 2])

In [6]: from sklearn.ensemble import RandomForestClassifier, VotingClassifier

In [7]: RandomForestClassifier
Out[7]: sklearn.ensemble.forest.RandomForestClassifier

In [8]: weights = np.array([0, 0, 0, 1, 1, 1])

In [9]: RandomForestClassifier
Out[9]: sklearn.ensemble.forest.RandomForestClassifier

In [10]: clf = RandomForestClassifier(max_depth=2, random_state=0)

In [11]: clf.fit(X, y, sample_weight=weights)
Out[11]: 
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

It doesn't raise a ValueError.

gxyd · 2017-12-12T17:48:10Z

I have removed the double import. Thanks @glemaitre.

gxyd · 2017-12-12T18:21:54Z

@amueller I raised this point of address this for other classifiers here #10207 (comment) as well.

Perhaps, we should move all this to check_classification_targets? As I can see it is mostly first used by _validate_y_... method of RandomForestClassifier as well.

amueller · 2017-12-12T18:57:10Z

random forests actually "work" with a single class IIRC, so not giving an error doesn't seem a problem?

gxyd · 2017-12-12T19:00:01Z

Oh never mind, no it doesn't seem like a problem. I guess, I should learn some more basic machine learning theory.

amueller · 2017-12-12T19:09:43Z

I don't think this has a lot to do with machine learning theory, more with the particularities of how each of our estimators handles some edge-cases.

jnothman · 2017-12-13T00:02:19Z

Yes, we can test it like we test single class/sample, but in general it's not very important. I thought it was a concern here because GBC does things like partition the data, changing the distribution of weights per class, etc.

…

On 13 December 2017 at 06:09, Andreas Mueller ***@***.***> wrote: I don't think this has a lot to do with machine learning theory, more with the particularities of how each of our estimators handles some edge-cases. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#10207 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xWBsD8LHuDv0jP9lopcl3Gpfzifks5s_s97gaJpZM4QqywZ> .

gxyd · 2017-12-14T10:59:09Z

Are you expecting me to make those changes here? (not hurry) Just want to know if waiting is being for me.

amueller · 2017-12-14T20:53:00Z

I'm fine with merging this and open another issue for adding a common test?

jnothman · 2017-12-18T12:59:06Z

Thanks @gxyd

gxyd · 2017-12-18T13:05:45Z

Are you opening an issue for the comment that Andreas made? Otherwise we might not remember that.

gxyd · 2017-12-18T13:26:28Z

I've tried to make a report for the issue here #10337, if you think of any necessary changes please make them.

jnothman · 2017-12-18T20:28:49Z

thanks

…cikit-learn#10207)

yenchenlin and others added 2 commits February 25, 2016 09:18

Make GradientBoostingClassifier error message more informative

3bd251e

changes to GradientBoosting estimator base class

299f499

jnothman reviewed Nov 26, 2017

View reviewed changes

gxyd added 2 commits November 27, 2017 12:58

use np.isclose instead of strong equality

c3dfd13

Merge branch 'master' into err-message-boosting

65a76e7

raise error for < 2 classes

58a0c48

better error message

95ded12

add better check

05023d6

gxyd mentioned this pull request Dec 6, 2017

fix docstring Gradient Boosting #10262

Merged

fix error 1 and error 2

0778ea9

jnothman reviewed Dec 7, 2017

View reviewed changes

gxyd added 2 commits December 7, 2017 16:40

fix pep8

cae58af

use 'isinstance' check instead of 'dtype' check

e72d21f

jnothman reviewed Dec 9, 2017

View reviewed changes

fix error message

3978819

jnothman reviewed Dec 9, 2017

View reviewed changes

gxyd added 2 commits December 11, 2017 23:19

move the check to _validate_y

1cab286

fix pyflake error

d815ce6

jnothman reviewed Dec 11, 2017

View reviewed changes

use sample_weight in _validate_y of GBR

6f77d09

jnothman approved these changes Dec 12, 2017

View reviewed changes

jnothman changed the title ~~[MRG] Make GradientBoostingClassifier error message more informative~~ [MRG+1] Make GradientBoostingClassifier error message more informative Dec 12, 2017

amueller approved these changes Dec 12, 2017

View reviewed changes

glemaitre reviewed Dec 12, 2017

View reviewed changes

remove double import

02b838b

jnothman merged commit c9e6d4d into scikit-learn:master Dec 18, 2017

gxyd deleted the err-message-boosting branch December 18, 2017 13:01

gxyd mentioned this pull request Dec 18, 2017

Add common test for classifiers reducing to less than two classes via sample weights during fit #10337

Closed

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

ENH Make GradientBoostingClassifier error message more informative (s…

87f68f1

…cikit-learn#10207)

cmarmo mentioned this pull request Aug 8, 2022

TST Add common tests for single class fitting induced by sample weights #24140

Merged

Uh oh!

[MRG+1] Make GradientBoostingClassifier error message more informative #10207

[MRG+1] Make GradientBoostingClassifier error message more informative #10207

Uh oh!

Conversation

gxyd commented Nov 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gxyd commented Nov 27, 2017 via email

Uh oh!

jnothman commented Nov 27, 2017

Uh oh!

gxyd commented Dec 3, 2017

Uh oh!

jnothman commented Dec 4, 2017

Uh oh!

gxyd commented Dec 4, 2017 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gxyd commented Dec 5, 2017

Uh oh!

gxyd commented Dec 5, 2017

Uh oh!

jnothman commented Dec 6, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gxyd commented Dec 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gxyd commented Dec 9, 2017 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gxyd Dec 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Dec 9, 2017 via email

Uh oh!

jnothman commented Dec 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Dec 12, 2017

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gxyd commented Dec 12, 2017

Uh oh!

gxyd commented Dec 12, 2017 via email

Uh oh!

gxyd commented Dec 12, 2017

gxyd commented Nov 26, 2017 •

edited

Loading

gxyd commented Dec 4, 2017 via email •

edited

Loading

gxyd Dec 9, 2017 •

edited

Loading