Skip to content

[MRG+1] Make GradientBoostingClassifier error message more informative #10207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Dec 18, 2017

Conversation

gxyd
Copy link
Contributor

@gxyd gxyd commented Nov 26, 2017

Reference Issues/PRs

Fixes #6433
Fixes #6435

What does this implement/fix? Explain your changes.

Takes commit from PR #6435 and tries to complete that PR.

else:
sample_weight = column_or_1d(sample_weight, warn=True)
n_pos = np.sum(sample_weight * y)
n_neg = np.sum(sample_weight * (1 - y))
if n_pos == 0 or n_neg == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need an approximate equality here? (np.allclose)

@gxyd
Copy link
Contributor Author

gxyd commented Nov 27, 2017 via email

@jnothman
Copy link
Member

I think the problem is that you're testing for exactly two classes, but should be testing for at least two classes.

@gxyd
Copy link
Contributor Author

gxyd commented Dec 3, 2017

I currently get errors like:

____________________________________________________ test_multi_target_sample_weights_api _____________________________________________________

    def test_multi_target_sample_weights_api():
        X = [[1, 2, 3], [4, 5, 6]]
        y = [[3.141, 2.718], [2.718, 3.141]]
        w = [0.8, 0.6]
    
        rgr = MultiOutputRegressor(Lasso())
        assert_raises_regex(ValueError, "does not support sample weights",
                            rgr.fit, X, y, w)
    
        # no exception should be raised if the base estimator supports weights
        rgr = MultiOutputRegressor(GradientBoostingRegressor(random_state=0))
>       rgr.fit(X, y, w)

X          = [[1, 2, 3], [4, 5, 6]]
rgr        = MultiOutputRegressor(estimator=GradientBoostingRegressor(alpha=0.9, criterion=...    validation_fraction=0.1, verbose=0, warm_start=False),
           n_jobs=1)
w          = [0.8, 0.6]
y          = [[3.141, 2.718], [2.718, 3.141]]

sklearn/tests/test_multioutput.py:112: 

which I can understand. But the thing is I am supposed to expect y to have only integer values for classification problems isn't it?

@jnothman
Copy link
Member

jnothman commented Dec 4, 2017

Classification targets may be strings or ints.

@gxyd
Copy link
Contributor Author

gxyd commented Dec 4, 2017 via email

@gxyd
Copy link
Contributor Author

gxyd commented Dec 5, 2017

Currently two types of errors are raised (enumratively total 3):

self = GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
    ... tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)
X = array([[-2., -1.],
       [-1., -1.],
       [-1., -2.],
       [ 1.,  1.],
       [ 1.,  2.],
       [ 2.,  1.]], dtype=float32)
y = array([ 1.,  1.,  1.,  1.,  1.,  1.]), sample_weight = array([ 1.,  1.,  1.,  1.,  1.,  1.], dtype=float32), monitor = None

        if np.count_nonzero(np.bincount(y_, weights=sample_weight)) < 2:
>           raise ValueError("y should contain atleast 2 classes after "
                             "sample_weight trims samples with zero weights.")
E           ValueError: y should contain atleast 2 classes after sample_weight trims samples with zero weights.

second types of error are

>               raise e
E               ValueError: y should contain 2 classes after sample_weight trims samples with zero weights.

X          = array([[ 1.64644051,  2.1455681 ,  1.80829013,  1.63464955,  1.2709644 ,
         1.93768234,  1.31276163,  2.675319  ,  2.89098828,  1.15032456]])
e          = ValueError('y should contain 2 classes after sample_weight trims samples with zero weights.',)
estimator  = GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
    ... tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)
estimator_orig = GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
    ... tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)
msgs       = ['1 sample', 'n_samples = 1', 'n_samples=1', 'one sample', '1 class', 'one class']
name       = 'GradientBoostingRegressor'
rnd        = <mtrand.RandomState object at 0x2b8f1bd35fa0>
y          = array([1])

I think the first type of errors can be fixed (do we actually expect it to raise error?) by using a fit method for GradientBoostingClassifier instead of using superclass's method.

I don't understand the second error.

@gxyd
Copy link
Contributor Author

gxyd commented Dec 5, 2017

It seems like may be an unrelated failing of lgtm as well.

@jnothman
Copy link
Member

jnothman commented Dec 6, 2017

Well you should be looking at test_gradient_boosting.py line 575 and estimator_checks.py line 671 to work out if your code or the tests need modification

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are your distinguishing between floats and other class dtypes?

if np.count_nonzero(np.bincount(y_, weights=sample_weight)) < 2:
raise ValueError("y should contain atleast 2 classes after "
"sample_weight trims samples with zero weights.")
if not np.issubdtype(y.dtype, np.dtype('float')):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why this is relevant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since BaseGradientBoosting can only differentiate between a classification and regression problem via checking the dtype of y. The original issue mentions the problem to be with classification problem not with regression problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of implementing a separate fit method for GradientBoostingClassifier containing this ValueError check excluding this float dtype comparison check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float dtype check is not appropriate. Ints can be regression targets, for instance.
Better off using isinstance(self,...), or if you feel it is not going to duplicate much code, yes, you can implement a separate GBC.fit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, that would mostly be a duplicate code. I've refrained from it.

@gxyd
Copy link
Contributor Author

gxyd commented Dec 8, 2017

I think this is ready for review.

if n_trim_classes < 2:
raise ValueError("y contains %d class after sample_weight "
"trimmed classes with zero weights, while a "
"minimum of 1 class is required."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be "two classes"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, yes. Made a mistake here.

@gxyd
Copy link
Contributor Author

gxyd commented Dec 9, 2017 via email

@@ -996,6 +997,17 @@ def fit(self, X, y, sample_weight=None, monitor=None):
else:
sample_weight = column_or_1d(sample_weight, warn=True)

if isinstance(self, GradientBoostingClassifier):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_validate_y already does this label encoding, albeit without a LabelEncoder. It's already specialised to regression/classification. Why not just add sample_weight as a parameter to _validate_y and handle it there?

Copy link
Contributor Author

@gxyd gxyd Dec 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a neat way of doing this.

And would this checking be done for any classification algorithm with such y (i.e. in check_classification_targets) or just for GradientBoostingClassifier (i.e just in _validate_y method of GradientBoostingClassifier)?

I am not fully aware of how other classification algorithms are supposed respond to a single classification target (i.e whether raise error or not?).

@jnothman
Copy link
Member

jnothman commented Dec 9, 2017 via email

@jnothman
Copy link
Member

This pull request introduces 1 alert - view on lgtm.com

new alerts:

  • 1 for Signature mismatch in overriding method

Comment posted by lgtm.com

@@ -998,7 +998,10 @@ def fit(self, X, y, sample_weight=None, monitor=None):

check_consistent_length(X, y, sample_weight)

y = self._validate_y(y)
if isinstance(self, GradientBoostingClassifier):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now a bit silly. We have a method called _validate_y in both regressor and classifier to take advantage of polymorphism. Better off adding sample_weight parameter to both

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I initially did this. But I kept remembering not to hinder with other classifier's or regressor's methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic decisions, while affecting readability and maintainability, can be hard to weigh up.

As they say:

There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.

I'm part dutch :O)

@jnothman jnothman changed the title [MRG] Make GradientBoostingClassifier error message more informative [MRG+1] Make GradientBoostingClassifier error message more informative Dec 12, 2017
@amueller
Copy link
Member

Is this a problem only in GradientBoostingClassifier? How do other classifiers behave in this case?
Otherwise LGTM

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should open an issue to add a common test?

@@ -27,6 +27,7 @@
from sklearn.utils.testing import assert_less
from sklearn.utils.testing import assert_raise_message
from sklearn.utils.testing import assert_raises
from sklearn.utils.testing import assert_raise_message
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same import is done two lines above

@gxyd
Copy link
Contributor Author

gxyd commented Dec 12, 2017

I think this is an issue with RandomForestClassifier as well (if I am making a right use of it):

In [4]: X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])

In [5]: y = np.array([1, 1, 1, 2, 2, 2])

In [6]: from sklearn.ensemble import RandomForestClassifier, VotingClassifier

In [7]: RandomForestClassifier
Out[7]: sklearn.ensemble.forest.RandomForestClassifier

In [8]: weights = np.array([0, 0, 0, 1, 1, 1])

In [9]: RandomForestClassifier
Out[9]: sklearn.ensemble.forest.RandomForestClassifier

In [10]: clf = RandomForestClassifier(max_depth=2, random_state=0)

In [11]: clf.fit(X, y, sample_weight=weights)
Out[11]: 
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

It doesn't raise a ValueError.

@gxyd
Copy link
Contributor Author

gxyd commented Dec 12, 2017 via email

@gxyd
Copy link
Contributor Author

gxyd commented Dec 12, 2017

@amueller I raised this point of address this for other classifiers here #10207 (comment) as well.

Perhaps, we should move all this to check_classification_targets? As I can see it is mostly first used by _validate_y_... method of RandomForestClassifier as well.

@amueller
Copy link
Member

random forests actually "work" with a single class IIRC, so not giving an error doesn't seem a problem?

@gxyd
Copy link
Contributor Author

gxyd commented Dec 12, 2017 via email

@amueller
Copy link
Member

I don't think this has a lot to do with machine learning theory, more with the particularities of how each of our estimators handles some edge-cases.

@jnothman
Copy link
Member

jnothman commented Dec 13, 2017 via email

@gxyd
Copy link
Contributor Author

gxyd commented Dec 14, 2017

Are you expecting me to make those changes here? (not hurry) Just want to know if waiting is being for me.

@amueller
Copy link
Member

I'm fine with merging this and open another issue for adding a common test?

@jnothman jnothman merged commit c9e6d4d into scikit-learn:master Dec 18, 2017
@jnothman
Copy link
Member

Thanks @gxyd

@gxyd gxyd deleted the err-message-boosting branch December 18, 2017 13:01
@gxyd
Copy link
Contributor Author

gxyd commented Dec 18, 2017 via email

@gxyd
Copy link
Contributor Author

gxyd commented Dec 18, 2017 via email

@jnothman
Copy link
Member

jnothman commented Dec 18, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make GradientBoostingClassifier error message more informative
5 participants