-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+1] Make GradientBoostingClassifier error message more informative #10207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
else: | ||
sample_weight = column_or_1d(sample_weight, warn=True) | ||
n_pos = np.sum(sample_weight * y) | ||
n_neg = np.sum(sample_weight * (1 - y)) | ||
if n_pos == 0 or n_neg == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need an approximate equality here? (np.allclose)
I had to merge in the 'master' branch, otherwise it was raising
'cythonizing' error (resolved this issue).
I have used 'np.isclose' instead of 'np.allclose'. Tests are still
failing though.
|
I think the problem is that you're testing for exactly two classes, but should be testing for at least two classes. |
I currently get errors like:
which I can understand. But the thing is I am supposed to expect |
Classification targets may be strings or ints. |
Though the title of original issue was 'GradientBoostingClassifier', and Andy here #6435 (comment) commented that to add check in gradient boosting base class, indicating that it should go with both `GradientBoostingClassifier` as well as with `GradientBoostingRegressor`.
Am I right in understanding till now?
The current error is raised because, the condition for disappearing features for multioutput regressor would be different from MultioutputClassifier?
|
Currently two types of errors are raised (enumratively total 3):
second types of error are
I think the first type of errors can be fixed (do we actually expect it to raise error?) by using a I don't understand the second error. |
It seems like may be an unrelated failing of lgtm as well. |
Well you should be looking at test_gradient_boosting.py line 575 and estimator_checks.py line 671 to work out if your code or the tests need modification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are your distinguishing between floats and other class dtypes?
if np.count_nonzero(np.bincount(y_, weights=sample_weight)) < 2: | ||
raise ValueError("y should contain atleast 2 classes after " | ||
"sample_weight trims samples with zero weights.") | ||
if not np.issubdtype(y.dtype, np.dtype('float')): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why this is relevant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since BaseGradientBoosting
can only differentiate between a classification and regression problem via checking the dtype of y
. The original issue mentions the problem to be with classification problem not with regression problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of implementing a separate fit
method for GradientBoostingClassifier
containing this ValueError check excluding this float dtype comparison check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float dtype check is not appropriate. Ints can be regression targets, for instance.
Better off using isinstance(self,...)
, or if you feel it is not going to duplicate much code, yes, you can implement a separate GBC.fit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, that would mostly be a duplicate code. I've refrained from it.
I think this is ready for review. |
if n_trim_classes < 2: | ||
raise ValueError("y contains %d class after sample_weight " | ||
"trimmed classes with zero weights, while a " | ||
"minimum of 1 class is required." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be "two classes"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah, yes. Made a mistake here.
Probably I have fixed it now.
|
@@ -996,6 +997,17 @@ def fit(self, X, y, sample_weight=None, monitor=None): | |||
else: | |||
sample_weight = column_or_1d(sample_weight, warn=True) | |||
|
|||
if isinstance(self, GradientBoostingClassifier): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_validate_y
already does this label encoding, albeit without a LabelEncoder. It's already specialised to regression/classification. Why not just add sample_weight as a parameter to _validate_y
and handle it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a neat way of doing this.
And would this checking be done for any classification algorithm with such y
(i.e. in check_classification_targets
) or just for GradientBoostingClassifier
(i.e just in _validate_y
method of GradientBoostingClassifier
)?
I am not fully aware of how other classification algorithms are supposed respond to a single classification target (i.e whether raise error or not?).
just for GBC for now.
|
This pull request introduces 1 alert - view on lgtm.com new alerts:
Comment posted by lgtm.com |
@@ -998,7 +998,10 @@ def fit(self, X, y, sample_weight=None, monitor=None): | |||
|
|||
check_consistent_length(X, y, sample_weight) | |||
|
|||
y = self._validate_y(y) | |||
if isinstance(self, GradientBoostingClassifier): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now a bit silly. We have a method called _validate_y in both regressor and classifier to take advantage of polymorphism. Better off adding sample_weight parameter to both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I initially did this. But I kept remembering not to hinder with other classifier's or regressor's methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistic decisions, while affecting readability and maintainability, can be hard to weigh up.
As they say:
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
I'm part dutch :O)
Is this a problem only in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should open an issue to add a common test?
@@ -27,6 +27,7 @@ | |||
from sklearn.utils.testing import assert_less | |||
from sklearn.utils.testing import assert_raise_message | |||
from sklearn.utils.testing import assert_raises | |||
from sklearn.utils.testing import assert_raise_message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same import is done two lines above
I think this is an issue with In [4]: X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
In [5]: y = np.array([1, 1, 1, 2, 2, 2])
In [6]: from sklearn.ensemble import RandomForestClassifier, VotingClassifier
In [7]: RandomForestClassifier
Out[7]: sklearn.ensemble.forest.RandomForestClassifier
In [8]: weights = np.array([0, 0, 0, 1, 1, 1])
In [9]: RandomForestClassifier
Out[9]: sklearn.ensemble.forest.RandomForestClassifier
In [10]: clf = RandomForestClassifier(max_depth=2, random_state=0)
In [11]: clf.fit(X, y, sample_weight=weights)
Out[11]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=2, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=0, verbose=0, warm_start=False) It doesn't raise a ValueError. |
I have removed the double import. Thanks @glemaitre.
|
@amueller I raised this point of address this for other classifiers here #10207 (comment) as well. Perhaps, we should move all this to |
random forests actually "work" with a single class IIRC, so not giving an error doesn't seem a problem? |
Oh never mind, no it doesn't seem like a problem. I guess, I should
learn some more basic machine learning theory.
|
I don't think this has a lot to do with machine learning theory, more with the particularities of how each of our estimators handles some edge-cases. |
Yes, we can test it like we test single class/sample, but in general it's
not very important. I thought it was a concern here because GBC does things
like partition the data, changing the distribution of weights per class,
etc.
…On 13 December 2017 at 06:09, Andreas Mueller ***@***.***> wrote:
I don't think this has a lot to do with machine learning theory, more with
the particularities of how each of our estimators handles some edge-cases.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#10207 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xWBsD8LHuDv0jP9lopcl3Gpfzifks5s_s97gaJpZM4QqywZ>
.
|
Are you expecting me to make those changes here? (not hurry) Just want to know if waiting is being for me. |
I'm fine with merging this and open another issue for adding a common test? |
Thanks @gxyd |
Are you opening an issue for the comment that Andreas made? Otherwise we
might not remember that.
|
I've tried to make a report for the issue here
#10337, if you think
of any necessary changes please make them.
|
thanks
|
Reference Issues/PRs
Fixes #6433
Fixes #6435
What does this implement/fix? Explain your changes.
Takes commit from PR #6435 and tries to complete that PR.