-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH replace loss module Gradient boosting #26278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH replace loss module Gradient boosting #26278
Conversation
* Make it similar to HGBT
* Make it more similar to HGBT
Note that in the case raw_prediction == y, we now have a different sign in AbsoluteError.gradient
A very simple benchmark for binary classification gives:
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
n_samples, n = 10, 10_000
y = np.tile(np.arange(n_samples) % 2, n)
x1 = np.minimum(y, n_samples / 2)
x2 = np.minimum(-y, -n_samples / 2)
X = np.c_[x1, x2]
%timeit GradientBoostingClassifier(n_estimators=100).fit(X, y) Same for multiclass classification (10 classes) gives:
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
n_samples, n = 10, 10_000
y = np.tile(np.arange(n_samples), n)
x1 = np.minimum(y, n_samples / 2)
x2 = np.minimum(-y, -n_samples / 2)
X = np.c_[x1, x2]
%timeit GradientBoostingClassifier(n_estimators=100).fit(X, y) |
Therefore, I provided the different |
The exact origin for the necessity of this change is unclear. The train_score_ of the GBT inside the pipeline is exactly the same for this branch and 1.2.2.
try: | ||
return numerator / denominator | ||
except FloatingPointError: | ||
return 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For codecov, can we add a small test to trigger the divide by zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to find a case with with GB estimator, but that's hard. So I simply added test_safe_divide
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, we have a nice example that check this one ;)
- This test is not necessary anymore. Already tested in the _loss module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates! LGTM
Dear future 2nd reviewer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @lorentzenchr. A few comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates @lorentzenchr! Looks good now. Just a few minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @lorentzenchr
It looks like one of the example broke after merging this PR, namely From build log:
|
@lesteve has been faster to me to report. |
OK so it is due to the In [5]: with np.errstate(divide="raise"):
...: np.float64(0.0) / np.float64(0.0)
...:
<ipython-input-5-941bf0f2d1fa>:2: RuntimeWarning: invalid value encountered in scalar divide
np.float64(0.0) / np.float64(0.0)
In [6]: with np.errstate(divide="raise"):
...: np.array([1.0, 2.0]) / np.float64(0.0)
...:
---------------------------------------------------------------------------
FloatingPointError Traceback (most recent call last)
Cell In[6], line 2
1 with np.errstate(divide="raise"):
----> 2 np.array([1.0, 2.0]) / np.float64(0.0)
FloatingPointError: divide by zero encountered in divide Looking at the code, it seems that we expect to have the second case. If we are sure to always have some scalar, then we could convert them to a Python scalar that would always raise an error. |
Actually, the only case that go side ways is when both the numerator and denominator goes to 0.0. I assume that it can happen when at the previous iteration we return 0.0 because of the |
Introduce a warning indicating that exporting data frame analytics models as ESGradientBoostingModel subclasses is deprecated and will be removed in version 9.0.0. The implementation of ESGradientBoostingModel relies on importing undocumented private classes that were changed in 1.4 to scikit-learn/scikit-learn#26278. This dependency makes the code difficult to maintain, while the functionality is not widely used by users. Therefore, we will deprecate this functionality in 8.16 and remove it completely in 9.0.0. --------- Co-authored-by: Quentin Pradet <quentin.pradet@elastic.co>
Reference Issues/PRs
Closes #25964.
What does this implement/fix? Explain your changes.
This replaces the losses from
_gb_losses.py
with the once of the common private loss submodulesklearn._loss
in a very backward compatible way. Some factors of the 2 loss implementations differ by a factor of 2. Those factors are here accounted for.Any other comments?
oob_improvement_
andtrain_score_
wrt the above mentioned constant factor of the loss? Asoob_scores_
andoob_score_
is about to be introduced with 1.3, we could still change them.