ENH replace loss module Gradient boosting #26278

lorentzenchr · 2023-04-24T19:22:38Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This replaces the losses from _gb_losses.py with the once of the common private loss submodule sklearn._loss in a very backward compatible way. Some factors of the 2 loss implementations differ by a factor of 2. Those factors are here accounted for.

Any other comments?

No use of parallelism of the new loss functions is activated yet. This is deferred to later PRs (and contributor capacity).
Should we discuss the backward compatibility for attributes oob_improvement_ and train_score_ wrt the above mentioned constant factor of the loss? As oob_scores_ and oob_score_ is about to be introduced with 1.3, we could still change them.

* Make it similar to HGBT

* Make it more similar to HGBT

Note that in the case raw_prediction == y, we now have a different sign in AbsoluteError.gradient

lorentzenchr · 2023-04-24T20:00:44Z

A very simple benchmark for binary classification gives:

PR:   768 ms ± 9.76 ms
main: 937 ms ± 128 ms

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
n_samples, n = 10, 10_000
y = np.tile(np.arange(n_samples) % 2, n)
x1 = np.minimum(y, n_samples / 2)
x2 = np.minimum(-y, -n_samples / 2)
X = np.c_[x1, x2]
%timeit GradientBoostingClassifier(n_estimators=100).fit(X, y)

Same for multiclass classification (10 classes) gives:

PR:   9.53 s ± 76.7
main: 28.8 s ± 324 ms

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
n_samples, n = 10, 10_000
y = np.tile(np.arange(n_samples), n)
x1 = np.minimum(y, n_samples / 2)
x2 = np.minimum(-y, -n_samples / 2)
X = np.c_[x1, x2]
%timeit GradientBoostingClassifier(n_estimators=100).fit(X, y)

sklearn/_loss/tests/test_loss.py

This reverts commit 64098db.

lorentzenchr · 2023-05-01T05:44:09Z

Given the size of this PR ...

Therefore, I provided the different test_XXX_exact_backward_compat tests.

The exact origin for the necessity of this change is unclear. The train_score_ of the GBT inside the pipeline is exactly the same for this branch and 1.2.2.

sklearn/ensemble/_gb.py

thomasjpfan · 2023-08-17T19:04:56Z

sklearn/ensemble/_gb.py

+        try:
+            return numerator / denominator
+        except FloatingPointError:
+            return 0.0


For codecov, can we add a small test to trigger the divide by zero?

I tried to find a case with with GB estimator, but that's hard. So I simply added test_safe_divide.

Apparently, we have a nice example that check this one ;)

- This test is not necessary anymore. Already tested in the _loss module.

thomasjpfan

Thank you for the updates! LGTM

lorentzenchr · 2023-08-30T18:35:02Z

Dear future 2nd reviewer
This is much easier to review than it looks by just having a look at the tests.

OmarManzoor

Thanks for the PR @lorentzenchr. A few comments.

sklearn/ensemble/_gb.py

sklearn/ensemble/tests/test_gradient_boosting.py

OmarManzoor

Thank you for the updates @lorentzenchr! Looks good now. Just a few minor comments.

sklearn/ensemble/tests/test_gradient_boosting.py

sklearn/ensemble/_gb.py

OmarManzoor

LGTM. Thanks @lorentzenchr

lesteve · 2023-09-07T13:59:14Z

It looks like one of the example broke after merging this PR, namely examples/ensemble/plot_gradient_boosting_regularization.py, because of NaNs.

From build log:

Unexpected failing examples:
/home/circleci/project/examples/ensemble/plot_gradient_boosting_regularization.py failed leaving traceback:
Traceback (most recent call last):
  File "/home/circleci/project/examples/ensemble/plot_gradient_boosting_regularization.py", line 77, in <module>
    test_deviance[i] = 2 * log_loss(y_test, y_proba[:, 1])
  File "/home/circleci/project/sklearn/utils/_param_validation.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/circleci/project/sklearn/metrics/_classification.py", line 2831, in log_loss
    y_pred = check_array(
  File "/home/circleci/project/sklearn/utils/validation.py", line 958, in check_array
    _assert_all_finite(
  File "/home/circleci/project/sklearn/utils/validation.py", line 123, in _assert_all_finite
    _assert_all_finite_element_wise(
  File "/home/circleci/project/sklearn/utils/validation.py", line 172, in _assert_all_finite_element_wise
    raise ValueError(msg_err)
ValueError: Input contains NaN.

glemaitre · 2023-09-07T16:23:17Z

@lesteve has been faster to me to report.
I will try to make a fix if I find what is the issue :).

glemaitre · 2023-09-07T16:38:49Z

OK so it is due to the _safe_divide where we expect numerator/denominator to raise an error using the errstate. However, it will not work when dividing two scalar np.float64:

In [5]: with np.errstate(divide="raise"):
   ...:     np.float64(0.0) / np.float64(0.0)
   ...: 
<ipython-input-5-941bf0f2d1fa>:2: RuntimeWarning: invalid value encountered in scalar divide
  np.float64(0.0) / np.float64(0.0)

In [6]: with np.errstate(divide="raise"):
   ...:     np.array([1.0, 2.0]) / np.float64(0.0)
   ...: 
---------------------------------------------------------------------------
FloatingPointError                        Traceback (most recent call last)
Cell In[6], line 2
      1 with np.errstate(divide="raise"):
----> 2     np.array([1.0, 2.0]) / np.float64(0.0)

FloatingPointError: divide by zero encountered in divide

Looking at the code, it seems that we expect to have the second case. If we are sure to always have some scalar, then we could convert them to a Python scalar that would always raise an error.

glemaitre · 2023-09-07T16:50:33Z

Actually, the only case that go side ways is when both the numerator and denominator goes to 0.0. I assume that it can happen when at the previous iteration we return 0.0 because of the FloatingPointError.

…n 1.4 Relevant change scikit-learn/scikit-learn#26278

Introduce a warning indicating that exporting data frame analytics models as ESGradientBoostingModel subclasses is deprecated and will be removed in version 9.0.0. The implementation of ESGradientBoostingModel relies on importing undocumented private classes that were changed in 1.4 to scikit-learn/scikit-learn#26278. This dependency makes the code difficult to maintain, while the functionality is not widely used by users. Therefore, we will deprecate this functionality in 8.16 and remove it completely in 9.0.0. --------- Co-authored-by: Quentin Pradet <quentin.pradet@elastic.co>

lorentzenchr added 10 commits April 19, 2023 19:19

ENH common losses in gradient boosting

c3ffef9

MNT replace _validate_y by _encode_y

ffc0cdc

* Make it similar to HGBT

MNT rename _is_initialized to _is_fitted

b250fd4

* Make it more similar to HGBT

ENH use common loss function module in GB

ba8f2a7

MNT fix backwards compatibility for oob_scores_ and oob_improvement_

7d0145c

FIX test_non_uniform_weights_toy_edge_case_reg

a8c73eb

Note that in the case raw_prediction == y, we now have a different sign in AbsoluteError.gradient

FIX _init_raw_predictions detection for predict_proba

2be5987

FIX huber GBT

1495084

FIX factor of losses for backward compat

55f2448

TST add tests for backward compat

9f6cb2f

github-actions bot added the module:ensemble label Apr 24, 2023

lorentzenchr added 3 commits April 24, 2023 21:44

FIX factor multinomial loss

bfb76c9

TST add test_binomial_vs_alternative_formulation

64098db

MNT remove _gb_losses.py and tests

966aa06

lorentzenchr added 2 commits April 24, 2023 22:09

DOC add whatsnew entry

b6f4853

Merge branch 'main' into gradient_boosting_common_loss

b6191b6

lorentzenchr mentioned this pull request Apr 24, 2023

Use common loss module in gradient boosting #25964

Closed

7 tasks

lorentzenchr added 3 commits April 26, 2023 11:02

FIX don't use sample_weight in gradient

5b5fc67

CLN remove comment left overs

d5ece9a

CLN add todo note on use of loss_gradient

2521d2f

thomasjpfan reviewed Apr 28, 2023

View reviewed changes

sklearn/_loss/tests/test_loss.py Outdated Show resolved Hide resolved

sklearn/_loss/tests/test_loss.py Outdated Show resolved Hide resolved

sklearn/_loss/tests/test_loss.py Outdated Show resolved Hide resolved

lorentzenchr mentioned this pull request Apr 29, 2023

TST add test_binomial_vs_alternative_formulation #26298

Merged

lorentzenchr added 3 commits April 29, 2023 13:25

Revert "TST add test_binomial_vs_alternative_formulation"

18933f7

This reverts commit 64098db.

TST reduce tol

30f91c7

FIX first check sample_weight then encode y

c80dc7f

lorentzenchr added 3 commits May 1, 2023 07:48

TST reduce tol

72f7fb8

CLN minor code cleanups in _update_terminal_regions

b4c8587

DOC cross_val_score in common pitfalls

c6cefe4

The exact origin for the necessity of this change is unclear. The train_score_ of the GBT inside the pipeline is exactly the same for this branch and 1.2.2.

thomasjpfan reviewed Aug 17, 2023

View reviewed changes

lorentzenchr added 4 commits August 18, 2023 17:50

CLN address review comments

7d87787

TST test_safe_divide

d0e9f7d

Merge branch 'main' into gradient_boosting_common_loss

7f49f3d

MNT remove test_probability_exponential

6e1fae2

- This test is not necessary anymore. Already tested in the _loss module.

thomasjpfan approved these changes Aug 27, 2023

View reviewed changes

thomasjpfan added the Waiting for Second Reviewer First reviewer is done, need a second one! label Aug 27, 2023

OmarManzoor reviewed Sep 6, 2023

View reviewed changes

lorentzenchr added 3 commits September 6, 2023 21:36

ENH improve compute_update for ExponentialLoss

63087d9

DOC SquaredError specialties in update terminal regions

6ef5c4d

CLN address review comments

8c01436

OmarManzoor reviewed Sep 7, 2023

View reviewed changes

sklearn/ensemble/tests/test_gradient_boosting.py Show resolved Hide resolved

sklearn/ensemble/_gb.py Show resolved Hide resolved

OmarManzoor approved these changes Sep 7, 2023

View reviewed changes

OmarManzoor merged commit 5dbb8f5 into scikit-learn:main Sep 7, 2023

ogrisel deleted the gradient_boosting_common_loss branch September 7, 2023 12:48

lorentzenchr mentioned this pull request Sep 8, 2023

FIX _safe_divide should handle zero-division with numpy scalar #27312

Merged

hcho3 mentioned this pull request Oct 25, 2023

[sklearn] Use more stable interface to get base_scores dmlc/treelite#526

Merged

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023

ENH replace loss module Gradient boosting (scikit-learn#26278)

adddd88

lorentzenchr mentioned this pull request Jan 10, 2024

FIX divide by zero in line search of GradientBoostingClassifier #28095

Merged

iamDecode added a commit to iamDecode/sklearn-pmml-model that referenced this pull request Apr 14, 2024

feat: make gradient boosting models compatible with changes in sklear…

d8fafa3

…n 1.4 Relevant change scikit-learn/scikit-learn#26278

iamDecode added a commit to iamDecode/sklearn-pmml-model that referenced this pull request Apr 14, 2024

feat: make gradient boosting models compatible with changes in sklear…

580b50d

…n 1.4 Relevant change scikit-learn/scikit-learn#26278

iamDecode added a commit to iamDecode/sklearn-pmml-model that referenced this pull request Apr 14, 2024

feat: make gradient boosting models compatible with changes in sklear…

ccdd952

…n 1.4 Relevant change scikit-learn/scikit-learn#26278

pquentin mentioned this pull request Oct 2, 2024

Remove support from sklearn-based ESGradientBoostingModel in 9.0 elastic/eland#731

Closed

valeriy42 mentioned this pull request Nov 7, 2024

Add deprecation warning for ESGradientBoostingModel subclasses elastic/eland#738

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH replace loss module Gradient boosting #26278

ENH replace loss module Gradient boosting #26278

lorentzenchr commented Apr 24, 2023 •

edited

Loading

lorentzenchr commented Apr 24, 2023 •

edited

Loading

lorentzenchr commented May 1, 2023

thomasjpfan Aug 17, 2023

lorentzenchr Aug 19, 2023

glemaitre Sep 7, 2023

thomasjpfan left a comment

lorentzenchr commented Aug 30, 2023

OmarManzoor left a comment

OmarManzoor left a comment

OmarManzoor left a comment

lesteve commented Sep 7, 2023

glemaitre commented Sep 7, 2023

glemaitre commented Sep 7, 2023 •

edited

Loading

glemaitre commented Sep 7, 2023

ENH replace loss module Gradient boosting #26278

ENH replace loss module Gradient boosting #26278

Conversation

lorentzenchr commented Apr 24, 2023 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

lorentzenchr commented Apr 24, 2023 • edited Loading

lorentzenchr commented May 1, 2023

thomasjpfan Aug 17, 2023

Choose a reason for hiding this comment

lorentzenchr Aug 19, 2023

Choose a reason for hiding this comment

glemaitre Sep 7, 2023

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

lorentzenchr commented Aug 30, 2023

OmarManzoor left a comment

Choose a reason for hiding this comment

OmarManzoor left a comment

Choose a reason for hiding this comment

OmarManzoor left a comment

Choose a reason for hiding this comment

lesteve commented Sep 7, 2023

glemaitre commented Sep 7, 2023

glemaitre commented Sep 7, 2023 • edited Loading

glemaitre commented Sep 7, 2023

lorentzenchr commented Apr 24, 2023 •

edited

Loading

lorentzenchr commented Apr 24, 2023 •

edited

Loading

glemaitre commented Sep 7, 2023 •

edited

Loading