-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FEA Add Gamma deviance as loss function to HGBT #22409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
7045b71
55ad3cd
82ad819
e67bbe4
df76d92
d8e5037
c8f9bfe
74caaf7
bb234ee
0cc8716
930572d
5f043a1
f31d541
7f783ee
e8a1a42
aa360c0
1f4a243
0776dec
3321e3f
0ad5afa
0b33f82
fcff47b
74964d0
038bfde
7b4abb6
e6041f4
015824c
6902f6f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,7 +17,7 @@ | |
from sklearn.base import clone, BaseEstimator, TransformerMixin | ||
from sklearn.base import is_regressor | ||
from sklearn.pipeline import make_pipeline | ||
from sklearn.metrics import mean_poisson_deviance | ||
from sklearn.metrics import mean_gamma_deviance, mean_poisson_deviance | ||
from sklearn.dummy import DummyRegressor | ||
from sklearn.exceptions import NotFittedError | ||
from sklearn.compose import make_column_transformer | ||
|
@@ -248,8 +248,64 @@ def test_absolute_error_sample_weight(): | |
gbdt.fit(X, y, sample_weight=sample_weight) | ||
|
||
|
||
@pytest.mark.parametrize("y", [([1.0, -2.0, 0.0]), ([0.0, 1.0, 2.0])]) | ||
def test_gamma_y_positive(y): | ||
# Test that ValueError is raised if any y_i <= 0. | ||
err_msg = r"loss='gamma' requires strictly positive y." | ||
gbdt = HistGradientBoostingRegressor(loss="gamma", random_state=0) | ||
with pytest.raises(ValueError, match=err_msg): | ||
gbdt.fit(np.zeros(shape=(len(y), 1)), y) | ||
|
||
|
||
def test_gamma(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently, this test fails because the HGBT with poisson loss out-performs all other models on the test set. from lightgbm import LGBMRegressor
lgbm_gamma = LGBMRegressor(objective="gamma")
lgbm_pois = LGBMRegressor(objective="poisson") I don't know how to construct a better test - for now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this true only for the small sample sizes that we typically use in our tests our would be this still true for a very large Is it similar to the fact that the median can be a better estimate of the expected value than the sample mean for a finite sample of a long tailed distribution (robustness to outliers)? This was observed empirically when computing metrics on a left out validation set here: https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_quantile.html There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I played with n_sample and n_features without success. This problem that Poisson HGBT always beats Gamma HGBT out-of-sample prevails. I think one reason is that the log-link is not canonical to Gamma deciance, but it is canonical to the Poisson deviance. Therefore, the Poisson HGBT is better calibrated, which here also translates to a better out-of-sample performance. Mabye, the features are not informative enough in this example. I can try to generate a decision tree and use it to generate a Gamma distributed sample per tree node. That should be easier to fit for tree based models than this GLM based data generation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That would be interesting to know indeed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried to create gamma distributed data with a decision tree structure for the mean value, see details. Proposition for solutions:
I'm in favor of the first point. import numpy as np
from sklearn._loss import HalfGammaLoss
from sklearn.dummy import DummyRegressor
from sklearn.datasets import make_low_rank_matrix
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_gamma_deviance
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from lightgbm import LGBMRegressor
### Data Generation
rng = np.random.RandomState(42)
n_train, n_test, n_features = 500, 500, 20
X = make_low_rank_matrix(
n_samples=n_train + n_test, n_features=n_features, random_state=rng,
)
# First, we create a normal distributed data
coef = rng.uniform(low=-10, high=20, size=n_features)
y_normal = np.abs(rng.normal(loc=np.exp(X @ coef)))
# Second, we fit a decision tree
dtree = DecisionTreeRegressor().fit(X, y_normal)
# Finally, we generate gamma distributed data according to the tree
# mean(y) = dtree.predict(X)
# var(y) = dispersion * mean(y)**2
dispersion = 0.5
y = rng.gamma(shape=1 / dispersion, scale=dispersion * dtree.predict(X))
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=n_test, random_state=rng
)
### Model fitting
params = {"early_stopping": False, "random_state": 123}
p_lgbt = {"min_child_weight": 0, "random_state": 123}
gbdt_gamma = HistGradientBoostingRegressor(loss="gamma", **params)
gbdt_pois = HistGradientBoostingRegressor(loss="poisson", **params)
gbdt_ls = HistGradientBoostingRegressor(loss="squared_error", **params)
lgbm_gamma = LGBMRegressor(objective="gamma", **p_lgbt)
lgbm_pois = LGBMRegressor(objective="poisson", **p_lgbt)
for model in (gbdt_gamma, gbdt_pois, gbdt_ls, lgbm_gamma, lgbm_pois):
model.fit(X_train, y_train)
dummy = DummyRegressor(strategy="mean").fit(X_train, y_train)
for m in (dummy, gbdt_gamma, gbdt_ls, gbdt_pois, lgbm_gamma, lgbm_pois):
message = f"training gamma deviance {m.__class__.__name__}"
if hasattr(m, "loss"):
message += " " + m.loss
elif hasattr(m, "objective"):
message += " " + m.objective
print(f"{message: <68}: {mean_gamma_deviance(y_train, np.clip(m.predict(X_train), 1e-12, None))}")
for m in (dummy, gbdt_gamma, gbdt_ls, gbdt_pois, lgbm_gamma, lgbm_pois):
message = f"test gamma deviance {m.__class__.__name__}"
if hasattr(m, "loss"):
message += " " + m.loss
elif hasattr(m, "objective"):
message += " " + m.objective
print(f"{message: <68}: {mean_gamma_deviance(y_test, np.clip(m.predict(X_test), 1e-12, None))}") There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I replaced Poisson with squared error as model to compare against, bb234ee. Now it works. Comments on comparison to Poisson are preserved. |
||
# For a Gamma distributed target, we expect an HGBT trained with the Gamma deviance | ||
# (loss) to give better results than an HGBT with any other loss function, measured | ||
# in out-of-sample Gamma deviance as metric/score. | ||
# Note that squared error could potentially predict negative values which is | ||
# invalid (np.inf) for the Gamma deviance. A Poisson HGBT (having a log link) | ||
# does not have that defect. | ||
# Important note: It seems that a Poisson HGBT almost always has better | ||
# out-of-sample performance than the Gamma HGBT, measured in Gamma deviance. | ||
# LightGBM shows the same behaviour. Hence, we only compare to a squared error | ||
# HGBT, but not to a Poisson deviance HGBT. | ||
rng = np.random.RandomState(42) | ||
n_train, n_test, n_features = 500, 100, 20 | ||
X = make_low_rank_matrix( | ||
n_samples=n_train + n_test, | ||
n_features=n_features, | ||
random_state=rng, | ||
) | ||
# We create a log-linear Gamma model. This gives y.min ~ 1e-2, y.max ~ 1e2 | ||
coef = rng.uniform(low=-10, high=20, size=n_features) | ||
# Numpy parametrizes gamma(shape=k, scale=theta) with mean = k * theta and | ||
# variance = k * theta^2. We parametrize it instead with mean = exp(X @ coef) | ||
# and variance = dispersion * mean^2 by setting k = 1 / dispersion, | ||
# theta = dispersion * mean. | ||
dispersion = 0.5 | ||
y = rng.gamma(shape=1 / dispersion, scale=dispersion * np.exp(X @ coef)) | ||
X_train, X_test, y_train, y_test = train_test_split( | ||
X, y, test_size=n_test, random_state=rng | ||
) | ||
gbdt_gamma = HistGradientBoostingRegressor(loss="gamma", random_state=123) | ||
gbdt_mse = HistGradientBoostingRegressor(loss="squared_error", random_state=123) | ||
dummy = DummyRegressor(strategy="mean") | ||
for model in (gbdt_gamma, gbdt_mse, dummy): | ||
model.fit(X_train, y_train) | ||
|
||
for X, y in [(X_train, y_train), (X_test, y_test)]: | ||
loss_gbdt_gamma = mean_gamma_deviance(y, gbdt_gamma.predict(X)) | ||
# We restrict the squared error HGBT to predict at least the minimum seen y at | ||
# train time to make it strictly positive. | ||
loss_gbdt_mse = mean_gamma_deviance( | ||
y, np.maximum(np.min(y_train), gbdt_mse.predict(X)) | ||
) | ||
loss_dummy = mean_gamma_deviance(y, dummy.predict(X)) | ||
assert loss_gbdt_gamma < loss_dummy | ||
assert loss_gbdt_gamma < loss_gbdt_mse | ||
|
||
|
||
@pytest.mark.parametrize("quantile", [0.2, 0.5, 0.8]) | ||
def test_asymmetric_error(quantile): | ||
def test_quantile_asymmetric_error(quantile): | ||
"""Test quantile regression for asymmetric distributed targets.""" | ||
n_samples = 10_000 | ||
rng = np.random.RandomState(42) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a bit lax. Any idea why we can be stricter in the case of the squared error? I wouldn't require to make it stricter as a pre-requisite to merge this PR as the Poisson loss is already in
main
but lightgbm equivalence was previously just not tested.But we could at least add a TODO comment to inform the reader that we are not 100% satisfied with the state of things :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is because, in the context of the squared error, we only test for the equivalence with the larger dataset with shallow trees (
max_leaf_nodes < 10 and n_samples >= 1000
): this condition might allow the averaging effect in leafs to mitigate the impact of rounding errors and maybe reduce the likelihood of ties when deciding on which features and thresholds to split on?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, LightGBM's algo deviates more from ours for general losses (=not squared error). For instance in the Poisson case, LightGBM has a
poisson_max_delta_step
which I set to1e-12
, but I'm not 100% sure of the effective difference this causes.I'll add a comment as suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And BTW, our losses have a better test suite than the ones in LightGBM:smirk: