-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Describe the workflow you want to enable
The histogram gradient boosted decision trees usually do not fulfil the so called balance property on the training data, i.e. sum([proba]predictions) == sum(observations)
. A simple "post-fit" step could ensure this condition. This will usually decrease the in-sample performance, but I observed in practise that is is quite beneficial for the out-of-sample performance, in particular for non-canonical link-loss combinations such as the Gamma deviance with log link (with XGBoost or LightGBM as it is not yet available in scikit-learn). The main advantage is, however, better calibrated models, in-sample as well as out-of-sample.
model = HistGradientBoostingRegressor() # could also be a classifier
model.fit(X, y)
# post-fit recalibration to fulfil the balance property
model._baseline_prediction = link_function(np.mean(y) / np.mean(model.predit[_proba](X))
For quantiles, this would be slightly different.
Describe your proposed solution
Add a new option post_fit_calibration
(better name?!):
model = HistGradientBoostingRegressor(post_fit_calibration=True) # could also be a classifier
model.fit(X, y)
Describe alternatives you've considered, if relevant
One could also invent a meta-estimator.
Additional context
No response