-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Adds Poisson criterion in RandomForestRegressor #19304 #19836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds Poisson criterion in RandomForestRegressor #19304 #19836
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a whats_new entry in v1.0.rst
? Something like "official support of the Poisson splitting criterion for RandomForestRegressor" (you can use it right now, but it's currently not documented and not tested => not officially supported).
… well as updated whats_new for scikit-learn#19304
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpicks
f93ba89
to
6fd45b2
Compare
6fd45b2
to
fd242b7
Compare
sklearn/ensemble/_forest.py
Outdated
|
||
.. versionadded:: 0.18 | ||
Mean Absolute Error (MAE) criterion. | ||
|
||
.. versionadded:: 0.24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As info for a 2nd reviewer: "poisson"
was made available with adding it to decision trees in v0.24. But it was not documented nor tested here for forests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it was not documented or tested, I do not think we should consider the feature "released". I think it is better to mark this as 1.0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @bsun94 Thank you.
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Thank you for all the pointers @lorentzenchr ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @bsun94 !
We need to update test_regression
to run a random forest with the poisson criterion. The target would need to be adjust so that it is greater than zero.
sklearn/ensemble/_forest.py
Outdated
|
||
.. versionadded:: 0.18 | ||
Mean Absolute Error (MAE) criterion. | ||
|
||
.. versionadded:: 0.24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it was not documented or tested, I do not think we should consider the feature "released". I think it is better to mark this as 1.0
.
Do you mean similar to:
What about
|
I am thinking of doing simple integration tests, such as: assert np.all(reg.predict(X) > 0) At a glance, due to the bagging nature of forests, I do not think all the conditions that held for single trees also holds for forests.
|
apologies, I'm not quite following the convo - for now, I've tried to rework def check_regression_criterion(name, criterion):
# Check consistency on regression dataset.
ForestRegressor = FOREST_REGRESSORS[name]
if criterion == 'poisson':
mask = y_reg > 0
y_regr = y_reg * mask + 1
else:
y_regr = y_reg
reg = ForestRegressor(n_estimators=5, criterion=criterion,
random_state=1)
reg.fit(X_reg, y_regr)
score = reg.score(X_reg, y_regr)
assert score > 0.93, ("Failed with max_features=None, criterion %s "
"and score = %f" % (criterion, score))
reg = ForestRegressor(n_estimators=5, criterion=criterion,
max_features=6, random_state=1)
reg.fit(X_reg, y_regr)
score = reg.score(X_reg, y_regr)
assert score > 0.92, ("Failed with max_features=6, criterion %s "
"and score = %f" % (criterion, score)) This might not be what we're looking for, but thought I'd put something out to try and improve my understanding. |
@bsun94 The point is that we should test, on some dataset, that the poisson forest regression actually works. One possibility is to test that a poisson forest has better predictive performance on a poisson distributed target compared to as squared error forest, see Ideally, we have both tests. |
I see, thanks Christian. I've replicated the |
Taking forest_poi = RandomForestRegressor(
criterion="poisson",
n_estimators=10,
bootstrap=False,
min_samples_split=10,
random_state=rng
)
forest_mse = RandomForestRegressor(
criterion="mse",
n_estimators=10,
bootstrap=False,
min_samples_split=10,
random_state=rng
) works for me. See details. import numpy as np
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_poisson_deviance
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn import datasets
rng = np.random.RandomState(42)
n_train, n_test, n_features = 500, 500, 10
X = datasets.make_low_rank_matrix(n_samples=n_train + n_test,
n_features=n_features, random_state=rng)
# We create a log-linear Poisson model and downscale coef as it will get
# exponentiated.
coef = rng.uniform(low=-2, high=2, size=n_features) / np.max(X, axis=0)
y = rng.poisson(lam=np.exp(X @ coef))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=n_test,
random_state=rng)
forest_poi = RandomForestRegressor(
criterion="poisson",
n_estimators=10,
bootstrap=False,
random_state=rng
)
forest_mse = RandomForestRegressor(
criterion="mse",
n_estimators=10,
bootstrap=False,
random_state=rng
)
forest_poi.fit(X_train, y_train)
forest_mse.fit(X_train, y_train)
dummy = DummyRegressor(strategy="mean").fit(X_train, y_train)
for X, y, val in [(X_train, y_train, "train"), (X_test, y_test, "test")]:
metric_poi = mean_poisson_deviance(y, forest_poi.predict(X))
# squared_error might produce non-positive predictions => clip
metric_mse = mean_poisson_deviance(y, np.clip(forest_mse.predict(X),
1e-15, None))
metric_dummy = mean_poisson_deviance(y, dummy.predict(X))
# As squared_error might correctly predict 0 in train set, its train
# score can be better than Poisson. This is no longer the case for the
# test set.
if val == "test":
assert metric_poi < metric_mse
assert metric_poi < metric_dummy |
No worries @lorentzenchr - thank you for your guidance throughout! |
@bsun94 Which aspect to you mean? The loss functions aka splitting criteria, how single decision trees work or how the random forest makes use of many single trees (with nice theoretical properties)? |
@lorentzenchr On all three would be great haha |
@bsun94 To advertise this nice library, why don't you have a look at the user guide:
I don't have a good reference for loss functions without knowing your background. In principle, it has components from probability theory (Poisson distribution), statistics (estimation theory, Poisson GLM), statistical learning theory and forecasting (and maybe more). |
@bsun94 Could you resolve merge conflicts by merging the main branch? |
…hecks.py (scikit-learn#20138) Co-authored-by: Alihan Zihna <a.zihna@ckhgbdp.onmicrosoft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some small fixes to the docstring. Otherwise LGTM! Thank you for working on this PR @bsun94 !
ENH Adds Poisson criterion in RandomForestRegressor (scikit-learn#19836)
* TST enable test docstring params for feature extraction module (scikit-learn#20188) * DOC fix a reference in sklearn.ensemble.GradientBoostingRegressor (scikit-learn#20198) * FIX mcc zero divsion (scikit-learn#19977) * TST Add TransformedTargetRegressor to test_meta_estimators_delegate_data_validation (scikit-learn#20175) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> * TST enable n_feature_in_ test for feature_extraction module * FIX Uses points instead of pixels in plot_tree (scikit-learn#20023) * MNT n_features_in through the multiclass module (scikit-learn#20193) * CI Removes python 3.6 builds from wheel building (scikit-learn#20184) * FIX Fix typo in error message in `fetch_openml` (scikit-learn#20201) * FIX Fix error when using Calibrated with Voting (scikit-learn#20087) * FIX Fix RandomForestRegressor doesn't accept max_samples=1.0 (scikit-learn#20159) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> * ENH Adds Poisson criterion in RandomForestRegressor (scikit-learn#19836) Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Alihan Zihna <alihanz@gmail.com> Co-authored-by: Alihan Zihna <a.zihna@ckhgbdp.onmicrosoft.com> Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com> Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com> Co-authored-by: naozin555 <37050583+naozin555@users.noreply.github.com> Co-authored-by: Venkatachalam N <venky.yuvy@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> * TST Replace assert_warns from decomposition/tests (scikit-learn#20214) * TST check n_features_in_ in pipeline module (scikit-learn#20192) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com> Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com> * Allow `n_knots=None` if knots are explicitly specified in `SplineTransformer` (scikit-learn#20191) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> * FIX make check_complex_data deterministic (scikit-learn#20221) * TST test_fit_docstring_attributes include properties (scikit-learn#20190) * FIX Uses the color max for colormap in ConfusionMatrixDisplay (scikit-learn#19784) * STY Changing .format method to f-string formatting (scikit-learn#20215) * CI Adds permissions for label action Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com> Co-authored-by: tsuga <2888173+tsuga@users.noreply.github.com> Co-authored-by: Conner Shen <connershen98@hotmail.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: mlondschien <61679398+mlondschien@users.noreply.github.com> Co-authored-by: Clément Fauchereau <clement.fauchereau@ensta-bretagne.org> Co-authored-by: murata-yu <67666318+murata-yu@users.noreply.github.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Brian Sun <52805678+bsun94@users.noreply.github.com> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Alihan Zihna <alihanz@gmail.com> Co-authored-by: Alihan Zihna <a.zihna@ckhgbdp.onmicrosoft.com> Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com> Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com> Co-authored-by: naozin555 <37050583+naozin555@users.noreply.github.com> Co-authored-by: Venkatachalam N <venky.yuvy@gmail.com> Co-authored-by: Nanshan Li <nanshanli@dsaid.gov.sg> Co-authored-by: solosilence <abhishekkr23rs@gmail.com>
thank you @lorentzenchr and @thomasjpfan for all your guidance in this! |
@bsun94 Happy to help out. Maybe there is a next time:smirk: |
I don't see why not! |
Reference Issues/PRs
Helps address issue outlined in issue #19304. Specifically:
RandomForestRegressor
.RandomForestRegressor
.RandomForestRegressor
.However, always happy to field suggestions and make adjustments as needed.
What does this implement/fix? Explain your changes.
Please see above checklist.
Any other comments?
None so far.