ENH Adds Poisson criterion in RandomForestRegressor #19304 #19836

bsun94 · 2021-04-07T02:45:44Z

Reference Issues/PRs

Helps address issue outlined in issue #19304. Specifically:

Add the poisson splitting criterion to the docstring of RandomForestRegressor.
Add input validation (non-negative y) to RandomForestRegressor.
Expand the tests for RandomForestRegressor.

However, always happy to field suggestions and make adjustments as needed.

What does this implement/fix? Explain your changes.

Please see above checklist.

Any other comments?

None so far.

…ikit-learn#19304

lorentzenchr

Could you also add a whats_new entry in v1.0.rst? Something like "official support of the Poisson splitting criterion for RandomForestRegressor" (you can use it right now, but it's currently not documented and not tested => not officially supported).

sklearn/ensemble/_forest.py

… well as updated whats_new for scikit-learn#19304

lorentzenchr

nitpicks

doc/whats_new/v1.0.rst

sklearn/ensemble/_forest.py

lorentzenchr · 2021-04-12T05:46:00Z

sklearn/ensemble/_forest.py


        .. versionadded:: 0.18
           Mean Absolute Error (MAE) criterion.

+        .. versionadded:: 0.24


As info for a 2nd reviewer: "poisson" was made available with adding it to decision trees in v0.24. But it was not documented nor tested here for forests.

Because it was not documented or tested, I do not think we should consider the feature "released". I think it is better to mark this as 1.0.

lorentzenchr

LGTM. @bsun94 Thank you.

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

bsun94 · 2021-04-13T01:51:23Z

Thank you for all the pointers @lorentzenchr !

thomasjpfan

Thank you for the PR @bsun94 !

We need to update test_regression to run a random forest with the poisson criterion. The target would need to be adjust so that it is greater than zero.

thomasjpfan · 2021-04-13T15:38:14Z

sklearn/ensemble/_forest.py


        .. versionadded:: 0.18
           Mean Absolute Error (MAE) criterion.

+        .. versionadded:: 0.24


Because it was not documented or tested, I do not think we should consider the feature "released". I think it is better to mark this as 1.0.

lorentzenchr · 2021-04-13T15:56:17Z

We need to update test_regression to run a random forest with the poisson criterion. The target would need to be adjust so that it is greater than zero.

Do you mean similar to:

scikit-learn/sklearn/tree/tests/test_tree.py

Line 2026 in ab65c8b

def test_poisson_vs_mse():
scikit-learn/sklearn/tree/tests/test_tree.py

Line 1980 in ab65c8b

def test_balance_property(criterion, Tree):

What about

scikit-learn/sklearn/tree/tests/test_tree.py

Line 1993 in ab65c8b

def test_poisson_zero_nodes(seed):

thomasjpfan · 2021-04-13T16:31:45Z

I am thinking of doing simple integration tests, such as:

assert np.all(reg.predict(X) > 0)

At a glance, due to the bagging nature of forests, I do not think all the conditions that held for single trees also holds for forests.

test_poisson_vs_mse could work, but I suspect some fine-tuning.

bsun94 · 2021-04-15T02:23:24Z

apologies, I'm not quite following the convo - for now, I've tried to rework check_regression_criterion so that it can run with the Poisson criterion.

def check_regression_criterion(name, criterion):
    # Check consistency on regression dataset.
    ForestRegressor = FOREST_REGRESSORS[name]

    if criterion == 'poisson':
        mask = y_reg > 0
        y_regr = y_reg * mask + 1
    else:
        y_regr = y_reg

    reg = ForestRegressor(n_estimators=5, criterion=criterion,
                          random_state=1)
    reg.fit(X_reg, y_regr)
    score = reg.score(X_reg, y_regr)
    assert score > 0.93, ("Failed with max_features=None, criterion %s "
                          "and score = %f" % (criterion, score))

    reg = ForestRegressor(n_estimators=5, criterion=criterion,
                          max_features=6, random_state=1)
    reg.fit(X_reg, y_regr)
    score = reg.score(X_reg, y_regr)
    assert score > 0.92, ("Failed with max_features=6, criterion %s "
                          "and score = %f" % (criterion, score))

This might not be what we're looking for, but thought I'd put something out to try and improve my understanding.

lorentzenchr · 2021-04-15T08:56:37Z

@bsun94 The point is that we should test, on some dataset, that the poisson forest regression actually works. One possibility is to test that a poisson forest has better predictive performance on a poisson distributed target compared to as squared error forest, see test_poisson_vs_mse in the tree module. (Note that this test required some fine tuning for single trees.)
Another possibility is to check the balance property, i.e. sum(y_true) == sum(y_pred) on the training set. This should also be true for squared error.

Ideally, we have both tests.

bsun94 · 2021-04-16T01:12:39Z

I see, thanks Christian. I've replicated the test_poisson_vs_mse test in the test_forest module - adapted to use RandomForestRegressor. I've tried playing around with n_features, different settings in rng.uniform and rng.poisson, but it seems no matter what I try, the test fails at metric_poi < metric_mse - mse having the lower error. Are there other ways for me to fine-tune/am I going about this the right way?

lorentzenchr · 2021-04-16T19:32:49Z

Taking test_poisson_vs_mse and replacing the tree models by

forest_poi = RandomForestRegressor(
    criterion="poisson",
    n_estimators=10,
    bootstrap=False,
    min_samples_split=10,
    random_state=rng
)
forest_mse = RandomForestRegressor(
    criterion="mse",
    n_estimators=10,
    bootstrap=False,
    min_samples_split=10,
    random_state=rng
)

works for me. See details.

import numpy as np
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_poisson_deviance
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn import datasets


rng = np.random.RandomState(42)
n_train, n_test, n_features = 500, 500, 10
X = datasets.make_low_rank_matrix(n_samples=n_train + n_test,
                                  n_features=n_features, random_state=rng)
# We create a log-linear Poisson model and downscale coef as it will get
# exponentiated.
coef = rng.uniform(low=-2, high=2, size=n_features) / np.max(X, axis=0)
y = rng.poisson(lam=np.exp(X @ coef))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=n_test,
                                                    random_state=rng)

forest_poi = RandomForestRegressor(
    criterion="poisson",
    n_estimators=10,
    bootstrap=False,
    random_state=rng
)
forest_mse = RandomForestRegressor(
    criterion="mse",
    n_estimators=10,
    bootstrap=False,
    random_state=rng
)
forest_poi.fit(X_train, y_train)
forest_mse.fit(X_train, y_train)
dummy = DummyRegressor(strategy="mean").fit(X_train, y_train)

for X, y, val in [(X_train, y_train, "train"), (X_test, y_test, "test")]:
    metric_poi = mean_poisson_deviance(y, forest_poi.predict(X))
    # squared_error might produce non-positive predictions => clip
    metric_mse = mean_poisson_deviance(y, np.clip(forest_mse.predict(X),
                                                  1e-15, None))
    metric_dummy = mean_poisson_deviance(y, dummy.predict(X))
    # As squared_error might correctly predict 0 in train set, its train
    # score can be better than Poisson. This is no longer the case for the
    # test set.
    if val == "test":
        assert metric_poi < metric_mse
    assert metric_poi < metric_dummy

…kit-learn#19304

bsun94 · 2021-05-27T23:04:53Z

No worries @lorentzenchr - thank you for your guidance throughout!
To be honest, I don't think I understand all the Maths here haha (why does this work?) - would there be any online resources you could refer me to for me to read up?

lorentzenchr · 2021-05-29T11:12:07Z

@bsun94 Which aspect to you mean? The loss functions aka splitting criteria, how single decision trees work or how the random forest makes use of many single trees (with nice theoretical properties)?

sklearn/ensemble/tests/test_forest.py

…19304

bsun94 · 2021-05-31T21:17:26Z

@bsun94 Which aspect to you mean? The loss functions aka splitting criteria, how single decision trees work or how the random forest makes use of many single trees (with nice theoretical properties)?

@lorentzenchr On all three would be great haha

lorentzenchr · 2021-06-01T15:32:48Z

@bsun94 To advertise this nice library, why don't you have a look at the user guide:

Decision Trees: https://scikit-learn.org/stable/modules/tree.html
Random Forests: https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees

I don't have a good reference for loss functions without knowing your background. In principle, it has components from probability theory (Poisson distribution), statistics (estimation theory, Poisson GLM), statistical learning theory and forecasting (and maybe more).

lorentzenchr · 2021-06-05T09:47:59Z

@bsun94 Could you resolve merge conflicts by merging the main branch?

…hecks.py (scikit-learn#20138) Co-authored-by: Alihan Zihna <a.zihna@ckhgbdp.onmicrosoft.com>

scikit-learn#20146)

…kit-learn#20117)

thomasjpfan

I made some small fixes to the docstring. Otherwise LGTM! Thank you for working on this PR @bsun94 !

ENH Adds Poisson criterion in RandomForestRegressor (scikit-learn#19836)

* TST enable test docstring params for feature extraction module (scikit-learn#20188) * DOC fix a reference in sklearn.ensemble.GradientBoostingRegressor (scikit-learn#20198) * FIX mcc zero divsion (scikit-learn#19977) * TST Add TransformedTargetRegressor to test_meta_estimators_delegate_data_validation (scikit-learn#20175) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> * TST enable n_feature_in_ test for feature_extraction module * FIX Uses points instead of pixels in plot_tree (scikit-learn#20023) * MNT n_features_in through the multiclass module (scikit-learn#20193) * CI Removes python 3.6 builds from wheel building (scikit-learn#20184) * FIX Fix typo in error message in `fetch_openml` (scikit-learn#20201) * FIX Fix error when using Calibrated with Voting (scikit-learn#20087) * FIX Fix RandomForestRegressor doesn't accept max_samples=1.0 (scikit-learn#20159) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> * ENH Adds Poisson criterion in RandomForestRegressor (scikit-learn#19836) Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Alihan Zihna <alihanz@gmail.com> Co-authored-by: Alihan Zihna <a.zihna@ckhgbdp.onmicrosoft.com> Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com> Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com> Co-authored-by: naozin555 <37050583+naozin555@users.noreply.github.com> Co-authored-by: Venkatachalam N <venky.yuvy@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> * TST Replace assert_warns from decomposition/tests (scikit-learn#20214) * TST check n_features_in_ in pipeline module (scikit-learn#20192) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com> Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com> * Allow `n_knots=None` if knots are explicitly specified in `SplineTransformer` (scikit-learn#20191) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> * FIX make check_complex_data deterministic (scikit-learn#20221) * TST test_fit_docstring_attributes include properties (scikit-learn#20190) * FIX Uses the color max for colormap in ConfusionMatrixDisplay (scikit-learn#19784) * STY Changing .format method to f-string formatting (scikit-learn#20215) * CI Adds permissions for label action Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com> Co-authored-by: tsuga <2888173+tsuga@users.noreply.github.com> Co-authored-by: Conner Shen <connershen98@hotmail.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: mlondschien <61679398+mlondschien@users.noreply.github.com> Co-authored-by: Clément Fauchereau <clement.fauchereau@ensta-bretagne.org> Co-authored-by: murata-yu <67666318+murata-yu@users.noreply.github.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Brian Sun <52805678+bsun94@users.noreply.github.com> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Alihan Zihna <alihanz@gmail.com> Co-authored-by: Alihan Zihna <a.zihna@ckhgbdp.onmicrosoft.com> Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com> Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com> Co-authored-by: naozin555 <37050583+naozin555@users.noreply.github.com> Co-authored-by: Venkatachalam N <venky.yuvy@gmail.com> Co-authored-by: Nanshan Li <nanshanli@dsaid.gov.sg> Co-authored-by: solosilence <abhishekkr23rs@gmail.com>

bsun94 · 2021-06-09T17:26:44Z

thank you @lorentzenchr and @thomasjpfan for all your guidance in this!

lorentzenchr · 2021-06-09T18:39:48Z

@bsun94 Happy to help out. Maybe there is a next time:smirk:

bsun94 · 2021-06-10T00:52:41Z

I don't see why not!

ENH adding handling for Poisson criterion in RandomForestRegressor sc…

e380b5b

…ikit-learn#19304

github-actions bot added the module:ensemble label Apr 7, 2021

bsun94 mentioned this pull request Apr 7, 2021

ENH Add Poisson criterion to RandomForestRegressor #19464

Closed

lorentzenchr reviewed Apr 10, 2021

View reviewed changes

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

reformatted message raised by ValueError for the Poisson criterion as…

a4f7c3e

… well as updated whats_new for scikit-learn#19304

lorentzenchr reviewed Apr 10, 2021

View reviewed changes

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

bsun94 added 2 commits April 10, 2021 17:26

ENH updated formatting for scikit-learn#19304

705a587

Resolved merge conflict with main for scikit-learn#19304

3b7b2f2

lorentzenchr reviewed Apr 11, 2021

View reviewed changes

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

bsun94 added 2 commits April 11, 2021 17:37

slight adjustment made to docstring scikit-learn#19304

f20f904

more formatting changes made scikit-learn#19304

168e8d3

bsun94 force-pushed the 19304_PoissonRandomForest branch from f93ba89 to 6fd45b2 Compare April 11, 2021 22:35

troubleshooting for CI tests for scikit-learn#19304

fd242b7

bsun94 force-pushed the 19304_PoissonRandomForest branch from 6fd45b2 to fd242b7 Compare April 11, 2021 22:53

lorentzenchr reviewed Apr 12, 2021

View reviewed changes

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

lorentzenchr reviewed Apr 12, 2021

View reviewed changes

lorentzenchr approved these changes Apr 12, 2021

View reviewed changes

Removed blank line as per final review

44ab402

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

lorentzenchr added the Waiting for Reviewer label Apr 13, 2021

thomasjpfan reviewed Apr 13, 2021

View reviewed changes

bsun94 added 3 commits April 17, 2021 11:13

ENH added new test for verifying functioning of poisson criterion sci…

02781e8

…kit-learn#19304

Merged with upstream to resolve merge conflicts

3488d86

Slight typo fix scikit-learn#19304

909e339

thomasjpfan reviewed May 29, 2021

View reviewed changes

ENH updates made to clipping in test_poisson_vs_mse for scikit-learn#…

6270670

…19304

thomasjpfan changed the title ~~ENH for Poisson criterion in RandomForestRegressor #19304~~ ENH Adds Poisson criterion in RandomForestRegressor #19304 Jun 5, 2021

azihna and others added 11 commits June 5, 2021 15:17

TST Changes assert_raises to raises in sklearn/utils/test_estimator_c…

90f8ff9

…hecks.py (scikit-learn#20138) Co-authored-by: Alihan Zihna <a.zihna@ckhgbdp.onmicrosoft.com>

DOC Update minimal versions for dependencies (scikit-learn#20143)

53b07b1

MAINT silence spurious mypy error (scikit-learn#20147)

f5abaa4

Add missing link to user guide in PolynomialFeatures API documentation (

ad62527

scikit-learn#20146)

ENH Allowing sparse inputs for prediction in AffinityPropagation (sci…

4018fc8

…kit-learn#20117)

TST Fixes test and mis-matched pandas version

e026cdf

CI Try numpy 1.14.6

3f511e8

DOC Upate to whatsnew to 1.14.6

5087036

Merge remote-tracking branch 'upstream/main' into pr/19836

659b382

DOC Fix up docstring

70df2ec

DOC Small fixes to docstring

35de51d

thomasjpfan approved these changes Jun 5, 2021

View reviewed changes

DOC Add period

cd66938

thomasjpfan merged commit 36915ae into scikit-learn:main Jun 6, 2021

sakibguy added a commit to sakibguy/scikit-learn that referenced this pull request Jun 6, 2021

Merge pull request #5 from scikit-learn/main

33e18c1

ENH Adds Poisson criterion in RandomForestRegressor (scikit-learn#19836)

lorentzenchr linked an issue Jun 6, 2021 that may be closed by this pull request

Poisson criterion in RandomForestRegressor #19304

Closed

3 tasks

lorentzenchr mentioned this pull request Jun 6, 2021

Poisson criterion in RandomForestRegressor #19304

Closed

3 tasks

ogrisel mentioned this pull request Jun 11, 2024

Tweedie deviance loss for tree based models #16668

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Adds Poisson criterion in RandomForestRegressor #19304 #19836

ENH Adds Poisson criterion in RandomForestRegressor #19304 #19836

bsun94 commented Apr 7, 2021

lorentzenchr left a comment

lorentzenchr left a comment

lorentzenchr Apr 12, 2021

thomasjpfan Apr 13, 2021

lorentzenchr left a comment

bsun94 commented Apr 13, 2021

thomasjpfan left a comment

thomasjpfan Apr 13, 2021

lorentzenchr commented Apr 13, 2021

thomasjpfan commented Apr 13, 2021

bsun94 commented Apr 15, 2021

lorentzenchr commented Apr 15, 2021

bsun94 commented Apr 16, 2021

lorentzenchr commented Apr 16, 2021

bsun94 commented May 27, 2021

lorentzenchr commented May 29, 2021

bsun94 commented May 31, 2021

lorentzenchr commented Jun 1, 2021

lorentzenchr commented Jun 5, 2021

thomasjpfan left a comment

bsun94 commented Jun 9, 2021

lorentzenchr commented Jun 9, 2021

bsun94 commented Jun 10, 2021

ENH Adds Poisson criterion in RandomForestRegressor #19304 #19836

ENH Adds Poisson criterion in RandomForestRegressor #19304 #19836

Conversation

bsun94 commented Apr 7, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

lorentzenchr left a comment

Choose a reason for hiding this comment

lorentzenchr left a comment

Choose a reason for hiding this comment

lorentzenchr Apr 12, 2021

Choose a reason for hiding this comment

thomasjpfan Apr 13, 2021

Choose a reason for hiding this comment

lorentzenchr left a comment

Choose a reason for hiding this comment

bsun94 commented Apr 13, 2021

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan Apr 13, 2021

Choose a reason for hiding this comment

lorentzenchr commented Apr 13, 2021

thomasjpfan commented Apr 13, 2021

bsun94 commented Apr 15, 2021

lorentzenchr commented Apr 15, 2021

bsun94 commented Apr 16, 2021

lorentzenchr commented Apr 16, 2021

bsun94 commented May 27, 2021

lorentzenchr commented May 29, 2021

bsun94 commented May 31, 2021

lorentzenchr commented Jun 1, 2021

lorentzenchr commented Jun 5, 2021

thomasjpfan left a comment

Choose a reason for hiding this comment

bsun94 commented Jun 9, 2021

lorentzenchr commented Jun 9, 2021

bsun94 commented Jun 10, 2021