Skip to content

ENH Adds Poisson criterion in RandomForestRegressor #19304 #19836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Jun 6, 2021

Conversation

bsun94
Copy link
Contributor

@bsun94 bsun94 commented Apr 7, 2021

Reference Issues/PRs

Helps address issue outlined in issue #19304. Specifically:

  • Add the poisson splitting criterion to the docstring of RandomForestRegressor.
  • Add input validation (non-negative y) to RandomForestRegressor.
  • Expand the tests for RandomForestRegressor.

However, always happy to field suggestions and make adjustments as needed.

What does this implement/fix? Explain your changes.

Please see above checklist.

Any other comments?

None so far.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a whats_new entry in v1.0.rst? Something like "official support of the Poisson splitting criterion for RandomForestRegressor" (you can use it right now, but it's currently not documented and not tested => not officially supported).

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicks

@bsun94 bsun94 force-pushed the 19304_PoissonRandomForest branch from f93ba89 to 6fd45b2 Compare April 11, 2021 22:35
@bsun94 bsun94 force-pushed the 19304_PoissonRandomForest branch from 6fd45b2 to fd242b7 Compare April 11, 2021 22:53

.. versionadded:: 0.18
Mean Absolute Error (MAE) criterion.

.. versionadded:: 0.24
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As info for a 2nd reviewer: "poisson" was made available with adding it to decision trees in v0.24. But it was not documented nor tested here for forests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it was not documented or tested, I do not think we should consider the feature "released". I think it is better to mark this as 1.0.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @bsun94 Thank you.

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
@bsun94
Copy link
Contributor Author

bsun94 commented Apr 13, 2021

Thank you for all the pointers @lorentzenchr !

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @bsun94 !

We need to update test_regression to run a random forest with the poisson criterion. The target would need to be adjust so that it is greater than zero.


.. versionadded:: 0.18
Mean Absolute Error (MAE) criterion.

.. versionadded:: 0.24
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it was not documented or tested, I do not think we should consider the feature "released". I think it is better to mark this as 1.0.

@lorentzenchr
Copy link
Member

We need to update test_regression to run a random forest with the poisson criterion. The target would need to be adjust so that it is greater than zero.

Do you mean similar to:

What about

@thomasjpfan
Copy link
Member

I am thinking of doing simple integration tests, such as:

assert np.all(reg.predict(X) > 0)

At a glance, due to the bagging nature of forests, I do not think all the conditions that held for single trees also holds for forests.

test_poisson_vs_mse could work, but I suspect some fine-tuning.

@bsun94
Copy link
Contributor Author

bsun94 commented Apr 15, 2021

apologies, I'm not quite following the convo - for now, I've tried to rework check_regression_criterion so that it can run with the Poisson criterion.

def check_regression_criterion(name, criterion):
    # Check consistency on regression dataset.
    ForestRegressor = FOREST_REGRESSORS[name]

    if criterion == 'poisson':
        mask = y_reg > 0
        y_regr = y_reg * mask + 1
    else:
        y_regr = y_reg

    reg = ForestRegressor(n_estimators=5, criterion=criterion,
                          random_state=1)
    reg.fit(X_reg, y_regr)
    score = reg.score(X_reg, y_regr)
    assert score > 0.93, ("Failed with max_features=None, criterion %s "
                          "and score = %f" % (criterion, score))

    reg = ForestRegressor(n_estimators=5, criterion=criterion,
                          max_features=6, random_state=1)
    reg.fit(X_reg, y_regr)
    score = reg.score(X_reg, y_regr)
    assert score > 0.92, ("Failed with max_features=6, criterion %s "
                          "and score = %f" % (criterion, score))

This might not be what we're looking for, but thought I'd put something out to try and improve my understanding.

@lorentzenchr
Copy link
Member

@bsun94 The point is that we should test, on some dataset, that the poisson forest regression actually works. One possibility is to test that a poisson forest has better predictive performance on a poisson distributed target compared to as squared error forest, see test_poisson_vs_mse in the tree module. (Note that this test required some fine tuning for single trees.)
Another possibility is to check the balance property, i.e. sum(y_true) == sum(y_pred) on the training set. This should also be true for squared error.

Ideally, we have both tests.

@bsun94
Copy link
Contributor Author

bsun94 commented Apr 16, 2021

I see, thanks Christian. I've replicated the test_poisson_vs_mse test in the test_forest module - adapted to use RandomForestRegressor. I've tried playing around with n_features, different settings in rng.uniform and rng.poisson, but it seems no matter what I try, the test fails at metric_poi < metric_mse - mse having the lower error. Are there other ways for me to fine-tune/am I going about this the right way?

@lorentzenchr
Copy link
Member

Taking test_poisson_vs_mse and replacing the tree models by

forest_poi = RandomForestRegressor(
    criterion="poisson",
    n_estimators=10,
    bootstrap=False,
    min_samples_split=10,
    random_state=rng
)
forest_mse = RandomForestRegressor(
    criterion="mse",
    n_estimators=10,
    bootstrap=False,
    min_samples_split=10,
    random_state=rng
)

works for me. See details.

import numpy as np
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_poisson_deviance
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn import datasets


rng = np.random.RandomState(42)
n_train, n_test, n_features = 500, 500, 10
X = datasets.make_low_rank_matrix(n_samples=n_train + n_test,
                                  n_features=n_features, random_state=rng)
# We create a log-linear Poisson model and downscale coef as it will get
# exponentiated.
coef = rng.uniform(low=-2, high=2, size=n_features) / np.max(X, axis=0)
y = rng.poisson(lam=np.exp(X @ coef))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=n_test,
                                                    random_state=rng)

forest_poi = RandomForestRegressor(
    criterion="poisson",
    n_estimators=10,
    bootstrap=False,
    random_state=rng
)
forest_mse = RandomForestRegressor(
    criterion="mse",
    n_estimators=10,
    bootstrap=False,
    random_state=rng
)
forest_poi.fit(X_train, y_train)
forest_mse.fit(X_train, y_train)
dummy = DummyRegressor(strategy="mean").fit(X_train, y_train)

for X, y, val in [(X_train, y_train, "train"), (X_test, y_test, "test")]:
    metric_poi = mean_poisson_deviance(y, forest_poi.predict(X))
    # squared_error might produce non-positive predictions => clip
    metric_mse = mean_poisson_deviance(y, np.clip(forest_mse.predict(X),
                                                  1e-15, None))
    metric_dummy = mean_poisson_deviance(y, dummy.predict(X))
    # As squared_error might correctly predict 0 in train set, its train
    # score can be better than Poisson. This is no longer the case for the
    # test set.
    if val == "test":
        assert metric_poi < metric_mse
    assert metric_poi < metric_dummy

@bsun94
Copy link
Contributor Author

bsun94 commented May 27, 2021

No worries @lorentzenchr - thank you for your guidance throughout!
To be honest, I don't think I understand all the Maths here haha (why does this work?) - would there be any online resources you could refer me to for me to read up?

@lorentzenchr
Copy link
Member

@bsun94 Which aspect to you mean? The loss functions aka splitting criteria, how single decision trees work or how the random forest makes use of many single trees (with nice theoretical properties)?

@bsun94
Copy link
Contributor Author

bsun94 commented May 31, 2021

@bsun94 Which aspect to you mean? The loss functions aka splitting criteria, how single decision trees work or how the random forest makes use of many single trees (with nice theoretical properties)?

@lorentzenchr On all three would be great haha

@lorentzenchr
Copy link
Member

@bsun94 To advertise this nice library, why don't you have a look at the user guide:

I don't have a good reference for loss functions without knowing your background. In principle, it has components from probability theory (Poisson distribution), statistics (estimation theory, Poisson GLM), statistical learning theory and forecasting (and maybe more).

@lorentzenchr
Copy link
Member

@bsun94 Could you resolve merge conflicts by merging the main branch?

@thomasjpfan thomasjpfan changed the title ENH for Poisson criterion in RandomForestRegressor #19304 ENH Adds Poisson criterion in RandomForestRegressor #19304 Jun 5, 2021
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some small fixes to the docstring. Otherwise LGTM! Thank you for working on this PR @bsun94 !

@thomasjpfan thomasjpfan merged commit 36915ae into scikit-learn:main Jun 6, 2021
sakibguy added a commit to sakibguy/scikit-learn that referenced this pull request Jun 6, 2021
ENH Adds Poisson criterion in RandomForestRegressor (scikit-learn#19836)
@lorentzenchr lorentzenchr linked an issue Jun 6, 2021 that may be closed by this pull request
3 tasks
thomasjpfan added a commit to thomasjpfan/scikit-learn that referenced this pull request Jun 8, 2021
* TST enable test docstring params for feature extraction module (scikit-learn#20188)

* DOC fix a reference in sklearn.ensemble.GradientBoostingRegressor (scikit-learn#20198)

* FIX mcc zero divsion  (scikit-learn#19977)

* TST Add TransformedTargetRegressor to test_meta_estimators_delegate_data_validation (scikit-learn#20175)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* TST enable n_feature_in_ test for feature_extraction module

* FIX Uses points instead of pixels in plot_tree (scikit-learn#20023)

* MNT n_features_in through the multiclass module (scikit-learn#20193)

* CI Removes python 3.6 builds from wheel building (scikit-learn#20184)

* FIX Fix typo in error message in `fetch_openml` (scikit-learn#20201)

* FIX Fix error when using Calibrated with Voting (scikit-learn#20087)

* FIX Fix RandomForestRegressor doesn't accept max_samples=1.0 (scikit-learn#20159)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

* ENH Adds Poisson criterion in RandomForestRegressor (scikit-learn#19836)

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Alihan Zihna <alihanz@gmail.com>
Co-authored-by: Alihan Zihna <a.zihna@ckhgbdp.onmicrosoft.com>
Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>
Co-authored-by: naozin555 <37050583+naozin555@users.noreply.github.com>
Co-authored-by: Venkatachalam N <venky.yuvy@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

* TST Replace assert_warns from decomposition/tests (scikit-learn#20214)

* TST check n_features_in_ in pipeline module (scikit-learn#20192)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>

* Allow `n_knots=None` if knots are explicitly specified in `SplineTransformer` (scikit-learn#20191)


Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* FIX make check_complex_data deterministic (scikit-learn#20221)

* TST test_fit_docstring_attributes include properties (scikit-learn#20190)

* FIX Uses the color max for colormap in ConfusionMatrixDisplay (scikit-learn#19784)

* STY Changing .format method to f-string formatting (scikit-learn#20215)

* CI Adds permissions for label action

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: tsuga <2888173+tsuga@users.noreply.github.com>
Co-authored-by: Conner Shen <connershen98@hotmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: mlondschien <61679398+mlondschien@users.noreply.github.com>
Co-authored-by: Clément Fauchereau <clement.fauchereau@ensta-bretagne.org>
Co-authored-by: murata-yu <67666318+murata-yu@users.noreply.github.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Brian Sun <52805678+bsun94@users.noreply.github.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Alihan Zihna <alihanz@gmail.com>
Co-authored-by: Alihan Zihna <a.zihna@ckhgbdp.onmicrosoft.com>
Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>
Co-authored-by: naozin555 <37050583+naozin555@users.noreply.github.com>
Co-authored-by: Venkatachalam N <venky.yuvy@gmail.com>
Co-authored-by: Nanshan Li <nanshanli@dsaid.gov.sg>
Co-authored-by: solosilence <abhishekkr23rs@gmail.com>
@bsun94
Copy link
Contributor Author

bsun94 commented Jun 9, 2021

thank you @lorentzenchr and @thomasjpfan for all your guidance in this!

@lorentzenchr
Copy link
Member

@bsun94 Happy to help out. Maybe there is a next time:smirk:

@bsun94
Copy link
Contributor Author

bsun94 commented Jun 10, 2021

I don't see why not!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poisson criterion in RandomForestRegressor
8 participants