Skip to content

ENH Adds support for missing values in Random Forest #26391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jul 27, 2023

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Follow up to #23595

What does this implement/fix? Explain your changes.

This PR enables missing value support for random forest. I ran the same benchmarks from #23595 with Random Forest. The benchmarks confirms that there are no regressions compared to main when there are no missing values:

results_image

Any other comments?

Implementation wise, the forest constructs a boolean array of size (n_features, ) and passes it along to each tree in _fit. This helps preserve the performance compared to main, because the missing value check is only performed once.

@thomasjpfan thomasjpfan added this to the 1.3 milestone May 17, 2023
@@ -632,6 +671,15 @@ def feature_importances_(self):
all_importances = np.mean(all_importances, axis=0, dtype=np.float64)
return all_importances / np.sum(all_importances)

def _more_tags(self):
# Ignore errors because the parameters are not validated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand this a bit more? I was thinking that _safe_tags is the one that would raise an exception here, but as far as I can tell it would raise a ValueError. So I'm confused about where the exception comes from (set_params?) and why that means it is Ok to return an empty {} from _more_tags.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much of the error catching comes from the estimator not being validated when _safe_tags(estimator) is called. Historically, this results in estimator checks failing because _safe_tags is called before fit.

where the exception comes from

Yes, it was set_params or clone. (self.estimator is not validated at this point)

return an empty {} from _more_tags.

In that case, an error happened and _more_tags can not determine the tags. In those cases, returning an empty dictionary means "can not be determined and use the default tags".

In any case, I updated the PR to be more explicit.

@betatim
Copy link
Member

betatim commented May 24, 2023

Trying to wrap my head around the bit of code I commented on, otherwise I think this looks good already.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Maybe you could add a section in the docs, like the HistGradientBossting* have about missing values support.

Comment on lines +623 to +626
if self.estimators_[0]._support_missing_values(X):
force_all_finite = "allow-nan"
else:
force_all_finite = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to have a discussion about whether or not nans should be treated automatically as missing values. The issue I see here is that if this is in a pipeline and you have a bugged transformer before that which outputs nans, it will be silently ignored by the random forest.

Maybe missing value support should be enabled through a parameter, and/or through the config context. Maybe we should add an after-fit check in our transformers to ensure they did not create nans.

Anyway, this is consistent with the current behavior of HGBT so I'm fine merging it as is for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to your conclusion.

To continue your line of thought: how often do we think it happens that someone wants NaNs to mean missing value vs NaNs appearing because of a bug? Based on what we think is more likely I think we should either make the handling automatic (with an opt-in to get a warning/exception) or make warning/exception the default that needs optint-out of.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @thomasjpfan.

I just have a few suggestions.

Comment on lines 1854 to 1855
assert forest.score(X_train, y_train) >= 0.80
assert forest.score(X_test, y_test) >= 0.75
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also compare it against the score for a forest trained on non-missing data?

Comment on lines +674 to +678
def _more_tags(self):
# Only the criterion is required to determine if the tree supports
# missing values
estimator = type(self.estimator)(criterion=self.criterion)
return {"allow_nan": _safe_tags(estimator, key="allow_nan")}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we test the support or non-support of the criteria?

@github-actions
Copy link

github-actions bot commented Jun 23, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: d0eb5a9. Link to the linter CI: here

@jeremiedbb jeremiedbb modified the milestones: 1.3, 1.4 Jul 6, 2023
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you, @thomasjpfan.

To me this PR is mergeable since the failure on the CI are unrelated to the changes proposed.

Comment on lines 363 to 368
- |MajorFeature| :class:`ensemble.RandomForestClassifier` and
:class:`ensemble.RandomForestRegressor` support missing values when
the criterion is `gini`, `entropy`, or `log_loss`,
for classification or `squared_error`, `friedman_mse`, or `poisson`
for regression. :pr:`26391` by `Thomas Fan`_.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry must be moved to doc/whats_new/v1.4.rst.

@jjerphan jjerphan enabled auto-merge (squash) July 27, 2023 14:36
@jjerphan jjerphan merged commit 4094851 into scikit-learn:main Jul 27, 2023
@jjerphan
Copy link
Member

I was not expecting this PR to be merged with a single approval. 🫤

Should we require two approvals for auto-merging PRs?

@thomasjpfan
Copy link
Member Author

As a default, I think two approvals is too much, because there are some simple PRs such as documentation or CI fixes that should only require one review. I'll say to only click on the green auto merge button if you were going to merge anyways, but want to wait for the CI.

@betatim Do you have any issues with this PR? I prefer not revert this PR and open a new PR because there was not a second approval.

punndcoder28 pushed a commit to punndcoder28/scikit-learn that referenced this pull request Jul 29, 2023
)

Co-authored-by: Tim Head <betatim@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here comes a post merge approval with nitpicks that do not necessarily need to be adressed.

],
)
def test_missing_values_is_resilient(make_data, Forest):
"""Check that forest can deal with missing values and have decent performance."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Check that forest can deal with missing values and have decent performance."""
"""Check that forest can deal with missing values and has decent performance."""


# Create dataset with missing values
X_missing = X.copy()
X_missing[rng.choice([False, True], size=X.shape, p=[0.95, 0.05])] = np.nan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an assertion that X_missing has indeed np.nan in it.

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
)

Co-authored-by: Tim Head <betatim@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
@mglowacki100
Copy link

mglowacki100 commented Dec 1, 2023

@thomasjpfan Is support for missing values in ExtraTrees also included in this PR?
Edited: I've checked it and missing values are only handled for Random Forest in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants