-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Adds support for missing values in Random Forest #26391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds support for missing values in Random Forest #26391
Conversation
sklearn/ensemble/_forest.py
Outdated
@@ -632,6 +671,15 @@ def feature_importances_(self): | |||
all_importances = np.mean(all_importances, axis=0, dtype=np.float64) | |||
return all_importances / np.sum(all_importances) | |||
|
|||
def _more_tags(self): | |||
# Ignore errors because the parameters are not validated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand this a bit more? I was thinking that _safe_tags
is the one that would raise an exception here, but as far as I can tell it would raise a ValueError
. So I'm confused about where the exception comes from (set_params
?) and why that means it is Ok to return an empty {}
from _more_tags
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much of the error catching comes from the estimator not being validated when _safe_tags(estimator)
is called. Historically, this results in estimator checks failing because _safe_tags
is called before fit
.
where the exception comes from
Yes, it was set_params
or clone
. (self.estimator
is not validated at this point)
return an empty {} from
_more_tags
.
In that case, an error happened and _more_tags
can not determine the tags. In those cases, returning an empty dictionary means "can not be determined and use the default tags".
In any case, I updated the PR to be more explicit.
Trying to wrap my head around the bit of code I commented on, otherwise I think this looks good already. |
Co-authored-by: Tim Head <betatim@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Maybe you could add a section in the docs, like the HistGradientBossting* have about missing values support.
if self.estimators_[0]._support_missing_values(X): | ||
force_all_finite = "allow-nan" | ||
else: | ||
force_all_finite = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to have a discussion about whether or not nans should be treated automatically as missing values. The issue I see here is that if this is in a pipeline and you have a bugged transformer before that which outputs nans, it will be silently ignored by the random forest.
Maybe missing value support should be enabled through a parameter, and/or through the config context. Maybe we should add an after-fit check in our transformers to ensure they did not create nans.
Anyway, this is consistent with the current behavior of HGBT so I'm fine merging it as is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to your conclusion.
To continue your line of thought: how often do we think it happens that someone wants NaNs to mean missing value vs NaNs appearing because of a bug? Based on what we think is more likely I think we should either make the handling automatic (with an opt-in to get a warning/exception) or make warning/exception the default that needs optint-out of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @thomasjpfan.
I just have a few suggestions.
assert forest.score(X_train, y_train) >= 0.80 | ||
assert forest.score(X_test, y_test) >= 0.75 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also compare it against the score for a forest trained on non-missing data?
def _more_tags(self): | ||
# Only the criterion is required to determine if the tree supports | ||
# missing values | ||
estimator = type(self.estimator)(criterion=self.criterion) | ||
return {"allow_nan": _safe_tags(estimator, key="allow_nan")} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we test the support or non-support of the criteria?
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you, @thomasjpfan.
To me this PR is mergeable since the failure on the CI are unrelated to the changes proposed.
doc/whats_new/v1.3.rst
Outdated
- |MajorFeature| :class:`ensemble.RandomForestClassifier` and | ||
:class:`ensemble.RandomForestRegressor` support missing values when | ||
the criterion is `gini`, `entropy`, or `log_loss`, | ||
for classification or `squared_error`, `friedman_mse`, or `poisson` | ||
for regression. :pr:`26391` by `Thomas Fan`_. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entry must be moved to doc/whats_new/v1.4.rst
.
I was not expecting this PR to be merged with a single approval. 🫤 Should we require two approvals for auto-merging PRs? |
As a default, I think two approvals is too much, because there are some simple PRs such as documentation or CI fixes that should only require one review. I'll say to only click on the green auto merge button if you were going to merge anyways, but want to wait for the CI. @betatim Do you have any issues with this PR? I prefer not revert this PR and open a new PR because there was not a second approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here comes a post merge approval with nitpicks that do not necessarily need to be adressed.
], | ||
) | ||
def test_missing_values_is_resilient(make_data, Forest): | ||
"""Check that forest can deal with missing values and have decent performance.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Check that forest can deal with missing values and have decent performance.""" | |
"""Check that forest can deal with missing values and has decent performance.""" |
|
||
# Create dataset with missing values | ||
X_missing = X.copy() | ||
X_missing[rng.choice([False, True], size=X.shape, p=[0.95, 0.05])] = np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an assertion that X_missing
has indeed np.nan
in it.
@thomasjpfan Is support for missing values in ExtraTrees also included in this PR? |
Reference Issues/PRs
Follow up to #23595
What does this implement/fix? Explain your changes.
This PR enables missing value support for random forest. I ran the same benchmarks from #23595 with Random Forest. The benchmarks confirms that there are no regressions compared to main when there are no missing values:
Any other comments?
Implementation wise, the forest constructs a boolean array of size
(n_features, )
and passes it along to each tree in_fit
. This helps preserve the performance compared tomain
, because the missing value check is only performed once.