ENH Adds support for missing values in Random Forest #26391

thomasjpfan · 2023-05-17T14:52:15Z

Reference Issues/PRs

Follow up to #23595

What does this implement/fix? Explain your changes.

This PR enables missing value support for random forest. I ran the same benchmarks from #23595 with Random Forest. The benchmarks confirms that there are no regressions compared to main when there are no missing values:

Any other comments?

Implementation wise, the forest constructs a boolean array of size (n_features, ) and passes it along to each tree in _fit. This helps preserve the performance compared to main, because the missing value check is only performed once.

sklearn/ensemble/tests/test_forest.py

betatim · 2023-05-24T23:39:15Z

sklearn/ensemble/_forest.py

@@ -632,6 +671,15 @@ def feature_importances_(self):
        all_importances = np.mean(all_importances, axis=0, dtype=np.float64)
        return all_importances / np.sum(all_importances)

+    def _more_tags(self):
+        # Ignore errors because the parameters are not validated


Can you help me understand this a bit more? I was thinking that _safe_tags is the one that would raise an exception here, but as far as I can tell it would raise a ValueError. So I'm confused about where the exception comes from (set_params?) and why that means it is Ok to return an empty {} from _more_tags.

Much of the error catching comes from the estimator not being validated when _safe_tags(estimator) is called. Historically, this results in estimator checks failing because _safe_tags is called before fit.

where the exception comes from

Yes, it was set_params or clone. (self.estimator is not validated at this point)

return an empty {} from _more_tags.

In that case, an error happened and _more_tags can not determine the tags. In those cases, returning an empty dictionary means "can not be determined and use the default tags".

In any case, I updated the PR to be more explicit.

betatim · 2023-05-24T23:40:16Z

Trying to wrap my head around the bit of code I commented on, otherwise I think this looks good already.

…om_forest

Co-authored-by: Tim Head <betatim@gmail.com>

jeremiedbb

Looks good. Maybe you could add a section in the docs, like the HistGradientBossting* have about missing values support.

jeremiedbb · 2023-06-08T15:19:16Z

sklearn/ensemble/_forest.py

+        if self.estimators_[0]._support_missing_values(X):
+            force_all_finite = "allow-nan"
+        else:
+            force_all_finite = True


I think we need to have a discussion about whether or not nans should be treated automatically as missing values. The issue I see here is that if this is in a pipeline and you have a bugged transformer before that which outputs nans, it will be silently ignored by the random forest.

Maybe missing value support should be enabled through a parameter, and/or through the config context. Maybe we should add an after-fit check in our transformers to ensure they did not create nans.

Anyway, this is consistent with the current behavior of HGBT so I'm fine merging it as is for now.

👍 to your conclusion.

To continue your line of thought: how often do we think it happens that someone wants NaNs to mean missing value vs NaNs appearing because of a bug? Based on what we think is more likely I think we should either make the handling automatic (with an opt-in to get a warning/exception) or make warning/exception the default that needs optint-out of.

jjerphan

Thank you, @thomasjpfan.

I just have a few suggestions.

sklearn/ensemble/_forest.py

sklearn/ensemble/tests/test_forest.py

jjerphan · 2023-06-09T17:34:12Z

sklearn/ensemble/tests/test_forest.py

+    assert forest.score(X_train, y_train) >= 0.80
+    assert forest.score(X_test, y_test) >= 0.75


Should we also compare it against the score for a forest trained on non-missing data?

jjerphan · 2023-06-09T17:37:25Z

sklearn/ensemble/_forest.py

+    def _more_tags(self):
+        # Only the criterion is required to determine if the tree supports
+        # missing values
+        estimator = type(self.estimator)(criterion=self.criterion)
+        return {"allow_nan": _safe_tags(estimator, key="allow_nan")}


Should we test the support or non-support of the criteria?

…om_forest

github-actions · 2023-06-23T15:58:22Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: d0eb5a9. Link to the linter CI: here}

…om_forest

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

jjerphan

LGTM. Thank you, @thomasjpfan.

To me this PR is mergeable since the failure on the CI are unrelated to the changes proposed.

jjerphan · 2023-07-21T06:12:02Z

doc/whats_new/v1.3.rst

+- |MajorFeature| :class:`ensemble.RandomForestClassifier` and
+  :class:`ensemble.RandomForestRegressor` support missing values when
+  the criterion is `gini`, `entropy`, or `log_loss`,
+  for classification or `squared_error`, `friedman_mse`, or `poisson`
+  for regression. :pr:`26391` by `Thomas Fan`_.
+


This entry must be moved to doc/whats_new/v1.4.rst.

…om_forest

jjerphan · 2023-07-28T21:01:13Z

I was not expecting this PR to be merged with a single approval. 🫤

Should we require two approvals for auto-merging PRs?

thomasjpfan · 2023-07-28T21:56:10Z

As a default, I think two approvals is too much, because there are some simple PRs such as documentation or CI fixes that should only require one review. I'll say to only click on the green auto merge button if you were going to merge anyways, but want to wait for the CI.

@betatim Do you have any issues with this PR? I prefer not revert this PR and open a new PR because there was not a second approval.

) Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

lorentzenchr

Here comes a post merge approval with nitpicks that do not necessarily need to be adressed.

lorentzenchr · 2023-07-29T12:41:39Z

sklearn/ensemble/tests/test_forest.py

+    ],
+)
+def test_missing_values_is_resilient(make_data, Forest):
+    """Check that forest can deal with missing values and have decent performance."""


Suggested change

"""Check that forest can deal with missing values and have decent performance."""

"""Check that forest can deal with missing values and has decent performance."""

lorentzenchr · 2023-07-29T12:42:16Z

sklearn/ensemble/tests/test_forest.py

+
+    # Create dataset with missing values
+    X_missing = X.copy()
+    X_missing[rng.choice([False, True], size=X.shape, p=[0.95, 0.05])] = np.nan


Add an assertion that X_missing has indeed np.nan in it.

) Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

mglowacki100 · 2023-12-01T10:04:11Z

@thomasjpfan Is support for missing values in ExtraTrees also included in this PR?
Edited: I've checked it and missing values are only handled for Random Forest in this PR.

ENH Adds support for missing values in random forest

c204d09

github-actions bot added the module:ensemble label May 17, 2023

DOC Adds PR number

74a1a79

thomasjpfan added this to the 1.3 milestone May 17, 2023

thomasjpfan added 3 commits May 18, 2023 13:48

TST Fix

c9d6d3c

TXT Formatting fix

4b77ad1

TST Lower bound

2ef4df4

betatim reviewed May 24, 2023

View reviewed changes

sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved

betatim reviewed May 24, 2023

View reviewed changes

thomasjpfan and others added 5 commits May 25, 2023 11:58

Merge remote-tracking branch 'upstream/main' into missing_values_rand…

26cba53

…om_forest

Simplify logic

6c2b6e2

DOC Adds comment

55ffce2

Apply suggestions from code review

3caaf3a

Co-authored-by: Tim Head <betatim@gmail.com>

CLN Remove unneeded code

ed7a843

jeremiedbb reviewed Jun 8, 2023

View reviewed changes

jjerphan reviewed Jun 9, 2023

View reviewed changes

thomasjpfan mentioned this pull request Jun 14, 2023

CLN Renames missing_values_in_feature_mask #26580

Merged

thomasjpfan added 4 commits June 14, 2023 16:57

TST Updates tests based on review

a37a574

Merge remote-tracking branch 'upstream/main' into missing_values_rand…

8f29608

…om_forest

FIX Fixes merge issues

b7d09ec

Merge remote-tracking branch 'upstream/main' into missing_values_rand…

4496468

…om_forest

jeremiedbb modified the milestones: 1.3, 1.4 Jul 6, 2023

thomasjpfan and others added 3 commits July 20, 2023 14:58

Merge remote-tracking branch 'upstream/main' into missing_values_rand…

7be2801

…om_forest

FIX Fixes errors

99dc00d

Apply suggestions from code review

ab1a4f4

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

jjerphan approved these changes Jul 21, 2023

View reviewed changes

DOC Move to 1.4

66dad75

Merge remote-tracking branch 'upstream/main' into missing_values_rand…

d0eb5a9

…om_forest

jjerphan enabled auto-merge (squash) July 27, 2023 14:36

jjerphan merged commit 4094851 into scikit-learn:main Jul 27, 2023

punndcoder28 pushed a commit to punndcoder28/scikit-learn that referenced this pull request Jul 29, 2023

ENH Adds support for missing values in Random Forest (scikit-learn#26391

419458d

) Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

lorentzenchr reviewed Jul 29, 2023

View reviewed changes

thomasjpfan mentioned this pull request Jul 29, 2023

TST Improves testing for missing value support in random forest #26939

Merged

adam2392 mentioned this pull request Aug 9, 2023

Integrate the latest changes from sklearn neurodata/scikit-learn#51

Closed

samronsin mentioned this pull request Oct 20, 2023

Monotonic trees missing values #27630

Draft

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023

ENH Adds support for missing values in Random Forest (scikit-learn#26391

7aad633

) Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

mglowacki100 mentioned this pull request Dec 10, 2023

ENH support for missing values in ExtraTrees #27931

Closed

adam2392 mentioned this pull request Dec 20, 2023

FEA Add missing-value support for ExtaTreeClassifier and ExtaTreeRegressor #27966

Merged

jeongyoonlee mentioned this pull request Jan 26, 2024

build fails in test_causal_trees.py, no attribute _support_missing_values uber/causalml#733

Closed

alexander-pv mentioned this pull request Jan 26, 2024

Temporary fix for causal trees missing values support #733 uber/causalml#734

Merged

10 tasks

adam2392 mentioned this pull request Jan 29, 2024

FEA Support missing-values in ExtraTrees* #28268

Merged

lesteve mentioned this pull request Dec 13, 2024

Add missing value support for AdaBoost? #30477

Closed

		assert forest.score(X_train, y_train) >= 0.80
		assert forest.score(X_test, y_test) >= 0.75

	"""Check that forest can deal with missing values and have decent performance."""
	"""Check that forest can deal with missing values and has decent performance."""

Uh oh!

ENH Adds support for missing values in Random Forest #26391

ENH Adds support for missing values in Random Forest #26391

Uh oh!

Conversation

thomasjpfan commented May 17, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

betatim commented May 24, 2023

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjerphan commented Jul 28, 2023

Uh oh!

thomasjpfan commented Jul 28, 2023

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mglowacki100 commented Dec 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jun 23, 2023 •

edited

Loading

mglowacki100 commented Dec 1, 2023 •

edited

Loading