FIX Draw indices using sample_weight in Forest #31529

antoinebaker · 2025-06-12T08:19:04Z

Part of #16298. Similar to #31414 (Bagging estimators) but for Forest estimators.

What does this implement/fix? Explain your changes.

When subsampling is activated (bootstrap=True), sample_weight are now used as probabilities to draw the indices. Forest estimators then pass the statistical repeated/weighted equivalence test.

Comments

This PR does not fix Forest estimators when bootstrap=False (no subsampling). sample_weight are still passed to the decision trees. Forest estimators then fail the statistical repeated/weighted equivalence test because the individual trees
also fail this test (probably because of tied splits in decision trees #23728).

TODO

choose how to generate indices in the sample_weight=None case
fix relative (float) max_samples as done in FIX Draw indices using sample_weight in Bagging #31414
docstrings
fix class_weight = "balanced" as done in Fix linear svc handling sample weights under class_weight="balanced" #30057
fix class_weight = "balanced_subsample"
changelog

github-actions · 2025-06-12T08:19:52Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: c612909. Link to the linter CI: here}

sklearn/ensemble/_forest.py

antoinebaker · 2025-06-12T09:14:29Z

The forest estimators now pass the statistical repeated/weighted equivalence test, for example

antoinebaker · 2025-06-16T07:49:21Z

Relative (float) max_samples, with the new meaning of drawing max_samples * sw_sum indices as done in #31414 , also passes the statistical repeated/weighted equivalence test

antoinebaker · 2025-06-25T09:29:58Z

The class_weight="balanced" option, now taking the sample_weight into account as in #30057, now passes the statistical repeated/weighted equivalence test

antoinebaker · 2025-06-25T09:37:08Z

The class_weight="balanced_subsampling" also passes, in that case sample_weight are used to draw the indices, the class_weight are then computed on the bootstraped sample for every grown tree and passed as sample_weight to the tree fit.

antoinebaker · 2025-06-27T10:08:39Z

sklearn/ensemble/_forest.py

+    if sample_weight is None:
+        sample_indices = random_instance.randint(0, n_samples, n_samples_bootstrap)


There two options for the random draw of indices when sample_weight=None

Convert to all ones

if sample_weight is None: sample_weight = np.ones(n_samples) normalized_sample_weight = sample_weight / np.sum(sample_weight) sample_indices = random_instance.choice( n_samples, n_samples_bootstrap, replace=True, p=normalized_sample_weight )

Use the old code path when sample_weight=None

if sample_weight is None: sample_indices = random_instance.randint(0, n_samples, n_samples_bootstrap) else: normalized_sample_weight = sample_weight / np.sum(sample_weight) sample_indices = random_instance.choice( n_samples, n_samples_bootstrap, replace=True, p=normalized_sample_weight, )

The two options use different rng functions: choice with uniform p for 1 and randint for 2. They are statistically the same but they don't give the same deterministic output for a given random state.

The benefit of 2. is that the code is backward compatible when sample_weight=None. A fit without sample_weight reproduce the same fit as main for a given random_state.

The benefit of 1. is that sample_weight=None and sample_weight=np.ones(n_samples) give the same fit for a given random_state.

Here we chose 2.

ogrisel

I haven't the time to finish my review today, but this looks great: I tried running the notebook of github.com/snath-xoc/sample-weight-audit-nondet/ against this branch and I confirm the statistical tests pass for RandomForestClassifier/Regressor and ExtraTreesClassifier/Regressor.

sklearn/ensemble/tests/test_voting.py

sklearn/ensemble/tests/test_forest.py

ogrisel

More feedback.

ogrisel · 2025-07-03T09:23:07Z

sklearn/ensemble/_forest.py

-            - if None, this indicates the total number of samples.
+        The maximum number of samples to draw.
+
+        - If None, then draw `n_samples` samples.


Suggested change

- If None, then draw `n_samples` samples.

- If None, then draw `n_samples` (or `sample_weight.sum()`) samples.

Currently the max_samples = None option (in this helper function and the forest estimators) will always draw n_samples, contrary to the max_samples = 1.0 option which returns n_samples or sample_weight.sum().

Would you prefer max_samples = None (which is the default) to be equivalent to max_samples = 1.0 ?

sklearn/ensemble/_forest.py

sklearn/ensemble/tests/test_voting.py

sklearn/ensemble/_forest.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

antoinebaker · 2025-07-18T15:18:23Z

NB: this PR uses the weighting strategy to represent the drawn indices (passing the bincount as sample_weight to tree.fit).

But for this to be correct, that is giving the same fit as the indexing strategy (directly passing X[indices] and y[indices] to tree.fit), we will need to fix DecisionTree estimators which currently fail the repeated/weighted equivalence test, probably due to tied or almost tied splits #23728.

use sample_weight in choice

9458a1c

github-actions bot added the module:ensemble label Jun 12, 2025

antoinebaker commented Jun 12, 2025

View reviewed changes

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

antoinebaker added 2 commits June 13, 2025 17:40

use old code path

2f30d7d

relative max_samples

a55643b

antoinebaker and others added 8 commits June 19, 2025 14:50

adapt tests

bcad08a

Merge branch 'main' into random_forest_sample_weight

cce2060

changelog

f77059e

add relative max_sample test

eae5d27

fix class_weight

83c45db

replace rf by gbdt

8f28c4a

cleanup

8ca62a9

comment

c311a2f

antoinebaker and others added 5 commits June 25, 2025 11:42

typo

65f3ece

docstrings

2cf1700

undo changelog

e6e083b

typo

9201e5a

Merge branch 'main' into random_forest_sample_weight

bff60ae

antoinebaker commented Jun 27, 2025

View reviewed changes

antoinebaker marked this pull request as ready for review June 27, 2025 10:14

ogrisel reviewed Jul 2, 2025

View reviewed changes

sklearn/ensemble/tests/test_voting.py Show resolved Hide resolved

sklearn/ensemble/tests/test_forest.py Show resolved Hide resolved

ogrisel reviewed Jul 3, 2025

View reviewed changes

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

antoinebaker and others added 2 commits July 8, 2025 17:10

Update sklearn/ensemble/_forest.py

e5282f6

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Apply suggestions from code review

94a96e3

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

antoinebaker added 5 commits July 8, 2025 18:07

add unittest

16111e3

add comment

de9a341

dot

d7ac6b1

cast to float sample_weight_tree

0db2642

unit tests validate class weight

c612909

		if sample_weight is None:
		sample_indices = random_instance.randint(0, n_samples, n_samples_bootstrap)

	- If None, then draw `n_samples` samples.
	- If None, then draw `n_samples` (or `sample_weight.sum()`) samples.

Uh oh!

FIX Draw indices using sample_weight in Forest #31529

Are you sure you want to change the base?

FIX Draw indices using sample_weight in Forest #31529

Conversation

antoinebaker commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this implement/fix? Explain your changes.

Comments

Uh oh!

github-actions bot commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

antoinebaker commented Jun 12, 2025

Uh oh!

antoinebaker commented Jun 16, 2025

Uh oh!

antoinebaker commented Jun 25, 2025

Uh oh!

antoinebaker commented Jun 25, 2025

Uh oh!

antoinebaker Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

antoinebaker Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antoinebaker commented Jul 18, 2025

Uh oh!

Uh oh!

antoinebaker commented Jun 12, 2025 •

edited

Loading

github-actions bot commented Jun 12, 2025 •

edited

Loading

antoinebaker Jun 27, 2025 •

edited

Loading

antoinebaker Jul 8, 2025 •

edited

Loading