Skip to content

FIX Draw indices using sample_weight in Forest #31529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

antoinebaker
Copy link
Contributor

@antoinebaker antoinebaker commented Jun 12, 2025

Part of #16298. Similar to #31414 (Bagging estimators) but for Forest estimators.

What does this implement/fix? Explain your changes.

When subsampling is activated (bootstrap=True), sample_weight are now used as probabilities to draw the indices. Forest estimators then pass the statistical repeated/weighted equivalence test.

Comments

This PR does not fix Forest estimators when bootstrap=False (no subsampling). sample_weight are still passed to the decision trees. Forest estimators then fail the statistical repeated/weighted equivalence test because the individual trees
also fail this test (probably because of tied splits in decision trees #23728).

TODO

Copy link

github-actions bot commented Jun 12, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: a55643b. Link to the linter CI: here

Comment on lines 131 to 136
if sample_weight is None:
sample_weight = np.ones(n_samples)
normalized_sample_weight = sample_weight / np.sum(sample_weight)
sample_indices = random_instance.choice(
n_samples, n_samples_bootstrap, replace=True, p=normalized_sample_weight
)
Copy link
Contributor Author

@antoinebaker antoinebaker Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hesitate between two options for dealing with the sample_weight=None case.

  1. Convert to all ones.
    if sample_weight is None:
        sample_weight = np.ones(n_samples)
    normalized_sample_weight = sample_weight / np.sum(sample_weight)
    sample_indices = random_instance.choice(
        n_samples, n_samples_bootstrap, replace=True, p=normalized_sample_weight
    )
  1. Use the old code path when sample_weight=None
    if sample_weight is None:
        sample_indices = random_instance.randint(
            0, n_samples, n_samples_bootstrap, dtype=np.int32
        )
    else:
        normalized_sample_weight = sample_weight / np.sum(sample_weight)
        sample_indices = random_instance.choice(
            n_samples,
            n_samples_bootstrap,
            replace=True,
            p=normalized_sample_weight,
        )

The benefit of 2. is that the code is backward compatible when sample_weight=None, this PR and main give the exact same fit for a given random_state.

The benefit of 1. is that sample_weight=None and sample_weight=np.ones(n_samples) give the exact same fit for a given random_state.

Using 1. test_set_estimator_drop, test_rfe_features_importance or test_forest_classifier_oob fail. These tests do not use sample_weight.

Using 2. test_class_weights fails, because it checks that no or all ones sample_weight give the same results.

Comment on lines +874 to +878
# NOTE: "balanced_subsample" option is ignored, treated as "balanced"
class_weight = self.class_weight
if class_weight == "balanced_subsample":
class_weight = "balanced"
expanded_class_weight = compute_sample_weight(class_weight, y_original)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I choose to simply ignore the "balanced_subsample" option and treat it as the "balanced" case.

@antoinebaker
Copy link
Contributor Author

The forest estimators now pass the statistical repeated/weighted equivalence test, for example
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant