Skip to content

[Meta] consistently break ties randomly in scikit-learn estimators (with random_state) in an unbiased way #23728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ogrisel opened this issue Jun 22, 2022 · 3 comments
Labels
Needs Decision Requires decision

Comments

@ogrisel
Copy link
Member

ogrisel commented Jun 22, 2022

Some estimators have arbitrary ways to break ties:

  • k-nearest neighbors with data points lying on a uniform 1d grid (see BallTree.query returns inconsistent indices between scikit-learn versions 0.24.1 and 1.1.1 #23667 for the BallTree for instance);
  • tied splits on different features in histogram gradient boosted trees with redundant features.
  • decision tree splits on the same feature with equivalent threshods: X = [[0], [1], [2], [3]] and y = [0, 1, 1, 0]: X > 0.5 and X > 2.5 are tied splits but only X > 0.5 is considered;
  • possibly other models (please feel free to edit this list or suggest missing estimators in the comments).

If the tie breaking logic is deterministic, then it might introduce a non-controllable bias in a datascience pipeline. For instance when analyzing the feature importance of an histogram gradient boosting model (via permutations or SHAP values), the first feature of a group of redundant features would always deterministically be picked up by the model and could lead a naive datascientist to believe that the other features of the group are not as predictive.

Note that this is not the case for our traditional DecisionTreeClassifier/Regressor / RandomForestClassifier/Regressor and extra trees because they all do feature shuffling (controllable by random_state) by default even when max_features == 1.0. This makes it easy to conduct the same study many times with different seeds to see if the results are an artifact of an arbitrary tie breaking or not.

@jondo
Copy link
Contributor

jondo commented Oct 24, 2024

Note that feature shuffling before the call does not help with the k-NN bias: Still all the distance ties would not be resolved independently.

@ogrisel
Copy link
Member Author

ogrisel commented Mar 19, 2025

While investigating the causes of XFAIL check_sample_weight_equivalence_on_dense_data for tree-based models (e.g. GradientBoostingRegressor), we found another case of such deterministic bias: when there are tied splits on the same feature with different threshold values, only the lowest threshold is ever selected. I updated the description to add a bullet point for this one.

Neither feature-wise shuffling nor sample-wise dataset shuffling can fix this one. Sample-wise shuffling can introduce machine level rounding errors that can break those ties randomly but still in a biased manner (always favoring smaller threshold values), which makes inference on the learned partitions of the feature space particularly difficult to interpret.

@ogrisel ogrisel changed the title [Meta] consistently break ties randomly in scikit-learn estimators (with random_state) [Meta] consistently break ties randomly in scikit-learn estimators (with random_state) in an unbiased way Apr 7, 2025
@antoinebaker
Copy link
Contributor

Here is a notebook to showcase some exact/close split ties that happen in small trees.
https://gist.github.com/antoinebaker/461ab61f80ea8538a52e80491319f670

To summarize, we refit the same decision tree for 100 random states and collect the splits.
There is a unique split for the root node. For the left and right child, we get (exact or close) tied splits:

  • we can get different (but allclose) impurities for the same (feature, threshold) pair, see for example node=left, feature=2.
  • we can get different (but allclose) values for the same (feature, threshold) pair, see for example node=right, feature=6.

As mentioned above this is likely due to some systematic (but uncontrolled) rounding errors, and is not solved by shuffling the features or samples.

node node_id feature threshold impurity value size
left 1 2 -0.5120572894811630 0.3224197418352606 0.6375592996157050 34
left 1 2 -0.5120572894811630 0.3224197418352607 0.6375592996157050 19
left 1 10 -0.9393941909074783 0.3224197418352606 0.6375592996157050 28
left 1 10 -0.9393941909074783 0.3224197418352607 0.6375592996157050 19
right 6 2 -0.7026711404323578 0.2909537914759641 -0.5731517908772066 3
right 6 2 -0.7026711404323578 0.2909537914759643 -0.5731517908772066 15
right 6 2 -0.7026711404323578 0.2909537914759646 -0.5731517908772066 3
right 6 11 1.2267193496227264 0.2909537914759641 -0.5731517908772066 5
right 6 11 1.2267193496227264 0.2909537914759641 -0.5731517908772065 18
right 6 11 1.2267193496227264 0.2909537914759643 -0.5731517908772066 9
right 6 11 1.2267193496227264 0.2909537914759643 -0.5731517908772065 28
right 6 11 1.2267193496227264 0.2909537914759646 -0.5731517908772066 3
right 6 14 -2.0491334795951843 0.2909537914759641 -0.5731517908772066 5
right 6 14 -2.0491334795951843 0.2909537914759643 -0.5731517908772066 10
right 6 14 -2.0491334795951843 0.2909537914759646 -0.5731517908772066 1
root 0 11 0.0169256031513214 0.6616304844057505 0.1532748634185404 100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Decision Requires decision
Projects
None yet
Development

No branches or pull requests

3 participants