-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[Meta] consistently break ties randomly in scikit-learn estimators (with random_state) in an unbiased way #23728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Note that feature shuffling before the call does not help with the k-NN bias: Still all the distance ties would not be resolved independently. |
While investigating the causes of XFAIL Neither feature-wise shuffling nor sample-wise dataset shuffling can fix this one. Sample-wise shuffling can introduce machine level rounding errors that can break those ties randomly but still in a biased manner (always favoring smaller threshold values), which makes inference on the learned partitions of the feature space particularly difficult to interpret. |
Here is a notebook to showcase some exact/close split ties that happen in small trees. To summarize, we refit the same decision tree for 100 random states and collect the splits.
As mentioned above this is likely due to some systematic (but uncontrolled) rounding errors, and is not solved by shuffling the features or samples.
|
Some estimators have arbitrary ways to break ties:
X = [[0], [1], [2], [3]]
andy = [0, 1, 1, 0]
:X > 0.5
andX > 2.5
are tied splits but onlyX > 0.5
is considered;If the tie breaking logic is deterministic, then it might introduce a non-controllable bias in a datascience pipeline. For instance when analyzing the feature importance of an histogram gradient boosting model (via permutations or SHAP values), the first feature of a group of redundant features would always deterministically be picked up by the model and could lead a naive datascientist to believe that the other features of the group are not as predictive.
Note that this is not the case for our traditional
DecisionTreeClassifier/Regressor
/RandomForestClassifier/Regressor
and extra trees because they all do feature shuffling (controllable byrandom_state
) by default even whenmax_features == 1.0
. This makes it easy to conduct the same study many times with different seeds to see if the results are an artifact of an arbitrary tie breaking or not.The text was updated successfully, but these errors were encountered: