-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Different Python version causes a different distribution of classification result #31206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, I'd like to work on this issue. |
If you need any more information, please let me know. If you need, I also have a script that will run the program many time and counts the repetition of different group. |
i didnt understand what exactly i have to do?? |
It would be great if you can open a PR. If you link it here I can have a look and see what I can contribute? |
@GAVARA-PRABHAS-RAM First, we need to analyze the root cause of the behavior reported by @GloC99 and assess whether this is a bug or not. @GloC99 thanks for the report. I confirm I can reproduce locally with a fresh conda-forge based environment running Python 3.13 on macOS. But, strangely, I could not reproduce using my usual dev env running Python 3.12 and scikit-learn Some preliminary remarks:
>>> np.unique([e.tree_.node_count for e in model.estimators_])
array([1]) I will try to investigate a bit further to understand the source of the non-deterministic behavior now that I can reproduce. |
I think I understand. Because this model is fit with This can also be confirmed with the fact that class frequencies stored in the single leaf >>> np.allclose(
... np.vstack([e.tree_.value.squeeze() for e in model.estimators_]),
... np.full(shape=(model.n_estimators, 3), fill_value=1/3))
... )
True So the individual trees return identically tied scikit-learn/sklearn/ensemble/_forest.py Lines 956 to 959 in 5943ab2
which calls into: scikit-learn/sklearn/ensemble/_forest.py Lines 723 to 736 in 5943ab2
Because floating point operations have rounding errors, the ordering of the operations matters, and it is not deterministic when As a result, the predict function which returns I therefore think this is not a bug. If you want to get deterministic predictions, you can call In the future, we could change the code to aggregate the parallel predictions in a deterministic order while not allocating too much memory for temporary prediction array in case the forest has a very large number of trees by using the |
Describe the bug
Running the same code using Python 3.10 and Python 3.13 with
n_jobs > 1
had a variety of result. Python 3.10 and Python 3.13 also has different distributions.Steps/Code to Reproduce
Expected Results
If
n_jobs
is 1, the result is:Actual Results
When the program is run 10,000 times:
n_jobs=255, Python 3.10 has two possible results:
n_jobs=255, Python 3.13 has three possible results:
Versions
The text was updated successfully, but these errors were encountered: