Different Python version causes a different distribution of classification result #31206

GloC99 · 2025-04-15T11:04:29Z

Describe the bug

Running the same code using Python 3.10 and Python 3.13 with n_jobs > 1 had a variety of result. Python 3.10 and Python 3.13 also has different distributions.

Steps/Code to Reproduce

import numpy as np
import random
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix

# Control the randomness
random.seed(0)  
np.random.seed(0)

iris = load_iris()  
x, y = iris.data, iris.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)

# Define and create a model
model = RandomForestClassifier(
    n_estimators=np.int64(101),
    criterion='gini',
    max_depth=np.int64(31),
    min_samples_split=7.291122019556396e-304,
    min_samples_leaf=np.int64(14876671),
    min_weight_fraction_leaf=0.0,
    max_features=None,
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    bootstrap=True,
    oob_score=False,
    n_jobs= np.int64(255),
    random_state=0,
    verbose=np.int64(0),
    warm_start=False,
    class_weight='balanced_subsample',
    ccp_alpha=0.0,
    max_samples=None)

model.fit(x_train, y_train)

# Evaluate model
y_pred = model.predict(x_test)
print("Accuracy: ", accuracy_score(y_test,
                                    y_pred))
print("Recall:",
    recall_score(y_test, y_pred, average='micro'))
# Print confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Expected Results

If n_jobs is 1, the result is:

    Accuracy:  0.43333333333333335
    Recall: 0.43333333333333335
    Confusion Matrix:
    [[ 0 11  0]
    [ 0 13  0]
    [ 0  6  0]]

Actual Results

When the program is run 10,000 times:
n_jobs=255, Python 3.10 has two possible results:

    Group:
    Accuracy:  0.43333333333333335
    Recall: 0.43333333333333335
    Confusion Matrix:
    [[ 0 11  0]
    [ 0 13  0]
    [ 0  6  0]]
    Count: 9887
    
    Group:
    Accuracy:  0.36666666666666664
    Recall: 0.36666666666666664
    Confusion Matrix:
    [[11  0  0]
    [13  0  0]
    [ 6  0  0]]
    Count: 113

n_jobs=255, Python 3.13 has three possible results:

    Group:
    Accuracy:  0.36666666666666664
    Recall: 0.36666666666666664
    Confusion Matrix:
    [[11  0  0]
    [13  0  0]
    [ 6  0  0]]
    Count: 7790
    
    Group:
    Accuracy:  0.43333333333333335
    Recall: 0.43333333333333335
    Confusion Matrix:
    [[ 0 11  0]
    [ 0 13  0]
    [ 0  6  0]]
    Count: 1965
    
    Group:
    Accuracy:  0.2
    Recall: 0.2
    Confusion Matrix:
    [[ 0  0 11]
    [ 0  0 13]
    [ 0  0  6]]
    Count: 245

Versions

System:
    python: 3.13.2 (main, Mar 27 2025, 14:05:19) [GCC 11.4.0]
executable: /opt/python/3.13.2/bin/python3.13
   machine: Linux-5.15.0-122-generic-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.6.1
          pip: 24.3.1
   setuptools: 75.6.0
        numpy: 2.2.4
        scipy: 1.15.2
       Cython: None
       pandas: 2.2.3
   matplotlib: None
       joblib: 1.4.2
threadpoolctl: 3.6.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libscipy_openblas
       filepath: /users/GloC99/.local/lib/python3.13/site-packages/numpy.libs/libscipy_openblas64_-6bb31eeb.so
        version: 0.3.28
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libscipy_openblas
       filepath: /users/GloC99/.local/lib/python3.13/site-packages/scipy.libs/libscipy_openblas-68440149.so
        version: 0.3.28
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: libgomp
       filepath: /users/GloC99/.local/lib/python3.13/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

===============

System:
    python: 3.10.17 (main, Apr 10 2025, 12:04:30) [GCC 11.4.0]
executable: /opt/python/3.10.17/bin/python3.10
   machine: Linux-5.15.0-122-generic-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.6.1
          pip: 24.3.1
   setuptools: 75.6.0
        numpy: 2.2.0
        scipy: 1.14.1
       Cython: None
       pandas: 2.2.3
   matplotlib: None
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libscipy_openblas
       filepath: /users/GloC99/.local/lib/python3.10/site-packages/numpy.libs/libscipy_openblas64_-6bb31eeb.so
        version: 0.3.28
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libscipy_openblas
       filepath: /users/GloC99/.local/lib/python3.10/site-packages/scipy.libs/libscipy_openblas-c128ec02.so
        version: 0.3.27.dev
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: libgomp
       filepath: /users/GloC99/.local/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

The text was updated successfully, but these errors were encountered:

GAVARA-PRABHAS-RAM · 2025-04-15T16:34:10Z

Hi, I'd like to work on this issue.
Can I go ahead and open a PR?

GloC99 · 2025-04-17T12:02:29Z

Hi, I'd like to work on this issue. Can I go ahead and open a PR?

If you need any more information, please let me know. If you need, I also have a script that will run the program many time and counts the repetition of different group.

GAVARA-PRABHAS-RAM · 2025-04-17T16:58:07Z

i didnt understand what exactly i have to do??

GloC99 · 2025-04-23T09:19:35Z

i didnt understand what exactly i have to do??

It would be great if you can open a PR. If you link it here I can have a look and see what I can contribute?

ogrisel · 2025-04-24T13:22:11Z

i didnt understand what exactly i have to do??

@GAVARA-PRABHAS-RAM First, we need to analyze the root cause of the behavior reported by @GloC99 and assess whether this is a bug or not.

@GloC99 thanks for the report. I confirm I can reproduce locally with a fresh conda-forge based environment running Python 3.13 on macOS. But, strangely, I could not reproduce using my usual dev env running Python 3.12 and scikit-learn main built from source.

Some preliminary remarks:

fitting with n_jobs=255 on a machine with 16 cores is likely to be either useless or even detrimental from a performance point of view.
fitting min_samples_leaf=np.int64(14876671) on the 150 data points of the iris dataset means that the trees will never perform any split and the single leaf will consistently predict the marginal class frequencies observed on the (bootstrapped) training sets. This can be checked via:

>>> np.unique([e.tree_.node_count for e in model.estimators_])
array([1])

I will try to investigate a bit further to understand the source of the non-deterministic behavior now that I can reproduce.

ogrisel · 2025-04-24T14:05:33Z

I think I understand. Because this model is fit with class_weight='balanced_subsample', it means that all trees are fit with exactly equally balanced training sets 1/3 of each 3 classes
(this is possible because, currently, bagging is implemented with sample_weight). You can confirm that if you comment out the line that sets class_weight='balanced_subsample': the outcome of the execution of the code becomes deterministic again.

This can also be confirmed with the fact that class frequencies stored in the single leaf value attribute are all 1/n_classes:

>>> np.allclose(
...     np.vstack([e.tree_.value.squeeze() for e in model.estimators_]),
...     np.full(shape=(model.n_estimators, 3), fill_value=1/3))
... )
True

So the individual trees return identically tied predict_proba values. But then, when n_jobs > 1, the forest aggregates the predict_proba returned by the trees in parallel using Python threads and accumulate the results in shared memory:

scikit-learn/sklearn/ensemble/_forest.py

Lines 956 to 959 in 5943ab2

    
           Parallel(n_jobs=n_jobs, verbose=self.verbose, require="sharedmem")( 
        
               delayed(_accumulate_prediction)(e.predict_proba, X, all_proba, lock) 
        
               for e in self.estimators_ 
        
           )

which calls into:

scikit-learn/sklearn/ensemble/_forest.py

Lines 723 to 736 in 5943ab2

    
           def _accumulate_prediction(predict, X, out, lock): 
        
               """ 
        
               This is a utility function for joblib's Parallel. 
        
               It can't go locally in ForestClassifier or ForestRegressor, because joblib 
        
               complains that it cannot pickle it when placed there. 
        
               """ 
        
               prediction = predict(X, check_input=False) 
        
               with lock: 
        
                   if len(out) == 1: 
        
                       out[0] += prediction 
        
                   else: 
        
                       for i in range(len(out)): 
        
                           out[i] += prediction[i]

Because floating point operations have rounding errors, the ordering of the operations matters, and it is not deterministic when n_jobs > 1 and depend on thread scheduling hence the observed dependency on the Python version.

As a result, the predict function which returns np.argmax(y_pred_proba, axis=1) is unstable: the exactly tied predictions are broken depending on rounding errors that non-deterministic because of the use of threads when accumulating the predicted probabilities.

I therefore think this is not a bug. If you want to get deterministic predictions, you can call model.set_params(n_jobs=1) (after fitting).

In the future, we could change the code to aggregate the parallel predictions in a deterministic order while not allocating too much memory for temporary prediction array in case the forest has a very large number of trees by using the return_as="generator" feature of joblib. However, this is quite a new feature of (released a year ago) so I would rather not depend on it yet to follow our minimum dependency version support guidelines.

GloC99 added Bug Needs Triage Issue requires triage labels Apr 15, 2025

ogrisel added Needs Investigation Issue requires investigation and removed Needs Triage Issue requires triage labels Apr 24, 2025

ogrisel closed this as not planned Won't fix, can't repro, duplicate, stale Apr 24, 2025

ogrisel removed the Needs Investigation Issue requires investigation label Apr 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different Python version causes a different distribution of classification result #31206

Different Python version causes a different distribution of classification result #31206

GloC99 commented Apr 15, 2025

GAVARA-PRABHAS-RAM commented Apr 15, 2025

GloC99 commented Apr 17, 2025 •

edited

Loading

GAVARA-PRABHAS-RAM commented Apr 17, 2025

GloC99 commented Apr 23, 2025

ogrisel commented Apr 24, 2025

ogrisel commented Apr 24, 2025

Different Python version causes a different distribution of classification result #31206

Different Python version causes a different distribution of classification result #31206

Comments

GloC99 commented Apr 15, 2025

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

GAVARA-PRABHAS-RAM commented Apr 15, 2025

GloC99 commented Apr 17, 2025 • edited Loading

GAVARA-PRABHAS-RAM commented Apr 17, 2025

GloC99 commented Apr 23, 2025

ogrisel commented Apr 24, 2025

ogrisel commented Apr 24, 2025

GloC99 commented Apr 17, 2025 •

edited

Loading