Skip to content

Inflated results on random-data with SVM #25631

Open
@CriticalValue

Description

@CriticalValue

Describe the bug

When trying to train/evaluate a support vector machine in scikit-learn, I am experiencing some unexpected behaviour and I am wondering whether I am doing something wrong or that this is a possible bug.

In a very specific subset of circumstances, namely:

  • LeaveOneOut() is used as cross-validation procedure
  • The SVM is used, with probability = True and a small C such as 0.01
  • The y labels are balanced (i.e. the mean of y is 0.5)

The results of the trained SVM are very good on randomly generated data - while they should be near chance. If the y labels are a bit different, or the SVM is swapped out for a LogisticRegression, it gives expected results (Brier of 0.25, AUC near 0.5).
But for the named circumstances, the Brier is roughly 0.10 - 0.15 and AUC > 0.9 if the y labels are balanced.

Steps/Code to Reproduce

from sklearn import svm
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection import GridSearchCV, StratifiedKFold, LeaveOneOut, KFold
from sklearn.metrics import roc_auc_score, brier_score_loss
from tqdm import tqdm
import pandas as pd


N = 20
N_FEATURES = 50


scores = []
for z in tqdm(range(500)):
    X = np.random.normal(0, 1, size=(N, N_FEATURES))
    y = np.random.binomial(1, 0.5, size=N)
    
    if z < 10:
        y = np.array([0, 1] * int(N/2))
        y = np.random.permutation(y)

    y_real, y_pred = [], []
    skf_outer = LeaveOneOut()
    for train_index, test_index in skf_outer.split(X, y):
        X_train, X_test = X[train_index], X[test_index, :]
        y_train, y_test = y[train_index], y[test_index]

        clf = svm.SVC(probability=True, C=0.01)

        clf.fit(X_train, y_train)
        predictions = clf.predict_proba(X_test)[:, 1]

        y_pred.extend(predictions)
        y_real.extend(y_test)

    scores.append([np.mean(y), 
                   brier_score_loss(np.array(y_real), np.array(y_pred)), 
                   roc_auc_score(np.array(y_real), np.array(y_pred))])

df_scores = pd.DataFrame(scores)
df_scores.columns = ['y_label', 'brier', 'auc']
df_scores['y_0.5'] = df_scores['y_label'] == 0.5
df_scores = df_scores.groupby(['y_0.5']).mean()
print(df_scores)

Expected Results

I would expect that all results would be somewhat similar, with a Brier ~0.25 and AUC ~0.5.

Actual Results

        y_label     brier       auc
y_0.5                              
False  0.514649  0.298204  0.216884
True   0.500000  0.159728  0.999080

Here, you can see that if the np.mean of the y_labels is 0.5, the results are actually really really good.
While the data is randomly generated for 500 times

Versions

System:
    python: 3.8.15 (default, Nov 24 2022, 14:38:14) [MSC v.1916 64 bit (AMD64)]
executable: C:\ProgramData\Anaconda3\envs\test\python.exe
   machine: Windows-10-10.0.19044-SP0
Python dependencies:
      sklearn: 1.2.0
          pip: 22.2.2
   setuptools: 61.2.0
        numpy: 1.19.5
        scipy: 1.10.0
       Cython: 0.29.14
       pandas: 1.4.4
   matplotlib: 3.6.3
       joblib: 1.2.0
threadpoolctl: 2.2.0
Built with OpenMP: True
threadpoolctl info:
       filepath: C:\ProgramData\Anaconda3\envs\test\Library\bin\mkl_rt.1.dll
         prefix: mkl_rt
       user_api: blas
   internal_api: mkl
        version: 2021.4-Product
    num_threads: 8
threading_layer: intel
       filepath: C:\Users\manuser\AppData\Roaming\Python\Python38\site-packages\scipy.libs\libopenblas-802f9ed1179cb9c9b03d67ff79f48187.dll
         prefix: libopenblas
       user_api: blas
   internal_api: openblas
        version: 0.3.18
    num_threads: 16
threading_layer: pthreads
   architecture: Prescott
       filepath: C:\ProgramData\Anaconda3\envs\test\Lib\site-packages\sklearn\.libs\vcomp140.dll
         prefix: vcomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 8
       filepath: C:\ProgramData\Anaconda3\envs\test\Library\bin\libiomp5md.dll
         prefix: libiomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 8
       filepath: C:\Users\manuser\AppData\Roaming\Python\Python38\site-packages\mxnet\libopenblas.dll
         prefix: libopenblas
       user_api: blas
   internal_api: openblas
        version: None
    num_threads: 16
threading_layer: pthreads
   architecture: Prescott
       filepath: C:\ProgramData\Anaconda3\envs\test\Lib\site-packages\torch\lib\libiomp5md.dll
         prefix: libiomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 16
       filepath: C:\ProgramData\Anaconda3\envs\test\Lib\site-packages\torch\lib\libiompstubs5md.dll
         prefix: libiomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 1

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions