Skip to content

GridSearchCV do not weight the score by the size of the fold when providing custom split for CV #28575

Closed as not planned
@jackred

Description

@jackred

Describe the bug

When providing an iterable for the cv arguments for GridSearchCV, if the splits have different size (as it can be the case when doing "leave one group out") the "best" score computed at the end is done as a direct average of the score for each fold, without weighting them by the number of samples in the fold.

Consequently, the "best" estimator found is not actually the real best.

For example (as seen in the example below), if there are 3 splits, of 50, 49 and 1 sample, if the split with 1 sample results in a score of 0.0 while every other samples are correctly predicted (score of 1), then the final score will be 0.666 instead of 0.99.

Steps/Code to Reproduce

from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
import numpy as np


data = np.arange(100)
label = data >= 99
data = data.reshape(-1, 1)


indices_break = [0, 50, 99, 100]
split = [np.arange(indices_break[i], indices_break[i+1])
         for i in range(len(indices_break)-1)]


splits = []
for i, test in enumerate(split):
    train = split[:i]
    train.extend(split[i+1:])
    train = np.concatenate(train)
    splits.append((train, test))


model = KNeighborsClassifier()
parameters = {'n_neighbors': [1, 2]}

clf = GridSearchCV(model, parameters, cv=splits)
clf.fit(data, label)


len_split = [len(i) for i in split]
res_n_1 = []
res_n_2 = []
for i in range(len(split)):
    res_n_1.append(clf.cv_results_[f'split{i}_test_score'][0])
    res_n_2.append(clf.cv_results_[f'split{i}_test_score'][1])


sum_res_n_1_weighted = sum([res_n_1[i] * len_split[i]
                            for i in range(len(split))])
sum_res_n_2_weighted = sum([res_n_2[i] * len_split[i]
                            for i in range(len(split))])

# weighted average of the split score
print(clf.cv_results_['mean_test_score'][0] == sum_res_n_1_weighted/100)
print(clf.cv_results_['mean_test_score'][1] == sum_res_n_2_weighted/100)

# direct average of the splits score
print(clf.cv_results_['mean_test_score'][0] == np.average(res_n_1))
print(clf.cv_results_['mean_test_score'][1] == np.average(res_n_2))


print('average of the metrics for n_neighbors 1 computed as average of split', np.average(res_n_1))
print('average of the metrics for n_neighbors 1 reported by clf.cv_results_', clf.cv_results_['mean_test_score'][0])
print('average of the metrics for n_neighbors 1 when weighting the lenght of each split with their score', sum_res_n_1_weighted/100)
print('score by split', res_n_1)
print('length of each splits', len_split)

print('average of the metrics for n_neighbors 2 computed as average of split', np.average(res_n_2))
print('average of the metrics for n_neighbors 2 reported by clf.cv_results_', clf.cv_results_['mean_test_score'][1])
print('average of the metrics for n_neighbors 2 when weighting the lenght of each split with their score', sum_res_n_2_weighted/100)
print('score by split', res_n_2)
print('length of each splits', len_split)

Expected Results

The average scores should be equal to the average of the score weighted by the number of samples in the fold. As a result the comparison results should be:
True
True
False
False

average of the metrics for n_neighbors 2 computed as average of split 0.6666666666666666
average of the metrics for n_neighbors 2 reported by clf.cv_results_ 0.99
average of the metrics for n_neighbors 2 when weighting the lenght of each split with their score 0.99
score by split [1.0, 1.0, 0.0]
length of each splits [50, 49, 1]

Actual Results

The averages scores are equal to the average of the score by fold. Consequently a fold of 1 sample has as much impact on the average as a fold with 50 elements. Thus the comparison results are:
False
False
True
True

average of the metrics for n_neighbors 2 computed as average of split 0.6666666666666666
average of the metrics for n_neighbors 2 reported by clf.cv_results_ 0.6666666666666666
average of the metrics for n_neighbors 2 when weighting the lenght of each split with their score 0.99
score by split [1.0, 1.0, 0.0]
length of each splits [50, 49, 1]

Versions

System
python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0]

Python dependencies:
      sklearn: 1.4.1.post1
          pip: 24.0
   setuptools: 69.1.1
        numpy: 1.26.4
        scipy: 1.12.0
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.3.2
threadpoolctl: 3.3.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions