Description
Describe the bug
When providing an iterable for the cv
arguments for GridSearchCV, if the splits have different size (as it can be the case when doing "leave one group out") the "best" score computed at the end is done as a direct average of the score for each fold, without weighting them by the number of samples in the fold.
Consequently, the "best" estimator found is not actually the real best.
For example (as seen in the example below), if there are 3 splits, of 50, 49 and 1 sample, if the split with 1 sample results in a score of 0.0
while every other samples are correctly predicted (score of 1
), then the final score will be 0.666
instead of 0.99
.
Steps/Code to Reproduce
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
data = np.arange(100)
label = data >= 99
data = data.reshape(-1, 1)
indices_break = [0, 50, 99, 100]
split = [np.arange(indices_break[i], indices_break[i+1])
for i in range(len(indices_break)-1)]
splits = []
for i, test in enumerate(split):
train = split[:i]
train.extend(split[i+1:])
train = np.concatenate(train)
splits.append((train, test))
model = KNeighborsClassifier()
parameters = {'n_neighbors': [1, 2]}
clf = GridSearchCV(model, parameters, cv=splits)
clf.fit(data, label)
len_split = [len(i) for i in split]
res_n_1 = []
res_n_2 = []
for i in range(len(split)):
res_n_1.append(clf.cv_results_[f'split{i}_test_score'][0])
res_n_2.append(clf.cv_results_[f'split{i}_test_score'][1])
sum_res_n_1_weighted = sum([res_n_1[i] * len_split[i]
for i in range(len(split))])
sum_res_n_2_weighted = sum([res_n_2[i] * len_split[i]
for i in range(len(split))])
# weighted average of the split score
print(clf.cv_results_['mean_test_score'][0] == sum_res_n_1_weighted/100)
print(clf.cv_results_['mean_test_score'][1] == sum_res_n_2_weighted/100)
# direct average of the splits score
print(clf.cv_results_['mean_test_score'][0] == np.average(res_n_1))
print(clf.cv_results_['mean_test_score'][1] == np.average(res_n_2))
print('average of the metrics for n_neighbors 1 computed as average of split', np.average(res_n_1))
print('average of the metrics for n_neighbors 1 reported by clf.cv_results_', clf.cv_results_['mean_test_score'][0])
print('average of the metrics for n_neighbors 1 when weighting the lenght of each split with their score', sum_res_n_1_weighted/100)
print('score by split', res_n_1)
print('length of each splits', len_split)
print('average of the metrics for n_neighbors 2 computed as average of split', np.average(res_n_2))
print('average of the metrics for n_neighbors 2 reported by clf.cv_results_', clf.cv_results_['mean_test_score'][1])
print('average of the metrics for n_neighbors 2 when weighting the lenght of each split with their score', sum_res_n_2_weighted/100)
print('score by split', res_n_2)
print('length of each splits', len_split)
Expected Results
The average scores should be equal to the average of the score weighted by the number of samples in the fold. As a result the comparison results should be:
True
True
False
False
average of the metrics for n_neighbors 2 computed as average of split 0.6666666666666666
average of the metrics for n_neighbors 2 reported by clf.cv_results_ 0.99
average of the metrics for n_neighbors 2 when weighting the lenght of each split with their score 0.99
score by split [1.0, 1.0, 0.0]
length of each splits [50, 49, 1]
Actual Results
The averages scores are equal to the average of the score by fold. Consequently a fold of 1 sample has as much impact on the average as a fold with 50 elements. Thus the comparison results are:
False
False
True
True
average of the metrics for n_neighbors 2 computed as average of split 0.6666666666666666
average of the metrics for n_neighbors 2 reported by clf.cv_results_ 0.6666666666666666
average of the metrics for n_neighbors 2 when weighting the lenght of each split with their score 0.99
score by split [1.0, 1.0, 0.0]
length of each splits [50, 49, 1]
Versions
System
python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0]
Python dependencies:
sklearn: 1.4.1.post1
pip: 24.0
setuptools: 69.1.1
numpy: 1.26.4
scipy: 1.12.0
Cython: None
pandas: None
matplotlib: None
joblib: 1.3.2
threadpoolctl: 3.3.0
Built with OpenMP: True