GridSearchCV do not weight the score by the size of the fold when providing custom split for CV

### Describe the bug

When providing an iterable for the `cv` arguments for GridSearchCV, if the splits have different size (as it can be the case when doing "leave one group out") the "best" score computed at the end is done as a direct average of the score for each fold, without weighting them by the number of samples in the fold. 

Consequently, the "best" estimator found is not actually the real best.

For example (as seen in the example below), if there are 3 splits, of 50, 49 and 1 sample, if the split with 1 sample results in a score of `0.0` while every other samples are correctly predicted (score of `1`), then the final score will be `0.666` instead of `0.99`.

### Steps/Code to Reproduce

```python
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
import numpy as np


data = np.arange(100)
label = data >= 99
data = data.reshape(-1, 1)


indices_break = [0, 50, 99, 100]
split = [np.arange(indices_break[i], indices_break[i+1])
         for i in range(len(indices_break)-1)]


splits = []
for i, test in enumerate(split):
    train = split[:i]
    train.extend(split[i+1:])
    train = np.concatenate(train)
    splits.append((train, test))


model = KNeighborsClassifier()
parameters = {'n_neighbors': [1, 2]}

clf = GridSearchCV(model, parameters, cv=splits)
clf.fit(data, label)


len_split = [len(i) for i in split]
res_n_1 = []
res_n_2 = []
for i in range(len(split)):
    res_n_1.append(clf.cv_results_[f'split{i}_test_score'][0])
    res_n_2.append(clf.cv_results_[f'split{i}_test_score'][1])


sum_res_n_1_weighted = sum([res_n_1[i] * len_split[i]
                            for i in range(len(split))])
sum_res_n_2_weighted = sum([res_n_2[i] * len_split[i]
                            for i in range(len(split))])

# weighted average of the split score
print(clf.cv_results_['mean_test_score'][0] == sum_res_n_1_weighted/100)
print(clf.cv_results_['mean_test_score'][1] == sum_res_n_2_weighted/100)

# direct average of the splits score
print(clf.cv_results_['mean_test_score'][0] == np.average(res_n_1))
print(clf.cv_results_['mean_test_score'][1] == np.average(res_n_2))


print('average of the metrics for n_neighbors 1 computed as average of split', np.average(res_n_1))
print('average of the metrics for n_neighbors 1 reported by clf.cv_results_', clf.cv_results_['mean_test_score'][0])
print('average of the metrics for n_neighbors 1 when weighting the lenght of each split with their score', sum_res_n_1_weighted/100)
print('score by split', res_n_1)
print('length of each splits', len_split)

print('average of the metrics for n_neighbors 2 computed as average of split', np.average(res_n_2))
print('average of the metrics for n_neighbors 2 reported by clf.cv_results_', clf.cv_results_['mean_test_score'][1])
print('average of the metrics for n_neighbors 2 when weighting the lenght of each split with their score', sum_res_n_2_weighted/100)
print('score by split', res_n_2)
print('length of each splits', len_split)

```

### Expected Results

The average scores should be equal to the average of the score weighted by the number of samples in the fold. As a result the comparison results should be:  
True
True
False
False

average of the metrics for n_neighbors 2 computed as average of split 0.6666666666666666
**average of the metrics for n_neighbors 2 reported by clf.cv_results_ 0.99**
average of the metrics for n_neighbors 2 when weighting the lenght of each split with their score 0.99
score by split [1.0, 1.0, 0.0]
length of each splits [50, 49, 1]



### Actual Results


The averages scores are equal to the average of the score by fold. Consequently a fold of 1 sample has as much impact on the average as a fold with 50 elements. Thus the comparison results are:  
False
False
True
True

average of the metrics for n_neighbors 2 computed as average of split 0.6666666666666666
**average of the metrics for n_neighbors 2 reported by clf.cv_results_ 0.6666666666666666**
average of the metrics for n_neighbors 2 when weighting the lenght of each split with their score 0.99
score by split [1.0, 1.0, 0.0]
length of each splits [50, 49, 1]


### Versions

```shell
System
python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0]

Python dependencies:
      sklearn: 1.4.1.post1
          pip: 24.0
   setuptools: 69.1.1
        numpy: 1.26.4
        scipy: 1.12.0
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.3.2
threadpoolctl: 3.3.0

Built with OpenMP: True
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GridSearchCV do not weight the score by the size of the fold when providing custom split for CV #28575

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

GridSearchCV do not weight the score by the size of the fold when providing custom split for CV #28575

Description

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions