[API] A public API for creating and using multiple scorers in the sklearn-ecosystem

### Describe the workflow you want to enable

I would like a **public** stable interface for multiple scorers that can be developed against for the sklearn eco-system.

Without this, it makes it difficult for libraries to provide any consistent API for dealing with evaluation with multiple scorers unless they:
1. Rely exclusively on `cross_validate` for evaluation as its the only place user input from multiple metrics can be funneled directly through to sklearn for evaluation. 
2. Implement custom wrapper types.
3. Refuse to support multiple metrics.

Why developers may prefer an externally sklearn supported multi-metric API:
1. Custom evaluation protocols can be developed that evaluate multiple objectives and benefit from sklearn's correctness (_i.e. caching, metadata and response values_).
2. Custom multi-scoring wrappers do not have to version against the verison of sklearn installed. (_See alternatives considered_)
3. Users can rely more on the same interface in sklearn-ecosystem of compliant libraries.

---

**Context for suggestion:**

In re-developing Auto-Sklearn, we perform Hyperparameter Optimization, which can include evaluating many metrics. We require custom evaluation protocols not trivially satisfied by `cross_validate` or the related family of provided sklearn functions. Previously, AutoSklearn would implement it's own metrics, however we'd like to extend this to any sklearn compliant scorer. Using a `_MultiMetricScorer` is ideal for their caching and handling of model response values to fit the scorer. Ideally we could also access this cache but that is a secondary concern for now.

I had previous solutions which emulated `_MultiMetricScorer` but they broke with sklearn `1.3` and `1.4` due to changes in scorers. I'm unsure how to reliably build a stable API against sklearn for multiple metrics.

An example use case where a user may want to evaluate against
```python
# Custom evaluation class the depends on sklearn API
# Does not need to know anything ahead of time about the scorers 
class CustomEvaluator:
    def __init__(..., scoring: dict[str, str | _Scorer]):
        self.scoring = {name: get_scorer(v) if isinstance(v, str) else v}
        
    def evaluate(self, pipeline_configuration):
        model = something(pipeline_configuration)

		# MAIN API REQUEST
        scorers = public_sklearn_api_to_make_multimetric_scorer(self.scoring)
        scores = scorers(model, X, Y)
        ...

# Custom evaluation metric
def user_custom_metric(y_pred, y_true) -> float:
    ....
    
# Userland, can rely on libraries to accept the following interface for providing multiple scorers
scorers = {
  	"acc": "accuracy",
  	"custom": make_scorer(
  	    user_custom_metric,
  	    response_method="predict",
  	    greater_is_better=False
  	)    
  }
custom_evaluator = CustomEvaluator(scorers)

)
```
### Describe your proposed solution

My proposed solution would involve making some variant of `_MultiMetricScorer` public API. Perhaps this could be made accessible through a non-backwards breaking change to `get_scorer`

```python
# Before
def get_scorer(scoring: str) -> _Scorer: ...

# After
@overload
def get_scorer(scoring: str) -> _Scorer: ...

@overload
def get_scorer(scoring: Iterable[str]) -> MultiMetricScorer: ...

def get_scorer(scoring: str | Iterable[str]) -> _Scorer | MultiMetricScorer: ...
```

---

This would allow a user to pass in a `MultiMetricScorer` which I can act upon, or at the very least, a `list[str]` I can reliably convert to one.
```python
# Example
match scorer:
    case MultiMetricScorer():
        scores: dict[str, float] = scorer(estimator, X, y)
    case list():
    	scorers = get_scorer(scorer)
    	scores = scorer(estimator, X, y)
    case _:
    	score: float = scorer(estimator, X, y)
```


This might cause inconsistency issues internally with `sklearn` which could be problematic. One additional change that might be required would be to add a new non-backwards breaking default to [`check_scoring(..., *, allow_multi_scoring: bool = False)`](https://github.com/scikit-learn/scikit-learn/blob/cb836be0ff8347ccb0ab722760df68d07485101e/sklearn/metrics/_scorer.py#L889).

---

** Issues with this proposal **

* There is no public `Scorer` class API, perhaps this suggestion make no sense without a public `Scorer` API. However I think that even if the `_MultiMetric` class were to remain hidden but there is a publicly advertised method to construct one that had reliable usage semantics, then both classes can remain hidden.

### Describe alternatives you've considered, if relevant
This easiest solution in most cases is rely on the private `_check_multimetric_scoring` and just instantiating a `_MultiMetricScorer`, relying on private functionality.

Previous solutions relied using the private `_MultiMetricScorer` and family of `_BaseScorer` and it's previous sub-families.

Understandably, these private classes are subject to change and broke with 1.3 changes to metadata routing and 1.4 with changes to the `_Scorer` hierarchy.

I will rely on private functionality if I have to but it makes developing a library against sklearn quite difficult due to versioning.

If this will not be supported, I will likely go with some wrapper class that is dependent upon the version of scikit-learn in use.

### Additional context

Currently, the only way to use multiple scorers for a model is through the interface to `cross_validate(scoring=["a", "b", "c"])` or to `permutation_importance`:
* [Code search for `_check_multimetric_scoring`](https://github.com/search?q=repo%3Ascikit-learn%2Fscikit-learn%20_check_multimetric_scoring&type=code)
* [Code search for `_MultiMetricScorer`](https://github.com/search?q=repo%3Ascikit-learn%2Fscikit-learn+_MultiMetricScorer&type=code)
* [Code search for `check_scoring`](https://github.com/search?q=repo%3Ascikit-learn%2Fscikit-learn%20check_scoring&type=code)

### Further Comments
Having access to the transformed cached predictions post scoring would be useful as well but I think that lies outside the scope for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[API] A public API for creating and using multiple scorers in the sklearn-ecosystem #28299

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Further Comments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[API] A public API for creating and using multiple scorers in the sklearn-ecosystem #28299

Description

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Further Comments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions