Description
Describe the workflow you want to enable
I would like a public stable interface for multiple scorers that can be developed against for the sklearn eco-system.
Without this, it makes it difficult for libraries to provide any consistent API for dealing with evaluation with multiple scorers unless they:
- Rely exclusively on
cross_validate
for evaluation as its the only place user input from multiple metrics can be funneled directly through to sklearn for evaluation. - Implement custom wrapper types.
- Refuse to support multiple metrics.
Why developers may prefer an externally sklearn supported multi-metric API:
- Custom evaluation protocols can be developed that evaluate multiple objectives and benefit from sklearn's correctness (i.e. caching, metadata and response values).
- Custom multi-scoring wrappers do not have to version against the verison of sklearn installed. (See alternatives considered)
- Users can rely more on the same interface in sklearn-ecosystem of compliant libraries.
Context for suggestion:
In re-developing Auto-Sklearn, we perform Hyperparameter Optimization, which can include evaluating many metrics. We require custom evaluation protocols not trivially satisfied by cross_validate
or the related family of provided sklearn functions. Previously, AutoSklearn would implement it's own metrics, however we'd like to extend this to any sklearn compliant scorer. Using a _MultiMetricScorer
is ideal for their caching and handling of model response values to fit the scorer. Ideally we could also access this cache but that is a secondary concern for now.
I had previous solutions which emulated _MultiMetricScorer
but they broke with sklearn 1.3
and 1.4
due to changes in scorers. I'm unsure how to reliably build a stable API against sklearn for multiple metrics.
An example use case where a user may want to evaluate against
# Custom evaluation class the depends on sklearn API
# Does not need to know anything ahead of time about the scorers
class CustomEvaluator:
def __init__(..., scoring: dict[str, str | _Scorer]):
self.scoring = {name: get_scorer(v) if isinstance(v, str) else v}
def evaluate(self, pipeline_configuration):
model = something(pipeline_configuration)
# MAIN API REQUEST
scorers = public_sklearn_api_to_make_multimetric_scorer(self.scoring)
scores = scorers(model, X, Y)
...
# Custom evaluation metric
def user_custom_metric(y_pred, y_true) -> float:
....
# Userland, can rely on libraries to accept the following interface for providing multiple scorers
scorers = {
"acc": "accuracy",
"custom": make_scorer(
user_custom_metric,
response_method="predict",
greater_is_better=False
)
}
custom_evaluator = CustomEvaluator(scorers)
)
Describe your proposed solution
My proposed solution would involve making some variant of _MultiMetricScorer
public API. Perhaps this could be made accessible through a non-backwards breaking change to get_scorer
# Before
def get_scorer(scoring: str) -> _Scorer: ...
# After
@overload
def get_scorer(scoring: str) -> _Scorer: ...
@overload
def get_scorer(scoring: Iterable[str]) -> MultiMetricScorer: ...
def get_scorer(scoring: str | Iterable[str]) -> _Scorer | MultiMetricScorer: ...
This would allow a user to pass in a MultiMetricScorer
which I can act upon, or at the very least, a list[str]
I can reliably convert to one.
# Example
match scorer:
case MultiMetricScorer():
scores: dict[str, float] = scorer(estimator, X, y)
case list():
scorers = get_scorer(scorer)
scores = scorer(estimator, X, y)
case _:
score: float = scorer(estimator, X, y)
This might cause inconsistency issues internally with sklearn
which could be problematic. One additional change that might be required would be to add a new non-backwards breaking default to check_scoring(..., *, allow_multi_scoring: bool = False)
.
** Issues with this proposal **
- There is no public
Scorer
class API, perhaps this suggestion make no sense without a publicScorer
API. However I think that even if the_MultiMetric
class were to remain hidden but there is a publicly advertised method to construct one that had reliable usage semantics, then both classes can remain hidden.
Describe alternatives you've considered, if relevant
This easiest solution in most cases is rely on the private _check_multimetric_scoring
and just instantiating a _MultiMetricScorer
, relying on private functionality.
Previous solutions relied using the private _MultiMetricScorer
and family of _BaseScorer
and it's previous sub-families.
Understandably, these private classes are subject to change and broke with 1.3 changes to metadata routing and 1.4 with changes to the _Scorer
hierarchy.
I will rely on private functionality if I have to but it makes developing a library against sklearn quite difficult due to versioning.
If this will not be supported, I will likely go with some wrapper class that is dependent upon the version of scikit-learn in use.
Additional context
Currently, the only way to use multiple scorers for a model is through the interface to cross_validate(scoring=["a", "b", "c"])
or to permutation_importance
:
- Code search for
_check_multimetric_scoring
- Code search for
_MultiMetricScorer
- Code search for
check_scoring
Further Comments
Having access to the transformed cached predictions post scoring would be useful as well but I think that lies outside the scope for now.