-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Adds support for multimetric callable return a dictionary #15126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Adds support for multimetric callable return a dictionary #15126
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to enable error_score handling are a mess... Is there a nicer way we can do this? either without fit_failed, or with only fit_failed (and not error_score)??
How do we handle a callable that gives us different keys in each call? For example, if it only returned keys corresponding to classes in that data split? Do we support them? Do we attempt to catch this case and raise an explicit error?
doc/modules/model_evaluation.rst
Outdated
@@ -263,22 +263,20 @@ parameter: | |||
Note that the dict values can either be scorer functions or one of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be indented unless I'm much mistaken
sklearn/metrics/_scorer.py
Outdated
err_msg_generic = ("scoring should either be a single string or " | ||
"callable or a " | ||
"list/tuple of strings or a dict of scorer name " | ||
"mapped to the callable for multiple metric " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we improve this message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about:
"scoring should be a string, callable, list of strings, or a dict mapping scorer names to {callables, strings}. Refer to https://scikit-learn.org/stable/modules/model_evaluation.html#using-multiple-metric-evaluation for details."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "scoring is invalid (got {!r}). Refer to the model_evaluation documentation and scoring API reference." Or refer to the glossary... which should be updated in this PR.
return score(self.best_estimator_, X, y) | ||
if isinstance(self.scorer_, dict): | ||
if self.multimetric_: | ||
scorer = self.scorer_[self.refit] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multiplicity of types that scorer_ can be now makes it not very useful as public API. We could consider removing the attribute
sklearn/model_selection/_search.py
Outdated
refit_metric = self.refit | ||
|
||
refit_metric = "score" | ||
scoring_callable = callable(self.scoring) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this stored in a variable
sklearn/model_selection/_search.py
Outdated
scoring_callable = callable(self.scoring) | ||
if scoring_callable: | ||
scorers = self.scoring | ||
elif (self.scoring is None or isinstance(self.scoring, str)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rm parentheses
sklearn/model_selection/_search.py
Outdated
scorers = _check_multimetric_scoring(self.estimator, self.scoring) | ||
refit_metric = self.refit | ||
|
||
if self.refit is not False and ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should pull this out as a function
@@ -455,29 +499,37 @@ def _fit_and_score(estimator, X, y, scorer, train, test, verbose, | |||
return_estimator : boolean, optional, default: False | |||
Whether to return the fitted estimator. | |||
|
|||
return_fit_failed : bool, default=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a private function. There's no need to add a parameter for this kind of change.
A cleaner solution would be to have mark runs with
With this PR, I would prefer to not support this and raise an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otheriwse, this LGTM, thanks @thomasjpfan
sklearn/model_selection/_search.py
Outdated
else: | ||
refit_metric = 'score' | ||
scorers = _check_multimetric_scoring(self.estimator, self.scoring) | ||
self._check_multimetric_scores_refit(scorers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid confusion with prev line, better rename to _check_refit_for_multimetric
train_scores : dict of scorer name -> float, optional | ||
Score on training set (for all the scorers), | ||
returned only if `return_train_score` is `True`. | ||
result: dict with the following attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space before :, please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite excited for the potential of this. It needs a what's new. Does it need an example?
An example would be nice to have. I want to revisit this one more time to think about if there is a (nice) way to avoid using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we benefit from polymorphism or similar here? I feel like there are a lot of code paths that differ depending on the format of scoring
. Could we use polymorphism so that after some validation step, all scoring cases were treated identically from the perspective of the code?
Maybe in a later iteration.
result : dict with the following attributes | ||
train_scores : dict of scorer name -> float | ||
Score on training set (for all the scorers), | ||
returned only if `return_train_score` is `True`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I think blank lines between each of these would not render as a <dl>
in ReST
raise NotFittedError("All estimators failed to fit") | ||
|
||
if isinstance(successful_score, dict): | ||
formatted_erorr = {name: error_score for name in successful_score} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatted_erorr = {name: error_score for name in successful_score} | |
formatted_error = {name: error_score for name in successful_score} |
and consequently below
that failed are set to error_score. `results` are the aggregated output | ||
of `_fit_and_score`. | ||
""" | ||
successful_score = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should call this score_names
and set it to result["test_scores"].keys()
below
@@ -222,45 +222,89 @@ def cross_validate(estimator, X, y=None, groups=None, scoring=None, cv=None, | |||
X, y, groups = indexable(X, y, groups) | |||
|
|||
cv = check_cv(cv, y, classifier=is_classifier(estimator)) | |||
scorers, _ = _check_multimetric_scoring(estimator, scoring=scoring) | |||
|
|||
if callable(scoring): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment about why we check results only in the callable scoring case. Perhaps to the effect of "We try to validate scoring early if it is not a callable"
or else check_fit_and_score_results
could be called _handle_error_score
.
@thomasjpfan Could you rebase, I would like to review this one. |
sklearn/metrics/_scorer.py
Outdated
|
||
if len(keys) != len(scoring): | ||
raise ValueError(err_msg + "Duplicate elements were found in" | ||
" the given list. %r" % repr(scoring)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use an f-string as well?
if callable(self.scoring): | ||
scorers = self.scoring | ||
elif self.scoring is None or isinstance(self.scoring, str): | ||
scorers = check_scoring(self.estimator, self.scoring) | ||
else: | ||
refit_metric = 'score' | ||
scorers = _check_multimetric_scoring(self.estimator, self.scoring) | ||
self._check_refit_for_multimetric(scorers) | ||
refit_metric = self.refit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most the complication of this PR comes from the fact that with a callable, it is not possible to tell if it returns a scalar or a dictionary without calling it. This means that _check_multimetric_scoring
can't check if a self.scoring
is multimetric yet. We have to use the results of _fit_and_score
to tell if the score is a scalar or a dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add an entry in what's new as well.
sklearn/metrics/_scorer.py
Outdated
raise ValueError(err_msg + | ||
"Empty list was given. %r" % repr(scoring)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise ValueError(err_msg + | |
"Empty list was given. %r" % repr(scoring)) | |
raise ValueError( | |
err_msg + f"Empty list was given. {repr(scoring)} | |
) |
sklearn/metrics/_scorer.py
Outdated
"One or more of the elements were " | ||
"callables. Use a dict of score name " | ||
"mapped to the scorer callable. " | ||
"Got %r" % repr(scoring)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f-string
sklearn/metrics/_scorer.py
Outdated
"Got %r" % repr(scoring)) | ||
else: | ||
raise ValueError(err_msg + | ||
"Non-string types were found in " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f-string
sklearn/metrics/_scorer.py
Outdated
keys = set(scoring) | ||
if not all(isinstance(k, str) for k in keys): | ||
raise ValueError("Non-string types were found in the keys of " | ||
"the given dict. scoring=%r" % repr(scoring)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f-string
sklearn/metrics/_scorer.py
Outdated
raise ValueError("Non-string types were found in the keys of " | ||
"the given dict. scoring=%r" % repr(scoring)) | ||
if len(keys) == 0: | ||
raise ValueError("An empty dict was passed. %r" % repr(scoring)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f-string
sklearn/model_selection/_search.py
Outdated
"False explicitly. %r was passed." | ||
% self.refit) | ||
if self.refit is not False and ( | ||
not isinstance(self.refit, str) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could almost break the condition into variable with meaningful name. It will be easier to read the if statement
|
||
return ret | ||
|
||
|
||
def _insert_error_scores(results, error_score): | ||
"""Insert error in results by replacing them with `error_score`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be worth mentioning that the insert is in place (even if it is almost obvious)
|
||
return ret | ||
|
||
|
||
def _insert_error_scores(results, error_score): | ||
"""Insert error in results by replacing them with `error_score`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Insert error in results by replacing them with `error_score`. | |
"""Insert error in `results` by replacing them with `error_score`. |
"""Insert error in results by replacing them with `error_score`. | ||
|
||
This only applies to multimetric scores because `_fit_and_score` will | ||
handle the single metric case.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle the single metric case.""" | |
handle the single metric case. | |
""" |
sklearn/model_selection/_search.py
Outdated
@@ -751,16 +769,31 @@ def evaluate_candidates(candidate_params): | |||
.format(n_splits, | |||
len(out) // n_candidates)) | |||
|
|||
# For callabe self.scoring, the return type is only know after | |||
# calling. If the return type is a dictionary, the error scores | |||
# can now be inserted with the correct key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add in the comment that the type checking of out will happen in _insert_error_scores
elif isinstance(scoring, dict): | ||
keys = set(scoring) | ||
if not all(isinstance(k, str) for k in keys): | ||
raise ValueError("Non-string types were found in the keys of " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we don't check for this case in the test
if self.multimetric_: | ||
scorer = self.scorer_[self.refit] | ||
else: | ||
scorer = self.scorer_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not covered. So it seems this when scorer
is a dict but with a single metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
score = self.scorer_[self.refit] if self.multimetric_ else self.scorer_ | ||
return score(self.best_estimator_, X, y) | ||
if isinstance(self.scorer_, dict): | ||
if self.multimetric_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan it seems that this attribute is undocumented. Should it be private instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am +0.5 on making it private. (which would require deprecation) Even before this PR, this attribute was defined as:
False
= scoring returns a scalarTrue
= scoring returns a dictionary (Which can be a dictionary with one item)
So if scoring=['accuracy']
that would be considered "multimetric", while scoring='accuracy'
is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan, @glemaitre: this seems to have missed a what's new, and the parameter documentation for scoring
wasn't updated!!
Reference Issues/PRs
Fixes #15021
What does this implement/fix? Explain your changes.
_fit_and_score
to return a dictionary._check_multimetric_scoring
to only handle list of strings or a dictionary._check_fit_and_score_results
was added to normalize the output of parallelize_fit_and_score
._aggregate_list_of_dicts
is a helper that combines list of dicts.