Skip to content

[MRG] Adds support for multimetric callable return a dictionary #15126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 42 commits into from
Jul 16, 2020

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Fixes #15021

What does this implement/fix? Explain your changes.

  1. Refactors _fit_and_score to return a dictionary.
  2. Refactors _check_multimetric_scoring to only handle list of strings or a dictionary.
  3. Because the return of the callable is only known after being called, _check_fit_and_score_results was added to normalize the output of parallelize _fit_and_score.
  4. _aggregate_list_of_dicts is a helper that combines list of dicts.
  5. Updates docs, introduction a third way to do multi-metric scoring.

@jnothman jnothman self-requested a review December 3, 2019 10:23
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to enable error_score handling are a mess... Is there a nicer way we can do this? either without fit_failed, or with only fit_failed (and not error_score)??

How do we handle a callable that gives us different keys in each call? For example, if it only returned keys corresponding to classes in that data split? Do we support them? Do we attempt to catch this case and raise an explicit error?

@@ -263,22 +263,20 @@ parameter:
Note that the dict values can either be scorer functions or one of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be indented unless I'm much mistaken

err_msg_generic = ("scoring should either be a single string or "
"callable or a "
"list/tuple of strings or a dict of scorer name "
"mapped to the callable for multiple metric "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we improve this message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about:

"scoring should be a string, callable, list of strings, or a dict mapping scorer names to {callables, strings}. Refer to https://scikit-learn.org/stable/modules/model_evaluation.html#using-multiple-metric-evaluation for details."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "scoring is invalid (got {!r}). Refer to the model_evaluation documentation and scoring API reference." Or refer to the glossary... which should be updated in this PR.

return score(self.best_estimator_, X, y)
if isinstance(self.scorer_, dict):
if self.multimetric_:
scorer = self.scorer_[self.refit]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multiplicity of types that scorer_ can be now makes it not very useful as public API. We could consider removing the attribute

refit_metric = self.refit

refit_metric = "score"
scoring_callable = callable(self.scoring)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this stored in a variable

scoring_callable = callable(self.scoring)
if scoring_callable:
scorers = self.scoring
elif (self.scoring is None or isinstance(self.scoring, str)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rm parentheses

scorers = _check_multimetric_scoring(self.estimator, self.scoring)
refit_metric = self.refit

if self.refit is not False and (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should pull this out as a function

@@ -455,29 +499,37 @@ def _fit_and_score(estimator, X, y, scorer, train, test, verbose,
return_estimator : boolean, optional, default: False
Whether to return the fitted estimator.

return_fit_failed : bool, default=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a private function. There's no need to add a parameter for this kind of change.

@thomasjpfan
Copy link
Member Author

The changes to enable error_score handling are a mess... Is there a nicer way we can do this? either without fit_failed, or with only fit_failed (and not error_score)??

A cleaner solution would be to have mark runs with fit_failed and have the caller adjust the score accordingly.

How do we handle a callable that gives us different keys in each call? For example, if it only returned keys corresponding to classes in that data split? Do we support them? Do we attempt to catch this case and raise an explicit error?

With this PR, I would prefer to not support this and raise an error.

@jnothman jnothman added this to the 0.23 milestone Jan 30, 2020
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otheriwse, this LGTM, thanks @thomasjpfan

else:
refit_metric = 'score'
scorers = _check_multimetric_scoring(self.estimator, self.scoring)
self._check_multimetric_scores_refit(scorers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid confusion with prev line, better rename to _check_refit_for_multimetric

train_scores : dict of scorer name -> float, optional
Score on training set (for all the scorers),
returned only if `return_train_score` is `True`.
result: dict with the following attributes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space before :, please

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite excited for the potential of this. It needs a what's new. Does it need an example?

@thomasjpfan
Copy link
Member Author

An example would be nice to have.

I want to revisit this one more time to think about if there is a (nice) way to avoid using fit_failed.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we benefit from polymorphism or similar here? I feel like there are a lot of code paths that differ depending on the format of scoring. Could we use polymorphism so that after some validation step, all scoring cases were treated identically from the perspective of the code?

Maybe in a later iteration.

result : dict with the following attributes
train_scores : dict of scorer name -> float
Score on training set (for all the scorers),
returned only if `return_train_score` is `True`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think blank lines between each of these would not render as a <dl> in ReST

raise NotFittedError("All estimators failed to fit")

if isinstance(successful_score, dict):
formatted_erorr = {name: error_score for name in successful_score}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
formatted_erorr = {name: error_score for name in successful_score}
formatted_error = {name: error_score for name in successful_score}

and consequently below

that failed are set to error_score. `results` are the aggregated output
of `_fit_and_score`.
"""
successful_score = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should call this score_names and set it to result["test_scores"].keys() below

@@ -222,45 +222,89 @@ def cross_validate(estimator, X, y=None, groups=None, scoring=None, cv=None,
X, y, groups = indexable(X, y, groups)

cv = check_cv(cv, y, classifier=is_classifier(estimator))
scorers, _ = _check_multimetric_scoring(estimator, scoring=scoring)

if callable(scoring):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment about why we check results only in the callable scoring case. Perhaps to the effect of "We try to validate scoring early if it is not a callable"

or else check_fit_and_score_results could be called _handle_error_score.

@glemaitre
Copy link
Member

@thomasjpfan Could you rebase, I would like to review this one.


if len(keys) != len(scoring):
raise ValueError(err_msg + "Duplicate elements were found in"
" the given list. %r" % repr(scoring))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use an f-string as well?

Comment on lines +704 to +711
if callable(self.scoring):
scorers = self.scoring
elif self.scoring is None or isinstance(self.scoring, str):
scorers = check_scoring(self.estimator, self.scoring)
else:
refit_metric = 'score'
scorers = _check_multimetric_scoring(self.estimator, self.scoring)
self._check_refit_for_multimetric(scorers)
refit_metric = self.refit
Copy link
Member Author

@thomasjpfan thomasjpfan Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most the complication of this PR comes from the fact that with a callable, it is not possible to tell if it returns a scalar or a dictionary without calling it. This means that _check_multimetric_scoring can't check if a self.scoring is multimetric yet. We have to use the results of _fit_and_score to tell if the score is a scalar or a dictionary.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an entry in what's new as well.

Comment on lines 483 to 484
raise ValueError(err_msg +
"Empty list was given. %r" % repr(scoring))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError(err_msg +
"Empty list was given. %r" % repr(scoring))
raise ValueError(
err_msg + f"Empty list was given. {repr(scoring)}
)

"One or more of the elements were "
"callables. Use a dict of score name "
"mapped to the scorer callable. "
"Got %r" % repr(scoring))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f-string

"Got %r" % repr(scoring))
else:
raise ValueError(err_msg +
"Non-string types were found in "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f-string

keys = set(scoring)
if not all(isinstance(k, str) for k in keys):
raise ValueError("Non-string types were found in the keys of "
"the given dict. scoring=%r" % repr(scoring))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f-string

raise ValueError("Non-string types were found in the keys of "
"the given dict. scoring=%r" % repr(scoring))
if len(keys) == 0:
raise ValueError("An empty dict was passed. %r" % repr(scoring))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f-string

"False explicitly. %r was passed."
% self.refit)
if self.refit is not False and (
not isinstance(self.refit, str) or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could almost break the condition into variable with meaningful name. It will be easier to read the if statement


return ret


def _insert_error_scores(results, error_score):
"""Insert error in results by replacing them with `error_score`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be worth mentioning that the insert is in place (even if it is almost obvious)


return ret


def _insert_error_scores(results, error_score):
"""Insert error in results by replacing them with `error_score`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Insert error in results by replacing them with `error_score`.
"""Insert error in `results` by replacing them with `error_score`.

"""Insert error in results by replacing them with `error_score`.

This only applies to multimetric scores because `_fit_and_score` will
handle the single metric case."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
handle the single metric case."""
handle the single metric case.
"""

@@ -751,16 +769,31 @@ def evaluate_candidates(candidate_params):
.format(n_splits,
len(out) // n_candidates))

# For callabe self.scoring, the return type is only know after
# calling. If the return type is a dictionary, the error scores
# can now be inserted with the correct key.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add in the comment that the type checking of out will happen in _insert_error_scores

elif isinstance(scoring, dict):
keys = set(scoring)
if not all(isinstance(k, str) for k in keys):
raise ValueError("Non-string types were found in the keys of "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we don't check for this case in the test

if self.multimetric_:
scorer = self.scorer_[self.refit]
else:
scorer = self.scorer_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not covered. So it seems this when scorer is a dict but with a single metric?

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the changes, LGTM. I would like to tackle #12385 because it would be key for solving the issue of passing scoring parameters in the grid-search and solving the bug in #17704

@glemaitre glemaitre merged commit 9acfaab into scikit-learn:master Jul 16, 2020
score = self.scorer_[self.refit] if self.multimetric_ else self.scorer_
return score(self.best_estimator_, X, y)
if isinstance(self.scorer_, dict):
if self.multimetric_:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan it seems that this attribute is undocumented. Should it be private instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am +0.5 on making it private. (which would require deprecation) Even before this PR, this attribute was defined as:

  1. False = scoring returns a scalar
  2. True = scoring returns a dictionary (Which can be a dictionary with one item)

So if scoring=['accuracy'] that would be considered "multimetric", while scoring='accuracy' is not.

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan, @glemaitre: this seems to have missed a what's new, and the parameter documentation for scoring wasn't updated!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow scoring callable to return a dict {name: score}
5 participants