[MRG] Adds support for multimetric callable return a dictionary #15126

thomasjpfan · 2019-10-04T02:17:28Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Refactors _fit_and_score to return a dictionary.
Refactors _check_multimetric_scoring to only handle list of strings or a dictionary.
Because the return of the callable is only known after being called, _check_fit_and_score_results was added to normalize the output of parallelize _fit_and_score.
_aggregate_list_of_dicts is a helper that combines list of dicts.
Updates docs, introduction a third way to do multi-metric scoring.

jnothman

The changes to enable error_score handling are a mess... Is there a nicer way we can do this? either without fit_failed, or with only fit_failed (and not error_score)??

How do we handle a callable that gives us different keys in each call? For example, if it only returned keys corresponding to classes in that data split? Do we support them? Do we attempt to catch this case and raise an explicit error?

jnothman · 2019-12-03T11:14:45Z

doc/modules/model_evaluation.rst

@@ -263,22 +263,20 @@ parameter:
 Note that the dict values can either be scorer functions or one of the


This should be indented unless I'm much mistaken

jnothman · 2019-12-03T20:16:15Z

sklearn/metrics/_scorer.py

+    err_msg_generic = ("scoring should either be a single string or "
+                       "callable or a "
+                       "list/tuple of strings or a dict of scorer name "
+                       "mapped to the callable for multiple metric "


Should we improve this message?

What do you think about:

"scoring should be a string, callable, list of strings, or a dict mapping scorer names to {callables, strings}. Refer to https://scikit-learn.org/stable/modules/model_evaluation.html#using-multiple-metric-evaluation for details."

How about "scoring is invalid (got {!r}). Refer to the model_evaluation documentation and scoring API reference." Or refer to the glossary... which should be updated in this PR.

jnothman · 2019-12-04T03:51:25Z

sklearn/model_selection/_search.py

-        return score(self.best_estimator_, X, y)
+        if isinstance(self.scorer_, dict):
+            if self.multimetric_:
+                scorer = self.scorer_[self.refit]


The multiplicity of types that scorer_ can be now makes it not very useful as public API. We could consider removing the attribute

jnothman · 2019-12-04T03:52:46Z

sklearn/model_selection/_search.py

-                refit_metric = self.refit
+
+        refit_metric = "score"
+        scoring_callable = callable(self.scoring)


I don't think we need this stored in a variable

jnothman · 2019-12-04T03:53:02Z

sklearn/model_selection/_search.py

+        scoring_callable = callable(self.scoring)
+        if scoring_callable:
+            scorers = self.scoring
+        elif (self.scoring is None or isinstance(self.scoring, str)):


Rm parentheses

jnothman · 2019-12-04T03:59:57Z

sklearn/model_selection/_search.py

+            scorers = _check_multimetric_scoring(self.estimator, self.scoring)
+            refit_metric = self.refit
+
+            if self.refit is not False and (


Maybe we should pull this out as a function

jnothman · 2019-12-04T10:14:58Z

sklearn/model_selection/_validation.py

@@ -455,29 +499,37 @@ def _fit_and_score(estimator, X, y, scorer, train, test, verbose,
    return_estimator : boolean, optional, default: False
        Whether to return the fitted estimator.

+    return_fit_failed : bool, default=False


This is a private function. There's no need to add a parameter for this kind of change.

thomasjpfan · 2019-12-04T17:28:50Z

The changes to enable error_score handling are a mess... Is there a nicer way we can do this? either without fit_failed, or with only fit_failed (and not error_score)??

A cleaner solution would be to have mark runs with fit_failed and have the caller adjust the score accordingly.

How do we handle a callable that gives us different keys in each call? For example, if it only returned keys corresponding to classes in that data split? Do we support them? Do we attempt to catch this case and raise an explicit error?

With this PR, I would prefer to not support this and raise an error.

jnothman

Otheriwse, this LGTM, thanks @thomasjpfan

jnothman · 2020-02-03T13:14:19Z

sklearn/model_selection/_search.py

        else:
-            refit_metric = 'score'
+            scorers = _check_multimetric_scoring(self.estimator, self.scoring)
+            self._check_multimetric_scores_refit(scorers)


to avoid confusion with prev line, better rename to _check_refit_for_multimetric

jnothman · 2020-02-03T13:19:41Z

sklearn/model_selection/_validation.py

-    train_scores : dict of scorer name -> float, optional
-        Score on training set (for all the scorers),
-        returned only if `return_train_score` is `True`.
+    result: dict with the following attributes


space before :, please

jnothman

I'm quite excited for the potential of this. It needs a what's new. Does it need an example?

thomasjpfan · 2020-02-03T14:00:21Z

An example would be nice to have.

I want to revisit this one more time to think about if there is a (nice) way to avoid using fit_failed.

jnothman

Could we benefit from polymorphism or similar here? I feel like there are a lot of code paths that differ depending on the format of scoring. Could we use polymorphism so that after some validation step, all scoring cases were treated identically from the perspective of the code?

Maybe in a later iteration.

jnothman · 2020-02-09T09:37:39Z

sklearn/model_selection/_validation.py

+    result : dict with the following attributes
+        train_scores : dict of scorer name -> float
+            Score on training set (for all the scorers),
+            returned only if `return_train_score` is `True`.



FWIW, I think blank lines between each of these would not render as a <dl> in ReST

jnothman · 2020-02-09T09:51:01Z

sklearn/model_selection/_validation.py

+        raise NotFittedError("All estimators failed to fit")
+
+    if isinstance(successful_score, dict):
+        formatted_erorr = {name: error_score for name in successful_score}


Suggested change

formatted_erorr = {name: error_score for name in successful_score}

formatted_error = {name: error_score for name in successful_score}

and consequently below

jnothman · 2020-02-09T09:53:57Z

sklearn/model_selection/_validation.py

+    that failed are set to error_score. `results` are the aggregated output
+    of `_fit_and_score`.
+    """
+    successful_score = None


I think we should call this score_names and set it to result["test_scores"].keys() below

jnothman · 2020-02-09T09:56:28Z

sklearn/model_selection/_validation.py

@@ -222,45 +222,89 @@ def cross_validate(estimator, X, y=None, groups=None, scoring=None, cv=None,
    X, y, groups = indexable(X, y, groups)

    cv = check_cv(cv, y, classifier=is_classifier(estimator))
-    scorers, _ = _check_multimetric_scoring(estimator, scoring=scoring)
+
+    if callable(scoring):


Please add a comment about why we check results only in the callable scoring case. Perhaps to the effect of "We try to validate scoring early if it is not a callable"

or else check_fit_and_score_results could be called _handle_error_score.

glemaitre · 2020-07-07T08:20:56Z

@thomasjpfan Could you rebase, I would like to review this one.

glemaitre · 2020-07-09T18:45:24Z

sklearn/metrics/_scorer.py

+
+        if len(keys) != len(scoring):
+            raise ValueError(err_msg + "Duplicate elements were found in"
+                             " the given list. %r" % repr(scoring))


Shall we use an f-string as well?

thomasjpfan · 2020-07-09T14:20:10Z

sklearn/model_selection/_search.py

+        if callable(self.scoring):
+            scorers = self.scoring
+        elif self.scoring is None or isinstance(self.scoring, str):
+            scorers = check_scoring(self.estimator, self.scoring)
        else:
-            refit_metric = 'score'
+            scorers = _check_multimetric_scoring(self.estimator, self.scoring)
+            self._check_refit_for_multimetric(scorers)
+            refit_metric = self.refit


Most the complication of this PR comes from the fact that with a callable, it is not possible to tell if it returns a scalar or a dictionary without calling it. This means that _check_multimetric_scoring can't check if a self.scoring is multimetric yet. We have to use the results of _fit_and_score to tell if the score is a scalar or a dictionary.

glemaitre

Could you add an entry in what's new as well.

glemaitre · 2020-07-09T18:47:12Z

sklearn/metrics/_scorer.py

+            raise ValueError(err_msg +
+                             "Empty list was given. %r" % repr(scoring))


Suggested change

raise ValueError(err_msg +

"Empty list was given. %r" % repr(scoring))

raise ValueError(

err_msg + f"Empty list was given. {repr(scoring)}

)

glemaitre · 2020-07-09T18:49:04Z

sklearn/metrics/_scorer.py

+                                     "One or more of the elements were "
+                                     "callables. Use a dict of score name "
+                                     "mapped to the scorer callable. "
+                                     "Got %r" % repr(scoring))


glemaitre · 2020-07-09T18:49:12Z

sklearn/metrics/_scorer.py

+                                     "Got %r" % repr(scoring))
+                else:
+                    raise ValueError(err_msg +
+                                     "Non-string types were found in "


glemaitre · 2020-07-09T18:50:24Z

sklearn/metrics/_scorer.py

+        keys = set(scoring)
+        if not all(isinstance(k, str) for k in keys):
+            raise ValueError("Non-string types were found in the keys of "
+                             "the given dict. scoring=%r" % repr(scoring))


glemaitre · 2020-07-09T18:50:35Z

sklearn/metrics/_scorer.py

+            raise ValueError("Non-string types were found in the keys of "
+                             "the given dict. scoring=%r" % repr(scoring))
+        if len(keys) == 0:
+            raise ValueError("An empty dict was passed. %r" % repr(scoring))


glemaitre · 2020-07-09T19:13:54Z

sklearn/model_selection/_search.py

+                                 "False explicitly. %r was passed."
+                                 % self.refit)
+        if self.refit is not False and (
+                not isinstance(self.refit, str) or


we could almost break the condition into variable with meaningful name. It will be easier to read the if statement

glemaitre · 2020-07-09T19:22:38Z

sklearn/model_selection/_validation.py


    return ret


+def _insert_error_scores(results, error_score):
+    """Insert error in results by replacing them with `error_score`.


Could be worth mentioning that the insert is in place (even if it is almost obvious)

glemaitre · 2020-07-09T19:22:52Z

sklearn/model_selection/_validation.py


    return ret


+def _insert_error_scores(results, error_score):
+    """Insert error in results by replacing them with `error_score`.


Suggested change

"""Insert error in results by replacing them with `error_score`.

"""Insert error in `results` by replacing them with `error_score`.

glemaitre · 2020-07-09T19:23:02Z

sklearn/model_selection/_validation.py

+    """Insert error in results by replacing them with `error_score`.
+
+    This only applies to multimetric scores because `_fit_and_score` will
+    handle the single metric case."""


Suggested change

handle the single metric case."""

handle the single metric case.

"""

glemaitre · 2020-07-09T19:28:09Z

sklearn/model_selection/_search.py

@@ -751,16 +769,31 @@ def evaluate_candidates(candidate_params):
                                     .format(n_splits,
                                             len(out) // n_candidates))

+                # For callabe self.scoring, the return type is only know after
+                # calling. If the return type is a dictionary, the error scores
+                # can now be inserted with the correct key.


I would add in the comment that the type checking of out will happen in _insert_error_scores

glemaitre · 2020-07-09T19:35:31Z

sklearn/metrics/_scorer.py

+    elif isinstance(scoring, dict):
+        keys = set(scoring)
+        if not all(isinstance(k, str) for k in keys):
+            raise ValueError("Non-string types were found in the keys of "


It seems that we don't check for this case in the test

glemaitre · 2020-07-09T19:37:00Z

sklearn/model_selection/_search.py

+            if self.multimetric_:
+                scorer = self.scorer_[self.refit]
+            else:
+                scorer = self.scorer_


This is not covered. So it seems this when scorer is a dict but with a single metric?

glemaitre

After the changes, LGTM. I would like to tackle #12385 because it would be key for solving the issue of passing scoring parameters in the grid-search and solving the bug in #17704

NicolasHug · 2020-07-20T13:12:56Z

sklearn/model_selection/_search.py

-        score = self.scorer_[self.refit] if self.multimetric_ else self.scorer_
-        return score(self.best_estimator_, X, y)
+        if isinstance(self.scorer_, dict):
+            if self.multimetric_:


@thomasjpfan it seems that this attribute is undocumented. Should it be private instead?

I am +0.5 on making it private. (which would require deprecation) Even before this PR, this attribute was defined as:

False = scoring returns a scalar

True = scoring returns a dictionary (Which can be a dictionary with one item)

So if scoring=['accuracy'] that would be considered "multimetric", while scoring='accuracy' is not.

…-learn#15126)

jnothman

@thomasjpfan, @glemaitre: this seems to have missed a what's new, and the parameter documentation for scoring wasn't updated!!

thomasjpfan added 6 commits October 3, 2019 11:13

WIP

9d090da

ENH Increase compability

315c335

ENH Refactories _fit_and_score

702cf1b

RFC Moves support into a function

a7d2efb

BUG Fix old numpy bug

c77afd7

TST Removes tests for error on multimetric

5ab8693

boogie9843 approved these changes Dec 3, 2019

View reviewed changes

jnothman self-requested a review December 3, 2019 10:23

Merge remote-tracking branch 'upstream/master' into multimetric_refactor

9c53783

jnothman reviewed Dec 4, 2019

View reviewed changes

thomasjpfan added 2 commits December 4, 2019 10:41

Merge remote-tracking branch 'upstream/master' into multimetric_refactor

8676c04

DOC Indent

e8f8c9f

thomasjpfan added 2 commits December 4, 2019 12:36

CLN Refactors multimetric check

5f50a32

Merge remote-tracking branch 'upstream/master' into multimetric_refactor

ad829e1

jnothman added this to the 0.23 milestone Jan 30, 2020

jnothman reviewed Feb 3, 2020

View reviewed changes

jnothman approved these changes Feb 3, 2020

View reviewed changes

thomasjpfan added 9 commits February 3, 2020 20:01

Merge remote-tracking branch 'upstream/master' into multimetric_refactor

c27f592

CLN Address comments

57c390a

Merge remote-tracking branch 'upstream/master' into multimetric_refactor

524fd87

CLN Simplifies checking

1b28907

CLN Simplifies aggregation

2cf9ba8

CLN Less code the better

f336d64

CLN Moves definition closer to usage

a86eaf0

CLN Update error handling

b1782ae

Merge remote-tracking branch 'upstream/master' into multimetric_refactor

d4782b2

jnothman reviewed Feb 9, 2020

View reviewed changes

DOC Adds comments

286bb86

thomasjpfan mentioned this pull request May 25, 2020

ENH _fit_and_score now returns a dictionary #17332

Merged

thomasjpfan added 3 commits May 25, 2020 16:15

Merge remote-tracking branch 'upstream/master' into multimetric_refactor

9799297

Merge remote-tracking branch 'upstream/master' into multimetric_refactor

5e1b72b

Merge remote-tracking branch 'upstream/master' into multimetric_refactor

fe7dae3

thomasjpfan added 6 commits July 8, 2020 11:07

Merge remote-tracking branch 'upstream/master' into multimetric_refactor

7c85d54

CLN Removes some state

5da1571

CLN Address comments

e541de3

Merge remote-tracking branch 'upstream/master' into multimetric_refactor

e6116c5

BUG Fix score

657ef89

CLN Adds to glossary

b0cdc57

glemaitre reviewed Jul 9, 2020

View reviewed changes

thomasjpfan commented Jul 9, 2020

View reviewed changes

CLN Uses f-strings

714372f

glemaitre reviewed Jul 9, 2020

View reviewed changes

glemaitre approved these changes Jul 9, 2020

View reviewed changes

glemaitre mentioned this pull request Jul 10, 2020

[WIP] ENH create a generator of applicable metrics depending on the target y #17889

Open

thomasjpfan added 3 commits July 12, 2020 10:16

Merge remote-tracking branch 'upstream/master' into multimetric_refactor

b09e303

CLN Address comments

83a4a76

STY Fix

333ff25

glemaitre merged commit 9acfaab into scikit-learn:master Jul 16, 2020

glemaitre removed the Waiting for Reviewer label Jul 16, 2020

NicolasHug reviewed Jul 20, 2020

View reviewed changes

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020

ENH Adds support for multimetric callable return a dictionary (scikit…

dd9b09c

…-learn#15126)

jnothman reviewed Jan 19, 2021

View reviewed changes

This was referenced Jan 19, 2021

[WIP] ENH create callable class to get adequate scorer for a problem #17930

Open

Multimetric callable scoring not documented #19202

Closed

		@@ -263,22 +263,20 @@ parameter:
		Note that the dict values can either be scorer functions or one of the

	formatted_erorr = {name: error_score for name in successful_score}
	formatted_error = {name: error_score for name in successful_score}

		raise ValueError(err_msg +
		"Empty list was given. %r" % repr(scoring))

	"""Insert error in results by replacing them with `error_score`.
	"""Insert error in `results` by replacing them with `error_score`.

	handle the single metric case."""
	handle the single metric case.
	"""

Uh oh!

[MRG] Adds support for multimetric callable return a dictionary #15126

[MRG] Adds support for multimetric callable return a dictionary #15126

Uh oh!

Conversation

thomasjpfan commented Oct 4, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Dec 4, 2019

Uh oh!

jnothman left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Feb 3, 2020

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Jul 7, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Jul 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jnothman left a comment •

edited

Loading

thomasjpfan Jul 9, 2020 •

edited

Loading