Skip to content

[MRG] Adds _MultimetricScorer for Optimized Scoring #14593

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 40 commits into from
Sep 10, 2019

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Fixes #10802
Alternative to #10979

What does this implement/fix? Explain your changes.

  1. This PR creates a _MultimetricScorer that subclasses dict which is used to reduce the number of calls to predict, predict_proba, and decision_function.

  2. The public interface of objects and functions using scoring are unchanged.

  3. The cache is only used when it is beneficial to use, as defined in _MultimetricScorer._use_cache.

  4. Users can not create a _MultimetricScorer and pass it into scoring.

Any other comments?

I do have plans to support custom callables that return dictionaries from the user. This was not included in this PR to narrow the scope of this PR to _MultimetricScorer.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I think this could use a few more comments to describe the logic.

I'm not a huge fan of using None as _method_cacher for it to revert to the default method cacher of _BaseScorer. (Maybe with my suggestions I'd find it clearer, IDK)

Maybe add a sanity check that makes sure that passing another X gives different results even when caching is involved.

scorers = {"score": check_scoring(estimator, scoring=scoring)}
return scorers, False
return _MultimetricScorer(**scorers), False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why return a MultiMetricScorer when there is only one scorer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_check_multimetric_scoring always returned a multimetric scorer. (On master it returned a dictionary which was the data structure used to denote "mutlimetric scoring".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, a dict with only one key (as here) denotes a single metric scorer. That's the reason is_multimetric is False here.

Since no caching happens with a single-metric scorer, I think we should not change this part and still return the dict, to avoid the confusion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or else, MultiMetricScorer should have a whole different name. It doesn't make sense to return a MultiMetricScorer instance while is_multimetric is False

Copy link
Member Author

@thomasjpfan thomasjpfan Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user can pass a dictionary to scoring with one key and is_multimetric will be true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do plan on removing is_multimetric. And have "anything that returns a dictionary" as multimetric.

'll2': 'neg_log_loss',
'ra1': 'roc_auc',
'ra2': 'roc_auc'
}, 1, 1, 1), (['roc_auc', 'accuracy'], 1, 0, 1)],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for readability maybe separate both cases with a line break

@@ -543,3 +544,41 @@ def test_scoring_is_not_metric():
Ridge(), r2_score)
assert_raises_regexp(ValueError, 'make_scorer', check_scoring,
KMeans(), cluster_module.adjusted_rand_score)


@pytest.mark.parametrize("scorers,predicts,predict_probas,decision_funcs",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expected_predict_count, expected_predict_proba_count, ... ?

Long names, I know :/

scorer, _ = _check_multimetric_scoring(LogisticRegression(), scorers)
scores = scorer(mock_est, X, y)

assert set(scorers) == set(scores)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because I was slightly confused at first:

Suggested change
assert set(scorers) == set(scores)
assert set(scorers) == set(scores) # compare dict keys

return True

if counter[_ThresholdScorer] > 0 and (counter[_PredictScorer] or
counter[_ThresholdScorer]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent to

counter[_ThresholdScorer] and (counter[_PredictScorer]

(the or isn't useful)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have been:

        if counter[_ThresholdScorer] and (counter[_PredictScorer] or
                                          counter[_ProbaScorer]):

return scores

def _use_cache(self):
"""Return True if using a cache is desired."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short description of "desired" please ;)

return self._score(estimator, X, y_true, sample_weight=sample_weight)

def _method_cacher(self, estimator, method, *args, **kwargs):
"""Call estimator directly."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Call estimator directly."""
"""Call estimator's method directly, without caching."""

@@ -44,7 +45,54 @@
from ..base import is_regressor


class _BaseScorer(metaclass=ABCMeta):
class _MultimetricScorer(dict):
"""Callable dictionary for multimetric scoring."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please briefly describe keys being strings and values being instances of _BaseScorer

I first thought (without looking) that this was also a _BaseScorer instance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would have been nice, the passthrough scorer and custom scorers may have weird interfaces, so _MultimetricScorer.__call__ needed to be as generic as possible.

"""
return self._score(estimator, X, y_true, sample_weight=sample_weight)

def _method_cacher(self, estimator, method, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused that this is called _method_cacher. Makes me things that it overrides the MultimetricScorer's _method_cacher, but the logic is slightly different.

Call this _passthrough_method_cacher?

fit_and_score_args = [None, None, None, two_params_scorer]

scorer = _MultimetricScorer(score=two_params_scorer)
fit_and_score_args = [None, None, None, scorer]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the original list [None, None, None, two_params_scorer] still be tested?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is testing a private method _score. On master, a multimetric scoring was represented with a dictionary, which _score used to call the scorers independently. This PR moves this responsibility from _score to _MultimetricScorer. Now _score only needs to call _MultimetricScorer.__call__.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nitpicks about comments but LGTM, thanks @thomasjpfan

The whole scoring logic is becoming quite convoluted by now... Might be worth some re-thinking one day.


`_MultimetricScorer` will return a dictionary of scores corresponding to
the scorers in the dictionary. Note `_MultimetricScorer` can be created
with a dictionary with one key.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with a dictionary with one key.
with a dictionary with one key (i.e. only one actual scorer).

return scores

def _use_cache(self, estimator):
"""Return True if using a cache it is beneficial.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Return True if using a cache it is beneficial.
"""Return True if using a cache is beneficial.

- `decision_function` and `predict_proba` is called.

"""
if len(self) == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(self) == 1:
if len(self) == 1: # Only one scorer

score : float
Score function applied to prediction of estimator on X.
"""
return self._score(partial(_method_caller, None), estimator, X, y_true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return self._score(partial(_method_caller, None), estimator, X, y_true,
return self._score(partial(_method_caller, cache=None), estimator, X, y_true,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since cache is a positional argument, partial needs to accept it as a positional argument.

"""Evaluate predicted target values for X relative to y_true.

Parameters
----------
method_caller: callable
Call estimator with method and args and kwargs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Call estimator with method and args and kwargs.
Call estimator's method with args and kwargs, potentially caching results.

Or anything else that indicates this is used for caching

if is_multimetric:
return _multimetric_score(estimator, X_test, y_test, scorer)
def _score(estimator, X_test, y_test, scorer):
"""Compute the score(s) of an estimator on a given test set."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the comment about what is returned.

IIUC a dict is returned iff scorer is a dict?


scorer_dict, _ = _check_multimetric_scoring(LogisticRegression(), scorers)
scorer = _MultimetricScorer(**scorer_dict)
scores = scorer(mock_est, X, y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is possible but it'd be cool to assert that the cache only exists during __call__().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is scoped in __call__ I do not think it is possible.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nitpicks but this looks great!

to `predict_proba`, `predict`, and `decision_function`.

`_MultimetricScorer` will return a dictionary of scores corresponding to
the scorers in the dictionary. Note `_MultimetricScorer` can be created
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that?

- `_ThresholdScorer` and `_PredictScorer` are called and
estimator is a regressor.
- `_ThresholdScorer` and `_ProbaScorer` are called and
estimator does not have `decision_function` an attribute.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
estimator does not have `decision_function` an attribute.
estimator does not have a `decision_function` attribute.

scorer = _MultimetricScorer(**scorer_dict)
scores = scorer(mock_est, X, y)

assert set(scorers) == set(scores) # compare dict keys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add assert set(scorers) == set(scorer)?
I find this hard to read btw because we have scorer, scorers and scores which have very small levinshtein distance, and scorer_dict, which is not very helpful, since the other three things are also dicts.

assert predict_proba_call_cnt == 1


def test_multimetric_scorer_calls_method_once_regressos_threshold():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_multimetric_scorer_calls_method_once_regressos_threshold():
def test_multimetric_scorer_calls_method_once_regressor_threshold():

clf.fit(X, y)

# regression metric that needs "threshold" which calls predict
r2_threshold = make_scorer(r2_score, needs_threshold=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this would be nicer with an actual ranking metric, like auc?

score1 = scorer(clf, X1, y1)
score2 = scorer(clf, X2, y2)
assert set(score1) == set(score2) # compare dict keys
assert score1 != score2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this test? object identity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad test is bad. I redid this test to manually call scorers as suggested in your other comment.

scorer_dict, _ = _check_multimetric_scoring(clf, scorers)
scorer = _MultimetricScorer(**scorer_dict)

score1 = scorer(clf, X1, y1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we maybe manually call the scorers in the scorer_dict and see that the results are correct for each of them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated test to do this.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this design of a callable dict that generates a dict uncomfortable. That duplicate use of dicts makes the documentation confusing, apart from anything else.

Is it really justified to make _MultimetricScorer a dict? What functionality of a dict is used? I understand that this may reduce the amount of code here, but I suspect it makes it a little more obfuscated.

@thomasjpfan
Copy link
Member Author

The dict feature was needed when _check_multimetric_scoring returned a _MultimetricScorer. Since this was removed, it is not needed anymore.

PR was updated such that _MultimetricScorer is not a dict.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a whatsnew

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM



def test_multimetric_scorer_sanity_check():
# scoring dictionary returned is the same as calling each scroer seperately
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# scoring dictionary returned is the same as calling each scroer seperately
# scoring dictionary returned is the same as calling each scorer seperately

if not isinstance(score, numbers.Number):
raise ValueError(error_msg % (score, type(score), name))
scores[name] = score
else: # scaler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else: # scaler
else: # scalar


error_msg = ("scoring must return a number, got %s (%s) "
"instead. (scorer=%s)")
if isinstance(scores, dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can return a number or a dict. Can we make all cases return a dict, and delete some code paths? we could just use:

if not isinstance(scores, 'dict'):
    scores = {'score': scores}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've looked into this and it might be better to consider this as a separate clean-up change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of change would reduce quite a few code paths. (It would most likely make it nicer to support custom callables that return dictionaries.

@@ -257,6 +257,11 @@ Changelog
- |Enhancement| Allow computing averaged metrics in the case of no true positives.
:pr:`14595` by `Andreas Müller`_.

- |Enhancement| Improved performance of multimetric scoring in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use |Efficiency|?


error_msg = ("scoring must return a number, got %s (%s) "
"instead. (scorer=%s)")
if isinstance(scores, dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've looked into this and it might be better to consider this as a separate clean-up change.

@amueller
Copy link
Member

amueller commented Aug 26, 2019

oh nice, this is even cleaner :) still lgtm from my side.

@amueller amueller added the High Priority High priority issues and pull requests label Aug 26, 2019
@amueller
Copy link
Member

fixes #10823, closes #9326

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan please address minor typos so we can merge ;)

@amueller
Copy link
Member

@thomasjpfan can you fix the merge conflicts again?
@jnothman does this still look good? I'd love to mere this.

@jnothman jnothman merged commit fbb2c7c into scikit-learn:master Sep 10, 2019
@jnothman
Copy link
Member

Thank you @thomasjpfan!!

I look forward to some of the things this enables along the lines of #12385

@amueller
Copy link
Member

Awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority High priority issues and pull requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-metric scoring is incredibly slow because it repeats predictions for every metric
4 participants