Cross-validation returning multiple scores #1850

jnothman · 2013-04-11T09:33:52Z

Scorer objects currently provide an interface that returns a scalar score given an estimator and test data. This is necessary for *SearchCV to calculate a mean score across folds, and determine the best score among parameters.

This is very hampering in terms of the diagnostic information available from a cross-fold validation or parameter exploration, which one can see by comparing to the catalogue of metrics that includes: precision and recall with F-score; scores for each of multiple classes as well as an aggregate; and error distributions (i.e. PR-curve or confusion matrix). @solomonm (#1837) and I (ML, an implementation within #1768) have independently sought Precision and Recall to be returned from cross-validation routines when F1 is used as the cross-validation objective; @eickenberg on #1381 (comment) posed a concern regarding array of scores corresponding to multiple targets.

I thought it deserved an Issue of its own to solidify the argument and its solution.

Some design options:

Allow multiple scorers to be provided to cross_val_score or *SearchCV (henceforth CVEvaluator), with one specified as the objective. But since the Scorer generally calls estimator.{predict,decision_function,predict_proba}, each scorer would repeat this work.
Separate the objective and non-objective metrics as parameters to CVEvaluator: the scoring parameter remains as it is and a diagnostics parameter provides a callable with similar (same?) arguments as Scorer, but returning a dict. This means that the prediction work is repeated but not necessarily as many times as there are metrics. This diagnostics callable is more flexible and perhaps could be passed the training data as well as the test data.
Continue to use the scoring parameter, but allow the Scorer to return a dict with a special key for the objective score. This would need to be handled by the caller. For backwards compatibility, no existing scorers would change their behaviour of returning a float. This ensures no repeated prediction work.
Add an additional method to the Scorer interface that generates a set of named outputs (as with calc_names proposed in Use cross_validation.cross_val_score with metrics.precision_recall_fscore_support #1837), again with a special key for the objective score. This allows users to continue using scoring='f1' but get back precision and recall for free.

Note that 3. and 4. potentially allow for any set of metrics to be composed into a scorer without redundant prediction work (and 1. allows composition with highly redundant prediction work).

Comments, critiques and suggestions are very welcome.

The text was updated successfully, but these errors were encountered:

amueller · 2013-04-11T12:55:09Z

+1 for addressing the problem. Have to think about your solution if I have time.

SolomonMg · 2013-04-14T17:41:54Z

jnothman this is awesome. I feel like 4 might be best from the user's perspective.
—
Sent from Mailbox for iPhone

On Thu, Apr 11, 2013 at 5:55 AM, Andreas Mueller notifications@github.com
wrote:

+1 for addressing the problem. Have to think about your solution if I have time.

Reply to this email directly or view it on GitHub:
#1850 (comment)

amueller · 2013-04-15T08:53:14Z

Thanks for pondering this.
Just a quick remark: as the Scorer's haven't been in the last release
afaik, we don't need to preserve backward compatibility.

This is another reason to pay some attention to this now: before the
next release, we are still flexible wrt the Scorer and GridSearchCV
interface.

jnothman · 2013-04-15T12:59:10Z

A very good point. I jumped from 0.12 to HEAD so I didn't notice that. Hmm... Yes, rethinking the Scorer at the same time probably makes sense. I'll ponder this and perhaps change the options above to reflect.

amueller · 2013-04-15T13:00:37Z

Friday is my last urgent deadline. Hopefully afterwards I can give your suggestions and pull requests some more attention :-/

jnothman · 2013-04-15T13:02:18Z

no worries.

jnothman · 2013-04-15T13:45:15Z

Hmmm given that it's okay to wreak havoc to the current interface, I'm thinking 3 might be the way to go, allowing a scoring function to return either a scalar or a dict/structured array with a special key for the primary score. But I'm not yet certain.

I also think we should recognise that GridSearchCV and cross_val_score both currently require the score to be averageable, and the former requires that it be comparable. But really, cross_val_score and perhaps even the parameter searches should be usable even when the score is of a form that can't be used to calculate a mean.

amueller · 2013-04-15T13:56:40Z

agreed. also, there is the nasty "greater_is_better" that would be nice to get rid of ;)

jnothman · 2013-04-16T05:44:47Z

Well, the nasty part about greater_is_better is that it should be something used by the Scorer, not by GridSearchCV in my opinion.

I also have decided a dict or generator of named scores is probably more expressive than we want. We really want something equivalent to a namedtuple, or rather, the Scorer may return a tuple if it is also annotated with the names (e.g. P, R, F, support) the tuple entries correspond to.

Currently, I'm trying to imagine a fairly maximal design for a Scorer with the following attributes:

__call__ (or from_estimator): (estimator, X, y, weights) -> float or tuple
fields: list of field names if __call__ returns a tuple (and perhaps a corresponding list of dtypes would reduce work elsewhere)
mean (or aggregate): I'm not sure what the best interface for this is.
argbest: (array/list of score outputs) -> (index of best, or None)
argbest_mean: (array/list of mean outputs) -> (index of best, or None)

Default implementations of most methods are provided in a base class, wrapping existing metrics, arbitrary score_funcs and estimator.score where appropriate. This takes greater_is_better out of the CV operations.

I tend to agree with those in the ML (@ogrisel in particular) who saw unnecessary duplication in having scorer and metrics but I don't mind that much either way. I do think annotations on the metrics makes sense; after all, they're just computer-readable docstrings. (And composite Scorers can be easily thrown together, as long as there's a straightforward interface to provide needs_threshold, which is why I saw a need for #1774.)

A couple of small other notes: When we think of applying this to training scores as well, we note that any output that doesn't depend on X,y will be duplicated, so we might need a separate callback/param for extracting diagnostics from the estimator, or returning the fit estimator itself. This suggests to me nested records output from CV searches, thus searcher.fold_results_['test']['precision'] and searcher.fold_results_['train']['precision'] instead of searcher.fold_results_['test_precision'] and searcher.fold_results_['train_precision']. The procedure from the *SearchCV perspective would be as follows:

for each fold collect (train_scores, test_scores, elapsed_times, number_of_samples, estimator_diagnostics) as appropriate.
given a sequence of fold data (or a 2d array of params x fold), compile these into an array of records, using the field name (and perhaps dtype) information from the Scorer, and store in searcher.fold_results_.
calculate means using the Scorer's mean method, and store it in searcher.grid_results_ (or search_results_). Note the searcher doesn't need to care what this output looks like.
find the index corresponding to the best mean (whether max or min) -- if it can be determined -- using argbest_mean, and store it in searcher.best_index_.

jnothman · 2013-04-16T07:10:00Z

I've just realised a fault of not singling out the objective score in the method proposed in my last comment: a single objective is also needed for non-predetermined parameter searches e.g using scipy, hyperopt. Maybe instead of argbest, the Scorer can implement to_objective which extracts (or calculates) the necessary field and negates it if appropriate.

jnothman · 2013-04-16T13:55:45Z

A little messy, but here's the gist of that latest proposal (run for sample output): https://gist.github.com/jnothman/5396078

arjoly · 2013-04-25T11:40:01Z

I also have decided a dict or generator of named scores is probably more expressive than we want. We really want something equivalent to a namedtuple, or rather, the Scorer may return a tuple if it is also annotated with the names (e.g. P, R, F, support) the tuple entries correspond to.

I would prefer a dict, since you keep more meaningful annotated number. And later on, you can easily postprocess the dict {"metrics": measure, ...}. I use a lot this feature to aggregate results, performs plots, ...

Furthermore, It would be coherent with your gist. You pass dicts of metrics.

mean (or aggregate): I'm not sure what the best interface for this is.

numpy uses the mean function name without weight and the average function name with weight.

jnothman · 2013-04-25T12:48:38Z

In general, though, even though scipy documentation names return values,
they are elements in tuples. Obviously these can be transformed into dicts
and vice-versa. I'm not sure what you mean by "annotated number".

On Thu, Apr 25, 2013 at 9:40 PM, Arnaud Joly notifications@github.comwrote:

I also have decided a dict or generator of named scores is probably more
expressive than we want. We really want something equivalent to a
namedtuple, or rather, the Scorer may return a tuple if it is also
annotated with the names (e.g. P, R, F, support) the tuple entries
correspond to.

I would prefer a dict, since you keep more meaningful annotated number.
And later on, you can easily postprocess the dict {"metrics": measure,
...}. I use a lot this feature to aggregate results, performs plots, ...

Furthermore, It would be coherent with your gist. You pass dicts of
metrics.

mean (or aggregate): I'm not sure what the best interface for this is.

numpy uses the mean function name without weight and the average function
name with weight.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/1850#issuecomment-17001740
.

GaelVaroquaux · 2013-04-25T12:52:21Z

In general, though, even though scipy documentation names return values,
they are elements in tuples.

In the Python standard librairy, more and more functions return
namedtuples. I think I would go for that.

larsmans · 2013-07-01T14:48:27Z

Coming to this from the angle of #2013, I must say I also strongly dislike the greater_is_better attribute. The API is just clumsy. I tried my "return (-1, loss) or (+1, score)" proposal, but I couldn't find an elegant way to implement the relevant code in GridSearchCV. I propose we at least do the following:

Return -loss from a loss function and document that. Anyone who knows basic machine learning should know that minimizing a loss function means maximizing minus that loss function, and it's not pretty but plenty of toolkits do it this way. It simplifies the API to the point where a Scorer is just a simple function returning a scalar to maximize, and we can extend that later so a scorer can return a tuple, the first value of which is to be maximized.
Get rid of the public Scorer class while we still can. It's going to tie us down; when more options than needs_proba and needs_threshold are going to be implemented, the __call__ is going to get heavier and heavier. If we want to split the class into ThresholdScorer, ProbaScorer, etc., we're going to have to overload __new__ and do all kinds of trickery to keep things working.

I'll submit a PR with the main ideas.

amueller · 2013-07-01T15:38:19Z

I take all the blame for the greater_is_better. I wanted to move forward with the scoreres after the long discussion.
This is one of the reasons I didn't want to release yet.
Can you elaborate on the __new__ problem? Why couldn't we just split the class?

larsmans · 2013-07-01T15:52:50Z

Because then you have a single public class Scorer with a documented constructor. Suppose you break off the decision_function handling into ThresholdScorer. Then the user calls Scorer with needs_treshold=True. Scorer.__init__ cannot decide to return a ThresholdScorer, since it cannot return a value at all; you need a factory. __new__ can be used to implement a factory, but its protocol is pretty hairy (call __init__ manually iff you're returning a custom type, call super constructor with or without arguments, etc.).

amueller · 2013-07-01T16:01:51Z

Ok, so you mean providing backward-compatibility is a problem? We don't need to do that, right?

larsmans · 2013-07-01T16:03:38Z

Why not? This is a public API!

amueller · 2013-07-01T16:05:10Z

But not included in any release.

larsmans · 2013-07-01T16:08:21Z

Oh, sure, I meant: if we do release it, we're stuck with it. Anyway, see #2123.

amueller · 2013-07-26T09:44:41Z

@jnothman did you tag this? I'm not sure we can come up with a good solution within the next two days. First, we should review what was the interface at the last release and what we already changed.

jnothman · 2013-07-27T09:01:53Z

I don't think I did.

ogrisel · 2013-07-28T15:20:36Z

To be able to do this refactoring in 0.15 and have time to discuss it, I have issued a pull request to rename cv_scores_ back to grid_scores_ so as to have 0.14 compatible with the 0.13.1 public API: #2299 .

I will retag this PR to 0.15 now.

… is actually a number, not an array (otherwise cross_val_score returns bogus).

dwiel · 2013-11-28T02:10:54Z

Any chance of this getting brought back up? It is a feature I would definitely enjoy.

jnothman · 2013-11-28T02:20:24Z

Me too... in fact, it was more or less the first thing that brought me to
contribute to scikit-learn, and it's still something I need regularly...
but I'm not sure we've come to any conclusions about a clean representation
within the scorer framework.

On Thu, Nov 28, 2013 at 1:10 PM, Zach Dwiel notifications@github.comwrote:

Any chance of this getting brought back up? It is a feature I would
definitely enjoy.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/1850#issuecomment-29434790
.

dengemann · 2015-10-16T10:33:55Z

Did this go anywhere? It would be really nice to pass a list of metrics to cross val score and get a list of scores in the same order or a dict with metric names as keys.

amueller · 2015-10-16T16:00:56Z

patience ;) This is one of the highest things on my never-ending priority list. But we need to merge the model_selection refactoring before that. and before that we have to do a release

amueller · 2016-08-25T21:27:09Z

@raghavrv this is still open, right? Is there a PR?

jona-sassenhagen · 2016-11-20T19:38:16Z

I have a usage case where it would be cool if cross_val_score could return one score per target for a multi-target estimator. (@dengemann EEG channels :D)

RokoMijic · 2017-03-02T14:02:09Z

Any progress on this front? I am busy putting an explanation in a docstring for some code telling the reader why I am re-implementing crossvalidation rather than using scikit-learn.

jnothman opened this issue on Apr 11, 2013 ·

Are we going to send this issue to elementary school? It's going to be 4 years old soon! ;)
Anyway just do add, scikit-learn is awesome and I'm really grateful for the hard work that people put into it!

jnothman · 2017-03-03T00:32:31Z

I think we're close, and there's a fair chance you'll see this in 0.19. But we have no desire to rush into design that then need to be redesigned. I'm interested in whether the current proposal (#7388), allowing multiple values for scoring is better than a generic callback to extract diagnostic info from each fit, or whether we need both...

…

On 3 Mar 2017 1:02 am, "RokoMijic" ***@***.***> wrote: Any progress on this front? I am busy putting an explanation in a docstring for some code telling the reader why I am re-implementing crossvalidation rather than using scikit-learn. jnothman opened this issue on Apr 11, 2013 · Are we going to send this issue to elementary school? It's going to be 4 years old soon! ;) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1850 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz61L3w0cwzfoaZEGdfUMnE-AbmDdNks5rhsvjgaJpZM4AkbI5> .

jnothman · 2017-07-08T09:06:14Z

Fixed in #7388

jnothman mentioned this issue Apr 16, 2013

ENH extensible parameter search results #1787

Closed

jnothman mentioned this issue Jul 1, 2013

MRG add log loss (cross-entropy loss) to metrics #2013

Closed

larsmans mentioned this issue Jul 1, 2013

WIP new, simpler scorer API #2123

Closed

dwiel referenced this issue in amueller/scikit-learn Nov 28, 2013

ENH GridSearchCV and cross_val_score check whether the returned score…

2d9cb81

… is actually a number, not an array (otherwise cross_val_score returns bogus).

jnothman mentioned this issue Aug 18, 2014

GridSearchCV should report the average values of arbitrary scoring functions #3575

Closed

amueller modified the milestones: 0.16, 0.15, 1.0 Jan 23, 2015

amueller added the API label Jan 23, 2015

This was referenced Feb 28, 2015

[WIP] Multiple-metric grid search #2759

Closed

[MRG] Choose number of clusters #4301

Closed

raghavrv mentioned this issue Mar 4, 2015

Simultaneous evaluation of several scorers when building validation and learning curves #4330

Closed

raghavrv mentioned this issue Apr 22, 2016

[MRG+3] ENH Restructure grid_scores_ into a dict of 1D (numpy) (masked) arrays that can be imported into pandas as a DataFrame. #6697

Merged

10 tasks

jnothman closed this as completed Jul 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-validation returning multiple scores #1850

Cross-validation returning multiple scores #1850

jnothman commented Apr 11, 2013

amueller commented Apr 11, 2013

SolomonMg commented Apr 14, 2013

+1 for addressing the problem. Have to think about your solution if I have time.

amueller commented Apr 15, 2013

jnothman commented Apr 15, 2013

amueller commented Apr 15, 2013

jnothman commented Apr 15, 2013

jnothman commented Apr 15, 2013

amueller commented Apr 15, 2013

jnothman commented Apr 16, 2013

jnothman commented Apr 16, 2013

jnothman commented Apr 16, 2013

arjoly commented Apr 25, 2013

jnothman commented Apr 25, 2013

GaelVaroquaux commented Apr 25, 2013

larsmans commented Jul 1, 2013

amueller commented Jul 1, 2013

larsmans commented Jul 1, 2013

amueller commented Jul 1, 2013

larsmans commented Jul 1, 2013

amueller commented Jul 1, 2013

larsmans commented Jul 1, 2013

amueller commented Jul 26, 2013

jnothman commented Jul 27, 2013

ogrisel commented Jul 28, 2013

dwiel commented Nov 28, 2013

jnothman commented Nov 28, 2013

dengemann commented Oct 16, 2015

amueller commented Oct 16, 2015

amueller commented Aug 25, 2016

jona-sassenhagen commented Nov 20, 2016

RokoMijic commented Mar 2, 2017 •

edited

Loading

jnothman commented Mar 3, 2017 via email

jnothman commented Jul 8, 2017

Cross-validation returning multiple scores #1850

Cross-validation returning multiple scores #1850

Comments

jnothman commented Apr 11, 2013

amueller commented Apr 11, 2013

SolomonMg commented Apr 14, 2013

+1 for addressing the problem. Have to think about your solution if I have time.

amueller commented Apr 15, 2013

jnothman commented Apr 15, 2013

amueller commented Apr 15, 2013

jnothman commented Apr 15, 2013

jnothman commented Apr 15, 2013

amueller commented Apr 15, 2013

jnothman commented Apr 16, 2013

jnothman commented Apr 16, 2013

jnothman commented Apr 16, 2013

arjoly commented Apr 25, 2013

jnothman commented Apr 25, 2013

GaelVaroquaux commented Apr 25, 2013

larsmans commented Jul 1, 2013

amueller commented Jul 1, 2013

larsmans commented Jul 1, 2013

amueller commented Jul 1, 2013

larsmans commented Jul 1, 2013

amueller commented Jul 1, 2013

larsmans commented Jul 1, 2013

amueller commented Jul 26, 2013

jnothman commented Jul 27, 2013

ogrisel commented Jul 28, 2013

dwiel commented Nov 28, 2013

jnothman commented Nov 28, 2013

dengemann commented Oct 16, 2015

amueller commented Oct 16, 2015

amueller commented Aug 25, 2016

jona-sassenhagen commented Nov 20, 2016

RokoMijic commented Mar 2, 2017 • edited Loading

jnothman commented Mar 3, 2017 via email

jnothman commented Jul 8, 2017

RokoMijic commented Mar 2, 2017 •

edited

Loading