Skip to content

Cross-validation returning multiple scores #1850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jnothman opened this issue Apr 11, 2013 · 33 comments
Closed

Cross-validation returning multiple scores #1850

jnothman opened this issue Apr 11, 2013 · 33 comments
Labels
Milestone

Comments

@jnothman
Copy link
Member

Scorer objects currently provide an interface that returns a scalar score given an estimator and test data. This is necessary for *SearchCV to calculate a mean score across folds, and determine the best score among parameters.

This is very hampering in terms of the diagnostic information available from a cross-fold validation or parameter exploration, which one can see by comparing to the catalogue of metrics that includes: precision and recall with F-score; scores for each of multiple classes as well as an aggregate; and error distributions (i.e. PR-curve or confusion matrix). @solomonm (#1837) and I (ML, an implementation within #1768) have independently sought Precision and Recall to be returned from cross-validation routines when F1 is used as the cross-validation objective; @eickenberg on #1381 (comment) posed a concern regarding array of scores corresponding to multiple targets.

I thought it deserved an Issue of its own to solidify the argument and its solution.

Some design options:

  1. Allow multiple scorers to be provided to cross_val_score or *SearchCV (henceforth CVEvaluator), with one specified as the objective. But since the Scorer generally calls estimator.{predict,decision_function,predict_proba}, each scorer would repeat this work.
  2. Separate the objective and non-objective metrics as parameters to CVEvaluator: the scoring parameter remains as it is and a diagnostics parameter provides a callable with similar (same?) arguments as Scorer, but returning a dict. This means that the prediction work is repeated but not necessarily as many times as there are metrics. This diagnostics callable is more flexible and perhaps could be passed the training data as well as the test data.
  3. Continue to use the scoring parameter, but allow the Scorer to return a dict with a special key for the objective score. This would need to be handled by the caller. For backwards compatibility, no existing scorers would change their behaviour of returning a float. This ensures no repeated prediction work.
  4. Add an additional method to the Scorer interface that generates a set of named outputs (as with calc_names proposed in Use cross_validation.cross_val_score with metrics.precision_recall_fscore_support #1837), again with a special key for the objective score. This allows users to continue using scoring='f1' but get back precision and recall for free.

Note that 3. and 4. potentially allow for any set of metrics to be composed into a scorer without redundant prediction work (and 1. allows composition with highly redundant prediction work).

Comments, critiques and suggestions are very welcome.

@amueller
Copy link
Member

+1 for addressing the problem. Have to think about your solution if I have time.

@SolomonMg
Copy link

jnothman this is awesome. I feel like 4 might be best from the user's perspective. 

Sent from Mailbox for iPhone

On Thu, Apr 11, 2013 at 5:55 AM, Andreas Mueller notifications@github.com
wrote:

+1 for addressing the problem. Have to think about your solution if I have time.

Reply to this email directly or view it on GitHub:
#1850 (comment)

@amueller
Copy link
Member

Thanks for pondering this.
Just a quick remark: as the Scorer's haven't been in the last release
afaik, we don't need to preserve backward compatibility.

This is another reason to pay some attention to this now: before the
next release, we are still flexible wrt the Scorer and GridSearchCV
interface.

@jnothman
Copy link
Member Author

A very good point. I jumped from 0.12 to HEAD so I didn't notice that. Hmm... Yes, rethinking the Scorer at the same time probably makes sense. I'll ponder this and perhaps change the options above to reflect.

@amueller
Copy link
Member

Friday is my last urgent deadline. Hopefully afterwards I can give your suggestions and pull requests some more attention :-/

@jnothman
Copy link
Member Author

no worries.

@jnothman
Copy link
Member Author

Hmmm given that it's okay to wreak havoc to the current interface, I'm thinking 3 might be the way to go, allowing a scoring function to return either a scalar or a dict/structured array with a special key for the primary score. But I'm not yet certain.

I also think we should recognise that GridSearchCV and cross_val_score both currently require the score to be averageable, and the former requires that it be comparable. But really, cross_val_score and perhaps even the parameter searches should be usable even when the score is of a form that can't be used to calculate a mean.

@amueller
Copy link
Member

agreed. also, there is the nasty "greater_is_better" that would be nice to get rid of ;)

@jnothman
Copy link
Member Author

Well, the nasty part about greater_is_better is that it should be something used by the Scorer, not by GridSearchCV in my opinion.

I also have decided a dict or generator of named scores is probably more expressive than we want. We really want something equivalent to a namedtuple, or rather, the Scorer may return a tuple if it is also annotated with the names (e.g. P, R, F, support) the tuple entries correspond to.

Currently, I'm trying to imagine a fairly maximal design for a Scorer with the following attributes:

  • __call__ (or from_estimator): (estimator, X, y, weights) -> float or tuple
  • fields: list of field names if __call__ returns a tuple (and perhaps a corresponding list of dtypes would reduce work elsewhere)
  • mean (or aggregate): I'm not sure what the best interface for this is.
  • argbest: (array/list of score outputs) -> (index of best, or None)
  • argbest_mean: (array/list of mean outputs) -> (index of best, or None)

Default implementations of most methods are provided in a base class, wrapping existing metrics, arbitrary score_funcs and estimator.score where appropriate. This takes greater_is_better out of the CV operations.

I tend to agree with those in the ML (@ogrisel in particular) who saw unnecessary duplication in having scorer and metrics but I don't mind that much either way. I do think annotations on the metrics makes sense; after all, they're just computer-readable docstrings. (And composite Scorers can be easily thrown together, as long as there's a straightforward interface to provide needs_threshold, which is why I saw a need for #1774.)

A couple of small other notes: When we think of applying this to training scores as well, we note that any output that doesn't depend on X,y will be duplicated, so we might need a separate callback/param for extracting diagnostics from the estimator, or returning the fit estimator itself. This suggests to me nested records output from CV searches, thus searcher.fold_results_['test']['precision'] and searcher.fold_results_['train']['precision'] instead of searcher.fold_results_['test_precision'] and searcher.fold_results_['train_precision']. The procedure from the *SearchCV perspective would be as follows:

  • for each fold collect (train_scores, test_scores, elapsed_times, number_of_samples, estimator_diagnostics) as appropriate.
  • given a sequence of fold data (or a 2d array of params x fold), compile these into an array of records, using the field name (and perhaps dtype) information from the Scorer, and store in searcher.fold_results_.
  • calculate means using the Scorer's mean method, and store it in searcher.grid_results_ (or search_results_). Note the searcher doesn't need to care what this output looks like.
  • find the index corresponding to the best mean (whether max or min) -- if it can be determined -- using argbest_mean, and store it in searcher.best_index_.

@jnothman
Copy link
Member Author

I've just realised a fault of not singling out the objective score in the method proposed in my last comment: a single objective is also needed for non-predetermined parameter searches e.g using scipy, hyperopt. Maybe instead of argbest, the Scorer can implement to_objective which extracts (or calculates) the necessary field and negates it if appropriate.

@jnothman
Copy link
Member Author

A little messy, but here's the gist of that latest proposal (run for sample output): https://gist.github.com/jnothman/5396078

@arjoly
Copy link
Member

arjoly commented Apr 25, 2013

I also have decided a dict or generator of named scores is probably more expressive than we want. We really want something equivalent to a namedtuple, or rather, the Scorer may return a tuple if it is also annotated with the names (e.g. P, R, F, support) the tuple entries correspond to.

I would prefer a dict, since you keep more meaningful annotated number. And later on, you can easily postprocess the dict {"metrics": measure, ...}. I use a lot this feature to aggregate results, performs plots, ...

Furthermore, It would be coherent with your gist. You pass dicts of metrics.

mean (or aggregate): I'm not sure what the best interface for this is.

numpy uses the mean function name without weight and the average function name with weight.

@jnothman
Copy link
Member Author

In general, though, even though scipy documentation names return values,
they are elements in tuples. Obviously these can be transformed into dicts
and vice-versa. I'm not sure what you mean by "annotated number".

On Thu, Apr 25, 2013 at 9:40 PM, Arnaud Joly notifications@github.comwrote:

I also have decided a dict or generator of named scores is probably more
expressive than we want. We really want something equivalent to a
namedtuple, or rather, the Scorer may return a tuple if it is also
annotated with the names (e.g. P, R, F, support) the tuple entries
correspond to.

I would prefer a dict, since you keep more meaningful annotated number.
And later on, you can easily postprocess the dict {"metrics": measure,
...}. I use a lot this feature to aggregate results, performs plots, ...

Furthermore, It would be coherent with your gist. You pass dicts of
metrics.

mean (or aggregate): I'm not sure what the best interface for this is.

numpy uses the mean function name without weight and the average function
name with weight.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1850#issuecomment-17001740
.

@GaelVaroquaux
Copy link
Member

In general, though, even though scipy documentation names return values,
they are elements in tuples.

In the Python standard librairy, more and more functions return
namedtuples. I think I would go for that.

@larsmans
Copy link
Member

larsmans commented Jul 1, 2013

Coming to this from the angle of #2013, I must say I also strongly dislike the greater_is_better attribute. The API is just clumsy. I tried my "return (-1, loss) or (+1, score)" proposal, but I couldn't find an elegant way to implement the relevant code in GridSearchCV. I propose we at least do the following:

  • Return -loss from a loss function and document that. Anyone who knows basic machine learning should know that minimizing a loss function means maximizing minus that loss function, and it's not pretty but plenty of toolkits do it this way. It simplifies the API to the point where a Scorer is just a simple function returning a scalar to maximize, and we can extend that later so a scorer can return a tuple, the first value of which is to be maximized.
  • Get rid of the public Scorer class while we still can. It's going to tie us down; when more options than needs_proba and needs_threshold are going to be implemented, the __call__ is going to get heavier and heavier. If we want to split the class into ThresholdScorer, ProbaScorer, etc., we're going to have to overload __new__ and do all kinds of trickery to keep things working.

I'll submit a PR with the main ideas.

@amueller
Copy link
Member

amueller commented Jul 1, 2013

I take all the blame for the greater_is_better. I wanted to move forward with the scoreres after the long discussion.
This is one of the reasons I didn't want to release yet.
Can you elaborate on the __new__ problem? Why couldn't we just split the class?

@larsmans
Copy link
Member

larsmans commented Jul 1, 2013

Because then you have a single public class Scorer with a documented constructor. Suppose you break off the decision_function handling into ThresholdScorer. Then the user calls Scorer with needs_treshold=True. Scorer.__init__ cannot decide to return a ThresholdScorer, since it cannot return a value at all; you need a factory. __new__ can be used to implement a factory, but its protocol is pretty hairy (call __init__ manually iff you're returning a custom type, call super constructor with or without arguments, etc.).

@amueller
Copy link
Member

amueller commented Jul 1, 2013

Ok, so you mean providing backward-compatibility is a problem? We don't need to do that, right?

@larsmans
Copy link
Member

larsmans commented Jul 1, 2013

Why not? This is a public API!

@amueller
Copy link
Member

amueller commented Jul 1, 2013

But not included in any release.

@larsmans
Copy link
Member

larsmans commented Jul 1, 2013

Oh, sure, I meant: if we do release it, we're stuck with it. Anyway, see #2123.

@amueller
Copy link
Member

@jnothman did you tag this? I'm not sure we can come up with a good solution within the next two days. First, we should review what was the interface at the last release and what we already changed.

@jnothman
Copy link
Member Author

I don't think I did.

@ogrisel
Copy link
Member

ogrisel commented Jul 28, 2013

To be able to do this refactoring in 0.15 and have time to discuss it, I have issued a pull request to rename cv_scores_ back to grid_scores_ so as to have 0.14 compatible with the 0.13.1 public API: #2299 .

I will retag this PR to 0.15 now.

dwiel referenced this issue in amueller/scikit-learn Nov 28, 2013
… is actually a number, not an array (otherwise cross_val_score returns bogus).
@dwiel
Copy link

dwiel commented Nov 28, 2013

Any chance of this getting brought back up? It is a feature I would definitely enjoy.

@jnothman
Copy link
Member Author

Me too... in fact, it was more or less the first thing that brought me to
contribute to scikit-learn, and it's still something I need regularly...
but I'm not sure we've come to any conclusions about a clean representation
within the scorer framework.

On Thu, Nov 28, 2013 at 1:10 PM, Zach Dwiel notifications@github.comwrote:

Any chance of this getting brought back up? It is a feature I would
definitely enjoy.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1850#issuecomment-29434790
.

@dengemann
Copy link
Contributor

Did this go anywhere? It would be really nice to pass a list of metrics to cross val score and get a list of scores in the same order or a dict with metric names as keys.

@amueller
Copy link
Member

patience ;) This is one of the highest things on my never-ending priority list. But we need to merge the model_selection refactoring before that. and before that we have to do a release

@amueller
Copy link
Member

@raghavrv this is still open, right? Is there a PR?

@jona-sassenhagen
Copy link
Contributor

I have a usage case where it would be cool if cross_val_score could return one score per target for a multi-target estimator. (@dengemann EEG channels :D)

@RokoMijic
Copy link

RokoMijic commented Mar 2, 2017

Any progress on this front? I am busy putting an explanation in a docstring for some code telling the reader why I am re-implementing crossvalidation rather than using scikit-learn.

jnothman opened this issue on Apr 11, 2013 ·

Are we going to send this issue to elementary school? It's going to be 4 years old soon! ;)
Anyway just do add, scikit-learn is awesome and I'm really grateful for the hard work that people put into it!

@jnothman
Copy link
Member Author

jnothman commented Mar 3, 2017 via email

@jnothman
Copy link
Member Author

jnothman commented Jul 8, 2017

Fixed in #7388

@jnothman jnothman closed this as completed Jul 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests