Skip to content

[WIP] Multiple-metric grid search #2759

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 57 commits into from

Conversation

mblondel
Copy link
Member

This PR brings multiple-metric grid search. This is important for finding the best-tuned estimator on a per metric basis without redoing the grid / randomized search from scratch for each metric.

Highlights:

  • No public API has been broken.
  • Some parts are actually simplified.
  • Updates cross_val_score so as to support lists as scoring parameter. In this case, a 2d array of shape (n_scoring, n_folds) is returned instead of a 1d array of shape (n_folds,).
  • Adds multiple metric support to GridSearchCV and RandomizedSearchCV.
  • Introduces _evaluate_scorers for computing several scorers without recomputing the predictions every time.
  • Fixes make_scorer needs_threshold makes wrong assumptions #2588 (regressors can now be used with metrics that require need_threshold=True.

Tagging this PR with the v0.15 milestone.

AlexanderFabisch and others added 30 commits January 9, 2014 09:18
Conflicts:
	sklearn/cross_validation.py
	sklearn/feature_selection/rfe.py
	sklearn/grid_search.py
	sklearn/learning_curve.py
	sklearn/metrics/scorer.py
	sklearn/metrics/tests/test_score_objects.py
	sklearn/tests/test_grid_search.py
We are not supposed to use `parameters` outside of the loop. And this
makes the code very difficult to read.
@jnothman
Copy link
Member

@amueller wrote:

What is the rational behind returning a 2d array. I think I would have preferred a dict.
What are the dict keys when the scorers aren't entered as strings?

I'd really like to see this happen. I'd happily attempt to complete the PR. What do you consider to still be lacking, @mblondel?

I am however a little concerned that any code that attempts efficient calculation of multiple scorers (with the current definition of scorer) is going to be frameworkish to a great extent, and hence will be difficult to get merged. Is there some way to limit this?

@mblondel
Copy link
Member Author

I am no longer working on this and would be glad if somebody could take
over.

@jnothman
Copy link
Member

Note: this PR fixes #1850.

@jnothman
Copy link
Member

I still consider this feature sorely missing and of high priority. And seeing as it was possible to get multiple metrics back prior to the advent scorers (because there was no output checking on score_func) it is really fixing a functional regression.

Obviously a lot of the codebase has changed since this PR was launched and much of the work yet to be done is transferring changes onto moved code. I do wonder whether there's a way to get it merged piece by piece in an agile way, or whether we just need a monolithic PR at the risk of growing stale again. Certainly, some of the auxiliary changes to validation_curve should be left for another time; and if there are any concerns, we can make do with a simplified _evaluate_scorers in which prediction is repeated for each scorer.

I think @mblondel has made some reasonable API decisions here, but we should decide on the following:

I think the only substantial question is whether scores should be a dict {scorer_name: score_data_structure} (for all of cross_val_score, validation_curve, *SearchCV.grid_scores_, *SearchCV.best_score_) or a list/array. I think a list, as presently implemented, is straightforward enough to understand and code with, but that code may not be as legible as if a dict. If we choose output as a dict, when scorers are custom, will the dict key just be a scorer object / function, or will it be possible to specify scoring as a dict {name: scorer}? I also anticipate a grid_scores_to_dataframe function in sklearn-pandas in any case!

The other issue is refit, and whether multiple best_estimator_ values can be refit. I think no. IMO, if multiple scoring values are specified, refit=True should be illegal, requiring refit=0 (indexing into a scoring list) or refit='mse'? I think the latter is my preference initially.

@amueller and others, do you have an opinion on these API issues?

Apart from those things, what remains to be done appears to be: moving the changes to the current codebase; ensuring test coverage; documentation; and an example or two.

@maniteja123
Copy link
Contributor

Hi everyone, it seems that the scorer API has changed a lot since the PR has started. I have tried to follow the discussions in issues 1850 and pull 2123 and would like to ask your opinion on working on this enhancement. I understand that this touches a lot of API and there needs to be strong decision from the core devs. I would like to know if there is a possibility for a newbie to work on this ? Thanks.

@jnothman
Copy link
Member

I don't think the scorer API has changed much, no.

@maniteja123
Copy link
Contributor

Oh really sorry, I have been going through all the related PRs for some time and confused it with the changes in grid_search and cross_val_score and others which use the scorers as arguments, and will probably get affected here. I just wanted to ask if it would be advisable to work on this feature. Thanks again.

@jnothman
Copy link
Member

I think it would be a decent thing to work on, if you're comfortable with that part of the code base.

@maniteja123
Copy link
Contributor

I have looked into the main aim of this PR, coming from here and the discussion at #1850 but if it is okay, could I ask about the decision regarding :

  • the type of scoring as list/tuple or dict
  • the working of refit option
  • _evaluate_scorers seems to have been agreed upon.
  • this PR seems to refactor a common code for Scorer rather than PredictScorer, ProbaScorer and ThresholdScorer as is the current implementation, which is preferred ?
  • What else other than cross_val_score, validation_curve, *SearchCV.grid_scores_, *SearchCV.best_score could support this feature ?
  • Also as a side note, though #3524 is not related to this, could you also tell about the status for adding sample_weight and scorer_params to cross_val_score, GridSearchCVand others mentioned in #1574 ?

Sorry for asking many doubts. I am not that aware of the practical use cases and am trying to get an idea based on the discussions here. Thanks again for patiently answering.

@maniteja123
Copy link
Contributor

Sorry @rvraghav93 , I didn't know that you intended to work in this issue. I just thought this to be challenging and also important, that's why looked at it.

@raghavrv
Copy link
Member

Yes. Myself and @amueller had a discussion on this. I've also let @GaelVaroquaux know that I will be working on this after my Tree PRs.

@jnothman
Copy link
Member

yay for this happening!

On 21 March 2016 at 04:22, Raghav R V notifications@github.com wrote:

Yes. Myself and @amueller https://github.com/amueller had a discussion
on this. I've also let @GaelVaroquaux https://github.com/GaelVaroquaux
know that I will be working on this after my Tree PRs.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#2759 (comment)

@jnothman
Copy link
Member

Something that we really need to support in this work (ping @rvraghav93 if you see yourself as doing this) is returning per-class performance for multilabel and multiclass problems without making a scorer for each class. At first glance I frankly have no idea how to do this neatly.

@raghavrv
Copy link
Member

raghavrv commented Aug 8, 2016

To make myself clearer we are having 2 problems to face here?

  • Returning per-class/per-label performance for multiclass/multilabel problems
  • Returning per-metric performance for multi-metric evaluation.

What will we do for the case of multi-metric multi-label?

Can I suggest that we have the notation of *_*_metric_cls0 replacing the *_*_score to handle both use cases?

Either that or we have to agree on using list of arrays for a key of *_*_metric with each element of array denoting the performance of that class?

@raghavrv
Copy link
Member

raghavrv commented Aug 8, 2016

Can I suggest that we have the notation of ___metric_cls0 replacing the **_score to handle both use cases?

Even with this approach we might have to be okay with list of arrays for a key *_*_metric_cls0 as performance for cls0 still may not be a single value depending on the metric. Correct?

In which case we can allow list of 2d arrays itself? But how do we rank them in that case?

@jnothman
Copy link
Member

jnothman commented Aug 8, 2016

With the implemented metrics, performance for any particular class for any particular metric is currently a scalar. (However, precision_recall_fscore_support(average=None) returns a tuple of arrays, such that there is 1 scalar per tuple element!) I think at this point we should maintain the promise that each key maps to an array of shape (n_candidates,), and choose an appropriate schema for munging the path to a particular value into a string. It would be nice if when the classes are named those names turned up in the metric string, but that may be going too far.

But we should first worry about implementing multiple metric scoring where each metric returns one value. If need be (with a runtime and API specification cost), class-wise metrics are all able to be specified to return a single scalar, just by wrapping the scorer in lambda *args, **kwargs: scorer(*args, **kwargs)[idx].

@raghavrv
Copy link
Member

@jnothman Thanks for the comment. I'll raise a PR soon. (Hopefully by this weekend while my other PRs are pending for review/comments ;( )

@jnothman
Copy link
Member

jnothman commented Jul 8, 2017

Fixed in #7388

@jnothman jnothman closed this Jul 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

make_scorer needs_threshold makes wrong assumptions
10 participants