[WIP] Multiple-metric grid search #2759

mblondel · 2014-01-16T16:36:40Z

This PR brings multiple-metric grid search. This is important for finding the best-tuned estimator on a per metric basis without redoing the grid / randomized search from scratch for each metric.

Highlights:

No public API has been broken.
Some parts are actually simplified.
Updates cross_val_score so as to support lists as scoring parameter. In this case, a 2d array of shape (n_scoring, n_folds) is returned instead of a 1d array of shape (n_folds,).
Adds multiple metric support to GridSearchCV and RandomizedSearchCV.
Introduces _evaluate_scorers for computing several scorers without recomputing the predictions every time.
Fixes make_scorer needs_threshold makes wrong assumptions #2588 (regressors can now be used with metrics that require need_threshold=True.

Tagging this PR with the v0.15 milestone.

…kit-learn into multiple_grid_search

…ikit-learn into multiple_grid_search

Conflicts: sklearn/cross_validation.py sklearn/feature_selection/rfe.py sklearn/grid_search.py sklearn/learning_curve.py sklearn/metrics/scorer.py sklearn/metrics/tests/test_score_objects.py sklearn/tests/test_grid_search.py

We are not supposed to use `parameters` outside of the loop. And this makes the code very difficult to read.

jnothman · 2014-08-24T14:46:37Z

@amueller wrote:

What is the rational behind returning a 2d array. I think I would have preferred a dict.
What are the dict keys when the scorers aren't entered as strings?

I'd really like to see this happen. I'd happily attempt to complete the PR. What do you consider to still be lacking, @mblondel?

I am however a little concerned that any code that attempts efficient calculation of multiple scorers (with the current definition of scorer) is going to be frameworkish to a great extent, and hence will be difficult to get merged. Is there some way to limit this?

mblondel · 2014-08-24T18:23:05Z

I am no longer working on this and would be glad if somebody could take
over.

jnothman · 2015-02-28T10:58:45Z

Note: this PR fixes #1850.

jnothman · 2016-01-26T00:48:58Z

I still consider this feature sorely missing and of high priority. And seeing as it was possible to get multiple metrics back prior to the advent scorers (because there was no output checking on score_func) it is really fixing a functional regression.

Obviously a lot of the codebase has changed since this PR was launched and much of the work yet to be done is transferring changes onto moved code. I do wonder whether there's a way to get it merged piece by piece in an agile way, or whether we just need a monolithic PR at the risk of growing stale again. Certainly, some of the auxiliary changes to validation_curve should be left for another time; and if there are any concerns, we can make do with a simplified _evaluate_scorers in which prediction is repeated for each scorer.

I think @mblondel has made some reasonable API decisions here, but we should decide on the following:

I think the only substantial question is whether scores should be a dict {scorer_name: score_data_structure} (for all of cross_val_score, validation_curve, *SearchCV.grid_scores_, *SearchCV.best_score_) or a list/array. I think a list, as presently implemented, is straightforward enough to understand and code with, but that code may not be as legible as if a dict. If we choose output as a dict, when scorers are custom, will the dict key just be a scorer object / function, or will it be possible to specify scoring as a dict {name: scorer}? I also anticipate a grid_scores_to_dataframe function in sklearn-pandas in any case!

The other issue is refit, and whether multiple best_estimator_ values can be refit. I think no. IMO, if multiple scoring values are specified, refit=True should be illegal, requiring refit=0 (indexing into a scoring list) or refit='mse'? I think the latter is my preference initially.

@amueller and others, do you have an opinion on these API issues?

Apart from those things, what remains to be done appears to be: moving the changes to the current codebase; ensuring test coverage; documentation; and an example or two.

maniteja123 · 2016-03-20T06:44:37Z

Hi everyone, it seems that the scorer API has changed a lot since the PR has started. I have tried to follow the discussions in issues 1850 and pull 2123 and would like to ask your opinion on working on this enhancement. I understand that this touches a lot of API and there needs to be strong decision from the core devs. I would like to know if there is a possibility for a newbie to work on this ? Thanks.

jnothman · 2016-03-20T08:59:31Z

I don't think the scorer API has changed much, no.

maniteja123 · 2016-03-20T10:14:15Z

Oh really sorry, I have been going through all the related PRs for some time and confused it with the changes in grid_search and cross_val_score and others which use the scorers as arguments, and will probably get affected here. I just wanted to ask if it would be advisable to work on this feature. Thanks again.

jnothman · 2016-03-20T10:21:29Z

I think it would be a decent thing to work on, if you're comfortable with that part of the code base.

maniteja123 · 2016-03-20T10:48:11Z

I have looked into the main aim of this PR, coming from here and the discussion at #1850 but if it is okay, could I ask about the decision regarding :

the type of scoring as list/tuple or dict
the working of refit option
_evaluate_scorers seems to have been agreed upon.
this PR seems to refactor a common code for Scorer rather than PredictScorer, ProbaScorer and ThresholdScorer as is the current implementation, which is preferred ?
What else other than cross_val_score, validation_curve, *SearchCV.grid_scores_, *SearchCV.best_score could support this feature ?
Also as a side note, though #3524 is not related to this, could you also tell about the status for adding sample_weight and scorer_params to cross_val_score, GridSearchCVand others mentioned in #1574 ?

Sorry for asking many doubts. I am not that aware of the practical use cases and am trying to get an idea based on the discussions here. Thanks again for patiently answering.

maniteja123 · 2016-03-20T17:14:18Z

Sorry @rvraghav93 , I didn't know that you intended to work in this issue. I just thought this to be challenging and also important, that's why looked at it.

raghavrv · 2016-03-20T17:22:48Z

Yes. Myself and @amueller had a discussion on this. I've also let @GaelVaroquaux know that I will be working on this after my Tree PRs.

jnothman · 2016-03-21T04:43:08Z

yay for this happening!

On 21 March 2016 at 04:22, Raghav R V notifications@github.com wrote:

Yes. Myself and @amueller https://github.com/amueller had a discussion
on this. I've also let @GaelVaroquaux https://github.com/GaelVaroquaux
know that I will be working on this after my Tree PRs.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#2759 (comment)

jnothman · 2016-05-11T04:56:53Z

Something that we really need to support in this work (ping @rvraghav93 if you see yourself as doing this) is returning per-class performance for multilabel and multiclass problems without making a scorer for each class. At first glance I frankly have no idea how to do this neatly.

raghavrv · 2016-08-08T17:38:50Z

To make myself clearer we are having 2 problems to face here?

Returning per-class/per-label performance for multiclass/multilabel problems
Returning per-metric performance for multi-metric evaluation.

What will we do for the case of multi-metric multi-label?

Can I suggest that we have the notation of *_*_metric_cls0 replacing the *_*_score to handle both use cases?

Either that or we have to agree on using list of arrays for a key of *_*_metric with each element of array denoting the performance of that class?

raghavrv · 2016-08-08T17:41:26Z

Can I suggest that we have the notation of ___metric_cls0 replacing the **_score to handle both use cases?

Even with this approach we might have to be okay with list of arrays for a key *_*_metric_cls0 as performance for cls0 still may not be a single value depending on the metric. Correct?

In which case we can allow list of 2d arrays itself? But how do we rank them in that case?

jnothman · 2016-08-08T23:41:51Z

With the implemented metrics, performance for any particular class for any particular metric is currently a scalar. (However, precision_recall_fscore_support(average=None) returns a tuple of arrays, such that there is 1 scalar per tuple element!) I think at this point we should maintain the promise that each key maps to an array of shape (n_candidates,), and choose an appropriate schema for munging the path to a particular value into a string. It would be nice if when the classes are named those names turned up in the metric string, but that may be going too far.

But we should first worry about implementing multiple metric scoring where each metric returns one value. If need be (with a runtime and API specification cost), class-wise metrics are all able to be specified to return a single scalar, just by wrapping the scorer in lambda *args, **kwargs: scorer(*args, **kwargs)[idx].

raghavrv · 2016-08-10T12:12:15Z

@jnothman Thanks for the comment. I'll raise a PR soon. (Hopefully by this weekend while my other PRs are pending for review/comments ;( )

jnothman · 2017-07-08T09:06:09Z

Fixed in #7388

AlexanderFabisch and others added 30 commits January 9, 2014 09:18

Refactor cv code

b217697

Clean up

c4d6278

Refactor RFE and add _check_scorable

1599952

FIX typo in docstring

5e52031

Merge fit_grid_point into _cross_val_score

4b5f468

Return time

38081fd

Move set_params back to fit_grid_point

30c86ea

Log score and time in 'cross_val_score'

389ed8d

check_scorable returns scorer

1fa3ec3

Clean up

5b8933d

Replace '_fit_estimator' by '_cross_val_score'

70aaef2

Fix PEP8, style and documentation

13c7915

Remove wrong variable names

7b951d8

Remove helper function '_fit'

5b211cd

Merge branch 'refactor_cv' of https://github.com/AlexanderFabisch/sci…

365368e

…kit-learn into multiple_grid_search

Add evaluate_scorers function.

13bc90e

Add more tests for evaluate_scorers.

4b2cd18

Support ranking by regression.

91ff498

Support SVC.

4a934f0

Handle multi-label case.

314497a

Test ranking with more than two relevance levels.

754c72d

Merge branch 'multiple_grid_search' of https://github.com/mblondel/sc…

79656d5

…ikit-learn into multiple_grid_search

Merge branch 'master' into multiple_grid_search

f6a44a0

Conflicts: sklearn/cross_validation.py sklearn/feature_selection/rfe.py sklearn/grid_search.py sklearn/learning_curve.py sklearn/metrics/scorer.py sklearn/metrics/tests/test_score_objects.py sklearn/tests/test_grid_search.py

Rename evaluate_scorers to _evaluate_scorers.

7f4d7ad

Remove _score utility function.

a756083

Support for multiple scorers in cross_val_score.

b4255d8

Refactoring for allowing mutiple scorers.

264013f

Define parameters upfront.

0feed96

We are not supposed to use `parameters` outside of the loop. And this makes the code very difficult to read.

Use more informative name.

0a66748

Put __repr__ back.

6f68bfb

jnothman mentioned this pull request Aug 18, 2014

GridSearchCV should report the average values of arbitrary scoring functions #3575

Closed

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

mblondel mentioned this pull request Sep 29, 2014

[WIP] OOB-aware grid search. #3720

Closed

mblondel mentioned this pull request Oct 7, 2014

[WIP] Choose out of bag scoring metric. Fixes #3455 #3723

Closed

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

amueller mentioned this pull request Mar 3, 2015

Simultaneous evaluation of several scorers when building validation and learning curves #4330

Closed

jnothman mentioned this pull request Sep 25, 2015

cross_val_score can't take scoring parameters without a custom scoring function #5308

Closed

jnothman mentioned this pull request Apr 7, 2016

Use cross_validation.cross_val_score with metrics.precision_recall_fscore_support #1837

Closed

raghavrv mentioned this pull request Apr 22, 2016

[MRG+3] ENH Restructure grid_scores_ into a dict of 1D (numpy) (masked) arrays that can be imported into pandas as a DataFrame. #6697

Merged

10 tasks

raghavrv mentioned this pull request May 26, 2016

[RFC / SLEP?] Enhance scorer (objective) interface and deprecate the greater_is_better attr. #6731

Closed

raghavrv mentioned this pull request Sep 11, 2016

[MRG + 2] ENH Allow cross_val_score, GridSearchCV et al. to evaluate on multiple metrics #7388

Merged

16 tasks

jnothman closed this Jul 8, 2017

Uh oh!

[WIP] Multiple-metric grid search #2759

[WIP] Multiple-metric grid search #2759

Uh oh!

Conversation

mblondel commented Jan 16, 2014

Uh oh!

jnothman commented Aug 24, 2014

Uh oh!

mblondel commented Aug 24, 2014

Uh oh!

jnothman commented Feb 28, 2015

Uh oh!

jnothman commented Jan 26, 2016

Uh oh!

maniteja123 commented Mar 20, 2016

Uh oh!

jnothman commented Mar 20, 2016

Uh oh!

maniteja123 commented Mar 20, 2016

Uh oh!

jnothman commented Mar 20, 2016

Uh oh!

maniteja123 commented Mar 20, 2016

Uh oh!

maniteja123 commented Mar 20, 2016

Uh oh!

raghavrv commented Mar 20, 2016

Uh oh!

jnothman commented Mar 21, 2016

Uh oh!

jnothman commented May 11, 2016

Uh oh!

raghavrv commented Aug 8, 2016

Uh oh!

raghavrv commented Aug 8, 2016

Uh oh!

jnothman commented Aug 8, 2016

Uh oh!

raghavrv commented Aug 10, 2016

Uh oh!

jnothman commented Jul 8, 2017

Uh oh!

Uh oh!