WIP new, simpler scorer API #2123

larsmans · 2013-07-01T16:08:10Z

Here's a simpler scorer API. Sorry for being so late to the game, I was too busy to bother with the discussions. I hope no-one takes offense (esp. @amueller who worked hard on this). Ping @jnothman. Anyway, the highlights:

Scorers are just callables that take estimator, X, y, **kwargs.
No more greater_is_better. A scorer made from a loss function returns minus the loss.
No more Scorer class. There's a factory function make_scorer, but it produces objects of an internal class hierarchy that do their best to hide as callables.
There's a currently unused scorer class for probabilistic classification, in anticipation of MRG add log loss (cross-entropy loss) to metrics #2013.
Scorers can return tuples to report additional information beyond a simple score. The first element of such a tuple should be the score (see Cross-validation returning multiple scores #1850).

TODO:

Docs needs copy-editing, see commit message.
Should make_scorer be public at all?

amueller · 2013-07-01T17:05:37Z

doc/modules/model_evaluation.rst

@@ -987,13 +987,11 @@ follows:
  the ground truth target for ``X`` (in the supervised case) or ``None`` in the
  unsupervised case.

- The call returns a number indicating the quality of estimator.
+- It returns a pair (sign, score), where sign -1 means this is a score to


This is not accurate, is it?

No, still is still the old proposal. Sorry, I'll change it.

amueller · 2013-07-01T17:07:31Z

The general approach looks good to me, I don't have a good overview of how it might interact with @jnothman's PRs, though. I'm kinda busy, so don't wait for my 👍 to go ahead. Btw, @ogrisel might also have an opinion, as the original interface was pretty much his idea.

jnothman · 2013-07-01T23:19:59Z

I'm definitely +1 for getting rid of needs_threshold. It didn't make much sense checking needs_threshold on __call__ anyway when it was a property of the metric.

Was there a reason for not using -loss previously?

And as to my PRs, I'm only require that there be a way to get more information than a single objective score (multiple class scores; P/R/F) back from the scorer. E.g. the scorer could optionally return a tuple of (objective_score, score_dict). You might then use the following, which I don't think is very pretty but does the job:

score = scorer(estimator, X, y)
try:
    score, score_data = score
except TypeError:
    score_data = {'score': score}

:s

jnothman · 2013-07-01T23:21:15Z

sklearn/metrics/__init__.py

@@ -85,5 +85,5 @@
           'silhouette_samples',
           'v_measure_score',
           'zero_one_loss',
-           'Scorer',
+           'make_corer',


missing an s

Indeed. Funny that make test still passes.

It shouldn't pass since #2033... Might want to check that out.

jaquesgrobler · 2013-07-02T13:55:25Z

I know it's still WIP, but I thought I'd just mention this assert fail from travis so long:

======================================================================
FAIL: sklearn.tests.test_cross_validation.test_cross_val_score_with_score_func_regression
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/tests/test_cross_validation.py", line 403, in test_cross_val_score_with_score_func_regression
    assert_array_almost_equal(mse_scores, expected_mse, 2)
  File "/usr/lib/python2.7/dist-packages/numpy/testing/utils.py", line 800, in assert_array_almost_equal
    header=('Arrays are not almost equal to %d decimals' % decimal))
  File "/usr/lib/python2.7/dist-packages/numpy/testing/utils.py", line 636, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 2 decimals
(mismatch 100.0%)
 x: array([ -763.0785884 ,  -553.16968926,  -274.38203955,  -273.26281775,
       -1681.99954932])
 y: array([  763.07,   553.16,   274.38,   273.26,  1681.99])
>>  raise AssertionError('\nArrays are not almost equal to 2 decimals\n\n(mismatch 100.0%)\n x: array([ -763.0785884 ,  -553.16968926,  -274.38203955,  -273.26281775,\n       -1681.99954932])\n y: array([  763.07,   553.16,   274.38,   273.26,  1681.99])')

That asside, looks good to me so far. 👍

larsmans · 2013-07-02T13:59:20Z

Strange, make test passed on my box. But this is to be expected, of course.

jaquesgrobler · 2013-07-02T14:20:46Z

Yeah passed on mine too when I checked this branch. Travis is annoying like that sometimes haha

amueller · 2013-07-02T21:38:55Z

@jnothman the reason not to use -loss was that I didn't like returning negative mean squared errors to the user. That is just very counter-intuitive.

jnothman · 2013-07-02T22:20:31Z

I guess I was asking whether there was more than that: I actually find it quite clear to report negative loss being climbed positively towards 0. At least, loss is strictly non-negative (as opposed to some ranking metrics) so it should be unambiguous. The confusion is that it's the opposite of what you get when you use the metric function directly.

jnothman · 2013-07-02T22:24:38Z

@larsmans, despite your fix for CV over MSE, there's a more substantial deprecation issue: If someone passes a loss_func (deprecated) to cross_val_score or GridSearchCV you actually need to negate the output for them. Might as well add to the DeprecationWarning where loss_func is passed that the equivalent scoring parameter will return a negated version of the result.

Or we could have scorers returning tuples so this doesn't happen.

larsmans · 2013-07-02T23:06:55Z

@jnothman You mean an (mse, 'minimize') tuple? I'm not too fond of that. It's possible to screw that up by returning different values from different calls. Always maximize is a robust and stable strategy that requires very little documentation (my favorite metric of complexity). I'm willing to jump through the deprecation hoops on this one if we can really simplify the API.

Compare the situation to that in scipy.optimize. That has an "always minimize" strategy, which is so obvious that when you screw it up, you're so ashamed that you fix the bug without posting issues about it ;)

jnothman · 2013-07-02T23:29:25Z

No, I mean a (-mse, mse) tuple, or (-mse, {'loss': mse}). Certainly not simple; but again I'd really like to see P, R and F being returned somehow from a scorer, and that will add complexity somewhere, it just needs to be minimal...

(And I didn't mean to imply that the deprecation was excessively imposing,
but that you'd need to add it to your PR)

larsmans · 2013-07-03T10:51:51Z

Yes, I agree with the tuple return value. For PRF I'd say we return values of type

_PRF1Tuple = namedtuple("_PRF1Tuple", "f1 precision recall")

That's simpler than a (scalar, dict) pair.

jnothman · 2013-07-03T11:02:11Z

Yes, I see some advantages in the namedtuple. I had thought at one point a dict where the field 'score' was required, and if a scalar is returned it's interpreted as such. If it returns a namedtuple, we'll probably land up calling _asdict() given something like #2079, anyway, but I have no real objection to assuming the first element of a namedtuple is the objective.

(One further annoyance with the namedtuple for PRF is that the precision_recall_fscore_support function returns things in that order, not fscore first.)

larsmans · 2013-07-03T11:35:56Z

We could name the scorer differently, like fscorer_with_precision_recall.

GaelVaroquaux · 2013-07-08T19:48:45Z

I had a quick look and my general feeling is that I like this approach and think that it is an improvement over the current code.

The good news is that the current code wasn't in a released version, right, So we won't have to go through deprecation and users hating us, right?

GaelVaroquaux · 2013-07-08T23:35:17Z

sklearn/metrics/scorer.py

+    def __init__(self, score_func, sign, kwargs):
+        self._kwargs = kwargs
+        self._score_func = score_func
+        self._sign = sign


I think that the name 'greater_is_better' is more explicit than 'sign'.

It's actually going to be a sign ∈ {-1, 1}:

sign = 1 if greater_is_better else -1

(Line 160.) This is a private API, btw.

larsmans · 2013-07-09T08:08:29Z

No scorer-related stuff was ever released, so we're safe.

larsmans · 2013-07-15T15:38:45Z

Do we want to rename Scorer to Evaluator? I find Scorer rather ambiguous.

larsmans · 2013-07-15T16:54:31Z

Rebased on master. I can haz reviews?

jnothman · 2013-07-15T22:41:08Z

I don't particularly mind Evaluator... but most users won't see this name anyway.

Would you advocate changing the scoring param in *SearchCV and cross_val_score to evaluator or evaluation? Are those names less ambiguous?

Remember also that the default Scorer is estimator.__class__.score, so it's extending on an existing API naming convention.

GaelVaroquaux · 2013-07-16T07:03:22Z

Do we want to rename Scorer to Evaluator?

Sounds very French to me (Lars, are we having a bad influence on you?).

I actually like Scorer as it relates well to 'cross_val_score', and the
score method.

larsmans · 2013-07-16T07:05:54Z

Very well :)

glouppe · 2013-07-16T08:03:24Z

sklearn/metrics/scorer.py


 SCORERS = dict(r2=r2_scorer, mse=mse_scorer, accuracy=accuracy_scorer,
-               f1=f1_scorer, roc_auc=auc_scorer,
+               f1=f_scorer, roc_auc=auc_scorer,


Does this mean that the behaviour of f1 has changed? If I am not wrong, it was previously returning single values while it now returns tuples. Can this break user code in any way?

(I am not sure this was shipped in 13.1, so my question might not be relevant)

Actually, this also seems related to the errors reported by Travis in the tests.

Nothing was ever shipped. And yes, this is causing the test failure, will look into that.

Yes, this requires more changes to score averaging, and I think it'll be better in a separate PR...

glouppe · 2013-07-16T08:08:37Z

This looks like a nice improvement!

Once my comments above regarding the changes in F1 scores are addressed, I am +1 for merge.

ogrisel · 2013-07-16T16:28:12Z

One think I liked with the previous API that made need_threshold and score_func public attributes of the scorer instance is that it was possible to introspect the scoring object in a custom fit_grid_point implementation to combine several operations on the raw predictions that could be in a third party model evaluation tool such as hyperopt.

For instance I would like to have grid search fit_grid_point calls be able to:

record the raw prediction time (the wall clock duration of the call to predict / predict_proba or / decision_function
dump the raw predictions (or predict probas output) on the drive (for manual introspection of the common classification mistakes or to implement ensemble strategies such as Caruana et al., 2004 for instance)
compute ROC-AUC, PR-AUC, F1, Precision, Recall, at the same time (for binary classification for instance) without calling predict_proba twice on the fitted model

The new public API hides too many implementation details to be able to implement this efficiently from a third party tool like hyperopt or even to implement this as options in the *SearchCV using the public scoring API.

larsmans · 2013-07-16T19:33:43Z

You can do all those things with custom scorers. They're allowed to return tuples holding arbitrary data, and they're still arbitrary Python callables. Partial example:

Results = namedtuple("Results", "f1 precision recall time".split())

def super_scorer(estimator, X, y_true):
    t0 = time()
    y_pred = estimator.predict(X)
    t = time() - t0
    return Results(f1_score(y_true, y_pred), precision(y_true, y_pred), recall(y_true, y_pred), t)

jnothman · 2013-07-16T23:41:21Z

sklearn/grid_search.py

@@ -392,6 +401,33 @@ def __init__(self, estimator, scoring=None, loss_func=None,
        self.pre_dispatch = pre_dispatch
        self._check_estimator()

+    def report(self, file=None):


I'm not convinced by the format of this. Do we really need a report function that's little different from pprint(search.cv_scores_)?

Or, indeed which is identical to print(*search.cv_scores_, file=file, sep='\n')?

I think it would be much more useful to output something like a CSV, but that requires interpreting the data more.

It's a proof of concept. I wanted to make clear in some way that just print(cv_scores_) doesn't give all the information. If you know a better solution (e.g. document the pprint trick?), I'm up for suggestions.

I'm not convinced by the format of this. Do we really need a report function
that's little different from pprint(search.cv_scores_)?

In the long run, we might want such features, but in the short run, I'd
rather avoid.

e.g. document the pprint trick?)

I think that teaching people to use pprint is a good idea.

I don't think pprint is wonderful either. Afaik it only knows about the basic standard collections (list, tuple, dict) and reprs everything else, including namedtuples, defaultdicts, arrays, etc, on the basis that its output should be evalable (except that most repr implementations don't support that).

ogrisel · 2013-07-17T08:14:01Z

You can do all those things with custom scorers. They're allowed to return tuples holding arbitrary data, and they're still arbitrary Python callables.

Also I forgot one very important use case: compute the same scores on the training split of the data to be able to detect underfitting and overfitting at the same time.

But the meta custom scorer would have to know (hardcoding) if the subscorers "need thresholds" and whether greater is better and so on.

Ideally I would like to have the default GridSearchCV compute the scores on both the train and test splits at the same time.

I think it would be need if it would be possible to have: GridSearchCV(model, param_grid, cv=5, scoring=('roc_auc', 'pr_auc', 'f1', 'precision', 'recall')) supported by the default sklearn tools without having to write a custom score class each time.

Also for the multi-class classification problem, one would also like to collect the confusion matrix (and / or) the detailed per-class P/R/F1 prior to multiclass averaging.

ogrisel · 2013-07-17T08:15:02Z

Also make the scorer responsible for dumping the raw predictions proba to the hard drive sounds like a design error (separation of concerns).

amueller · 2013-07-23T07:13:56Z

sklearn/metrics/scorer.py

+        Whether score_func requires predict_proba to get probability estimates
+        out of a classifier.
+
+    needs_threshold : boolean, default=False


I feel like having two boolean parameters is a bit confusing. What is the difference between needs_proba=True, needs_threshold=False and needs_proba=True, needs_threshold=True?

Perhaps more explicit would be something like: default input_type='labels' vs input_type='binary_threshold' and input_type='class_proba'.

Good point, except that "labels" doesn't cover the regression use case. Also, whether the decision function must give 1-d results is not this function's responsibility, the metric will just have to raise an exception. I suggest prediction_method.

No wait, that doesn't really replace needs_threshold because that backs off to predict_proba when there's no decision_function.

Also, ThresholdScorer slices [:, 1] in that case where ProbaScorer does not.

The other option is to keep it this way and raise an exception when both are true...

I'm just going to raise the exception. Rebase in progress.

amueller · 2013-07-23T09:58:15Z

I'm not sure if I'm happy with the scope of the PR.
Could you explain in simple words what it does?

amueller · 2013-07-23T09:59:43Z

I just realized the issue message actually is pretty descriptive, but it is not entirely clear which changes are for the last point and which are for the rest.

jnothman · 2013-07-23T10:09:44Z

I'm not sure if I'm happy with the scope of the PR.

I have suggested the handling of multiple metrics should be a separate PR (as much as I am anxious to have the feature, I want it to be given some thought), and @ogrisel seemed to concur, as we need to consider its interaction with other extensibility in grid search output.

amueller · 2013-07-23T10:14:16Z

I agree. We should move forward with the scorer refactoring. I plan to work on the multiple metrics features the next couple of days. We want to release soon and we should get stuff in a state that is future-ready but doesn't make changes that we might regret afterwards (like using named tuples ;)

amueller · 2013-07-23T10:17:55Z

doc/modules/model_evaluation.rst

-Objects that meet those conditions as said to implement the sklearn Scorer
-protocol.
+Callables that meet those conditions as said to implement the scikit-learn
+Scorer protocol.


not capitalized maybe? or maybe scoring?

Removed caps.

ogrisel · 2013-07-23T11:01:14Z

Global +1 on @jnothman and @amueller's latest batch of comments.

larsmans · 2013-07-23T11:06:06Z

@amueller @ogrisel @jnothman The scope of the PR is to simplify the scorers so that they are just functions. There are currently only two commits, the first of which establishes a simple API, the second exemplifies how it can be extended. If we want something that works right now, we can just pull the first commit and leave the structured score stuff for later.

As for @ogrisel's requirements of passing lots of information around and the complaint about separation of concerns: that's independent of the exact API. When you adorn scorers with all kinds of attributes to store generic information about the CV procedure, you're not doing proper separation of concerns.

amueller · 2013-07-23T11:44:24Z

doc/modules/classes.rst

@@ -673,8 +673,6 @@ Model Selection Interface
   :toctree: generated/
   :template: class_with_call.rst


I think make_scorer should be public api, so it should be in the classes.rst

It was intended to be, will add it.

Small remark: when in a rush, I prioritize on PR that are 'MRG', rather
than 'WIP', so I missed that one.

amueller · 2013-07-23T11:48:17Z

+1 for merging the first commit.

GaelVaroquaux · 2013-07-24T11:43:16Z

examples/grid_search_text_feature_extraction.py

@@ -29,18 +29,16 @@
   'vect__max_features': (None, 5000, 10000, 50000)}
  done in 1737.030s

-  Best score: 0.940
+  Best score: 0.923


Why did that change?

I changed the scoring from accuracy (the default) to F1 score to demo and test the structured return values from f_scorer and F1 score ≤ accuracy. This is also why the best parameter set changed.

amueller · 2013-07-24T12:00:00Z

the function is still missing from the references.

GaelVaroquaux · 2013-07-24T12:00:48Z

the function is still missing from the references.

I was about to point it out :)

larsmans · 2013-07-24T12:08:02Z

Yes, and the description of the scorer protocol mentioned tuples while it shouldn't for now. Will force-push a new version when I'm confident the tests pass on current master.

larsmans · 2013-07-24T12:15:52Z

Ok, pushed the changes. The test failures are in the second commit.

A Scorer is now a function that returns a score that should be maximized.

Added a report method to GridSearchCV to use it.

larsmans · 2013-07-24T12:51:51Z

Ok, pushed the first commit to master. I'm closing this PR because we'll have to rethink or at least discuss the structured score stuff.

amueller · 2013-07-24T13:04:28Z

Thanks a lot @larsmans :)

larsmans · 2013-07-24T13:14:21Z

Thank you for setting this up, Andy. This API turns out to be exactly what I needed to do better optimization :)

larsmans mentioned this pull request Jul 1, 2013

Cross-validation returning multiple scores #1850

Closed

amueller reviewed Jul 1, 2013
View reviewed changes

jnothman reviewed Jul 1, 2013
View reviewed changes

GaelVaroquaux reviewed Jul 8, 2013
View reviewed changes

glouppe reviewed Jul 16, 2013
View reviewed changes

jnothman reviewed Jul 16, 2013
View reviewed changes

amueller reviewed Jul 23, 2013
View reviewed changes

GaelVaroquaux reviewed Jul 24, 2013
View reviewed changes

larsmans added 2 commits July 24, 2013 14:18

ENH simplify the Scorer API

471bd75

A Scorer is now a function that returns a score that should be maximized.

ENH f_scorer that returns multiple values + support for that

932f0bf

Added a report method to GridSearchCV to use it.

larsmans closed this Jul 24, 2013

This was referenced Jul 25, 2013

ENH annotate metrics to simplify populating SCORERS #1774

Closed

[MRG] DOC Remove deprecated reference + acknowledge @larsman FIX#2309 #2310

Merged

		@@ -673,8 +673,6 @@ Model Selection Interface
		:toctree: generated/
		:template: class_with_call.rst

Uh oh!

WIP new, simpler scorer API #2123

WIP new, simpler scorer API #2123

Uh oh!

Conversation

larsmans commented Jul 1, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Jul 1, 2013

Uh oh!

jnothman commented Jul 1, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaquesgrobler commented Jul 2, 2013

Uh oh!

larsmans commented Jul 2, 2013

Uh oh!

jaquesgrobler commented Jul 2, 2013

Uh oh!

amueller commented Jul 2, 2013

Uh oh!

jnothman commented Jul 2, 2013

Uh oh!

jnothman commented Jul 2, 2013

Uh oh!

larsmans commented Jul 2, 2013

Uh oh!

jnothman commented Jul 2, 2013

Uh oh!

larsmans commented Jul 3, 2013

Uh oh!

jnothman commented Jul 3, 2013

Uh oh!

larsmans commented Jul 3, 2013

Uh oh!

GaelVaroquaux commented Jul 8, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

larsmans commented Jul 9, 2013

Uh oh!

larsmans commented Jul 15, 2013

Uh oh!

larsmans commented Jul 15, 2013

Uh oh!

jnothman commented Jul 15, 2013

Uh oh!

GaelVaroquaux commented Jul 16, 2013

Uh oh!

larsmans commented Jul 16, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glouppe commented Jul 16, 2013

Uh oh!

ogrisel commented Jul 16, 2013

Uh oh!

larsmans commented Jul 16, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment