Skip to content

WIP new, simpler scorer API #2123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

larsmans
Copy link
Member

@larsmans larsmans commented Jul 1, 2013

Here's a simpler scorer API. Sorry for being so late to the game, I was too busy to bother with the discussions. I hope no-one takes offense (esp. @amueller who worked hard on this). Ping @jnothman. Anyway, the highlights:

  • Scorers are just callables that take estimator, X, y, **kwargs.
  • No more greater_is_better. A scorer made from a loss function returns minus the loss.
  • No more Scorer class. There's a factory function make_scorer, but it produces objects of an internal class hierarchy that do their best to hide as callables.
  • There's a currently unused scorer class for probabilistic classification, in anticipation of MRG add log loss (cross-entropy loss) to metrics #2013.
  • Scorers can return tuples to report additional information beyond a simple score. The first element of such a tuple should be the score (see Cross-validation returning multiple scores #1850).

TODO:

  • Docs needs copy-editing, see commit message.
  • Should make_scorer be public at all?

@@ -987,13 +987,11 @@ follows:
the ground truth target for ``X`` (in the supervised case) or ``None`` in the
unsupervised case.

- The call returns a number indicating the quality of estimator.
- It returns a pair (sign, score), where sign -1 means this is a score to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not accurate, is it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, still is still the old proposal. Sorry, I'll change it.

@amueller
Copy link
Member

amueller commented Jul 1, 2013

The general approach looks good to me, I don't have a good overview of how it might interact with @jnothman's PRs, though. I'm kinda busy, so don't wait for my 👍 to go ahead. Btw, @ogrisel might also have an opinion, as the original interface was pretty much his idea.

@jnothman
Copy link
Member

jnothman commented Jul 1, 2013

I'm definitely +1 for getting rid of needs_threshold. It didn't make much sense checking needs_threshold on __call__ anyway when it was a property of the metric.

Was there a reason for not using -loss previously?

And as to my PRs, I'm only require that there be a way to get more information than a single objective score (multiple class scores; P/R/F) back from the scorer. E.g. the scorer could optionally return a tuple of (objective_score, score_dict). You might then use the following, which I don't think is very pretty but does the job:

score = scorer(estimator, X, y)
try:
    score, score_data = score
except TypeError:
    score_data = {'score': score}

:s

@@ -85,5 +85,5 @@
'silhouette_samples',
'v_measure_score',
'zero_one_loss',
'Scorer',
'make_corer',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing an s

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Funny that make test still passes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't pass since #2033... Might want to check that out.

@jaquesgrobler
Copy link
Member

I know it's still WIP, but I thought I'd just mention this assert fail from travis so long:

======================================================================
FAIL: sklearn.tests.test_cross_validation.test_cross_val_score_with_score_func_regression
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/tests/test_cross_validation.py", line 403, in test_cross_val_score_with_score_func_regression
    assert_array_almost_equal(mse_scores, expected_mse, 2)
  File "/usr/lib/python2.7/dist-packages/numpy/testing/utils.py", line 800, in assert_array_almost_equal
    header=('Arrays are not almost equal to %d decimals' % decimal))
  File "/usr/lib/python2.7/dist-packages/numpy/testing/utils.py", line 636, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 2 decimals
(mismatch 100.0%)
 x: array([ -763.0785884 ,  -553.16968926,  -274.38203955,  -273.26281775,
       -1681.99954932])
 y: array([  763.07,   553.16,   274.38,   273.26,  1681.99])
>>  raise AssertionError('\nArrays are not almost equal to 2 decimals\n\n(mismatch 100.0%)\n x: array([ -763.0785884 ,  -553.16968926,  -274.38203955,  -273.26281775,\n       -1681.99954932])\n y: array([  763.07,   553.16,   274.38,   273.26,  1681.99])')

That asside, looks good to me so far. 👍

@larsmans
Copy link
Member Author

larsmans commented Jul 2, 2013

Strange, make test passed on my box. But this is to be expected, of course.

@jaquesgrobler
Copy link
Member

Yeah passed on mine too when I checked this branch. Travis is annoying like that sometimes haha

@amueller
Copy link
Member

amueller commented Jul 2, 2013

@jnothman the reason not to use -loss was that I didn't like returning negative mean squared errors to the user. That is just very counter-intuitive.

@jnothman
Copy link
Member

jnothman commented Jul 2, 2013

I guess I was asking whether there was more than that: I actually find it quite clear to report negative loss being climbed positively towards 0. At least, loss is strictly non-negative (as opposed to some ranking metrics) so it should be unambiguous. The confusion is that it's the opposite of what you get when you use the metric function directly.

@jnothman
Copy link
Member

jnothman commented Jul 2, 2013

@larsmans, despite your fix for CV over MSE, there's a more substantial deprecation issue: If someone passes a loss_func (deprecated) to cross_val_score or GridSearchCV you actually need to negate the output for them. Might as well add to the DeprecationWarning where loss_func is passed that the equivalent scoring parameter will return a negated version of the result.

Or we could have scorers returning tuples so this doesn't happen.

@larsmans
Copy link
Member Author

larsmans commented Jul 2, 2013

@jnothman You mean an (mse, 'minimize') tuple? I'm not too fond of that. It's possible to screw that up by returning different values from different calls. Always maximize is a robust and stable strategy that requires very little documentation (my favorite metric of complexity). I'm willing to jump through the deprecation hoops on this one if we can really simplify the API.

Compare the situation to that in scipy.optimize. That has an "always minimize" strategy, which is so obvious that when you screw it up, you're so ashamed that you fix the bug without posting issues about it ;)

@jnothman
Copy link
Member

jnothman commented Jul 2, 2013

No, I mean a (-mse, mse) tuple, or (-mse, {'loss': mse}). Certainly not simple; but again I'd really like to see P, R and F being returned somehow from a scorer, and that will add complexity somewhere, it just needs to be minimal...

(And I didn't mean to imply that the deprecation was excessively imposing,
but that you'd need to add it to your PR)

@larsmans
Copy link
Member Author

larsmans commented Jul 3, 2013

Yes, I agree with the tuple return value. For PRF I'd say we return values of type

_PRF1Tuple = namedtuple("_PRF1Tuple", "f1 precision recall")

That's simpler than a (scalar, dict) pair.

@jnothman
Copy link
Member

jnothman commented Jul 3, 2013

Yes, I see some advantages in the namedtuple. I had thought at one point a dict where the field 'score' was required, and if a scalar is returned it's interpreted as such. If it returns a namedtuple, we'll probably land up calling _asdict() given something like #2079, anyway, but I have no real objection to assuming the first element of a namedtuple is the objective.

(One further annoyance with the namedtuple for PRF is that the precision_recall_fscore_support function returns things in that order, not fscore first.)

@larsmans
Copy link
Member Author

larsmans commented Jul 3, 2013

We could name the scorer differently, like fscorer_with_precision_recall.

@GaelVaroquaux
Copy link
Member

I had a quick look and my general feeling is that I like this approach and think that it is an improvement over the current code.

The good news is that the current code wasn't in a released version, right, So we won't have to go through deprecation and users hating us, right?

def __init__(self, score_func, sign, kwargs):
self._kwargs = kwargs
self._score_func = score_func
self._sign = sign
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the name 'greater_is_better' is more explicit than 'sign'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually going to be a sign ∈ {-1, 1}:

sign = 1 if greater_is_better else -1

(Line 160.) This is a private API, btw.

@larsmans
Copy link
Member Author

larsmans commented Jul 9, 2013

No scorer-related stuff was ever released, so we're safe.

@larsmans
Copy link
Member Author

Do we want to rename Scorer to Evaluator? I find Scorer rather ambiguous.

@larsmans
Copy link
Member Author

Rebased on master. I can haz reviews?

@jnothman
Copy link
Member

I don't particularly mind Evaluator... but most users won't see this name anyway.

Would you advocate changing the scoring param in *SearchCV and cross_val_score to evaluator or evaluation? Are those names less ambiguous?

Remember also that the default Scorer is estimator.__class__.score, so it's extending on an existing API naming convention.

@GaelVaroquaux
Copy link
Member

Do we want to rename Scorer to Evaluator?

Sounds very French to me (Lars, are we having a bad influence on you?).

I actually like Scorer as it relates well to 'cross_val_score', and the
score method.

@larsmans
Copy link
Member Author

Very well :)


SCORERS = dict(r2=r2_scorer, mse=mse_scorer, accuracy=accuracy_scorer,
f1=f1_scorer, roc_auc=auc_scorer,
f1=f_scorer, roc_auc=auc_scorer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the behaviour of f1 has changed? If I am not wrong, it was previously returning single values while it now returns tuples. Can this break user code in any way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I am not sure this was shipped in 13.1, so my question might not be relevant)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this also seems related to the errors reported by Travis in the tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing was ever shipped. And yes, this is causing the test failure, will look into that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this requires more changes to score averaging, and I think it'll be better in a separate PR...

@glouppe
Copy link
Contributor

glouppe commented Jul 16, 2013

This looks like a nice improvement!

Once my comments above regarding the changes in F1 scores are addressed, I am +1 for merge.

@ogrisel
Copy link
Member

ogrisel commented Jul 16, 2013

One think I liked with the previous API that made need_threshold and score_func public attributes of the scorer instance is that it was possible to introspect the scoring object in a custom fit_grid_point implementation to combine several operations on the raw predictions that could be in a third party model evaluation tool such as hyperopt.

For instance I would like to have grid search fit_grid_point calls be able to:

  • record the raw prediction time (the wall clock duration of the call to predict / predict_proba or / decision_function
  • dump the raw predictions (or predict probas output) on the drive (for manual introspection of the common classification mistakes or to implement ensemble strategies such as Caruana et al., 2004 for instance)
  • compute ROC-AUC, PR-AUC, F1, Precision, Recall, at the same time (for binary classification for instance) without calling predict_proba twice on the fitted model

The new public API hides too many implementation details to be able to implement this efficiently from a third party tool like hyperopt or even to implement this as options in the *SearchCV using the public scoring API.

@larsmans
Copy link
Member Author

You can do all those things with custom scorers. They're allowed to return tuples holding arbitrary data, and they're still arbitrary Python callables. Partial example:

Results = namedtuple("Results", "f1 precision recall time".split())

def super_scorer(estimator, X, y_true):
    t0 = time()
    y_pred = estimator.predict(X)
    t = time() - t0
    return Results(f1_score(y_true, y_pred), precision(y_true, y_pred), recall(y_true, y_pred), t)

@@ -392,6 +401,33 @@ def __init__(self, estimator, scoring=None, loss_func=None,
self.pre_dispatch = pre_dispatch
self._check_estimator()

def report(self, file=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced by the format of this. Do we really need a report function that's little different from pprint(search.cv_scores_)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, indeed which is identical to print(*search.cv_scores_, file=file, sep='\n')?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be much more useful to output something like a CSV, but that requires interpreting the data more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a proof of concept. I wanted to make clear in some way that just print(cv_scores_) doesn't give all the information. If you know a better solution (e.g. document the pprint trick?), I'm up for suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced by the format of this. Do we really need a report function
that's little different from pprint(search.cv_scores_)?

In the long run, we might want such features, but in the short run, I'd
rather avoid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. document the pprint trick?)

I think that teaching people to use pprint is a good idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think pprint is wonderful either. Afaik it only knows about the basic standard collections (list, tuple, dict) and reprs everything else, including namedtuples, defaultdicts, arrays, etc, on the basis that its output should be evalable (except that most repr implementations don't support that).

@ogrisel
Copy link
Member

ogrisel commented Jul 17, 2013

You can do all those things with custom scorers. They're allowed to return tuples holding arbitrary data, and they're still arbitrary Python callables.

Also I forgot one very important use case: compute the same scores on the training split of the data to be able to detect underfitting and overfitting at the same time.

But the meta custom scorer would have to know (hardcoding) if the subscorers "need thresholds" and whether greater is better and so on.

Ideally I would like to have the default GridSearchCV compute the scores on both the train and test splits at the same time.

I think it would be need if it would be possible to have: GridSearchCV(model, param_grid, cv=5, scoring=('roc_auc', 'pr_auc', 'f1', 'precision', 'recall')) supported by the default sklearn tools without having to write a custom score class each time.

Also for the multi-class classification problem, one would also like to collect the confusion matrix (and / or) the detailed per-class P/R/F1 prior to multiclass averaging.

@ogrisel
Copy link
Member

ogrisel commented Jul 17, 2013

Also make the scorer responsible for dumping the raw predictions proba to the hard drive sounds like a design error (separation of concerns).

Whether score_func requires predict_proba to get probability estimates
out of a classifier.

needs_threshold : boolean, default=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like having two boolean parameters is a bit confusing. What is the difference between needs_proba=True, needs_threshold=False and needs_proba=True, needs_threshold=True?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps more explicit would be something like: default input_type='labels' vs input_type='binary_threshold' and input_type='class_proba'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, except that "labels" doesn't cover the regression use case. Also, whether the decision function must give 1-d results is not this function's responsibility, the metric will just have to raise an exception. I suggest prediction_method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No wait, that doesn't really replace needs_threshold because that backs off to predict_proba when there's no decision_function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, ThresholdScorer slices [:, 1] in that case where ProbaScorer does not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option is to keep it this way and raise an exception when both are true...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just going to raise the exception. Rebase in progress.

@amueller
Copy link
Member

I'm not sure if I'm happy with the scope of the PR.
Could you explain in simple words what it does?

@amueller
Copy link
Member

I just realized the issue message actually is pretty descriptive, but it is not entirely clear which changes are for the last point and which are for the rest.

@jnothman
Copy link
Member

I'm not sure if I'm happy with the scope of the PR.

I have suggested the handling of multiple metrics should be a separate PR (as much as I am anxious to have the feature, I want it to be given some thought), and @ogrisel seemed to concur, as we need to consider its interaction with other extensibility in grid search output.

@amueller
Copy link
Member

I agree. We should move forward with the scorer refactoring. I plan to work on the multiple metrics features the next couple of days. We want to release soon and we should get stuff in a state that is future-ready but doesn't make changes that we might regret afterwards (like using named tuples ;)

Objects that meet those conditions as said to implement the sklearn Scorer
protocol.
Callables that meet those conditions as said to implement the scikit-learn
Scorer protocol.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not capitalized maybe? or maybe scoring?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed caps.

@ogrisel
Copy link
Member

ogrisel commented Jul 23, 2013

Global +1 on @jnothman and @amueller's latest batch of comments.

@larsmans
Copy link
Member Author

@amueller @ogrisel @jnothman The scope of the PR is to simplify the scorers so that they are just functions. There are currently only two commits, the first of which establishes a simple API, the second exemplifies how it can be extended. If we want something that works right now, we can just pull the first commit and leave the structured score stuff for later.

As for @ogrisel's requirements of passing lots of information around and the complaint about separation of concerns: that's independent of the exact API. When you adorn scorers with all kinds of attributes to store generic information about the CV procedure, you're not doing proper separation of concerns.

@@ -673,8 +673,6 @@ Model Selection Interface
:toctree: generated/
:template: class_with_call.rst
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think make_scorer should be public api, so it should be in the classes.rst

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was intended to be, will add it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small remark: when in a rush, I prioritize on PR that are 'MRG', rather
than 'WIP', so I missed that one.

@amueller
Copy link
Member

+1 for merging the first commit.

@@ -29,18 +29,16 @@
'vect__max_features': (None, 5000, 10000, 50000)}
done in 1737.030s

Best score: 0.940
Best score: 0.923
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did that change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the scoring from accuracy (the default) to F1 score to demo and test the structured return values from f_scorer and F1 score ≤ accuracy. This is also why the best parameter set changed.

@amueller
Copy link
Member

the function is still missing from the references.

@GaelVaroquaux
Copy link
Member

the function is still missing from the references.

I was about to point it out :)

@larsmans
Copy link
Member Author

Yes, and the description of the scorer protocol mentioned tuples while it shouldn't for now. Will force-push a new version when I'm confident the tests pass on current master.

@larsmans
Copy link
Member Author

Ok, pushed the changes. The test failures are in the second commit.

larsmans added 2 commits July 24, 2013 14:18
A Scorer is now a function that returns a score that should be maximized.
Added a report method to GridSearchCV to use it.
@larsmans
Copy link
Member Author

Ok, pushed the first commit to master. I'm closing this PR because we'll have to rethink or at least discuss the structured score stuff.

@larsmans larsmans closed this Jul 24, 2013
@amueller
Copy link
Member

Thanks a lot @larsmans :)

@larsmans
Copy link
Member Author

Thank you for setting this up, Andy. This API turns out to be exactly what I needed to do better optimization :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants