-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
WIP new, simpler scorer API #2123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -987,13 +987,11 @@ follows: | |||
the ground truth target for ``X`` (in the supervised case) or ``None`` in the | |||
unsupervised case. | |||
|
|||
- The call returns a number indicating the quality of estimator. | |||
- It returns a pair (sign, score), where sign -1 means this is a score to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not accurate, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, still is still the old proposal. Sorry, I'll change it.
I'm definitely +1 for getting rid of Was there a reason for not using And as to my PRs, I'm only require that there be a way to get more information than a single objective score (multiple class scores; P/R/F) back from the scorer. E.g. the scorer could optionally return a tuple of score = scorer(estimator, X, y)
try:
score, score_data = score
except TypeError:
score_data = {'score': score} :s |
@@ -85,5 +85,5 @@ | |||
'silhouette_samples', | |||
'v_measure_score', | |||
'zero_one_loss', | |||
'Scorer', | |||
'make_corer', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing an s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Funny that make test
still passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't pass since #2033... Might want to check that out.
I know it's still WIP, but I thought I'd just mention this assert fail from travis so long:
That asside, looks good to me so far. 👍 |
Strange, |
Yeah passed on mine too when I checked this branch. Travis is annoying like that sometimes haha |
@jnothman the reason not to use |
I guess I was asking whether there was more than that: I actually find it quite clear to report negative loss being climbed positively towards 0. At least, loss is strictly non-negative (as opposed to some ranking metrics) so it should be unambiguous. The confusion is that it's the opposite of what you get when you use the metric function directly. |
@larsmans, despite your fix for CV over MSE, there's a more substantial deprecation issue: If someone passes a Or we could have scorers returning tuples so this doesn't happen. |
@jnothman You mean an Compare the situation to that in |
No, I mean a (And I didn't mean to imply that the deprecation was excessively imposing, |
Yes, I agree with the tuple return value. For PRF I'd say we return values of type
That's simpler than a (scalar, dict) pair. |
Yes, I see some advantages in the namedtuple. I had thought at one point a dict where the field 'score' was required, and if a scalar is returned it's interpreted as such. If it returns a namedtuple, we'll probably land up calling (One further annoyance with the namedtuple for PRF is that the |
We could name the scorer differently, like |
I had a quick look and my general feeling is that I like this approach and think that it is an improvement over the current code. The good news is that the current code wasn't in a released version, right, So we won't have to go through deprecation and users hating us, right? |
def __init__(self, score_func, sign, kwargs): | ||
self._kwargs = kwargs | ||
self._score_func = score_func | ||
self._sign = sign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the name 'greater_is_better' is more explicit than 'sign'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually going to be a sign ∈ {-1, 1}:
sign = 1 if greater_is_better else -1
(Line 160.) This is a private API, btw.
No scorer-related stuff was ever released, so we're safe. |
Do we want to rename Scorer to Evaluator? I find Scorer rather ambiguous. |
Rebased on master. I can haz reviews? |
I don't particularly mind Would you advocate changing the Remember also that the default |
Sounds very French to me (Lars, are we having a bad influence on you?). I actually like Scorer as it relates well to 'cross_val_score', and the |
Very well :) |
|
||
SCORERS = dict(r2=r2_scorer, mse=mse_scorer, accuracy=accuracy_scorer, | ||
f1=f1_scorer, roc_auc=auc_scorer, | ||
f1=f_scorer, roc_auc=auc_scorer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that the behaviour of f1
has changed? If I am not wrong, it was previously returning single values while it now returns tuples. Can this break user code in any way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I am not sure this was shipped in 13.1, so my question might not be relevant)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this also seems related to the errors reported by Travis in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing was ever shipped. And yes, this is causing the test failure, will look into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this requires more changes to score averaging, and I think it'll be better in a separate PR...
This looks like a nice improvement! Once my comments above regarding the changes in F1 scores are addressed, I am +1 for merge. |
One think I liked with the previous API that made For instance I would like to have grid search
The new public API hides too many implementation details to be able to implement this efficiently from a third party tool like hyperopt or even to implement this as options in the |
You can do all those things with custom scorers. They're allowed to return tuples holding arbitrary data, and they're still arbitrary Python callables. Partial example:
|
@@ -392,6 +401,33 @@ def __init__(self, estimator, scoring=None, loss_func=None, | |||
self.pre_dispatch = pre_dispatch | |||
self._check_estimator() | |||
|
|||
def report(self, file=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced by the format of this. Do we really need a report function that's little different from pprint(search.cv_scores_)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, indeed which is identical to print(*search.cv_scores_, file=file, sep='\n')
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be much more useful to output something like a CSV, but that requires interpreting the data more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a proof of concept. I wanted to make clear in some way that just print(cv_scores_)
doesn't give all the information. If you know a better solution (e.g. document the pprint
trick?), I'm up for suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced by the format of this. Do we really need a report function
that's little different from pprint(search.cv_scores_)?
In the long run, we might want such features, but in the short run, I'd
rather avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. document the pprint trick?)
I think that teaching people to use pprint is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think pprint
is wonderful either. Afaik it only knows about the basic standard collections (list, tuple, dict) and repr
s everything else, including namedtuples, defaultdicts, arrays, etc, on the basis that its output should be eval
able (except that most repr
implementations don't support that).
Also I forgot one very important use case: compute the same scores on the training split of the data to be able to detect underfitting and overfitting at the same time. But the meta custom scorer would have to know (hardcoding) if the subscorers "need thresholds" and whether greater is better and so on. Ideally I would like to have the default I think it would be need if it would be possible to have: Also for the multi-class classification problem, one would also like to collect the confusion matrix (and / or) the detailed per-class P/R/F1 prior to multiclass averaging. |
Also make the scorer responsible for dumping the raw predictions proba to the hard drive sounds like a design error (separation of concerns). |
Whether score_func requires predict_proba to get probability estimates | ||
out of a classifier. | ||
|
||
needs_threshold : boolean, default=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like having two boolean parameters is a bit confusing. What is the difference between needs_proba=True, needs_threshold=False
and needs_proba=True, needs_threshold=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps more explicit would be something like: default input_type='labels'
vs input_type='binary_threshold'
and input_type='class_proba'
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, except that "labels"
doesn't cover the regression use case. Also, whether the decision function must give 1-d results is not this function's responsibility, the metric will just have to raise an exception. I suggest prediction_method
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No wait, that doesn't really replace needs_threshold
because that backs off to predict_proba
when there's no decision_function
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, ThresholdScorer
slices [:, 1]
in that case where ProbaScorer
does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other option is to keep it this way and raise an exception when both are true...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just going to raise the exception. Rebase in progress.
I'm not sure if I'm happy with the scope of the PR. |
I just realized the issue message actually is pretty descriptive, but it is not entirely clear which changes are for the last point and which are for the rest. |
I have suggested the handling of multiple metrics should be a separate PR (as much as I am anxious to have the feature, I want it to be given some thought), and @ogrisel seemed to concur, as we need to consider its interaction with other extensibility in grid search output. |
I agree. We should move forward with the scorer refactoring. I plan to work on the multiple metrics features the next couple of days. We want to release soon and we should get stuff in a state that is future-ready but doesn't make changes that we might regret afterwards (like using named tuples ;) |
Objects that meet those conditions as said to implement the sklearn Scorer | ||
protocol. | ||
Callables that meet those conditions as said to implement the scikit-learn | ||
Scorer protocol. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not capitalized maybe? or maybe scoring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed caps.
@amueller @ogrisel @jnothman The scope of the PR is to simplify the scorers so that they are just functions. There are currently only two commits, the first of which establishes a simple API, the second exemplifies how it can be extended. If we want something that works right now, we can just pull the first commit and leave the structured score stuff for later. As for @ogrisel's requirements of passing lots of information around and the complaint about separation of concerns: that's independent of the exact API. When you adorn scorers with all kinds of attributes to store generic information about the CV procedure, you're not doing proper separation of concerns. |
@@ -673,8 +673,6 @@ Model Selection Interface | |||
:toctree: generated/ | |||
:template: class_with_call.rst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think make_scorer
should be public api, so it should be in the classes.rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was intended to be, will add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small remark: when in a rush, I prioritize on PR that are 'MRG', rather
than 'WIP', so I missed that one.
+1 for merging the first commit. |
@@ -29,18 +29,16 @@ | |||
'vect__max_features': (None, 5000, 10000, 50000)} | |||
done in 1737.030s | |||
|
|||
Best score: 0.940 | |||
Best score: 0.923 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did that change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the scoring from accuracy (the default) to F1 score to demo and test the structured return values from f_scorer
and F1 score ≤ accuracy. This is also why the best parameter set changed.
the function is still missing from the references. |
I was about to point it out :) |
Yes, and the description of the scorer protocol mentioned tuples while it shouldn't for now. Will force-push a new version when I'm confident the tests pass on current master. |
Ok, pushed the changes. The test failures are in the second commit. |
A Scorer is now a function that returns a score that should be maximized.
Added a report method to GridSearchCV to use it.
Ok, pushed the first commit to master. I'm closing this PR because we'll have to rethink or at least discuss the structured score stuff. |
Thanks a lot @larsmans :) |
Thank you for setting this up, Andy. This API turns out to be exactly what I needed to do better optimization :) |
Here's a simpler scorer API. Sorry for being so late to the game, I was too busy to bother with the discussions. I hope no-one takes offense (esp. @amueller who worked hard on this). Ping @jnothman. Anyway, the highlights:
estimator, X, y, **kwargs
.greater_is_better
. A scorer made from a loss function returns minus the loss.Scorer
class. There's a factory functionmake_scorer
, but it produces objects of an internal class hierarchy that do their best to hide as callables.TODO:
make_scorer
be public at all?