Skip to content

ENH annotate metrics to simplify populating SCORERS #1774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

jnothman
Copy link
Member

This patch makes needs_threshold and greater_is_better an attribute of each metric, rather than providing them when a Scorer constructed from them. I am not sure these are the most appropriate names, but their use means information isn't duplicated in scorer.py. (It also would allow for composite Scorers to be easily constructed assuming their support in #1768 is granted.)

Perhaps there should be other machine-readable annotations on our library of metrics:

  • lower and upper bound
  • is_symmetric
  • multiclass/binary/regression

but their application is not clear to me.

@amueller
Copy link
Member

I actually implemented exactly this and we decided against it.

@amueller
Copy link
Member

I am not entirely sure why you say information is not duplicated. Why is it duplicated now? It is now stored in the scorer file, while you stored it at the scoring functions.

Anyhow, I actually do like this approach but it was voted against it.

@jnothman
Copy link
Member Author

Oh, well. That's annoying. Can you point me to the reasons against?

I don't mean not duplicated. I mean not in multiple places.

Assuming #1768, note with this patch we can do something like the following:

class CompositeScorer(Scorer):
    def __init__(self, objective, **other_metrics):
        super(CompositeScorer, self).__init__(objective)
        self.metrics = ([], [])  # group metrics by their needs_threshold value
        other_metrics['score'] = objective
        for name, metric in six.iteritems(other_metrics):
            needs_threshold = self._get_annotation(metric, 'needs_threshold', False)
            self.metrics[int(needs_threshold)].append((name, metric))

    def calc_scores(self, estimator, X, y=None):
        for needs_threshold, metrics in enumerate(self.metrics):
            if not metrics:
                continue
            if needs_threshold:
                try:
                    y_pred = estimator.decision_function(X).ravel()
                except (NotImplementedError, AttributeError):
                    y_pred = estimator.predict_proba(X)[:, 1]
            else:
                y_pred = estimator.predict(X)
            for name, metric in metrics:
                yield name, self.score_func(y, y_pred)


binary_debug_scorer = CompositeScorer(f1_score,
    precision=precision_score, recall=recall_score,
    accuracy=accuracy_score, hinge_loss=hinge_loss,
    pr_curve=precision_recall_curve, roc_curve=roc_curve,
    confusion=confusion_matrix)

@amueller
Copy link
Member

Hum, I never really looked into more than one score actually ;)

@larsmans
Copy link
Member

@amueller Where was the vote? I like this approach better than the current one too.

@amueller
Copy link
Member

In the three pull requests I sent in and the massive thread on the ML.
http://sourceforge.net/mailarchive/forum.php?thread_name=CAFvE7K4aU7Dj-%3De_jDxLNUAoByNNvO9caOAs%2BYQ8K65ctGceGg%40mail.gmail.com&forum_name=scikit-learn-general

#1381 #1198 #1014

Actually the discussion there was if we should let the score_functions know how they can be applied.
Because we didn't do that was the reason to introduce the scorer objects.
If we do annotate them, I'm not entirely sure we need the scorer objects at all?
Well some of the code that is now in the scorer objects would be in GridSearchCV and cross_val_score.

I worked pretty long and had endless discussion to get any working solution into sklearn and I am a bit fed up with the issue. The current state is probably not the perfect final solution, though.
Talk to @GaelVaroquaux.

@larsmans
Copy link
Member

Alright. Shall we close this PR?

@amueller
Copy link
Member

Not sure ;)

@jnothman
Copy link
Member Author

Right, so this comes down to "what is Scorer for?". Currently it:

  • transforms a function over estimator output into one over estimator input, using needs_threshold to decide how;
  • notes whether the metric is a loss/score.

With this patch it still performs the first function, but is not necessary for the second.

With #1768 it also:

  • may output multiple scores by name (without necessarily having to call predict multiple times).

Things it does not do:

  • know how to aggregate its output (*SearchCV assumes sum, or arithmetic mean weighted by number of samples, if iid=True)
  • decide whether one score is better than another (despite having the facility to do so in greater_is_better).

So, what's a Scorer for??

@amueller
Copy link
Member

The idea behind the scorer was that it can do more than just threshold.
@GaelVaroquaux didn't like the idea of building scoring objects just around classification, though this is the main area they are used.
The scorer interface makes it possible to build all kind of estimator specific scores, for example taking AIC, BIC, model complexity or margin into account. These are not my ideas so I can't really speak for them.

Have you read the other pull requests and the thread?
My first solution was basically to add a "needs threshold" decorator to the loss. That would have solved my problem.
@GaelVaroquaux (and maybe also @ogrisel?) said this was to much classification specific.

Now you say you want multiple scores. I don't really see the use case for that, I must admit (I'd actually like to have less and more aggregated views of the results, not more verbose ones). But if you can implement what you want with the scorer interface, that is nice :)
So maybe this is a useful abstraction after all.

Btw, if you want to write less for the CompositeScorer, you can just use the Scorers instead of the functions and all will be good ;)

@jnothman
Copy link
Member Author

The scorer interface makes it possible to build all kind of estimator specific scores

Yes of course it does... so it makes the metric not just a function of the input to the estimator, but of the estimator itself.

No, I haven't read up on the history yet. It seems the problem comes down to the domain of predict sometimes being integer classes, and sometimes float regression values. Perhaps the exceptional case is a metric that is categorical, not one that needs_threshold. I.e. if the metric is marked categorical, the Scorer should use predict; otherwise it tries decision_function, predict_proba and predict...

With the calc_scores idea I suggested in #1768, it can output multiple scores, which I think is at least useful for something like GridSearchCV to keep the objective and other diagnostic evaluations, but discard the models built along the way.

Yes, I considered using Scorers in the CompositeScorer, but assuming we're using something like calc_scores, it's trickier to provide an API to rename the scores output from the constituent scorers' calc_scores. And if the Scorer is already composite, introspecting to use the minimal number of calls to predict or decision_function is messy. So composing a scorer out of metrics makes some more sense, except that it doesn't handle score functions that depend on the estimator...

@jnothman
Copy link
Member Author

Or perhaps the metrics should be annotated directly with something like:

@applies_to('predict')
@applies_to('decision_function', 'predict_proba')

:s

For the moment I am persuaded that this PR might be ignored until we decide whether we're interested in composite scorers as in #1768 .

@arjoly
Copy link
Member

arjoly commented Jul 25, 2013

I think that this pr can be close given the excellent work of many people (see ##2123).

@arjoly arjoly closed this Jul 25, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants