WIP: Enh/metrics #443

satra · 2011-11-15T03:31:25Z

adding a couple of metrics

TODO before merge:

devise an API refactoring to be able to pass parameterized scoring callables to the GridSearchCV and cross_val_score utilities.

GaelVaroquaux · 2011-11-15T06:42:12Z

sklearn/metrics/metrics.py

+
+    """
+    mcc = np.corrcoef(y_true, y_pred)[0,1]
+    if np.isnan(mcc):


PEP 8 (missing space after coma) :)

satra · 2011-11-15T13:22:29Z

@Gael: the difference between the f1_score in multiclass settings is that whenever there is a binary classification problem, it will return only the positive f1 score by default. this will always return the average. clarified in the doc string now.

satra · 2011-11-15T13:26:49Z

also we should work towards including all the metrics in the referenced doi under matthews_corrcoef. there are some generally useful ones for the multiclass case. we already have several of the mutual info metrics.

ogrisel · 2011-11-15T15:04:55Z

Maybe we should instead write separate functions for binary and multiclass f1 score such that calling the binary classification f1_score on multi-class input would raise a ValueError telling to use the multi class version instead? The binary classification could be named f1_score and the multi class version could be named either multiclass_f1_score or average_f1_score. WDYT?

Also I would rather avoid names such as avg_f1_score but rather use the explicit name average_f1_score.

satra · 2011-11-16T01:54:44Z

@ogrisel: just a quick question. in the binary case the current f1_score returns the value for the positive label. that's purely convention right? there is no reason to think that the +1 label is any different from a -1 or a 0 label.

if that's the case, wouldn't it make sense to have a single function and always return np.average(f1, weights=support) and which would have pos_label=None. if pos_label is not None, then it would return np.average(f1[pos_label], weights=support). that would be a generic single function solution, but would change the api.

i'll change avg -> average. should corrcoef be changed to correlation_coefficient?

GaelVaroquaux · 2011-11-16T06:11:14Z

i'll change avg -> average. should corrcoef be changed to correlation_coefficient?

I don't think so: corrcoef is just a question of following numpy.

ogrisel · 2011-11-16T14:36:14Z

@satra: yes the choice of the positive label is a convention in binary classification. In information retrieval positive means "belong to the search results of the query" whereas all other document from the index are classified as "negative". Hence the unbalanced labeling between positive and negative.

As for the treatment of pos_label=None to tell apart the multiclass classification tasks from binary classification tasks, I would rather change change the existing method fbeta_score (and related methods such as precision_score and recall_score) to do:

y_true, y_pred = check_arrays(y_true, y_pred)

if labels is None:
    labels = unique_labels(y_true, y_pred)
else:
    labels = np.asarray(labels, dtype=np.int)
_, _, f, s = precision_recall_fscore_support(y_true, y_pred, beta=beta, labels=labels)
if pos_label is not None:
    if f.shape[0] != 2:
        raise ValueError("pos_label should be set to None for multiclass tasks got %d" % pos_label)
    if pos_label is not in labels:
        raise ValueError("pos_label=%d is not a valid label: %r" % pos_label, labels)
    pos_label_idx = list(labels).index(pos_label)
    return f[pos_label_idx]
else:
    return np.average(f, weights=s)

In that case we don't need to introduce a new function (we can reuse the existing ones). Of course we need to make it explicit that the binary classification tasks should provide an explicit pos_label otherwise the problem is treated as multiclass classification with support-weighted averaging of the scores.

satra · 2011-11-16T14:38:09Z

@ogrisel: this sounds good to me. shall i post to the list to get some feedback before i go ahead and change?

ogrisel · 2011-11-16T14:43:51Z

Yes that would be great to ask for more comments.

robertlayton · 2011-11-16T22:15:47Z

sklearn/metrics/metrics.py

+    References
+    ----------
+    http://en.wikipedia.org/wiki/Matthews_correlation_coefficient
+    doi: 10.1093/bioinformatics/16.5.412


Can we get the full reference here? The Wikipedia page may change in the future

is the doi not sufficient?

On Sat, Dec 10, 2011 at 02:54:05AM -0800, Satrajit Ghosh wrote:

http://en.wikipedia.org/wiki/Matthews_correlation_coefficient

doi: 10.1093/bioinformatics/16.5.412

is the doi not sufficient?

I'd rather have a clickable link. Ideally freely downloadable.

G

so

http://dx.doi.org/10.1093/bioinformatics/16.5.412

would be good for you? :)

On Sat, Dec 10, 2011 at 02:59:22AM -0800, Satrajit Ghosh wrote:

http://dx.doi.org/10.1093/bioinformatics/16.5.412

That's great: I can click on it, and get a PDF :)

G

robertlayton · 2011-11-16T22:18:07Z

This is great - it was actually on my todo list when I got some time!
I would also prefer ogrisel's method of doing both binary and average f1 in the same function.
If f.shape[0] > 2, should the value of pos_label just be ignored?

satra · 2011-11-16T22:31:50Z

can't we simply do

np.average(f[labels], weights=s[labels])

where labels can be an array. we can do the appropriate checks and change pos_label to labels.

ogrisel · 2011-11-16T22:33:18Z

The default value being pos_label=None it's better to throw an exception when the user passes a none None value on a multiclass dataset rather than trying to silently fix the problem in the background without telling the caller that what was asked for is meaningless.

mblondel · 2011-11-17T06:40:48Z

I didn't quite understand how the new metric for average f1 score differs from what we currently have (I'm a bit in a hurry so I guess I'm missing something). By the way, the way we average the f1 score (before this PR) is not standard. We need to add micro and macro averaging instead. See #83.

satra · 2011-11-17T13:18:48Z

@mblondel: the difference with the new metric is that for even two classes it gives the weighted average, which is impossible to do with the current. i like the micro-macro-average idea. is there a reference i can look up?

@agramfort

* master: (85 commits) Benchmark: Added random forests and extra-trees to bench_sgd_covertype.py ENH: Address @agramfort comments fix: rm `nu` argument from sparse.SVR (taken from dense SVR). fix: learning rate schedule doc. Cosmit DOC: Use ELLIPSIS in doc-test FIX: Bug in plot_forest_iris FIX: Bug with bootstrapping (2) FIX: Bug with bootstrapping Revert "FIX : removing param nu from sparse.SVR, C from NuSVR + pep8" Revert "Remove C from NuSVR." DOC: Removed 'default' DOC: Added warning in make_estimator ENH: Use trailing _ for private attributes Web page layout tweaks. DOC: rename n -> p Docstring conventions. ENH: Simplified RandomForest and ExtraTrees API Move implementation details into RST doc. Minor update and fixes to linear_model documentation ...

robertlayton · 2011-12-10T09:42:51Z

sklearn/metrics/metrics.py

+
+    Only in the binary case does this relate to information about true and
+    false positives and negatives. See references below.
+


Perhaps this should include a brief description of what matthew's correlation coefficient is? Something like the wikipedia page would be good.

@deprecated

* master: (69 commits) FIX - error in the bibtex entry - extra comma that makes bibtex fail Add arXiv link to Halko et al. 2009 paper. DOC rm Methods section from KMeans docstring Fix svmlight loader doc. modified 'gid' to 'git' COSMIT use Python 2.6 except-as syntax use cPickle in spectral clustering tests rm Py2.5 compat factorial and combinations from utils.extmath COSMIT a few more easy cases of with open syntax COSMIT use ABCMeta in naive_bayes COSMIT use urlretrieve and "with" syntax in LFW module Revert "FIX: python2.5 SyntaxError" Revert "FIX: more python2.5 SyntaxError" don't use assert_in, not supported by nose on buildbot COSMIT use utils.deprecated as a class decorator test @deprecated using warnings.catch_warnings DOC I don't think Ubuntu 10.04 will be the last LTS release Update README.txt dependencies info to match the configuration tested on jenkins cosmit COSMIT docstring fix + US spelling in K-means code ...

removed avg_f1_score

satra · 2011-12-15T13:38:06Z

there are a couple of tests in other sections that are failing. could somebody please review this to ensure that this is doing what was intended?

ogrisel · 2011-12-15T14:02:06Z

sklearn/metrics/tests/test_metrics.py

+    fp = cm[0,1]
+    fn = cm[1,0]
+    num = (tp*tn-fp*fn)
+    den = np.sqrt((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn))


can you please run pep8 on this to spot the code style issues? The goal is to keep this plot going downward: https://jenkins.shiningpanda.com/scikit-learn/job/python2.7-numpy1.5.1-scipy-0.10.0/violations/

ogrisel · 2011-12-15T14:06:03Z

sklearn/metrics/tests/test_metrics.py

    assert_array_almost_equal(ps, 0.62, 2)

-    rs = recall_score(y_true, y_pred)
+    rs = recall_score(y_true, y_pred, pos_label=None, average='micro')
    assert_array_almost_equal(rs, 0.61, 2)


Recent discussions here and here seem to imply that micro averaging should yield the same score for precision and recall (hence f1 too). This is not the case in this test. Who is right?

these were actually identical. but since i ran the tests without checking the actual number 0.62 and .61 both satisfy array_almost equal. the true answer is 0.6133.

in this case where these numbers are equal, does it make sense to have 'micro' as the default?

We could keep weighted as the default for backward compatibility.

Alright, thanks for checking I was too lazy :) As for the default value, I have no strong opinion. Both micro and weighted are counter intuitive for different reasons though: I think the docstring should make it explicit that:

'weighted' averaging can have f1 score that is not between precision and recall

'micro' averaging implies that precision == recall == f1

@mblondel I am not sure we want to keep 'weighted' as the default as it does seem to be a scikit-learn specific approach that nobody else is using. I think I slightly like "micro" better even though the fact that precision == recall == f1 in multiclass might be a bit counter intuitive. I think that's fine as long as this behavior is documented in the docstring.

… recall

mblondel · 2011-12-15T14:27:16Z

sklearn/metrics/metrics.py

+            avg_recall = true_pos.sum() / (true_pos.sum() + false_neg.sum())
+            avg_fscore = (1 + beta2) * (avg_precision * avg_recall) / \
+                         (beta2 * avg_precision + avg_recall)
+        if average == 'macro':


Any reason to use if rather than elif?

mblondel · 2011-12-15T14:38:44Z

sklearn/metrics/metrics.py

+
+    average : string, ['micro', 'macro', 'weighted']
+        in the multiclass classification case, this determines the
+        type of averaging performed on the data.


A small description of each averaging strategy would be nice.

macro: average over classes (does not take imbalance into account) micro: average over instances (takes imbalance into account) weighted: average weighted by support (takes imbalance into account)

pushed some more changes.

satra · 2011-12-15T14:55:04Z

query about default behavior. so a bunch of tests are failing in test_sgd because currently we set defaults as:
pos_label=1, average=weighted, but if a multiclass value is sent it raises a ValueError. i don't want to touch the sgd tests, but the idea there would be to call f1_score with pos_label=None for multiclass input.

also along these lines the cross_validation routines are failing. is there a way to assign score_func=f1_score(pos_label=None)

ogrisel · 2011-12-15T15:07:40Z

Hum the cross validation stuff is very good remark.

Maybe we should split features and have explicit methods for multiclass and binaryclass cases after all. But also cross_val_score is restrictive in other cases for instance for using AUC as cross-validated score I have to subclass the estimator to override the estimator.score function because the cross_val_score API is not flexible enough.

Otherwise maybe as a short term solution we could use pos_label=None as default?

ogrisel · 2011-12-16T11:18:25Z

About the cross_val_score API issues, I had a chat with @agramfort in the bus from the Alhambra to Sierra Nevada and we came up to the same conclusion that we probably need a parameterized Scorer API in the scikit that would be suitable for the following use cases:

cross validation with explicit integer predictions, transforms or AUC based on decision functions or
grid search with cross validation (same as above)
early stopping against a fixed validation set for online / large scale incremental estimators (SGD, MinibatchKMeans, MinibatchDictionaryLearning)

The gist would be to have the user provide one time constructor level parameters (like the averaging scheme for multiclass evaluation using the F1 score) or a fixed validation set to compute the score against that could be then treated as a curified evaluation function to be cold many times (on different cross val folds during grid search or every n_minibatches for early stopping).

We will need to think this idea further: I think we need a new experimental branch to devise the exact API by implementing the afore mentioned use cases.

I would also like to keep the *_score functions in parallel so as to make it easier for the user to compute one time evaluation of an estimator (e.g. in a IPython session). The scorer implementations could be based on those function implementations when it makes sense.

WDYT @satra and @mblondel?

satra · 2011-12-21T09:00:05Z

@ogrisel

first i changed pos_label to None - this is an api change so it should be made ample clear to folks.
yes we need a new api for scoring in relation to any functionality that calls the scoring object (also add permutation test to your list above). something along the following lines?

class Scorer(object):
    def __init__(self, score_func, **score_func_args):
        self._score_func = score_func
        self._score_func_args = score_func_args
        #TODO: validate that score_func_args can be applied to score_func

    def score(self, y_true, y_pred):
        return self._score_func(y_true, y_pred, **self._score_func_args)

I presume a similar thing for loss_func.

@ogrisel could you clarify a little more on the validation set and the transforms aspect? seems like that should be different from the scoring api. perhaps you are thinking that score api can itself take a pipeline object?

ogrisel · 2011-12-31T09:30:00Z

@satra sorry for the late reply. I think @GaelVaroquaux has an alternative suggestion based on functools to address this issue.

satra · 2012-01-10T21:39:07Z

@GaelVaroquaux could you please elaborate on your functools suggestions

…nto metrics

mblondel · 2012-01-27T12:42:52Z

I pushed a couple of commits to my metrics branch (https://github.com/mblondel/scikit-learn/tree/metrics). First, I reverted pos_label back to 1: pos_label is just to specify a label encoding and I think it's confusing to use it as a switch to enable / disable a behavior. Also now people who want the values of the two classes in the binary case (a feature that @satra wanted) need to pass an explicit average=False or average=None. I think my solution is simpler. The tests are passing. I hope I didn't misunderstand some of the things you were trying to do.

I think we can now merge @satra 's PR with my commits to master and work on the scorer API in another PR. This PR has been pending for too long (and I need averaged metrics for my research :p)

In the grid-search module, there are plenty of calls to predict but for AUC we need to use either decision_function or predict_proba. Currently, score_func must be a callable. I suggest we also allow strings so we can for example do GridSearchCV(score_func="auc"). This way we can easily switch to decision_function or predict_proba where it's necessary.

ogrisel · 2012-01-27T12:54:27Z

Sounds like a reasonable plan to advance incrementally on this issue.

ogrisel · 2012-01-27T12:55:03Z

@satra can you merge @mblondel's change into your branch so that we can merge this PR?

mblondel · 2012-01-27T14:06:32Z

It could be convenient to have alias functions macro_precision, micro_precision, ..., micro_f1_score. Pro: useful for cross-validation. Con: 6 more functions. What do you think?

@ogrisel

* upstream/master: (823 commits) Set the download link to PYPI. Address Issue scikit-learn#590 : use relative path link to about.html check random state in _fit_transform add random_state to LocallyLinearEmbedding MISC: fix bibtex Add n_jobs option to pairwise_distances and pairwise_kernels. euclidian_distances is to be deprecated in v0.11. Better default threshold for L1-regularized models. Remove CoefSelectTransformerMixin and use SelectorMixin instead. COSMIT removed unused imports Cosmit: remove unused imports Refactor in KFold. COSMIT pep8 FIX: MurmurHash3 compilation on older GCC DOC added required versions of python, numpy and scipy to install documentation. Closes issue scikit-learn#579 DOC website: added link to 0.10 docs under support. FIX: broken build / tests Move preprocessing.py to sklearn/. preprocessing/__init__.py -> preprocessing/preprocessing.py ENH: Default in Vectorizer "None" as @ogrisel suggested ...

* mblondel/metrics: Factor some code. Backward compatibility in precision, recall and f1-score.

satra · 2012-01-27T14:18:18Z

@mblondel i think allowing strings is what would solve the problem but @GaelVaroquaux had some alternate thoughts (?)

also i have no issue with adding helper functions. in fact what got this started was an additional helper function called average_f1_score!

@ogrisel : merged @mblondel's branch.

ogrisel · 2012-01-27T14:21:42Z

I am -0 for adding more public functions for these metrics.

If we add new public functions they should be suffixed with "_score" when positive and "_loss" when negative to be consistent with the rest of the module. In this case (precision and recall) they are both positive.

mblondel · 2012-01-27T14:27:00Z

@satra : if you don't have any further change, please press the green button :)

We can add the helper functions later in master if we need them (for now, I put a few helpers in my user-land code).

MRG: Enh/metrics

satra added 3 commits November 14, 2011 21:13

added avg_f1_score

090037f

tst: added tests

e546621

enh: added matthew's correlation coefficient

df6010b

GaelVaroquaux reviewed Nov 15, 2011
View reviewed changes

sklearn/metrics/metrics.py

"""

mcc = np.corrcoef(y_true, y_pred)[0,1]

if np.isnan(mcc):

Copy link

Member

GaelVaroquaux Nov 15, 2011

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP 8 (missing space after coma) :)

sty: pep8 + doc

ee5451e

robertlayton reviewed Nov 16, 2011
View reviewed changes

robertlayton reviewed Dec 10, 2011
View reviewed changes

satra added 4 commits December 15, 2011 12:35

enh: added support for weighted metrics closes scikit-learn#83

940f34a

removed avg_f1_score

doc: added description for matthew's corrcoef from wikipedia

bbd8b71

sty: pep8 fixes

493e20e

ogrisel reviewed Dec 15, 2011
View reviewed changes

satra added 3 commits December 15, 2011 15:09

sty: pep8 on test file

e69c24d

doc: removed strange character

8c052b4

fix: updated tests to reflect that micro shows the same precision and…

584a2e6

… recall

mblondel reviewed Dec 15, 2011
View reviewed changes

fix: average with elif

3ade680

mblondel reviewed Dec 15, 2011
View reviewed changes

doc: improved description of average

9a6d934

api: changed pos_label to None for metrics

0a8b6e5

mblondel added 3 commits January 27, 2012 20:36

Merge branch 'enh/metrics' of https://github.com/satra/scikit-learn i…

f548fce

…nto metrics

Backward compatibility in precision, recall and f1-score.

2fa7a09

Factor some code.

f62398b

satra added 2 commits January 27, 2012 09:13

Merge remote-tracking branch 'mblondel/metrics' into enh/metrics

28a40df

* mblondel/metrics: Factor some code. Backward compatibility in precision, recall and f1-score.

satra added a commit that referenced this pull request Jan 27, 2012

Merge pull request #443 from satra/enh/metrics

689931c

MRG: Enh/metrics

satra merged commit 689931c into scikit-learn:master Jan 27, 2012


		Only in the binary case does this relate to information about true and
		false positives and negatives. See references below.

Uh oh!

WIP: Enh/metrics #443

WIP: Enh/metrics #443

Uh oh!

Conversation

satra commented Nov 15, 2011

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satra commented Nov 15, 2011

Uh oh!

satra commented Nov 15, 2011

Uh oh!

ogrisel commented Nov 15, 2011

Uh oh!

satra commented Nov 16, 2011

Uh oh!

GaelVaroquaux commented Nov 16, 2011

Uh oh!

ogrisel commented Nov 16, 2011

Uh oh!

satra commented Nov 16, 2011

Uh oh!

ogrisel commented Nov 16, 2011

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertlayton commented Nov 16, 2011

Uh oh!

satra commented Nov 16, 2011

Uh oh!

ogrisel commented Nov 16, 2011

Uh oh!

mblondel commented Nov 17, 2011

Uh oh!

satra commented Nov 17, 2011

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satra commented Dec 15, 2011

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satra commented Dec 15, 2011

Uh oh!

ogrisel commented Dec 15, 2011

Uh oh!

ogrisel commented Dec 16, 2011

Uh oh!

satra commented Dec 21, 2011

Uh oh!

ogrisel commented Dec 31, 2011