Skip to content

WIP: Enh/metrics #443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jan 27, 2012
Merged

WIP: Enh/metrics #443

merged 20 commits into from
Jan 27, 2012

Conversation

satra
Copy link
Member

@satra satra commented Nov 15, 2011

adding a couple of metrics

TODO before merge:

  • devise an API refactoring to be able to pass parameterized scoring callables to the GridSearchCV and cross_val_score utilities.


"""
mcc = np.corrcoef(y_true, y_pred)[0,1]
if np.isnan(mcc):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP 8 (missing space after coma) :)

@satra
Copy link
Member Author

satra commented Nov 15, 2011

@Gael: the difference between the f1_score in multiclass settings is that whenever there is a binary classification problem, it will return only the positive f1 score by default. this will always return the average. clarified in the doc string now.

@satra
Copy link
Member Author

satra commented Nov 15, 2011

also we should work towards including all the metrics in the referenced doi under matthews_corrcoef. there are some generally useful ones for the multiclass case. we already have several of the mutual info metrics.

@ogrisel
Copy link
Member

ogrisel commented Nov 15, 2011

Maybe we should instead write separate functions for binary and multiclass f1 score such that calling the binary classification f1_score on multi-class input would raise a ValueError telling to use the multi class version instead? The binary classification could be named f1_score and the multi class version could be named either multiclass_f1_score or average_f1_score. WDYT?

Also I would rather avoid names such as avg_f1_score but rather use the explicit name average_f1_score.

@satra
Copy link
Member Author

satra commented Nov 16, 2011

@ogrisel: just a quick question. in the binary case the current f1_score returns the value for the positive label. that's purely convention right? there is no reason to think that the +1 label is any different from a -1 or a 0 label.

if that's the case, wouldn't it make sense to have a single function and always return np.average(f1, weights=support) and which would have pos_label=None. if pos_label is not None, then it would return np.average(f1[pos_label], weights=support). that would be a generic single function solution, but would change the api.

i'll change avg -> average. should corrcoef be changed to correlation_coefficient?

@GaelVaroquaux
Copy link
Member

i'll change avg -> average. should corrcoef be changed to correlation_coefficient?

I don't think so: corrcoef is just a question of following numpy.

@ogrisel
Copy link
Member

ogrisel commented Nov 16, 2011

@satra: yes the choice of the positive label is a convention in binary classification. In information retrieval positive means "belong to the search results of the query" whereas all other document from the index are classified as "negative". Hence the unbalanced labeling between positive and negative.

As for the treatment of pos_label=None to tell apart the multiclass classification tasks from binary classification tasks, I would rather change change the existing method fbeta_score (and related methods such as precision_score and recall_score) to do:

y_true, y_pred = check_arrays(y_true, y_pred)

if labels is None:
    labels = unique_labels(y_true, y_pred)
else:
    labels = np.asarray(labels, dtype=np.int)
_, _, f, s = precision_recall_fscore_support(y_true, y_pred, beta=beta, labels=labels)
if pos_label is not None:
    if f.shape[0] != 2:
        raise ValueError("pos_label should be set to None for multiclass tasks got %d" % pos_label)
    if pos_label is not in labels:
        raise ValueError("pos_label=%d is not a valid label: %r" % pos_label, labels)
    pos_label_idx = list(labels).index(pos_label)
    return f[pos_label_idx]
else:
    return np.average(f, weights=s)

In that case we don't need to introduce a new function (we can reuse the existing ones). Of course we need to make it explicit that the binary classification tasks should provide an explicit pos_label otherwise the problem is treated as multiclass classification with support-weighted averaging of the scores.

@satra
Copy link
Member Author

satra commented Nov 16, 2011

@ogrisel: this sounds good to me. shall i post to the list to get some feedback before i go ahead and change?

@ogrisel
Copy link
Member

ogrisel commented Nov 16, 2011

Yes that would be great to ask for more comments.

References
----------
http://en.wikipedia.org/wiki/Matthews_correlation_coefficient
doi: 10.1093/bioinformatics/16.5.412
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get the full reference here? The Wikipedia page may change in the future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the doi not sufficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Sat, Dec 10, 2011 at 02:54:05AM -0800, Satrajit Ghosh wrote:

is the doi not sufficient?

I'd rather have a clickable link. Ideally freely downloadable.

G

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Sat, Dec 10, 2011 at 02:59:22AM -0800, Satrajit Ghosh wrote:

http://dx.doi.org/10.1093/bioinformatics/16.5.412

That's great: I can click on it, and get a PDF :)

G

@robertlayton
Copy link
Member

This is great - it was actually on my todo list when I got some time!
I would also prefer ogrisel's method of doing both binary and average f1 in the same function.
If f.shape[0] > 2, should the value of pos_label just be ignored?

@satra
Copy link
Member Author

satra commented Nov 16, 2011

can't we simply do

np.average(f[labels], weights=s[labels])

where labels can be an array. we can do the appropriate checks and change pos_label to labels.

@ogrisel
Copy link
Member

ogrisel commented Nov 16, 2011

The default value being pos_label=None it's better to throw an exception when the user passes a none None value on a multiclass dataset rather than trying to silently fix the problem in the background without telling the caller that what was asked for is meaningless.

@mblondel
Copy link
Member

I didn't quite understand how the new metric for average f1 score differs from what we currently have (I'm a bit in a hurry so I guess I'm missing something). By the way, the way we average the f1 score (before this PR) is not standard. We need to add micro and macro averaging instead. See #83.

@satra
Copy link
Member Author

satra commented Nov 17, 2011

@mblondel: the difference with the new metric is that for even two classes it gives the weighted average, which is impossible to do with the current. i like the micro-macro-average idea. is there a reference i can look up?

* master: (85 commits)
  Benchmark: Added random forests and extra-trees to bench_sgd_covertype.py
  ENH: Address @agramfort comments
  fix: rm `nu` argument from sparse.SVR (taken from dense SVR).
  fix: learning rate schedule doc.
  Cosmit
  DOC: Use ELLIPSIS in doc-test
  FIX: Bug in plot_forest_iris
  FIX: Bug with bootstrapping (2)
  FIX: Bug with bootstrapping
  Revert "FIX : removing param nu from sparse.SVR, C from NuSVR + pep8"
  Revert "Remove C from NuSVR."
  DOC: Removed 'default'
  DOC: Added warning in make_estimator
  ENH: Use trailing _ for private attributes
  Web page layout tweaks.
  DOC: rename n -> p
  Docstring conventions.
  ENH: Simplified RandomForest and ExtraTrees API
  Move implementation details into RST doc.
  Minor update and fixes to linear_model documentation
  ...

Only in the binary case does this relate to information about true and
false positives and negatives. See references below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should include a brief description of what matthew's correlation coefficient is? Something like the wikipedia page would be good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

* master: (69 commits)
  FIX - error in the bibtex entry - extra comma that makes bibtex fail
  Add arXiv link to Halko et al. 2009 paper.
  DOC rm Methods section from KMeans docstring
  Fix svmlight loader doc.
  modified 'gid' to 'git'
  COSMIT use Python 2.6 except-as syntax
  use cPickle in spectral clustering tests
  rm Py2.5 compat factorial and combinations from utils.extmath
  COSMIT a few more easy cases of with open syntax
  COSMIT use ABCMeta in naive_bayes
  COSMIT use urlretrieve and "with" syntax in LFW module
  Revert "FIX: python2.5 SyntaxError"
  Revert "FIX: more python2.5 SyntaxError"
  don't use assert_in, not supported by nose on buildbot
  COSMIT use utils.deprecated as a class decorator
  test @deprecated using warnings.catch_warnings
  DOC I don't think Ubuntu 10.04 will be the last LTS release
  Update README.txt dependencies info to match the configuration tested on jenkins
  cosmit
  COSMIT docstring fix + US spelling in K-means code
  ...
@satra
Copy link
Member Author

satra commented Dec 15, 2011

there are a couple of tests in other sections that are failing. could somebody please review this to ensure that this is doing what was intended?

fp = cm[0,1]
fn = cm[1,0]
num = (tp*tn-fp*fn)
den = np.sqrt((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please run pep8 on this to spot the code style issues? The goal is to keep this plot going downward: https://jenkins.shiningpanda.com/scikit-learn/job/python2.7-numpy1.5.1-scipy-0.10.0/violations/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

assert_array_almost_equal(ps, 0.62, 2)

rs = recall_score(y_true, y_pred)
rs = recall_score(y_true, y_pred, pos_label=None, average='micro')
assert_array_almost_equal(rs, 0.61, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recent discussions here and here seem to imply that micro averaging should yield the same score for precision and recall (hence f1 too). This is not the case in this test. Who is right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these were actually identical. but since i ran the tests without checking the actual number 0.62 and .61 both satisfy array_almost equal. the true answer is 0.6133.

in this case where these numbers are equal, does it make sense to have 'micro' as the default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could keep weighted as the default for backward compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, thanks for checking I was too lazy :) As for the default value, I have no strong opinion. Both micro and weighted are counter intuitive for different reasons though: I think the docstring should make it explicit that:

  • 'weighted' averaging can have f1 score that is not between precision and recall
  • 'micro' averaging implies that precision == recall == f1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mblondel I am not sure we want to keep 'weighted' as the default as it does seem to be a scikit-learn specific approach that nobody else is using. I think I slightly like "micro" better even though the fact that precision == recall == f1 in multiclass might be a bit counter intuitive. I think that's fine as long as this behavior is documented in the docstring.

avg_recall = true_pos.sum() / (true_pos.sum() + false_neg.sum())
avg_fscore = (1 + beta2) * (avg_precision * avg_recall) / \
(beta2 * avg_precision + avg_recall)
if average == 'macro':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use if rather than elif?


average : string, ['micro', 'macro', 'weighted']
in the multiclass classification case, this determines the
type of averaging performed on the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small description of each averaging strategy would be nice.

macro: average over classes (does not take imbalance into account)
micro: average over instances (takes imbalance into account)
weighted: average weighted by support (takes imbalance into account)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed some more changes.

@satra
Copy link
Member Author

satra commented Dec 15, 2011

query about default behavior. so a bunch of tests are failing in test_sgd because currently we set defaults as:
pos_label=1, average=weighted, but if a multiclass value is sent it raises a ValueError. i don't want to touch the sgd tests, but the idea there would be to call f1_score with pos_label=None for multiclass input.

also along these lines the cross_validation routines are failing. is there a way to assign score_func=f1_score(pos_label=None)

@ogrisel
Copy link
Member

ogrisel commented Dec 15, 2011

Hum the cross validation stuff is very good remark.

Maybe we should split features and have explicit methods for multiclass and binaryclass cases after all. But also cross_val_score is restrictive in other cases for instance for using AUC as cross-validated score I have to subclass the estimator to override the estimator.score function because the cross_val_score API is not flexible enough.

Otherwise maybe as a short term solution we could use pos_label=None as default?

@ogrisel
Copy link
Member

ogrisel commented Dec 16, 2011

About the cross_val_score API issues, I had a chat with @agramfort in the bus from the Alhambra to Sierra Nevada and we came up to the same conclusion that we probably need a parameterized Scorer API in the scikit that would be suitable for the following use cases:

  • cross validation with explicit integer predictions, transforms or AUC based on decision functions or
  • grid search with cross validation (same as above)
  • early stopping against a fixed validation set for online / large scale incremental estimators (SGD, MinibatchKMeans, MinibatchDictionaryLearning)

The gist would be to have the user provide one time constructor level parameters (like the averaging scheme for multiclass evaluation using the F1 score) or a fixed validation set to compute the score against that could be then treated as a curified evaluation function to be cold many times (on different cross val folds during grid search or every n_minibatches for early stopping).

We will need to think this idea further: I think we need a new experimental branch to devise the exact API by implementing the afore mentioned use cases.

I would also like to keep the *_score functions in parallel so as to make it easier for the user to compute one time evaluation of an estimator (e.g. in a IPython session). The scorer implementations could be based on those function implementations when it makes sense.

WDYT @satra and @mblondel?

@satra
Copy link
Member Author

satra commented Dec 21, 2011

@ogrisel

  1. first i changed pos_label to None - this is an api change so it should be made ample clear to folks.
  2. yes we need a new api for scoring in relation to any functionality that calls the scoring object (also add permutation test to your list above). something along the following lines?
class Scorer(object):
    def __init__(self, score_func, **score_func_args):
        self._score_func = score_func
        self._score_func_args = score_func_args
        #TODO: validate that score_func_args can be applied to score_func

    def score(self, y_true, y_pred):
        return self._score_func(y_true, y_pred, **self._score_func_args)

I presume a similar thing for loss_func.

@ogrisel could you clarify a little more on the validation set and the transforms aspect? seems like that should be different from the scoring api. perhaps you are thinking that score api can itself take a pipeline object?

@ogrisel
Copy link
Member

ogrisel commented Dec 31, 2011

@satra sorry for the late reply. I think @GaelVaroquaux has an alternative suggestion based on functools to address this issue.

@satra
Copy link
Member Author

satra commented Jan 10, 2012

@GaelVaroquaux could you please elaborate on your functools suggestions

@mblondel
Copy link
Member

I pushed a couple of commits to my metrics branch (https://github.com/mblondel/scikit-learn/tree/metrics). First, I reverted pos_label back to 1: pos_label is just to specify a label encoding and I think it's confusing to use it as a switch to enable / disable a behavior. Also now people who want the values of the two classes in the binary case (a feature that @satra wanted) need to pass an explicit average=False or average=None. I think my solution is simpler. The tests are passing. I hope I didn't misunderstand some of the things you were trying to do.

I think we can now merge @satra 's PR with my commits to master and work on the scorer API in another PR. This PR has been pending for too long (and I need averaged metrics for my research :p)

In the grid-search module, there are plenty of calls to predict but for AUC we need to use either decision_function or predict_proba. Currently, score_func must be a callable. I suggest we also allow strings so we can for example do GridSearchCV(score_func="auc"). This way we can easily switch to decision_function or predict_proba where it's necessary.

@ogrisel
Copy link
Member

ogrisel commented Jan 27, 2012

Sounds like a reasonable plan to advance incrementally on this issue.

@ogrisel
Copy link
Member

ogrisel commented Jan 27, 2012

@satra can you merge @mblondel's change into your branch so that we can merge this PR?

@mblondel
Copy link
Member

It could be convenient to have alias functions macro_precision, micro_precision, ..., micro_f1_score. Pro: useful for cross-validation. Con: 6 more functions. What do you think?

satra added 2 commits January 27, 2012 09:13
* upstream/master: (823 commits)
  Set the download link to PYPI.
  Address Issue scikit-learn#590 : use relative path link to about.html
  check random state in _fit_transform
  add random_state to LocallyLinearEmbedding
  MISC: fix bibtex
  Add n_jobs option to pairwise_distances and pairwise_kernels.
  euclidian_distances is to be deprecated in v0.11.
  Better default threshold for L1-regularized models.
  Remove CoefSelectTransformerMixin and use SelectorMixin instead.
  COSMIT removed unused imports
  Cosmit: remove unused imports
  Refactor in KFold.
  COSMIT pep8
  FIX: MurmurHash3 compilation on older GCC
  DOC added required versions of python, numpy and scipy to install documentation. Closes issue scikit-learn#579
  DOC website: added link to 0.10 docs under support.
  FIX: broken build / tests
  Move preprocessing.py to sklearn/.
  preprocessing/__init__.py -> preprocessing/preprocessing.py
  ENH: Default in Vectorizer "None" as @ogrisel suggested
  ...
* mblondel/metrics:
  Factor some code.
  Backward compatibility in precision, recall and f1-score.
@satra
Copy link
Member Author

satra commented Jan 27, 2012

@mblondel i think allowing strings is what would solve the problem but @GaelVaroquaux had some alternate thoughts (?)

also i have no issue with adding helper functions. in fact what got this started was an additional helper function called average_f1_score!

@ogrisel : merged @mblondel's branch.

@ogrisel
Copy link
Member

ogrisel commented Jan 27, 2012

I am -0 for adding more public functions for these metrics.

If we add new public functions they should be suffixed with "_score" when positive and "_loss" when negative to be consistent with the rest of the module. In this case (precision and recall) they are both positive.

@mblondel
Copy link
Member

@satra : if you don't have any further change, please press the green button :)

We can add the helper functions later in master if we need them (for now, I put a few helpers in my user-land code).

satra added a commit that referenced this pull request Jan 27, 2012
@satra satra merged commit 689931c into scikit-learn:master Jan 27, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants