-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
WIP: Enh/metrics #443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Enh/metrics #443
Conversation
|
||
""" | ||
mcc = np.corrcoef(y_true, y_pred)[0,1] | ||
if np.isnan(mcc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP 8 (missing space after coma) :)
@Gael: the difference between the f1_score in multiclass settings is that whenever there is a binary classification problem, it will return only the positive f1 score by default. this will always return the average. clarified in the doc string now. |
also we should work towards including all the metrics in the referenced doi under matthews_corrcoef. there are some generally useful ones for the multiclass case. we already have several of the mutual info metrics. |
Maybe we should instead write separate functions for binary and multiclass f1 score such that calling the binary classification Also I would rather avoid names such as |
@ogrisel: just a quick question. in the binary case the current f1_score returns the value for the positive label. that's purely convention right? there is no reason to think that the +1 label is any different from a -1 or a 0 label. if that's the case, wouldn't it make sense to have a single function and always return np.average(f1, weights=support) and which would have pos_label=None. if pos_label is not None, then it would return np.average(f1[pos_label], weights=support). that would be a generic single function solution, but would change the api. i'll change avg -> average. should corrcoef be changed to correlation_coefficient? |
I don't think so: corrcoef is just a question of following numpy. |
@satra: yes the choice of the positive label is a convention in binary classification. In information retrieval positive means "belong to the search results of the query" whereas all other document from the index are classified as "negative". Hence the unbalanced labeling between positive and negative. As for the treatment of y_true, y_pred = check_arrays(y_true, y_pred)
if labels is None:
labels = unique_labels(y_true, y_pred)
else:
labels = np.asarray(labels, dtype=np.int)
_, _, f, s = precision_recall_fscore_support(y_true, y_pred, beta=beta, labels=labels)
if pos_label is not None:
if f.shape[0] != 2:
raise ValueError("pos_label should be set to None for multiclass tasks got %d" % pos_label)
if pos_label is not in labels:
raise ValueError("pos_label=%d is not a valid label: %r" % pos_label, labels)
pos_label_idx = list(labels).index(pos_label)
return f[pos_label_idx]
else:
return np.average(f, weights=s) In that case we don't need to introduce a new function (we can reuse the existing ones). Of course we need to make it explicit that the binary classification tasks should provide an explicit |
@ogrisel: this sounds good to me. shall i post to the list to get some feedback before i go ahead and change? |
Yes that would be great to ask for more comments. |
References | ||
---------- | ||
http://en.wikipedia.org/wiki/Matthews_correlation_coefficient | ||
doi: 10.1093/bioinformatics/16.5.412 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get the full reference here? The Wikipedia page may change in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the doi not sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Sat, Dec 10, 2011 at 02:54:05AM -0800, Satrajit Ghosh wrote:
- http://en.wikipedia.org/wiki/Matthews_correlation_coefficient
- doi: 10.1093/bioinformatics/16.5.412
is the doi not sufficient?
I'd rather have a clickable link. Ideally freely downloadable.
G
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Sat, Dec 10, 2011 at 02:59:22AM -0800, Satrajit Ghosh wrote:
That's great: I can click on it, and get a PDF :)
G
This is great - it was actually on my todo list when I got some time! |
can't we simply do np.average(f[labels], weights=s[labels]) where |
The default value being |
I didn't quite understand how the new metric for average f1 score differs from what we currently have (I'm a bit in a hurry so I guess I'm missing something). By the way, the way we average the f1 score (before this PR) is not standard. We need to add micro and macro averaging instead. See #83. |
@mblondel: the difference with the new metric is that for even two classes it gives the weighted average, which is impossible to do with the current. i like the micro-macro-average idea. is there a reference i can look up? |
* master: (85 commits) Benchmark: Added random forests and extra-trees to bench_sgd_covertype.py ENH: Address @agramfort comments fix: rm `nu` argument from sparse.SVR (taken from dense SVR). fix: learning rate schedule doc. Cosmit DOC: Use ELLIPSIS in doc-test FIX: Bug in plot_forest_iris FIX: Bug with bootstrapping (2) FIX: Bug with bootstrapping Revert "FIX : removing param nu from sparse.SVR, C from NuSVR + pep8" Revert "Remove C from NuSVR." DOC: Removed 'default' DOC: Added warning in make_estimator ENH: Use trailing _ for private attributes Web page layout tweaks. DOC: rename n -> p Docstring conventions. ENH: Simplified RandomForest and ExtraTrees API Move implementation details into RST doc. Minor update and fixes to linear_model documentation ...
|
||
Only in the binary case does this relate to information about true and | ||
false positives and negatives. See references below. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should include a brief description of what matthew's correlation coefficient is? Something like the wikipedia page would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
* master: (69 commits) FIX - error in the bibtex entry - extra comma that makes bibtex fail Add arXiv link to Halko et al. 2009 paper. DOC rm Methods section from KMeans docstring Fix svmlight loader doc. modified 'gid' to 'git' COSMIT use Python 2.6 except-as syntax use cPickle in spectral clustering tests rm Py2.5 compat factorial and combinations from utils.extmath COSMIT a few more easy cases of with open syntax COSMIT use ABCMeta in naive_bayes COSMIT use urlretrieve and "with" syntax in LFW module Revert "FIX: python2.5 SyntaxError" Revert "FIX: more python2.5 SyntaxError" don't use assert_in, not supported by nose on buildbot COSMIT use utils.deprecated as a class decorator test @deprecated using warnings.catch_warnings DOC I don't think Ubuntu 10.04 will be the last LTS release Update README.txt dependencies info to match the configuration tested on jenkins cosmit COSMIT docstring fix + US spelling in K-means code ...
removed avg_f1_score
there are a couple of tests in other sections that are failing. could somebody please review this to ensure that this is doing what was intended? |
fp = cm[0,1] | ||
fn = cm[1,0] | ||
num = (tp*tn-fp*fn) | ||
den = np.sqrt((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please run pep8 on this to spot the code style issues? The goal is to keep this plot going downward: https://jenkins.shiningpanda.com/scikit-learn/job/python2.7-numpy1.5.1-scipy-0.10.0/violations/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
assert_array_almost_equal(ps, 0.62, 2) | ||
|
||
rs = recall_score(y_true, y_pred) | ||
rs = recall_score(y_true, y_pred, pos_label=None, average='micro') | ||
assert_array_almost_equal(rs, 0.61, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these were actually identical. but since i ran the tests without checking the actual number 0.62 and .61 both satisfy array_almost equal. the true answer is 0.6133.
in this case where these numbers are equal, does it make sense to have 'micro' as the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could keep weighted
as the default for backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, thanks for checking I was too lazy :) As for the default value, I have no strong opinion. Both micro and weighted are counter intuitive for different reasons though: I think the docstring should make it explicit that:
- 'weighted' averaging can have f1 score that is not between precision and recall
- 'micro' averaging implies that precision == recall == f1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mblondel I am not sure we want to keep 'weighted' as the default as it does seem to be a scikit-learn specific approach that nobody else is using. I think I slightly like "micro" better even though the fact that precision == recall == f1 in multiclass might be a bit counter intuitive. I think that's fine as long as this behavior is documented in the docstring.
avg_recall = true_pos.sum() / (true_pos.sum() + false_neg.sum()) | ||
avg_fscore = (1 + beta2) * (avg_precision * avg_recall) / \ | ||
(beta2 * avg_precision + avg_recall) | ||
if average == 'macro': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to use if
rather than elif
?
|
||
average : string, ['micro', 'macro', 'weighted'] | ||
in the multiclass classification case, this determines the | ||
type of averaging performed on the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small description of each averaging strategy would be nice.
macro: average over classes (does not take imbalance into account) micro: average over instances (takes imbalance into account) weighted: average weighted by support (takes imbalance into account)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushed some more changes.
query about default behavior. so a bunch of tests are failing in test_sgd because currently we set defaults as: also along these lines the cross_validation routines are failing. is there a way to assign score_func=f1_score(pos_label=None) |
Hum the cross validation stuff is very good remark. Maybe we should split features and have explicit methods for multiclass and binaryclass cases after all. But also Otherwise maybe as a short term solution we could use |
About the
The gist would be to have the user provide one time constructor level parameters (like the averaging scheme for multiclass evaluation using the F1 score) or a fixed validation set to compute the score against that could be then treated as a curified evaluation function to be cold many times (on different cross val folds during grid search or every We will need to think this idea further: I think we need a new experimental branch to devise the exact API by implementing the afore mentioned use cases. I would also like to keep the |
class Scorer(object):
def __init__(self, score_func, **score_func_args):
self._score_func = score_func
self._score_func_args = score_func_args
#TODO: validate that score_func_args can be applied to score_func
def score(self, y_true, y_pred):
return self._score_func(y_true, y_pred, **self._score_func_args) I presume a similar thing for loss_func. @ogrisel could you clarify a little more on the validation set and the transforms aspect? seems like that should be different from the scoring api. perhaps you are thinking that score api can itself take a pipeline object? |
@satra sorry for the late reply. I think @GaelVaroquaux has an alternative suggestion based on |
@GaelVaroquaux could you please elaborate on your functools suggestions |
I pushed a couple of commits to my I think we can now merge @satra 's PR with my commits to master and work on the scorer API in another PR. This PR has been pending for too long (and I need averaged metrics for my research :p) In the grid-search module, there are plenty of calls to |
Sounds like a reasonable plan to advance incrementally on this issue. |
It could be convenient to have alias functions |
* upstream/master: (823 commits) Set the download link to PYPI. Address Issue scikit-learn#590 : use relative path link to about.html check random state in _fit_transform add random_state to LocallyLinearEmbedding MISC: fix bibtex Add n_jobs option to pairwise_distances and pairwise_kernels. euclidian_distances is to be deprecated in v0.11. Better default threshold for L1-regularized models. Remove CoefSelectTransformerMixin and use SelectorMixin instead. COSMIT removed unused imports Cosmit: remove unused imports Refactor in KFold. COSMIT pep8 FIX: MurmurHash3 compilation on older GCC DOC added required versions of python, numpy and scipy to install documentation. Closes issue scikit-learn#579 DOC website: added link to 0.10 docs under support. FIX: broken build / tests Move preprocessing.py to sklearn/. preprocessing/__init__.py -> preprocessing/preprocessing.py ENH: Default in Vectorizer "None" as @ogrisel suggested ...
* mblondel/metrics: Factor some code. Backward compatibility in precision, recall and f1-score.
@mblondel i think allowing strings is what would solve the problem but @GaelVaroquaux had some alternate thoughts (?) also i have no issue with adding helper functions. in fact what got this started was an additional helper function called average_f1_score! |
I am -0 for adding more public functions for these metrics. If we add new public functions they should be suffixed with "_score" when positive and "_loss" when negative to be consistent with the rest of the module. In this case (precision and recall) they are both positive. |
@satra : if you don't have any further change, please press the green button :) We can add the helper functions later in master if we need them (for now, I put a few helpers in my user-land code). |
adding a couple of metrics
TODO before merge:
GridSearchCV
andcross_val_score
utilities.