Skip to content

labels argument of classification_report is not useful when y is a list of strings #3123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kmike opened this issue Apr 29, 2014 · 8 comments
Labels
Milestone

Comments

@kmike
Copy link
Contributor

kmike commented Apr 29, 2014

In scikit-learn 0.14.1 it was possible to have y_true and y_pred lists of strings, and pass a list of strings as a labels argument to classification_report, and it worked as expected: only labels from this list were included to the report. This no longer works in scikit-learn master.

It was never documented that it should work: docs say that labels is an "Optional list of label indices to include in the report." So, according to docs, it was undefined what happens if y consists of strings and labels argument is passed - caller doesn't have correct indices to pass in this case.

It seems it is better to either raise an error if labels is passed when y is not pre-transformed by a LabelEncoder, or to restore and document 0.14.1 behavior. What do you think?

@arjoly
Copy link
Member

arjoly commented Apr 30, 2014

Thanks for reporting. Can you give a small example?

@kmike
Copy link
Contributor Author

kmike commented Apr 30, 2014

Sure, sorry for not doing that! This used to work:

>>> import sklearn
>>> sklearn.__version__
'0.14.1'
>>> from sklearn.metrics import classification_report
>>> y_true = ['foo', 'bar', 'baz', 'spam']
>>> y_pred = ['foo', 'bar', 'bar', 'spam']
>>> print classification_report(y_true, y_pred, labels=['bar', 'spam'])
precision    recall  f1-score   support

        bar       0.50      1.00      0.67         1
       spam       1.00      1.00      1.00         1

avg / total       0.75      1.00      0.83         2

And this is how it fails now:

>>> import sklearn
>>> sklearn.__version__
'0.15-git'
>>> from sklearn.metrics import classification_report
>>> y_true = ['foo', 'bar', 'baz', 'spam']
>>> y_pred = ['foo', 'bar', 'bar', 'spam']
>>> print classification_report(y_true, y_pred, labels=['bar', 'spam'])
ValueError                                Traceback (most recent call last)
/Users/kmike/svn/scikit-learn/sklearn/metrics/metrics.py in classification_report(y_true, y_pred, labels, target_names, sample_weight)
   2055                                                   labels=labels,
   2056                                                   average=None,
-> 2057                                                   sample_weight=sample_weight)
   2058 
   2059     for i, label in enumerate(labels):

/Users/kmike/svn/scikit-learn/sklearn/metrics/metrics.py in precision_recall_fscore_support(y_true, y_pred, beta, labels, pos_label, average, warn_for, sample_weight)
   1724         lb = LabelEncoder()
   1725         lb.fit(labels)
-> 1726         y_true = lb.transform(y_true)
   1727         y_pred = lb.transform(y_pred)
   1728         labels = lb.classes_

/Users/kmike/svn/scikit-learn/sklearn/preprocessing/label.pyc in transform(self, y)
    135         if len(np.intersect1d(classes, self.classes_)) < len(classes):
    136             diff = np.setdiff1d(classes, self.classes_)
--> 137             raise ValueError("y contains new labels: %s" % str(diff))
    138         return np.searchsorted(self.classes_, y)
    139 

ValueError: y contains new labels: ['baz' 'foo']

@arjoly
Copy link
Member

arjoly commented Apr 30, 2014

+1 for restoring the previous behaviour. This would be in line with #2610.

@amueller amueller added the Bug label Jan 22, 2015
@amueller amueller added this to the 0.16 milestone Jan 22, 2015
@amueller amueller modified the milestones: 0.16, 0.17 Sep 11, 2015
@raghavrv
Copy link
Member

raghavrv commented Oct 7, 2015

@ogrisel This has been fixed by Joel's #4287

>>> import sklearn
>>> sklearn.__version__
0.17.dev0
>>> from sklearn.metrics import classification_report
>>> y_true = ['foo', 'bar', 'baz', 'spam']
>>> y_pred = ['foo', 'bar', 'bar', 'spam']
>>> print classification_report(y_true, y_pred, labels=['bar', 'spam'])
             precision    recall  f1-score   support

        bar       0.50      1.00      0.67         1
       spam       1.00      1.00      1.00         1

avg / total       0.75      1.00      0.83         2

@ogrisel
Copy link
Member

ogrisel commented Oct 7, 2015

Indeed, thanks for the heads up @rvraghav93, closing.

@ogrisel ogrisel closed this as completed Oct 7, 2015
@YanjingWang
Copy link

I still has this problem and do not know how to fix it.

mapper = DataFrameMapper([('AgeGroup', LabelEncoder()),('Education', LabelEncoder()),('Workclass', LabelEncoder()),('MaritalStatus', LabelEncoder()),('Occupation', LabelEncoder()),('Relationship', LabelEncoder()),('Race', LabelEncoder()),('Sex', LabelEncoder()),('Income', LabelEncoder())], df_out=True, default=None)

cols = list(df_train_set.columns)
cols.remove("Income")
cols = cols[:-3] + ["Income"] + cols[-3:]

df_train = mapper.fit_transform(df_train_set.copy())
df_train.columns = cols

df_test = mapper.transform(df_test_set.copy())
df_test.columns = cols

cols.remove("Income")
x_train, y_train = df_train[cols].values, df_train["Income"].values
x_test, y_test = df_test[cols].values, df_test["Income"].values

ValueError Traceback (most recent call last)
in ()
8 df_train.columns = cols
9
---> 10 df_test = mapper.transform(df_test_set.copy())
11 df_test.columns = cols
12

~/anaconda/lib/python3.6/site-packages/sklearn_pandas/dataframe_mapper.py in transform(self, X)
277 if transformers is not None:
278 with add_column_names_to_exception(columns):
--> 279 Xt = transformers.transform(Xt)
280 extracted.append(_handle_feature(Xt))
281

~/anaconda/lib/python3.6/site-packages/sklearn/preprocessing/label.py in transform(self, y)
131 if len(np.intersect1d(classes, self.classes_)) < len(classes):
132 diff = np.setdiff1d(classes, self.classes_)
--> 133 raise ValueError("y contains new labels: %s" % str(diff))
134 return np.searchsorted(self.classes_, y)
135

ValueError: Income: y contains new labels: [' <=50K.' ' >50K.']

@jnothman
Copy link
Member

jnothman commented Dec 4, 2017

Your issue has nothing to do with classification_report, and belongs on a user forum or stack overflow, not a bug tracker. You have punctuation and whitespace in some of your labels that is causing your error

@YanjingWang
Copy link

YanjingWang commented Dec 4, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants