[MRG + 2] FIX LogisticRegressionCV to correctly handle string labels #5874

raghavrv · 2015-11-18T14:17:01Z

In master -

>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

>>> X = np.arange(12)[:, np.newaxis]
>>> y = ['a',] * 4 + ['b',] * 4 + ['c',] * 4

>>> LogisticRegressionCV(solver='lbfgs', multi_class='multinomial').fit(X, y).predict(X)
ValueError: could not convert string to float: 'a'

In this branch -

>>> LogisticRegressionCV(solver='lbfgs', multi_class='multinomial').fit(X, y).predict(X)
array(['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c'], dtype='<U1')

@agramfort @GaelVaroquaux

MechCoder · 2015-11-18T14:59:35Z

sklearn/linear_model/logistic.py

@@ -1525,9 +1525,6 @@ def fit(self, X, y, sample_weight=None):
        cv = check_cv(self.cv, y, classifier=True)
        folds = list(cv.split(X, y))

-        self._enc = LabelEncoder()
-        self._enc.fit(y)


Surely self._enc must be used somewhere?

It was added here - 67585f6, but not used anywhere... I am unable to understand why though...

MechCoder · 2015-12-04T19:18:39Z

Instead of this can we encode y once in the fit time of LogisticRegressionCV and pass the encoded y to every split.

If you did this you would need to reconstruct the class_weights if they are in a dict format, such that they use the encoded class labels. Other than that I don't foresee any problem.

Also you can remove the check_classification_targets to before the type of y is checked.

MechCoder · 2015-12-04T19:27:24Z

Would also be a good time to clean up some parts of the code, especially the encoding logic, dtype checks etc

amueller · 2016-09-12T22:41:38Z

hm this is still a bug, right?

jnothman · 2016-09-17T10:54:09Z

sklearn/linear_model/logistic.py

@@ -898,7 +897,8 @@ def _log_reg_scoring_path(X, y, train, test, pos_class=None, Cs=10,
        y_test[~mask] = -1.

    # To deal with object dtypes, we need to convert into an array of floats.
-    y_test = check_array(y_test, dtype=np.float64, ensure_2d=False)
+    y_test = check_array(LabelEncoder().fit_transform(y_test),


Surely y_test needs to use the same encoder as was used for training.

raghavrv · 2016-09-27T15:08:59Z

@MechCoder @jnothman @amueller Could you take a look at this now?

TomDLT · 2016-09-28T09:37:33Z

~~What about just adding label encoding after _check_solver_option in _log_reg_scoring_path ?~~

y = LabelEncoder().fit_transform(y)

TomDLT · 2016-09-28T09:45:45Z

sklearn/linear_model/tests/test_logistic.py

+    y = np.array(y) - 1
+    # Test for string labels
+    lr = LogisticRegression(solver='lbfgs', multi_class='multinomial')
+    lr_cv = LogisticRegression(solver='lbfgs', multi_class='multinomial')


you probably wanted to use LogisticRegressionCV here

jnothman · 2016-10-06T08:28:51Z

sklearn/linear_model/logistic.py

-        check_consistent_length(X, y)
+        if y.dtype.kind in ('S', 'U'):
+            # Encode for string labels
+            self._label_encoder = LabelEncoder().fit(y)


Why can't you do encoding in all cases and assume that LabelEncoder handles efficiency issues?

thanks for the comment I did so in the recent commit...

jnothman · 2016-10-06T08:28:56Z

sklearn/linear_model/logistic.py

+                for new_key, old_key in zip(new_keys, old_keys):
+                    self.class_weight[new_key] = self.class_weight[old_key]
+        else:
+            self.classes_enc_ = self.classes_


Does this work when classes are [1,2,3], not [0,1,2]?

I'm encoding for all cases now like you suggested. And there is a test for this usecase.

raghavrv · 2016-10-12T23:50:42Z

Arghh. This is tougher to solve than I thought. After encoding the string labels at the fit of LogisticRegressionCV, ~~the logistic_regression_path helper again encodes the data and this causes a test failure I think... I'll have a deeper look at it soon...~~ The scores and coef dict require the unencoded class label as keys... ah...

raghavrv · 2016-10-13T04:41:15Z

Now it should pass all the tests... Gentle ping @TomDLT @MechCoder for reviews too...

TomDLT · 2016-10-21T14:38:26Z

Your variables names are quite confusing:
enc_labels, iter_labels, enc_lbl
cls_labels, iter_classes, cls_lbl

what about:
encoded_labels, iter_encoded_labels, encoded_label
classes_labels, iter_classes_labels, classes_label
or
enc_labels, iter_enc_labels, enc_label
cls_labels, iter_cls_labels, cls_label

or even remove iter_labels and iter_classes and use enc_labels and cls_labels directly

TomDLT · 2016-10-21T14:38:50Z

Otherwise LGTM

raghavrv · 2016-11-09T19:26:39Z

Merging once CIs pass...

raghavrv · 2016-11-09T20:41:16Z

Thanks @jnothman @amueller @TomDLT @MechCoder for the reviews!

amueller · 2016-11-09T20:44:14Z

thanks for the pr :)

…cikit-learn#5874) * TST if LogisticRegressionCV handles string labels properly * TST Add a test with class_weight dict * ENH Encode y and class_weight dict * Better variable names * TYPO casses --> classes * FIX Use dict comprehension; classes_labels --> classes * Revert dict comprehension (for Python 2.6 compat) * MNT reorder validation to improve clarity * Add whatsnew entry

jnothman · 2016-11-20T22:23:31Z

We seem to be getting test failures from this on PRs, such as at https://ci.appveyor.com/project/sklearn-ci/scikit-learn/build/1.0.10296/job/qol1vrsk30ycsopl

raghavrv · 2016-11-20T22:33:35Z

Which PR is this? I saw a similar error but because of an incorrect updation of master. (Which got resolved after fixing that...)

jnothman · 2016-11-20T22:38:33Z

#7838 The commit history
looks clean.

On 21 November 2016 at 09:33, Raghav RV notifications@github.com wrote:

Which PR is this? I saw a similar error but because of an incorrect
updation of master. (Which got resolved after fixing that...)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#5874 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz68dXJJvBChav9vNhG-Dby-LREw8Oks5rAMrAgaJpZM4Gkt-Q
.

raghavrv · 2016-11-20T22:39:32Z

Okay. Thanks for the notice. I'll look into it tomorrow...

…cikit-learn#5874) * TST if LogisticRegressionCV handles string labels properly * TST Add a test with class_weight dict * ENH Encode y and class_weight dict * Better variable names * TYPO casses --> classes * FIX Use dict comprehension; classes_labels --> classes * Revert dict comprehension (for Python 2.6 compat) * MNT reorder validation to improve clarity * Add whatsnew entry

* tag '0.18.1': (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ...

* releases: (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ... Conflicts: removed sklearn/externals/joblib/__init__.py sklearn/externals/joblib/_parallel_backends.py sklearn/externals/joblib/testing.py

* dfsg: (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ...

…cikit-learn#5874) * TST if LogisticRegressionCV handles string labels properly * TST Add a test with class_weight dict * ENH Encode y and class_weight dict * Better variable names * TYPO casses --> classes * FIX Use dict comprehension; classes_labels --> classes * Revert dict comprehension (for Python 2.6 compat) * MNT reorder validation to improve clarity * Add whatsnew entry

raghavrv mentioned this pull request Nov 18, 2015

LogisticRegressionCV fails when labels are strings #5868

Closed

raghavrv force-pushed the string_multinomial branch from 7a54392 to 41777f3 Compare November 18, 2015 14:57

MechCoder reviewed Nov 18, 2015
View reviewed changes

raghavrv changed the title ~~FIX LabelEncoder to correctly handle string labels~~ [WIP] FIX LabelEncoder to correctly handle string labels Nov 21, 2015

MechCoder mentioned this pull request Dec 4, 2015

FIX LabelEncoder to correctly handle string labels (Issue #5868) #5951

Closed

amueller added the Bug label Sep 14, 2016

amueller added this to the 0.19 milestone Sep 14, 2016

jnothman requested changes Sep 17, 2016

View reviewed changes

raghavrv force-pushed the string_multinomial branch 2 times, most recently from 3e85d50 to c61373d Compare September 27, 2016 15:08

raghavrv force-pushed the string_multinomial branch 2 times, most recently from 1a7889b to 9097471 Compare September 27, 2016 15:18

raghavrv changed the title ~~[WIP] FIX LabelEncoder to correctly handle string labels~~ [MRG] FIX LogisticRegression(CV) to correctly handle string labels Sep 27, 2016

raghavrv changed the title ~~[MRG] FIX LogisticRegression(CV) to correctly handle string labels~~ [MRG] FIX LogisticRegressionCV to correctly handle string labels Sep 27, 2016

raghavrv force-pushed the string_multinomial branch 2 times, most recently from c3d207e to 0b7129a Compare September 27, 2016 17:23

TomDLT reviewed Sep 28, 2016

View reviewed changes

jnothman requested changes Oct 6, 2016

View reviewed changes

raghavrv force-pushed the string_multinomial branch from 0b7129a to 14699b7 Compare October 12, 2016 23:45

raghavrv force-pushed the string_multinomial branch from 14699b7 to 934493c Compare October 13, 2016 04:38

raghavrv added the Waiting for Reviewer label Oct 13, 2016

raghavrv added 10 commits November 9, 2016 20:10

ENH Encode y and class_weight dict

3306f76

Better variable names

9bd4c76

TYPO casses --> classes

599aa10

FIX Use dict comprehension; classes_labels --> classes

41e4e19

Don't use dict comprehension

f8e3e96

MNT reorder validation to improve clarity

026e97f

Use enumerate

427eb41

BUGFIX

7f3795e

zip was not needed

cc70ff9

Add whatsnew

4961d97

raghavrv force-pushed the string_multinomial branch from b0de8f3 to 4961d97 Compare November 9, 2016 19:11

raghavrv changed the title ~~[MRG + 1] FIX LogisticRegressionCV to correctly handle string labels~~ [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels Nov 9, 2016

raghavrv merged commit 31ee1a8 into scikit-learn:master Nov 9, 2016

raghavrv deleted the string_multinomial branch November 9, 2016 20:46

Uh oh!

[MRG + 2] FIX LogisticRegressionCV to correctly handle string labels #5874

[MRG + 2] FIX LogisticRegressionCV to correctly handle string labels #5874

Uh oh!

Conversation

raghavrv commented Nov 18, 2015 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Dec 4, 2015

Uh oh!

MechCoder commented Dec 4, 2015

Uh oh!

amueller commented Sep 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raghavrv commented Sep 27, 2016

Uh oh!

TomDLT commented Sep 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raghavrv commented Oct 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raghavrv commented Oct 13, 2016

Uh oh!

TomDLT commented Oct 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomDLT commented Oct 21, 2016

Uh oh!

raghavrv commented Nov 9, 2016

Uh oh!

raghavrv commented Nov 9, 2016

Uh oh!

amueller commented Nov 9, 2016

Uh oh!

jnothman commented Nov 20, 2016

Uh oh!

raghavrv commented Nov 20, 2016

Uh oh!

jnothman commented Nov 20, 2016

Uh oh!

raghavrv commented Nov 20, 2016

Uh oh!

Uh oh!

raghavrv commented Nov 18, 2015 •

edited

Loading

TomDLT commented Sep 28, 2016 •

edited

Loading

raghavrv commented Oct 12, 2016 •

edited

Loading

TomDLT commented Oct 21, 2016 •

edited

Loading