Skip to content

[MRG + 2] FIX LogisticRegressionCV to correctly handle string labels #5874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Nov 9, 2016

Conversation

raghavrv
Copy link
Member

@raghavrv raghavrv commented Nov 18, 2015

Fixes #5868

In master -

>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

>>> X = np.arange(12)[:, np.newaxis]
>>> y = ['a',] * 4 + ['b',] * 4 + ['c',] * 4

>>> LogisticRegressionCV(solver='lbfgs', multi_class='multinomial').fit(X, y).predict(X)
ValueError: could not convert string to float: 'a'

In this branch -

>>> LogisticRegressionCV(solver='lbfgs', multi_class='multinomial').fit(X, y).predict(X)
array(['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c'], dtype='<U1')

@agramfort @GaelVaroquaux

@@ -1525,9 +1525,6 @@ def fit(self, X, y, sample_weight=None):
cv = check_cv(self.cv, y, classifier=True)
folds = list(cv.split(X, y))

self._enc = LabelEncoder()
self._enc.fit(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely self._enc must be used somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was added here - 67585f6, but not used anywhere... I am unable to understand why though...

@raghavrv raghavrv changed the title FIX LabelEncoder to correctly handle string labels [WIP] FIX LabelEncoder to correctly handle string labels Nov 21, 2015
@MechCoder
Copy link
Member

Instead of this can we encode y once in the fit time of LogisticRegressionCV and pass the encoded y to every split.

If you did this you would need to reconstruct the class_weights if they are in a dict format, such that they use the encoded class labels. Other than that I don't foresee any problem.

Also you can remove the check_classification_targets to before the type of y is checked.

@MechCoder
Copy link
Member

Would also be a good time to clean up some parts of the code, especially the encoding logic, dtype checks etc

@amueller
Copy link
Member

hm this is still a bug, right?

@amueller amueller added the Bug label Sep 14, 2016
@amueller amueller added this to the 0.19 milestone Sep 14, 2016
@@ -898,7 +897,8 @@ def _log_reg_scoring_path(X, y, train, test, pos_class=None, Cs=10,
y_test[~mask] = -1.

# To deal with object dtypes, we need to convert into an array of floats.
y_test = check_array(y_test, dtype=np.float64, ensure_2d=False)
y_test = check_array(LabelEncoder().fit_transform(y_test),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely y_test needs to use the same encoder as was used for training.

@raghavrv raghavrv force-pushed the string_multinomial branch 2 times, most recently from 3e85d50 to c61373d Compare September 27, 2016 15:08
@raghavrv
Copy link
Member Author

@MechCoder @jnothman @amueller Could you take a look at this now?

@raghavrv raghavrv force-pushed the string_multinomial branch 2 times, most recently from 1a7889b to 9097471 Compare September 27, 2016 15:18
@raghavrv raghavrv changed the title [WIP] FIX LabelEncoder to correctly handle string labels [MRG] FIX LogisticRegression(CV) to correctly handle string labels Sep 27, 2016
@raghavrv raghavrv changed the title [MRG] FIX LogisticRegression(CV) to correctly handle string labels [MRG] FIX LogisticRegressionCV to correctly handle string labels Sep 27, 2016
@raghavrv raghavrv force-pushed the string_multinomial branch 2 times, most recently from c3d207e to 0b7129a Compare September 27, 2016 17:23
@TomDLT
Copy link
Member

TomDLT commented Sep 28, 2016

What about just adding label encoding after _check_solver_option in _log_reg_scoring_path ?

y = LabelEncoder().fit_transform(y)

y = np.array(y) - 1
# Test for string labels
lr = LogisticRegression(solver='lbfgs', multi_class='multinomial')
lr_cv = LogisticRegression(solver='lbfgs', multi_class='multinomial')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably wanted to use LogisticRegressionCV here

check_consistent_length(X, y)
if y.dtype.kind in ('S', 'U'):
# Encode for string labels
self._label_encoder = LabelEncoder().fit(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you do encoding in all cases and assume that LabelEncoder handles efficiency issues?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the comment I did so in the recent commit...

for new_key, old_key in zip(new_keys, old_keys):
self.class_weight[new_key] = self.class_weight[old_key]
else:
self.classes_enc_ = self.classes_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work when classes are [1,2,3], not [0,1,2]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm encoding for all cases now like you suggested. And there is a test for this usecase.

@raghavrv
Copy link
Member Author

raghavrv commented Oct 12, 2016

Arghh. This is tougher to solve than I thought. After encoding the string labels at the fit of LogisticRegressionCV, the logistic_regression_path helper again encodes the data and this causes a test failure I think... I'll have a deeper look at it soon... The scores and coef dict require the unencoded class label as keys... ah...

@raghavrv
Copy link
Member Author

Now it should pass all the tests... Gentle ping @TomDLT @MechCoder for reviews too...

@TomDLT
Copy link
Member

TomDLT commented Oct 21, 2016

Your variables names are quite confusing:
enc_labels, iter_labels, enc_lbl
cls_labels, iter_classes, cls_lbl

what about:
encoded_labels, iter_encoded_labels, encoded_label
classes_labels, iter_classes_labels, classes_label
or
enc_labels, iter_enc_labels, enc_label
cls_labels, iter_cls_labels, cls_label

or even remove iter_labels and iter_classes and use enc_labels and cls_labels directly

@TomDLT
Copy link
Member

TomDLT commented Oct 21, 2016

Otherwise LGTM

@raghavrv raghavrv changed the title [MRG + 1] FIX LogisticRegressionCV to correctly handle string labels [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels Nov 9, 2016
@raghavrv
Copy link
Member Author

raghavrv commented Nov 9, 2016

Merging once CIs pass...

@raghavrv raghavrv merged commit 31ee1a8 into scikit-learn:master Nov 9, 2016
@raghavrv
Copy link
Member Author

raghavrv commented Nov 9, 2016

Thanks @jnothman @amueller @TomDLT @MechCoder for the reviews!

@amueller
Copy link
Member

amueller commented Nov 9, 2016

thanks for the pr :)

@raghavrv raghavrv deleted the string_multinomial branch November 9, 2016 20:46
amueller pushed a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016
…cikit-learn#5874)

* TST if LogisticRegressionCV handles string labels properly
* TST Add a test with class_weight dict
* ENH Encode y and class_weight dict
* Better variable names
* TYPO casses --> classes
* FIX Use dict comprehension; classes_labels --> classes
* Revert dict comprehension (for Python 2.6 compat)
* MNT reorder validation to improve clarity
* Add whatsnew entry
@jnothman
Copy link
Member

We seem to be getting test failures from this on PRs, such as at https://ci.appveyor.com/project/sklearn-ci/scikit-learn/build/1.0.10296/job/qol1vrsk30ycsopl

@raghavrv
Copy link
Member Author

Which PR is this? I saw a similar error but because of an incorrect updation of master. (Which got resolved after fixing that...)

@jnothman
Copy link
Member

#7838 The commit history
looks clean.

On 21 November 2016 at 09:33, Raghav RV notifications@github.com wrote:

Which PR is this? I saw a similar error but because of an incorrect
updation of master. (Which got resolved after fixing that...)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#5874 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz68dXJJvBChav9vNhG-Dby-LREw8Oks5rAMrAgaJpZM4Gkt-Q
.

@raghavrv
Copy link
Member Author

Okay. Thanks for the notice. I'll look into it tomorrow...

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
…cikit-learn#5874)

* TST if LogisticRegressionCV handles string labels properly
* TST Add a test with class_weight dict
* ENH Encode y and class_weight dict
* Better variable names
* TYPO casses --> classes
* FIX Use dict comprehension; classes_labels --> classes
* Revert dict comprehension (for Python 2.6 compat)
* MNT reorder validation to improve clarity
* Add whatsnew entry
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…cikit-learn#5874)

* TST if LogisticRegressionCV handles string labels properly
* TST Add a test with class_weight dict
* ENH Encode y and class_weight dict
* Better variable names
* TYPO casses --> classes
* FIX Use dict comprehension; classes_labels --> classes
* Revert dict comprehension (for Python 2.6 compat)
* MNT reorder validation to improve clarity
* Add whatsnew entry
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017
* tag '0.18.1': (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017
* releases: (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...

 Conflicts:  removed
	sklearn/externals/joblib/__init__.py
	sklearn/externals/joblib/_parallel_backends.py
	sklearn/externals/joblib/testing.py
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017
* dfsg: (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
…cikit-learn#5874)

* TST if LogisticRegressionCV handles string labels properly
* TST Add a test with class_weight dict
* ENH Encode y and class_weight dict
* Better variable names
* TYPO casses --> classes
* FIX Use dict comprehension; classes_labels --> classes
* Revert dict comprehension (for Python 2.6 compat)
* MNT reorder validation to improve clarity
* Add whatsnew entry
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…cikit-learn#5874)

* TST if LogisticRegressionCV handles string labels properly
* TST Add a test with class_weight dict
* ENH Encode y and class_weight dict
* Better variable names
* TYPO casses --> classes
* FIX Use dict comprehension; classes_labels --> classes
* Revert dict comprehension (for Python 2.6 compat)
* MNT reorder validation to improve clarity
* Add whatsnew entry
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…cikit-learn#5874)

* TST if LogisticRegressionCV handles string labels properly
* TST Add a test with class_weight dict
* ENH Encode y and class_weight dict
* Better variable names
* TYPO casses --> classes
* FIX Use dict comprehension; classes_labels --> classes
* Revert dict comprehension (for Python 2.6 compat)
* MNT reorder validation to improve clarity
* Add whatsnew entry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants