-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG + 2] FIX LogisticRegressionCV to correctly handle string labels #5874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7a54392
to
41777f3
Compare
@@ -1525,9 +1525,6 @@ def fit(self, X, y, sample_weight=None): | |||
cv = check_cv(self.cv, y, classifier=True) | |||
folds = list(cv.split(X, y)) | |||
|
|||
self._enc = LabelEncoder() | |||
self._enc.fit(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely self._enc must be used somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was added here - 67585f6, but not used anywhere... I am unable to understand why though...
Instead of this can we encode If you did this you would need to reconstruct the Also you can remove the |
Would also be a good time to clean up some parts of the code, especially the encoding logic, dtype checks etc |
hm this is still a bug, right? |
@@ -898,7 +897,8 @@ def _log_reg_scoring_path(X, y, train, test, pos_class=None, Cs=10, | |||
y_test[~mask] = -1. | |||
|
|||
# To deal with object dtypes, we need to convert into an array of floats. | |||
y_test = check_array(y_test, dtype=np.float64, ensure_2d=False) | |||
y_test = check_array(LabelEncoder().fit_transform(y_test), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely y_test
needs to use the same encoder as was used for training.
3e85d50
to
c61373d
Compare
@MechCoder @jnothman @amueller Could you take a look at this now? |
1a7889b
to
9097471
Compare
c3d207e
to
0b7129a
Compare
y = LabelEncoder().fit_transform(y) |
y = np.array(y) - 1 | ||
# Test for string labels | ||
lr = LogisticRegression(solver='lbfgs', multi_class='multinomial') | ||
lr_cv = LogisticRegression(solver='lbfgs', multi_class='multinomial') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably wanted to use LogisticRegressionCV
here
check_consistent_length(X, y) | ||
if y.dtype.kind in ('S', 'U'): | ||
# Encode for string labels | ||
self._label_encoder = LabelEncoder().fit(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't you do encoding in all cases and assume that LabelEncoder
handles efficiency issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the comment I did so in the recent commit...
for new_key, old_key in zip(new_keys, old_keys): | ||
self.class_weight[new_key] = self.class_weight[old_key] | ||
else: | ||
self.classes_enc_ = self.classes_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work when classes are [1,2,3], not [0,1,2]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm encoding for all cases now like you suggested. And there is a test for this usecase.
0b7129a
to
14699b7
Compare
Arghh. This is tougher to solve than I thought. After encoding the string labels at the fit of |
14699b7
to
934493c
Compare
Now it should pass all the tests... Gentle ping @TomDLT @MechCoder for reviews too... |
Your variables names are quite confusing: what about: or even remove |
Otherwise LGTM |
b0de8f3
to
4961d97
Compare
Merging once CIs pass... |
Thanks @jnothman @amueller @TomDLT @MechCoder for the reviews! |
thanks for the pr :) |
…cikit-learn#5874) * TST if LogisticRegressionCV handles string labels properly * TST Add a test with class_weight dict * ENH Encode y and class_weight dict * Better variable names * TYPO casses --> classes * FIX Use dict comprehension; classes_labels --> classes * Revert dict comprehension (for Python 2.6 compat) * MNT reorder validation to improve clarity * Add whatsnew entry
We seem to be getting test failures from this on PRs, such as at https://ci.appveyor.com/project/sklearn-ci/scikit-learn/build/1.0.10296/job/qol1vrsk30ycsopl |
Which PR is this? I saw a similar error but because of an incorrect updation of master. (Which got resolved after fixing that...) |
#7838 The commit history On 21 November 2016 at 09:33, Raghav RV notifications@github.com wrote:
|
Okay. Thanks for the notice. I'll look into it tomorrow... |
…cikit-learn#5874) * TST if LogisticRegressionCV handles string labels properly * TST Add a test with class_weight dict * ENH Encode y and class_weight dict * Better variable names * TYPO casses --> classes * FIX Use dict comprehension; classes_labels --> classes * Revert dict comprehension (for Python 2.6 compat) * MNT reorder validation to improve clarity * Add whatsnew entry
…cikit-learn#5874) * TST if LogisticRegressionCV handles string labels properly * TST Add a test with class_weight dict * ENH Encode y and class_weight dict * Better variable names * TYPO casses --> classes * FIX Use dict comprehension; classes_labels --> classes * Revert dict comprehension (for Python 2.6 compat) * MNT reorder validation to improve clarity * Add whatsnew entry
* tag '0.18.1': (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ...
* releases: (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ... Conflicts: removed sklearn/externals/joblib/__init__.py sklearn/externals/joblib/_parallel_backends.py sklearn/externals/joblib/testing.py
* dfsg: (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ...
…cikit-learn#5874) * TST if LogisticRegressionCV handles string labels properly * TST Add a test with class_weight dict * ENH Encode y and class_weight dict * Better variable names * TYPO casses --> classes * FIX Use dict comprehension; classes_labels --> classes * Revert dict comprehension (for Python 2.6 compat) * MNT reorder validation to improve clarity * Add whatsnew entry
…cikit-learn#5874) * TST if LogisticRegressionCV handles string labels properly * TST Add a test with class_weight dict * ENH Encode y and class_weight dict * Better variable names * TYPO casses --> classes * FIX Use dict comprehension; classes_labels --> classes * Revert dict comprehension (for Python 2.6 compat) * MNT reorder validation to improve clarity * Add whatsnew entry
…cikit-learn#5874) * TST if LogisticRegressionCV handles string labels properly * TST Add a test with class_weight dict * ENH Encode y and class_weight dict * Better variable names * TYPO casses --> classes * FIX Use dict comprehension; classes_labels --> classes * Revert dict comprehension (for Python 2.6 compat) * MNT reorder validation to improve clarity * Add whatsnew entry
Fixes #5868
In master -
In this branch -
@agramfort @GaelVaroquaux