Skip to content

[MRG+1] Fix cross_val_predict behavior for binary classification in decision_function (Fixes #9589) #9593

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

reiinakano
Copy link
Contributor

@reiinakano reiinakano commented Aug 21, 2017

Reference Issue

Fixes #9589

What does this implement/fix? Explain your changes.

I figured that the purpose of this code (original)

if method in ['decision_function', 'predict_proba', 'predict_log_proba']:
        n_classes = len(set(y))
        predictions_ = np.zeros((X_test.shape[0], n_classes))
        if method == 'decision_function' and len(estimator.classes_) == 2:
                predictions_[:, estimator.classes_[-1]] = predictions
        else:
            predictions_[:, estimator.classes_] = predictions
        predictions = predictions_

is to handle cases where the cross-validation done in cross_val_predict doesn't properly stratify the classes i.e. one fold in the CV may not have all classes present.

However, it misses an edge case for decision_function when len(estimator.classes_) == 2 (train split has only 2 classes) and n_classes == 2 (total data has only 2 classes). It wrongly assumes that the output will be of shape (n_samples, n_classes), when in fact, this is only true for decision_function if n_classes > 2. For the correct behavior, we must first check if the total number of classes is greater than 2, and if it is not, we must stick to an output shape of (n_samples,).

Any other comments?

While this PR fixes the case raised in #9589, it still doesn't take into account the fact that for sklearn.svm.SVC, decision_function does not always follow the shape (n_samples, n_classes) or (n_samples,). If decision_function_shape is not set to ovr, the shape is (n_samples, n_classes * (n_classes-1) / 2), and if different train splits have different numbers of classes (not stratified), this inconsistency can't be fixed by the approach currently taken by cross_val_predict.

I would also like to point out my personal opinion that cross_val_predict should probably just throw an error if the classes are not stratified properly (by this I mean some classes are entirely absent in some splits) instead of handling it automatically as it tries to do now. Metrics like log loss and ROC might be affected if for some splits, entire columns are zeroed out. Idk, it just feels inconsistent. Of course, you guys have the final say on this.

What I mean by inconsistent:

from sklearn.model_selection import cross_val_predict, KFold
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

X, y = load_iris(True)
cross_val_predict(LogisticRegression(), X, y, cv=KFold().split(X), method='predict_proba')

returns (array cut down in size)

array([[  0.00000000e+00,   9.99987841e-01,   1.21585683e-05],
       [  0.00000000e+00,   9.99963190e-01,   3.68104260e-05],
       [  0.00000000e+00,   9.99970224e-01,   2.97757119e-05],
       [  0.00000000e+00,   9.99978481e-01,   2.15185386e-05],
       [  0.00000000e+00,   9.99993136e-01,   6.86428459e-06],
       [  1.20030529e-01,   0.00000000e+00,   8.79969471e-01],
       [  1.13183668e-01,   0.00000000e+00,   8.86816332e-01],
       [  6.75506225e-02,   0.00000000e+00,   9.32449377e-01],
       [  1.07273496e-01,   0.00000000e+00,   8.92726504e-01],
       [  1.26854747e-01,   0.00000000e+00,   8.73145253e-01],
       [  1.16881976e-01,   0.00000000e+00,   8.83118024e-01],
       [  2.45077442e-04,   9.99754923e-01,   0.00000000e+00],
       [  2.70475201e-04,   9.99729525e-01,   0.00000000e+00],
       [  6.59523734e-04,   9.99340476e-01,   0.00000000e+00],
       [  4.07574999e-04,   9.99592425e-01,   0.00000000e+00],
       [  2.42451670e-03,   9.97575483e-01,   0.00000000e+00],
       [  2.03959503e-03,   9.97960405e-01,   0.00000000e+00]])

Notice how there are "zeroed out" columns.

And even worse:

from sklearn.model_selection import cross_val_predict, KFold
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

X, y = load_iris(True)
cross_val_predict(LogisticRegression(), X, y, cv=KFold().split(X), method='decision_function')

returns (array cut down in size)

array([[  0.        ,   0.        , -11.31746426],
       [  0.        ,   0.        , -10.20969263],
       [  0.        ,   0.        , -10.42178776],
       [  0.        ,   0.        ,  -9.60397656],
       [  0.        ,   0.        , -11.30005126],
       [  0.        ,   0.        , -11.1906238 ],
       [  0.        ,   0.        , -10.05510072],
       [  0.        ,   0.        , -10.74657422],
       [  0.        ,   0.        ,  -9.20295991],
       [  0.        ,   8.3136912 ,   0.        ],
       [  0.        ,   6.77281268,   0.        ],
       [  0.        ,   7.79874366,   0.        ],
       [  0.        ,   7.29615134,   0.        ],
       [  0.        ,   7.9199709 ,   0.        ],
       [  0.        ,   9.163118  ,   0.        ],
       [  0.        ,   5.8858718 ,   0.        ],
       [  0.        ,   6.37425957,   0.        ],
       [  0.        ,   6.66261924,   0.        ],
       [  0.        ,   6.19296233,   0.        ]])

which is hardly of use to anyone.

@jnothman
Copy link
Member

Thanks!

Running ROC over the output of cross_val_predict is a really bad idea in any case...

@jnothman
Copy link
Member

Yes, perhaps 0 isn't the best pick, particularly for predict_proba. NaN?

@jnothman
Copy link
Member

I think better not to error in the context of using this in ClassifierChain or a stacking meta-estimator.

@reiinakano
Copy link
Contributor Author

reiinakano commented Aug 21, 2017

I think better not to error in the context of using this in ClassifierChain or a stacking meta-estimator.

Hmm, you're right. One note though, I can't speak for ClassifierChain but for stacking, these zeroed out columns would mess with how the secondary learner learns and I wouldn't trust the results. Perhaps display a warning?

Yes, perhaps 0 isn't the best pick, particularly for predict_proba. NaN?

NaN makes it clearer that these columns weren't considered at all for this particular split, but wouldn't that break things? For instance, I frequently use np.argmax(cross_val_predict(..., method='predict_proba)) to recover the true labels for calculating accuracy, etc. Of course, I always make sure my folds are stratified so I'm probably not going to face it, but still..

Also, wouldn't that break ClassifierChains and stacking meta-estimators anyway?

@jnothman
Copy link
Member

jnothman commented Aug 21, 2017 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a non-regression test.

@reiinakano
Copy link
Contributor Author

I've added the necessary unit tests.

Although at this point, I'd like to point out there's another regression for the case of SVC with decision_function_shape=ovo

Code to test:

from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_predict

X, y = load_digits(return_X_y=True)
cross_val_predict(SVC(kernel='linear', decision_function_shape='ovo'), X, y, method='decision_function').shape

Expected Results
In previous versions, this is (1797, 45)

Actual Results

ValueError: shape mismatch: value array of shape (602,45) could not be broadcast to indexing result of shape (10,602)

This error is due to the shape of an SVC decision function with decision_function_shape set to 'ovr' which is (n_samples, n_classes * (n_classes-1) / 2)

Should I include the fix in this PR?

@jnothman
Copy link
Member

jnothman commented Aug 22, 2017 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jnothman jnothman changed the title [WIP] Fix cross_val_predict behavior for binary classification in decision_function (Fixes #9589) [MRG+1] Fix cross_val_predict behavior for binary classification in decision_function (Fixes #9589) Aug 22, 2017
@jnothman jnothman added this to the 0.19.1 milestone Aug 22, 2017
@reiinakano reiinakano changed the title [MRG+1] Fix cross_val_predict behavior for binary classification in decision_function (Fixes #9589) [WIP] Fix cross_val_predict behavior for binary classification in decision_function (Fixes #9589) Aug 22, 2017
@reiinakano
Copy link
Contributor Author

Hi @jnothman, I changed this back to WIP since there was a conflict with a recent merge. I also figured out a better fix that also fixes the SVC "ovo" regression.

@jnothman
Copy link
Member

Okay. Let me know when it's MRG, i.e. ready for review.

@reiinakano reiinakano changed the title [WIP] Fix cross_val_predict behavior for binary classification in decision_function (Fixes #9589) [MRG] Fix cross_val_predict behavior for binary classification in decision_function (Fixes #9589) Aug 22, 2017
@reiinakano
Copy link
Contributor Author

@jnothman Ready for review

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to add a test for the OvO case?

else:
predictions_[:, estimator.classes_] = predictions
predictions = predictions_
if not n_classes == len(estimator.classes_):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not ... == -> ... !=

@reiinakano
Copy link
Contributor Author

Yup, sorry. Added unit test and addressed comments.

@reiinakano
Copy link
Contributor Author

reiinakano commented Oct 11, 2017

I think I've mostly sorted this out. For the non-special case of decision_function (iv. n_classes_train > 2 and n_classes > 2 but n_classes_train != n_classes.), I handle it the same as predict_log_proba. For the other special cases, there is an error message.

Apologies for the confusing code previously, I should've done it this way in the first place, but I didn't expect decision_function to have so many edge cases.

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need a warning in any case when n_classes_training != n_classes.

I would still like feedback from some core devs about #9593 (comment), especially about the default value for missing classes for predict_log_proba and decision_function (personally I would favour -inf but maybe there are consumers of cross_val_predict which would raise an error with infinite values)

'use a cross-validation technique resulting '
'in properly stratified folds\.',
cross_val_predict, est, X, y,
cv=KFold(), method='decision_function')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use explicit number of splits here n_splits=3.

predictions_[:, estimator.classes_] = predictions
predictions = predictions_
if n_classes != len(estimator.classes_):
if predictions.ndim == 2 and \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I would use parentheses rather than backslash.

if n_classes != len(estimator.classes_):
if predictions.ndim == 2 and \
predictions.shape[1] != len(estimator.classes_):
# This handles the case when the shape of predictions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the comment matches the error message. IIUC the error can happen for one of these reasons:

  • you use SVC with decision_function_shape='ovo' whose predict output does not follow the same convention as all the other classifiers. This corresponds to the comment. In this case the recommendation in the error message is not helpful at all.
  • you don't have all the classes in your training fold. This corresponds to the error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, for the first case (ovo), the recommendation in the error message is helpful. We can handle the ovo case but only if the folds are stratified.

predictions_[:, estimator.classes_] = predictions

else: # Special handling logic for decision_function
err_mess = 'Output shape {} of {} does not match ' \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather have parentheses than backslashes

error_message = ('asdfsa '
    'asdfds '
    'asdfsa ')


predictions_ = np.full((_num_samples(X_test), n_classes),
np.finfo(predictions.dtype).min)
predictions_[:, estimator.classes_] = predictions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changes for each method is the default value, you could simplify the code and use better variable names, e.g. something like this in the non-error case:

float_min = np.finfo(predictions.dtype).min
default_values = {'decision_function': float_min, 'predict_log_proba': float_min, 
    'predict_proba': 0}
predictions_for_all_classes = np.full((_num_samples(predictions), n_classes),
    default_values[method])
predictions_for_all_classes[:, estimator.classes_] = predictions

' in different folds. To fix this, use ' \
'a cross-validation technique resulting' \
' in properly stratified folds'
if predictions.ndim == 2 and \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parentheses rather than backslashes

@@ -1267,6 +1347,18 @@ def test_cross_val_predict_class_subset():
est, method)
assert_array_almost_equal(expected_predictions, predictions)

# Special test for decision_function. This makes sure not to trigger
# any of the numerous edge cases in decision_function
X = np.arange(200).reshape(100, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should keep decision_function in the loop above. Just find a dataset that works for decision_function. e.g. this one or one with many classes to avoid the edge cases that error.

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need a warning in any case when n_classes_training != n_classes.

You have not tackled this comment AFAICT. It would be good to add a test for this with assert_warns as well.

len(estimator.classes_)))
if len(estimator.classes_) <= 2:
# In this special case, `predictions` contains a 1D array.
raise ValueError(err_mess.format(predictions.shape, method,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this error message makes sense here since predictions has the right shape compared to len(estimator.classes_). I am not sure what the error message should say, to be honest, maybe something like: only two classes in the training fold, this is not supported for method='decision_function' + the error message above about using properly stratified folds.

@@ -1259,7 +1339,8 @@ def test_cross_val_predict_class_subset():
assert_array_almost_equal(expected_predictions, predictions)

# Testing unordered labels
y = [1, 1, -4, 6]
y = np.array([x//20 for x in range(-100, 100, 2)])
y = shuffle(y, random_state=0)
Copy link
Member

@lesteve lesteve Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or just y = np.repeat([1, 1, -4, 6], 25). Always a good idea to avoid unnecessary complications ...

@@ -776,6 +777,87 @@ def split(self, X, y=None, groups=None):

assert_raises(ValueError, cross_val_predict, est, X, y, cv=BadCV())

X, y = load_iris(return_X_y=True)
assert_warns(RuntimeWarning, cross_val_predict, LogisticRegression(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use assert_warns_message to test the warning message?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from this do you have any concerns with the PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, I am fine with the general strategy as I mentioned above. There are some minor things I will fix shortly.

If you haven't already, maybe you can have a quick look at the warnings/error message and make sure they are fine.

Copy link
Member

@lesteve lesteve Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing: are you fine with float_min as the default value for missing classes for predict_log_proba and decision_function. Another option is -inf but maybe some consumers downstream will raise an error with infinite values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predict_log_proba should probably use -inf... decision_function is a messy case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think float-min is perfect, but I'm documenting it at least.

@codecov
Copy link

codecov bot commented Oct 17, 2017

Codecov Report

Merging #9593 into master will decrease coverage by 0.18%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9593      +/-   ##
==========================================
- Coverage   96.16%   95.97%   -0.19%     
==========================================
  Files         336      336              
  Lines       62442    62566     +124     
==========================================
+ Hits        60046    60049       +3     
- Misses       2396     2517     +121
Impacted Files Coverage Δ
sklearn/model_selection/_validation.py 96.87% <100%> (+0.08%) ⬆️
sklearn/model_selection/tests/test_validation.py 98.43% <100%> (-0.34%) ⬇️
sklearn/utils/fixes.py 65.76% <0%> (-15.32%) ⬇️
sklearn/utils/tests/test_deprecation.py 85.29% <0%> (-14.71%) ⬇️
sklearn/utils/deprecation.py 85.41% <0%> (-10.42%) ⬇️
sklearn/utils/tests/test_utils.py 91.19% <0%> (-8.81%) ⬇️
sklearn/_build_utils/__init__.py 55.1% <0%> (-6.13%) ⬇️
sklearn/utils/tests/test_estimator_checks.py 90.72% <0%> (-5.97%) ⬇️
sklearn/datasets/mldata.py 91.25% <0%> (-5%) ⬇️
sklearn/utils/extmath.py 93.77% <0%> (-2.4%) ⬇️
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 534f68b...bc405c8. Read the comment docs.

# it with. This case is found when sklearn.svm.SVC is
# set to `decision_function_shape='ovo'`.
raise ValueError('Output shape {} of {} does not match '
'number of classes ({}) in fold. Cannot '
Copy link
Member

@lesteve lesteve Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this error message is wrong:

  • I would remove "Cannot reconcile different number of classes in different folds, because the problem is specific to one fold
  • the recommendation of using properly stratified folds does not apply. If we are sure that this problem only exists for SVC with decision_shape_function='ovo', we should say use decision_shape_function='ovr' maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. It could happen in other situations? Could also be a custom estimator?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just say that wacky decision functions are not supported :)

@jnothman
Copy link
Member

I would like to see this merged and 0.19.1 released by the end of the week. Is that reasonable, @amueller?

default score needs to be assigned to all instances for that class if
``method`` produces columns per class, as in {'decision_function',
'predict_proba', 'predict_log_proba'}. For ``predict_proba`` this value is
0. In order to ensure finite output, we approximate negative infinity by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this behaviour and note is sufficient for a first release.

@amueller amueller merged commit ebc8730 into scikit-learn:master Oct 19, 2017
@reiinakano reiinakano deleted the fix-cross-val-predict-decision-function branch October 19, 2017 23:05
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Oct 20, 2017
…ecision_function (Fixes scikit-learn#9589) (scikit-learn#9593)

* fix cross_val_predict for binary classification in decision_function

* Add unit tests

* Add unit tests

* Add unit tests

* better fix

* fix conflict

* fix broken

* only calculate n_classes if one of 'decision_function', 'predict_proba', 'predict_log_proba'

* add test for SVC ovo in cross_val_predict

* flake8 fix

* fix case of ovo and imbalanced folds for binary classification

* change assert_raises to assert_raise_message for ovo case

* fix flake8 linetoo long

* add comments and clearer tests

* improve comments and error message for OvO

* fix .format error with L

* use assert_raises_regex for better error message

* raise error in decision_function special cases. change predict_log_proba missing classes to minimum numpy value

* fix broken tests due to special cases of decision_function

* add modified test for decision_function behavior that does not trigger edge cases

* fix typos

* fix typos

* escape regex .

* escape regex .

* address comments. one unaddressed comment

* simplify code

* flake

* wrong classes range

* address comments. adjust error message

* add warning

* change warning to runtimewarning

* add test for the warning

* Use assert_warns_message rather than assert_warns

Other minor fixes

* Note on class-absent replacement values

* Improve error message
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…ecision_function (Fixes scikit-learn#9589) (scikit-learn#9593)

* fix cross_val_predict for binary classification in decision_function

* Add unit tests

* Add unit tests

* Add unit tests

* better fix

* fix conflict

* fix broken

* only calculate n_classes if one of 'decision_function', 'predict_proba', 'predict_log_proba'

* add test for SVC ovo in cross_val_predict

* flake8 fix

* fix case of ovo and imbalanced folds for binary classification

* change assert_raises to assert_raise_message for ovo case

* fix flake8 linetoo long

* add comments and clearer tests

* improve comments and error message for OvO

* fix .format error with L

* use assert_raises_regex for better error message

* raise error in decision_function special cases. change predict_log_proba missing classes to minimum numpy value

* fix broken tests due to special cases of decision_function

* add modified test for decision_function behavior that does not trigger edge cases

* fix typos

* fix typos

* escape regex .

* escape regex .

* address comments. one unaddressed comment

* simplify code

* flake

* wrong classes range

* address comments. adjust error message

* add warning

* change warning to runtimewarning

* add test for the warning

* Use assert_warns_message rather than assert_warns

Other minor fixes

* Note on class-absent replacement values

* Improve error message
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
…ecision_function (Fixes scikit-learn#9589) (scikit-learn#9593)

* fix cross_val_predict for binary classification in decision_function

* Add unit tests

* Add unit tests

* Add unit tests

* better fix

* fix conflict

* fix broken

* only calculate n_classes if one of 'decision_function', 'predict_proba', 'predict_log_proba'

* add test for SVC ovo in cross_val_predict

* flake8 fix

* fix case of ovo and imbalanced folds for binary classification

* change assert_raises to assert_raise_message for ovo case

* fix flake8 linetoo long

* add comments and clearer tests

* improve comments and error message for OvO

* fix .format error with L

* use assert_raises_regex for better error message

* raise error in decision_function special cases. change predict_log_proba missing classes to minimum numpy value

* fix broken tests due to special cases of decision_function

* add modified test for decision_function behavior that does not trigger edge cases

* fix typos

* fix typos

* escape regex .

* escape regex .

* address comments. one unaddressed comment

* simplify code

* flake

* wrong classes range

* address comments. adjust error message

* add warning

* change warning to runtimewarning

* add test for the warning

* Use assert_warns_message rather than assert_warns

Other minor fixes

* Note on class-absent replacement values

* Improve error message
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression in cross_val_predict of decision_function in the binary case
4 participants