Skip to content

Check predict_proba shape in ThresholdScorer #12486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Nov 14, 2018
Merged

Check predict_proba shape in ThresholdScorer #12486

merged 19 commits into from
Nov 14, 2018

Conversation

reshamas
Copy link
Member

@reshamas reshamas commented Oct 30, 2018

Reference Issues/PRs

Continues and resolves #12221, fixes #7598

What does this implement/fix? Explain your changes.

Any other comments?

this time, I ran flake8, fixed formatting

cc: @AMDonati

@amueller Can you point me to the GMM test? I did not see on in here:
scikit-learn/sklearn/tests

Looks good in general, but you need to add a regression test (you could use the GMM one or just a classification one with a single class maybe)

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 73489b3 into f6b0c67 - view on LGTM.com

new alerts:

  • 1 for Unreachable code

Comment posted by LGTM.com

@amueller
Copy link
Member

gmm tests are in sklearn/mixture/tests

@reshamas
Copy link
Member Author

@AMDonati all tests passed!

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. The test change doesn't match the code change and neither matches the title of the pr

@amueller
Copy link
Member

amueller commented Nov 5, 2018

I'm also confused by the test?

@amueller amueller changed the title Check if classifier Check predict_proba shape in roc_auc_score Nov 5, 2018
@amueller amueller changed the title Check predict_proba shape in roc_auc_score Check predict_proba shape in ThresholdScorer Nov 5, 2018
@amueller
Copy link
Member

amueller commented Nov 5, 2018

Renamed to something more concrete (hopefully).

@amueller
Copy link
Member

amueller commented Nov 5, 2018

is this also an issue with _ProbaScorer?

@reshamas
Copy link
Member Author

reshamas commented Nov 5, 2018

is this also an issue with _ProbaScorer?

We created our own test, which failed. That's when I went over and used the mixture test which passed.
cc: @AMDonati

@amueller
Copy link
Member

amueller commented Nov 9, 2018

ok I added the appropriate test

if y_pred.shape[1] == 2:
y_pred = y_pred[:, 1]
else:
raise ValueError('Must use classifier with two classes'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we say "got predict_proba of shape"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a good idea, because that is what is actually happening.

But, will the user understand that? Is there some place in the documentation where we can write:

When obtaining predictions from a classifier, if it is only one class, the returned ValueError will be "got predict_proba of shape". That means it returns one column all with same probabilities (equal to 1).
Q: How to fix this?
A: Must use a classifier with 2 classes.

Or will they be able to see it from the test below:
with pytest.raises(ValueError, match="use classifier with two classes")
I would feel better if the user had explicit feedback on how to solve the error.

cc: @AMDonati

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying the current error message is not informative enough? The test is just to make sure that the error that is raised is actually informative.
I suggested adding the predict_proba shape to the error message, not replace the message that's there now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works.
I tried making the update myself but ran into multiple problems: errors list

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, making progress. Please let me know next steps.

cc: @AMDonati

if y_pred.shape[1] == 2:
y_pred = y_pred[:, 1]
else:
raise ValueError('got predict_proba of shape;'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're not actually providing the shape.

Copy link
Member Author

@reshamas reshamas Nov 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, what is happening is this:

  1. previously, when there was only one classifier, but expecting two, the y_pred was not resolving. This was the error: IndexError: index 1 is out of bounds for axis 1 with size 1
  2. so, we made a check and an adjustment so that if the shape was 2 (meaning it only returned one column vector probabilities, which are all 1's), we want to raise a ValueError.
  3. What should we call it? I don't think returning the shape (even if we did that) is informative. How about:
  • "vector of probabilities returnedpredict_proba is all 1's because only one classifier is in the model."

Does that work?

if y_pred.shape[1] == 2:
y_pred = y_pred[:, 1]
else:
raise ValueError('got predict_proba of shape;'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't you meant to output the shape here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it a good idea to output the shape here? Isn't this more informative:
"vector of probabilities returnedpredict_proba is all 1's because only one classifier is in the model."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean "only one class" rather than "only one classifier"?

Either way the proposed error message here says "got predict of shape; ..." that simply doesn't make sense.

Copy link
Member Author

@reshamas reshamas Nov 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about I change the error message to:
"vector of probabilities returned (predict_proba) is all 1's because there is only one class in the input to the model."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can also fail with GMMs where the shape would be wrong but the entries wouldn't be all ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but you should use actual y_pred.shape[1] not hard-code 1 as there might be code-paths that have >2. (unless we can make sure that's not happening)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean like this?

            if y_pred.shape[1] == 2:
                y_pred = y_pred[:, 1]
            elif y_pred.shape[1] == 1:
                raise ValueError('got predict_proba of shape;'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, I think you mean this:

predict_proba has shape (n_samples, y_pred.shape[1]) which returns only 1 vector of probabilities (because it is a single class) but 2 classes in the data are required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I mean like

ValueError("got predict_proba of shape {}, but need classifier with two classes for {} scoring".format(y_pred.shape, self._score_func.__name__))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I have made the updates.

@amueller
Copy link
Member

thanks.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jnothman jnothman merged commit 94db3d9 into scikit-learn:master Nov 14, 2018
@reshamas
Copy link
Member Author

Congratulations @AMDonati. We closed our first PR on scikit-learn! 🎆
Thanks @amueller for all your help!

@amueller
Copy link
Member

yay!

@amueller amueller mentioned this pull request Nov 20, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 20, 2018
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Using GridSearchCV with scoring='roc_auc' and GMM as classifier gives IndexError
4 participants