-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Feat: DummyClassifier strategy that produces randomized probabilities #31462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think this could be useful. What to call the strategy and which strategy to use. |
Thanks, @betatim. Do you recommend I create a PR or wait for more discussion? |
To be honest, I don't know. If you are ok investing a bit of time to make a PR that would be good, though it could be wasted if people don't like the idea. @ogrisel do you have an opinion on this or know who we could ask? |
Looks like the author of #31488 eagerly knocked this out! We just need an approving reviewer. |
It will always give a ROC AUC score of around 0.5. The most useful application of |
@glevv good question -- please see my original post for an example. Some performance metrics such as a gains chart depends on there being high entropy in the predicted probabilities. The |
Describe the workflow you want to enable
Motivation
The
dummy
module is fantastic for testing pipelines all the way up through enterprise scales. The strategies offered in theDummyClassifier
are excellent for testing corner cases. However, the strategies offered fall short when testing pipelines that include downstream tasks that depend on moments of the predicted probabilities (e.g. gains charts).This is because the existing strategies do not include sampling random probabilities.
Proposed API:
Consider adding a new strategy with a name like
uniform-proba
orscore-random
or something similar that results in this behavior for binary classification:Describe your proposed solution
Proposed implementation
I had something like this in mind:
Similar to the
"stratified"
strategy, this simple implementation relies onnumpy.random
, in this case thedirichlet
distribution. By setting all thealpha
s to 1, we are specifying that the probabilities of each class are equally distributed -- in contrast, the"stratified"
strategy effectively samples from a dirichlet distribution with one alpha equal to 1 and the rest equal to 0.Describe alternatives you've considered, if relevant
No response
Additional context
I am happy to make the PR. The biggest question is what the strategy string should be.
Thank you for reading 🙏.
The text was updated successfully, but these errors were encountered: