Skip to content

Feat: DummyClassifier strategy that produces randomized probabilities #31462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tmcclintock opened this issue Jun 1, 2025 · 6 comments · May be fixed by #31488
Open

Feat: DummyClassifier strategy that produces randomized probabilities #31462

tmcclintock opened this issue Jun 1, 2025 · 6 comments · May be fixed by #31488
Labels
Needs Decision - Include Feature Requires decision regarding including feature New Feature

Comments

@tmcclintock
Copy link

Describe the workflow you want to enable

Motivation

The dummy module is fantastic for testing pipelines all the way up through enterprise scales. The strategies offered in the DummyClassifier are excellent for testing corner cases. However, the strategies offered fall short when testing pipelines that include downstream tasks that depend on moments of the predicted probabilities (e.g. gains charts).

This is because the existing strategies do not include sampling random probabilities.

Proposed API:

Consider adding a new strategy with a name like uniform-proba or score-random or something similar that results in this behavior for binary classification:

print(DummyClassifier(strategy="uniform-proba").fit(X, y).predict_proba(X))
"""
[[0.5651713  0.4348287 ]
 [0.36557341 0.63442659]
 [0.42386353 0.57613647]
 ...
 [0.30348692 0.69651308]
 [0.59589879 0.40410121]
 [0.32664176 0.67335824]]
"""

Describe your proposed solution

Proposed implementation

I had something like this in mind:

class DummyClassifier(MultiOutputMixin, ClassifierMixin, BaseEstimator):
    ...

    def predict_proba(self, X):
        ...
        for k in range(self.n_outputs_):
            if self._strategy == "uniform-proba":
                out = rs.dirichlet([1] * n_classes_[k], size=n_samples)
                out = out.astype(np.float64)
            ...

Similar to the "stratified" strategy, this simple implementation relies on numpy.random, in this case the dirichlet distribution. By setting all the alphas to 1, we are specifying that the probabilities of each class are equally distributed -- in contrast, the "stratified" strategy effectively samples from a dirichlet distribution with one alpha equal to 1 and the rest equal to 0.

Describe alternatives you've considered, if relevant

No response

Additional context

I am happy to make the PR. The biggest question is what the strategy string should be.

Thank you for reading 🙏.

@tmcclintock tmcclintock added New Feature Needs Triage Issue requires triage labels Jun 1, 2025
@betatim
Copy link
Member

betatim commented Jun 3, 2025

I think this could be useful. What to call the strategy and which strategy to use.

@betatim betatim removed the Needs Triage Issue requires triage label Jun 3, 2025
@tmcclintock
Copy link
Author

Thanks, @betatim. Do you recommend I create a PR or wait for more discussion?

@betatim
Copy link
Member

betatim commented Jun 4, 2025

To be honest, I don't know. If you are ok investing a bit of time to make a PR that would be good, though it could be wasted if people don't like the idea.

@ogrisel do you have an opinion on this or know who we could ask?

@tmcclintock
Copy link
Author

Looks like the author of #31488 eagerly knocked this out! We just need an approving reviewer.

@virchan virchan added the Needs Decision - Include Feature Requires decision regarding including feature label Jun 8, 2025
@glevv
Copy link
Contributor

glevv commented Jun 8, 2025

It will always give a ROC AUC score of around 0.5. The most useful application of DummyClassifier is for model selection and comparisons. Wouldn't the uniform-proba strategy be a bit redundant in this case?

@tmcclintock
Copy link
Author

@glevv good question -- please see my original post for an example. Some performance metrics such as a gains chart depends on there being high entropy in the predicted probabilities. The uniform is not high entropy enough to test these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Decision - Include Feature Requires decision regarding including feature New Feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants