Skip to content

Fix spurious warning from type_of_target when called on estimator.classes_ #31584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 14, 2025

Conversation

saskra
Copy link
Contributor

@saskra saskra commented Jun 18, 2025

Reference Issues/PRs

Fixes #31583

What does this implement/fix? Explain your changes.

This PR suppresses an unintended warning in get_response_values, where type_of_target is called on estimator.classes_. Since classes_ does not represent full sample-level data, this call may spuriously trigger the warning:

"The number of unique classes is greater than 50% of the number of samples."

This is now avoided by passing suppress_warning=True to type_of_target() at this specific location.

This patch is intentionally minimal and does not affect calls to type_of_target that operate on actual sample labels (y, y_true, etc.).

Any other comments?

This was first observed while calibrating classifiers with many classes. Although the dataset was large and well-balanced, the warning appeared due to how classes_ was passed into type_of_target.

Apologies in advance if this is already known or intentional – this is my first contribution here, and I appreciate any feedback or corrections.

Thanks for your time and for maintaining this great library!

Copy link

github-actions bot commented Jun 18, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 421728c. Link to the linter CI: here

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @saskra. Please add a test for _get_response_values in test_response.py to check that no warning is raised now.

@saskra
Copy link
Contributor Author

saskra commented Jul 2, 2025

Thanks for the PR @saskra. Please add a test for _get_response_values in test_response.py to check that no warning is raised now.

Like this?

def test_response_values_warn_bug_type_of_target_on_classes_issue():
    """
    Ensure no warning is raised due to incorrect call to `type_of_target(classes_)`
    when using `CalibratedClassifierCV` with many classes.

    This test verifies that using `classes_` instead of `y` in `_get_response_values`
    does not incorrectly trigger the "unique classes > 50% of samples" warning.
    """
    n_samples = 1000
    n_features = 40
    n_classes = 30  # Well below 50% of 1000 samples

    rng = np.random.RandomState(42)
    X = rng.rand(n_samples, n_features)
    y = np.tile(np.arange(n_classes), int(np.ceil(n_samples / n_classes)))[:n_samples]

    base_clf = RandomForestClassifier(n_estimators=10, random_state=0)
    clf = CalibratedClassifierCV(base_clf, method="isotonic", cv=2)
    clf.fit(X, y)

    with warnings.catch_warnings(record=True) as w:
        warnings.simplefilter("always")

        _get_response_values(clf, X, response_method="predict_proba")

        warning_messages = [str(warning.message) for warning in w]

    # This warning was raised due to incorrect handling of classes_
    unexpected = [
        msg
        for msg in warning_messages
        if "number of unique classes is greater than 50%" in msg
    ]

    assert not unexpected, f"Unexpected warning: {unexpected}"

@jeremiedbb
Copy link
Member

Yes but I think that we can make it a bit simpler

def test_response_values_type_of_target_on_classes_no_warning():
    """
    Ensure that _get_response_values doesn't raise the "unique classes > 50% of samples"
    warning when calling `type_of_target(classes_)`.

    non-regression test for issue #31583.
    """
    X = np.random.RandomState(0).randn(120, 3)
    # 30 classes, less than 50% of number of samples
    y = np.repeat(np.arange(30), 4)

    clf = LogisticRegression().fit(X, y)

    with warnings.catch_warnings():
        warnings.simplefilter("error", UserWarning)

        _get_response_values(clf, X, response_method="predict_proba")

…lues_type_of_target_on_classes_no_warning; do not add suppress_warning
@saskra saskra force-pushed the fix-type-of-target-warning branch from fdb8624 to eb90ddd Compare July 2, 2025 13:39
@jeremiedbb
Copy link
Member

Thanks @saskra, looks good !
I just directly pushed a small fix because np.concat didn't exist in all numpy versions that we support.
Please add a changelog entry.

@jeremiedbb jeremiedbb added this to the 1.7.1 milestone Jul 7, 2025
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jeremiedbb
Copy link
Member

ping @lucyleeow or @StefanieSenger for a second review maybe ?

Copy link
Member

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @saskra ! Nitpicks only

@@ -0,0 +1,3 @@
- Fixed a spurious warning in :func:`utils.multiclass.type_of_target` that could be triggered
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be specific about which warning we are talking about.

I also wonder if it is worth detailing all public functions this will affect, since whats new is aimed for users, and type_of_target may be less meaningful to a user than e.g., GridSearchCV will no longer give spurious "The number of unique classes is greater than 50% of the number of samples" warning when the number of samples is...etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you find out all the public functions that are affected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a clever solution outside of searching in the codebase. Maybe a regular expression search for when classes is passed to type_of_target ? Or something that is not y_* ?
Failing that, just a list of classes/functions that use _get_response_values would be nice.

Copy link
Contributor Author

@saskra saskra Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this?

- Fixed a spurious warning that could occur when passing ``estimator.classes_``
  or similar arrays with many unique values to classification utilities.
  The warning came from :func:`sklearn.utils.multiclass.type_of_target` and has
  now been suppressed when the input is not a true target vector.
  The warning message was: "The number of unique classes is greater than 
  50% of the number of samples."
  This could appear in tools that internally validate classification outputs,
  such as :class:`~sklearn.model_selection.GridSearchCV`,
  :func:`~sklearn.model_selection.cross_val_score`,
  :func:`~sklearn.metrics.make_scorer`,
  :class:`~sklearn.multioutput.MultiOutputClassifier`,
  and :class:`~sklearn.calibration.CalibratedClassifierCV`.

  By :user:`Sascha D. Krauss <saskra>`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the large amount of potentially affect public classes and functions, I'm not sure it's worth trying to list them all. It was just an unexpected warning, not a critical bug. Since type_of_target is a public function I think it's fine for the changelog to only be about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, what about something like:

- Fixed a spurious warning (about the number of unique classes being
  greater than 50% of the number of samples) that could occur when
  passing `classes` :func:`utils.multiclass.type_of_target`.

Note you don't need the sklearn at the start in whats new entries.

@@ -302,7 +302,7 @@ def test_type_of_target_too_many_unique_classes():
We need to check that we don't raise if we have less than 20 samples.
"""

y = np.arange(25)
y = np.hstack((np.arange(20), [0]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a note here to explain why we add the '0' at the end?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe we could keep the original y = np.arange(25) and check that this does not raise a warning?

Copy link
Contributor Author

@saskra saskra Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the original y = np.arange(25) now raises a warning because the condition y.shape[0] > classes.shape[0] is not fulfilled, but y.shape[0] == classes.shape[0]. Hence the second zero in the new array in the test, so that you at least have a single class with two samples. The whole point was to prevent the warning from being issued incorrectly if the function is only called with the unique class list instead of the real list of class labels per sample - hence the original idea was also an argument to suppress the warning for this case.

So should we rather adapt the test or treat the case y.shape[0] == classes.shape[0] in type_of_target differently?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a note here to explain why we add the '0' at the end?

+1
Let's add a comment to explain that we create an array with almost all classes represented only once except 0 represented twice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe we could keep the original y = np.arange(25) and check that this does not raise a warning?

However, the original y = np.arange(25) now raises a warning because the condition y.shape[0] > classes.shape[0] is not fulfilled, but y.shape[0] == classes.shape[0]

@saskra I think @lucyleeow is correct. Now the np.arange(25) case won't raise the warning because we special cased y.shape[0] == classes.shape[0] so we can check that indeed no warning is raised. Something like the following that we'd put at the end of the test.

# More than 20 samples but only unique classes, no warning should be raised
y = np.arange(25)
with warnings.catch_warnings():
    warnings.simplefilter("ignore", UserWarning)
    type_of_target(y)

saskra and others added 2 commits July 11, 2025 08:36
Co-authored-by: Lucy Liu <jliu176@gmail.com>
@lucyleeow
Copy link
Member

Also #31584 (comment) should help you fix lint issues

@jeremiedbb
Copy link
Member

@saskra something very strange happened to the test_multiclass.py file. Looks like you changed your default tab/spacing in you ide or something 😄

@saskra
Copy link
Contributor Author

saskra commented Jul 11, 2025

@saskra something very strange happened to the test_multiclass.py file. Looks like you changed your default tab/spacing in you ide or something 😄

PyCharm has been doing a lot of weird standalone things since the developers started focussing on AI only. Is it fixed now?

@@ -247,7 +247,6 @@ def _generate_sparse(
],
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert unrelated line removal

saskra and others added 2 commits July 11, 2025 12:44
@jeremiedbb jeremiedbb added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Jul 11, 2025
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @saskra

Copy link
Member

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit about comments

Co-authored-by: Lucy Liu <jliu176@gmail.com>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
Copy link
Member

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience!

@lucyleeow lucyleeow merged commit 9b7a86f into scikit-learn:main Jul 14, 2025
36 checks passed
@saskra saskra deleted the fix-type-of-target-warning branch July 14, 2025 08:47
jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 15, 2025
…sses_ (scikit-learn#31584)

Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
@jeremiedbb jeremiedbb mentioned this pull request Jul 15, 2025
13 tasks
jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 16, 2025
…sses_ (scikit-learn#31584)

Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 16, 2025
…sses_ (scikit-learn#31584)

Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 16, 2025
…sses_ (scikit-learn#31584)

Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 16, 2025
…sses_ (scikit-learn#31584)

Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 16, 2025
…sses_ (scikit-learn#31584)

Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:utils To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unjustified "number of unique classes > 50%" warning in CalibratedClassifierCV
3 participants