-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Fix spurious warning from type_of_target when called on estimator.classes_ #31584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix spurious warning from type_of_target when called on estimator.classes_ #31584
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @saskra. Please add a test for _get_response_values
in test_response.py
to check that no warning is raised now.
Like this? def test_response_values_warn_bug_type_of_target_on_classes_issue():
"""
Ensure no warning is raised due to incorrect call to `type_of_target(classes_)`
when using `CalibratedClassifierCV` with many classes.
This test verifies that using `classes_` instead of `y` in `_get_response_values`
does not incorrectly trigger the "unique classes > 50% of samples" warning.
"""
n_samples = 1000
n_features = 40
n_classes = 30 # Well below 50% of 1000 samples
rng = np.random.RandomState(42)
X = rng.rand(n_samples, n_features)
y = np.tile(np.arange(n_classes), int(np.ceil(n_samples / n_classes)))[:n_samples]
base_clf = RandomForestClassifier(n_estimators=10, random_state=0)
clf = CalibratedClassifierCV(base_clf, method="isotonic", cv=2)
clf.fit(X, y)
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
_get_response_values(clf, X, response_method="predict_proba")
warning_messages = [str(warning.message) for warning in w]
# This warning was raised due to incorrect handling of classes_
unexpected = [
msg
for msg in warning_messages
if "number of unique classes is greater than 50%" in msg
]
assert not unexpected, f"Unexpected warning: {unexpected}" |
Yes but I think that we can make it a bit simpler def test_response_values_type_of_target_on_classes_no_warning():
"""
Ensure that _get_response_values doesn't raise the "unique classes > 50% of samples"
warning when calling `type_of_target(classes_)`.
non-regression test for issue #31583.
"""
X = np.random.RandomState(0).randn(120, 3)
# 30 classes, less than 50% of number of samples
y = np.repeat(np.arange(30), 4)
clf = LogisticRegression().fit(X, y)
with warnings.catch_warnings():
warnings.simplefilter("error", UserWarning)
_get_response_values(clf, X, response_method="predict_proba") |
…lues_type_of_target_on_classes_no_warning; do not add suppress_warning
fdb8624
to
eb90ddd
Compare
Thanks @saskra, looks good ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ping @lucyleeow or @StefanieSenger for a second review maybe ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @saskra ! Nitpicks only
@@ -0,0 +1,3 @@ | |||
- Fixed a spurious warning in :func:`utils.multiclass.type_of_target` that could be triggered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be specific about which warning we are talking about.
I also wonder if it is worth detailing all public functions this will affect, since whats new is aimed for users, and type_of_target
may be less meaningful to a user than e.g., GridSearchCV
will no longer give spurious "The number of unique classes is greater than 50% of the number of samples" warning when the number of samples is...etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you find out all the public functions that are affected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a clever solution outside of searching in the codebase. Maybe a regular expression search for when classes
is passed to type_of_target
? Or something that is not y_*
?
Failing that, just a list of classes/functions that use _get_response_values
would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like this?
- Fixed a spurious warning that could occur when passing ``estimator.classes_``
or similar arrays with many unique values to classification utilities.
The warning came from :func:`sklearn.utils.multiclass.type_of_target` and has
now been suppressed when the input is not a true target vector.
The warning message was: "The number of unique classes is greater than
50% of the number of samples."
This could appear in tools that internally validate classification outputs,
such as :class:`~sklearn.model_selection.GridSearchCV`,
:func:`~sklearn.model_selection.cross_val_score`,
:func:`~sklearn.metrics.make_scorer`,
:class:`~sklearn.multioutput.MultiOutputClassifier`,
and :class:`~sklearn.calibration.CalibratedClassifierCV`.
By :user:`Sascha D. Krauss <saskra>`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the large amount of potentially affect public classes and functions, I'm not sure it's worth trying to list them all. It was just an unexpected warning, not a critical bug. Since type_of_target
is a public function I think it's fine for the changelog to only be about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, what about something like:
- Fixed a spurious warning (about the number of unique classes being
greater than 50% of the number of samples) that could occur when
passing `classes` :func:`utils.multiclass.type_of_target`.
Note you don't need the sklearn
at the start in whats new entries.
@@ -302,7 +302,7 @@ def test_type_of_target_too_many_unique_classes(): | |||
We need to check that we don't raise if we have less than 20 samples. | |||
""" | |||
|
|||
y = np.arange(25) | |||
y = np.hstack((np.arange(20), [0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a note here to explain why we add the '0' at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And maybe we could keep the original y = np.arange(25)
and check that this does not raise a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the original y = np.arange(25)
now raises a warning because the condition y.shape[0] > classes.shape[0]
is not fulfilled, but y.shape[0] == classes.shape[0]
. Hence the second zero in the new array in the test, so that you at least have a single class with two samples. The whole point was to prevent the warning from being issued incorrectly if the function is only called with the unique class list instead of the real list of class labels per sample - hence the original idea was also an argument to suppress the warning for this case.
So should we rather adapt the test or treat the case y.shape[0] == classes.shape[0]
in type_of_target
differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a note here to explain why we add the '0' at the end?
+1
Let's add a comment to explain that we create an array with almost all classes represented only once except 0 represented twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And maybe we could keep the original y = np.arange(25) and check that this does not raise a warning?
However, the original y = np.arange(25) now raises a warning because the condition y.shape[0] > classes.shape[0] is not fulfilled, but y.shape[0] == classes.shape[0]
@saskra I think @lucyleeow is correct. Now the np.arange(25)
case won't raise the warning because we special cased y.shape[0] == classes.shape[0]
so we can check that indeed no warning is raised. Something like the following that we'd put at the end of the test.
# More than 20 samples but only unique classes, no warning should be raised
y = np.arange(25)
with warnings.catch_warnings():
warnings.simplefilter("ignore", UserWarning)
type_of_target(y)
Co-authored-by: Lucy Liu <jliu176@gmail.com>
…ique_classes plus explanatory comments
Also #31584 (comment) should help you fix lint issues |
@saskra something very strange happened to the |
PyCharm has been doing a lot of weird standalone things since the developers started focussing on AI only. Is it fixed now? |
@@ -247,7 +247,6 @@ def _generate_sparse( | |||
], | |||
} | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please revert unrelated line removal
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @saskra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit about comments
Co-authored-by: Lucy Liu <jliu176@gmail.com>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience!
…sses_ (scikit-learn#31584) Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai> Co-authored-by: Lucy Liu <jliu176@gmail.com>
…sses_ (scikit-learn#31584) Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai> Co-authored-by: Lucy Liu <jliu176@gmail.com>
…sses_ (scikit-learn#31584) Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai> Co-authored-by: Lucy Liu <jliu176@gmail.com>
…sses_ (scikit-learn#31584) Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai> Co-authored-by: Lucy Liu <jliu176@gmail.com>
…sses_ (scikit-learn#31584) Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai> Co-authored-by: Lucy Liu <jliu176@gmail.com>
…sses_ (scikit-learn#31584) Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai> Co-authored-by: Lucy Liu <jliu176@gmail.com>
Reference Issues/PRs
Fixes #31583
What does this implement/fix? Explain your changes.
This PR suppresses an unintended warning in get_response_values, where type_of_target is called on estimator.classes_. Since classes_ does not represent full sample-level data, this call may spuriously trigger the warning:
"The number of unique classes is greater than 50% of the number of samples."
This is now avoided by passing suppress_warning=True to type_of_target() at this specific location.
This patch is intentionally minimal and does not affect calls to type_of_target that operate on actual sample labels (y, y_true, etc.).
Any other comments?
This was first observed while calibrating classifiers with many classes. Although the dataset was large and well-balanced, the warning appeared due to how classes_ was passed into type_of_target.
Apologies in advance if this is already known or intentional – this is my first contribution here, and I appreciate any feedback or corrections.
Thanks for your time and for maintaining this great library!