-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Mention possibility of regression targets in warning about unique classes >50% of n_samples #31689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/utils/multiclass.py
Outdated
"of samples. Note this may be because the data belongs to a regression " | ||
"problem, not a classification problem.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would only be triggered by a regression target containing only integers and with an integer dtype. Is it worth it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well technically integer float values are "multiclass", but I get your point...
type_of_target([1.0, 0.0, 3.0])
'multiclass'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is too rare. Happy to close
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I misread the previous code block, integer float values end-up as multiclass.
The original motivation for the warning was not about confusion with regression problems according to #16399 (comment).
However this comment #26335 (comment) goes in your direction. So I think your addition is worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yes. Christian also mentions it later: #26335 (comment)
sklearn/utils/multiclass.py
Outdated
"of samples. Note this may be because the data belongs to a regression " | ||
"problem, not a classification problem.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit:
"of samples. Note this may be because the data belongs to a regression " | |
"problem, not a classification problem.", | |
"of samples. The target data could represent a regression " | |
"problem, not a classification problem.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nicer thank you. What do you think about:
"y
could represent a regression problem, not a classification problem." ..?
(I am happy either way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your suggestion works for me to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
…sses >50% of n_samples (scikit-learn#31689) Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
…sses >50% of n_samples (scikit-learn#31689) Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
…sses >50% of n_samples (scikit-learn#31689) Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
…sses >50% of n_samples (scikit-learn#31689) Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
…sses >50% of n_samples (scikit-learn#31689) Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
…sses >50% of n_samples (scikit-learn#31689) Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Reference Issues/PRs
Addresses #26335 (comment)
What does this implement/fix? Explain your changes.
In #26335 we added a warning when unique classes >50% of
n_samples
. This adds a note that it could also be due to the data being from a regression problem.Any other comments?