-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX raise error for max_df and min_df greater than 1 in Vectorizer #20752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX raise error for max_df and min_df greater than 1 in Vectorizer #20752
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @AlekLefebvre ! Generally looks good aside from the following two comments.
doc/whats_new/v1.0.rst
Outdated
@@ -366,6 +366,10 @@ Changelog | |||
error with unsupported value type. | |||
:pr:`19520` by :user:`Jeff Zhao <kamiyaa>`. | |||
|
|||
- |Fix| Fixed a bug in :class:`feature_extraction.CountVectorizer` by raising an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- |Fix| Fixed a bug in :class:`feature_extraction.CountVectorizer` by raising an | |
- |Fix| Fixed a bug in :class:`feature_extraction.CountVectorizer` and | |
:class:`feature_extraction.TfidfVectorizer` by raising an |
Also please move this entry to v1.1.
|
debug.py
Outdated
@@ -0,0 +1,18 @@ | |||
from sklearn.feature_extraction.text import CountVectorizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this file
@@ -832,6 +832,28 @@ def test_vectorizer_min_df(): | |||
assert len(vect.stop_words_) == 5 | |||
|
|||
|
|||
def test_vectorizer_max_df_unwanted_float(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have a single test with parametrization. The parameter could be passed as a parameter of the function as well as the error message and error type that is expected.
doc/whats_new/v1.1.rst
Outdated
|
||
- |Fix| Fixed a bug in :class:`feature_extraction.CountVectorizer` and | ||
:class:`feature_extraction.TfidfVectorizer` by raising an | ||
error when min_idf or max_idf is a float and > 1.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error when min_idf or max_idf is a float and > 1.0. | |
error when `min_idf` or `max_idf` are floating-point numbers greater than 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we start to normalize the use of check_scalar
, I think it would be best. The validation is also not done in the right location if we follow our own API ;)
sklearn/feature_extraction/text.py
Outdated
try: | ||
check_scalar(self.min_df, "min_df", int, min_val=0) | ||
except TypeError: | ||
check_scalar(self.min_df, "min_df", float, min_val=0.0, max_val=1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try: | |
check_scalar(self.min_df, "min_df", int, min_val=0) | |
except TypeError: | |
check_scalar(self.min_df, "min_df", float, min_val=0.0, max_val=1.0) | |
if isinstance(self.min_df, numbers.Integral): | |
check_scalar(self.min_df, "min_df", numbers.Integral, min_val=0) | |
else: | |
check_scalar(self.min_df, "min_df", numbers.Real, min_val=0.0, max_val=1.0) |
sklearn/feature_extraction/text.py
Outdated
check_scalar(self.min_df, "min_df", float, min_val=0.0, max_val=1.0) | ||
|
||
if self.max_df is not None: | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can follow the same pattern as above.
(1.5, None, "min_df == 1.5, must be <= 1.0."), | ||
), | ||
) | ||
def test_vectorizer_max_df_unwanted_float(min_df, max_df, message): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you as well modify max_features
, could you add the validation in this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can rename the test test_vectorizer_params_validation
@@ -832,6 +832,20 @@ def test_vectorizer_min_df(): | |||
assert len(vect.stop_words_) == 5 | |||
|
|||
|
|||
@pytest.mark.parametrize( | |||
"min_df, max_df, message", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of parametrizing on 2 variables, you can pass directly a dictionary with all the values
In addition, codecov was failing. Make sure to try all the possible combinations for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I just resolve the merge issue
Merging @AlekLefebvre Thanks |
…cikit-learn#20752) Co-authored-by: Alek Lefebvre <info@aleklefebvre.ca> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
…20752) Co-authored-by: Alek Lefebvre <info@aleklefebvre.ca> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
…cikit-learn#20752) Co-authored-by: Alek Lefebvre <info@aleklefebvre.ca> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Reference Issues/PRs
Fixes #20746
What does this implement/fix? Explain your changes.
When setting the min_idf and max_idf of CountVectorizer, throw an error if the value is not an int and bigger than 1. This reflects the range and functionality documented for these parameters.
Any other comments?