Skip to content

FIX raise error for max_df and min_df greater than 1 in Vectorizer #20752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Oct 12, 2021

Conversation

AlekLefebvre
Copy link
Contributor

@AlekLefebvre AlekLefebvre commented Aug 15, 2021

Reference Issues/PRs

Fixes #20746

What does this implement/fix? Explain your changes.

When setting the min_idf and max_idf of CountVectorizer, throw an error if the value is not an int and bigger than 1. This reflects the range and functionality documented for these parameters.

Any other comments?

@glemaitre glemaitre self-requested a review September 2, 2021 17:16
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AlekLefebvre ! Generally looks good aside from the following two comments.

@@ -366,6 +366,10 @@ Changelog
error with unsupported value type.
:pr:`19520` by :user:`Jeff Zhao <kamiyaa>`.

- |Fix| Fixed a bug in :class:`feature_extraction.CountVectorizer` by raising an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- |Fix| Fixed a bug in :class:`feature_extraction.CountVectorizer` by raising an
- |Fix| Fixed a bug in :class:`feature_extraction.CountVectorizer` and
:class:`feature_extraction.TfidfVectorizer` by raising an

Also please move this entry to v1.1.

@rth
Copy link
Member

rth commented Sep 11, 2021

Also please resolve conflicts Never mind I had an outdated version.

debug.py Outdated
@@ -0,0 +1,18 @@
from sklearn.feature_extraction.text import CountVectorizer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file

@@ -832,6 +832,28 @@ def test_vectorizer_min_df():
assert len(vect.stop_words_) == 5


def test_vectorizer_max_df_unwanted_float():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a single test with parametrization. The parameter could be passed as a parameter of the function as well as the error message and error type that is expected.


- |Fix| Fixed a bug in :class:`feature_extraction.CountVectorizer` and
:class:`feature_extraction.TfidfVectorizer` by raising an
error when min_idf or max_idf is a float and > 1.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
error when min_idf or max_idf is a float and > 1.0.
error when `min_idf` or `max_idf` are floating-point numbers greater than 1.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we start to normalize the use of check_scalar, I think it would be best. The validation is also not done in the right location if we follow our own API ;)

@glemaitre glemaitre changed the title Throw error when min_idf or max_idf is float and > 1. Fixes #20746. FIX raise error for max_df and min_df greater than 1 in Vectorizer Sep 16, 2021
Comment on lines 1268 to 1271
try:
check_scalar(self.min_df, "min_df", int, min_val=0)
except TypeError:
check_scalar(self.min_df, "min_df", float, min_val=0.0, max_val=1.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try:
check_scalar(self.min_df, "min_df", int, min_val=0)
except TypeError:
check_scalar(self.min_df, "min_df", float, min_val=0.0, max_val=1.0)
if isinstance(self.min_df, numbers.Integral):
check_scalar(self.min_df, "min_df", numbers.Integral, min_val=0)
else:
check_scalar(self.min_df, "min_df", numbers.Real, min_val=0.0, max_val=1.0)

check_scalar(self.min_df, "min_df", float, min_val=0.0, max_val=1.0)

if self.max_df is not None:
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can follow the same pattern as above.

(1.5, None, "min_df == 1.5, must be <= 1.0."),
),
)
def test_vectorizer_max_df_unwanted_float(min_df, max_df, message):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you as well modify max_features, could you add the validation in this check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can rename the test test_vectorizer_params_validation

@@ -832,6 +832,20 @@ def test_vectorizer_min_df():
assert len(vect.stop_words_) == 5


@pytest.mark.parametrize(
"min_df, max_df, message",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of parametrizing on 2 variables, you can pass directly a dictionary with all the values

@glemaitre
Copy link
Member

In addition, codecov was failing. Make sure to try all the possible combinations for min_df and max_df (i.e integer and float)

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I just resolve the merge issue

@glemaitre glemaitre merged commit c9525d1 into scikit-learn:main Oct 12, 2021
@glemaitre
Copy link
Member

Merging @AlekLefebvre Thanks

@glemaitre glemaitre mentioned this pull request Oct 23, 2021
10 tasks
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Oct 23, 2021
…cikit-learn#20752)

Co-authored-by: Alek Lefebvre <info@aleklefebvre.ca>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
glemaitre added a commit that referenced this pull request Oct 25, 2021
…20752)

Co-authored-by: Alek Lefebvre <info@aleklefebvre.ca>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
…cikit-learn#20752)

Co-authored-by: Alek Lefebvre <info@aleklefebvre.ca>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Text vectorizer with unwanted float for max_idf or min_idf
3 participants