FIX raise error for max_df and min_df greater than 1 in Vectorizer #20752

AlekLefebvre · 2021-08-15T18:14:30Z

Reference Issues/PRs

Fixes #20746

What does this implement/fix? Explain your changes.

When setting the min_idf and max_idf of CountVectorizer, throw an error if the value is not an int and bigger than 1. This reflects the range and functionality documented for these parameters.

Any other comments?

rth

Thanks @AlekLefebvre ! Generally looks good aside from the following two comments.

rth · 2021-09-10T15:32:29Z

doc/whats_new/v1.0.rst

@@ -366,6 +366,10 @@ Changelog
  error with unsupported value type.
  :pr:`19520` by :user:`Jeff Zhao <kamiyaa>`.

+- |Fix| Fixed a bug in :class:`feature_extraction.CountVectorizer` by raising an


Suggested change

- |Fix| Fixed a bug in :class:`feature_extraction.CountVectorizer` by raising an

- |Fix| Fixed a bug in :class:`feature_extraction.CountVectorizer` and

:class:`feature_extraction.TfidfVectorizer` by raising an

Also please move this entry to v1.1.

sklearn/feature_extraction/text.py

rth · 2021-09-11T07:53:38Z

~~Also please resolve conflicts~~ Never mind I had an outdated version.

rth · 2021-09-11T07:55:42Z

debug.py

@@ -0,0 +1,18 @@
+from sklearn.feature_extraction.text import CountVectorizer


Please remove this file

glemaitre · 2021-09-16T09:18:54Z

sklearn/feature_extraction/tests/test_text.py

@@ -832,6 +832,28 @@ def test_vectorizer_min_df():
    assert len(vect.stop_words_) == 5


+def test_vectorizer_max_df_unwanted_float():


We could have a single test with parametrization. The parameter could be passed as a parameter of the function as well as the error message and error type that is expected.

glemaitre · 2021-09-16T09:19:50Z

doc/whats_new/v1.1.rst

+
+- |Fix| Fixed a bug in :class:`feature_extraction.CountVectorizer` and
+  :class:`feature_extraction.TfidfVectorizer` by raising an
+  error when min_idf or max_idf is a float and > 1.0.


Suggested change

error when min_idf or max_idf is a float and > 1.0.

error when `min_idf` or `max_idf` are floating-point numbers greater than 1.

glemaitre

Since we start to normalize the use of check_scalar, I think it would be best. The validation is also not done in the right location if we follow our own API ;)

glemaitre · 2021-09-23T13:33:14Z

sklearn/feature_extraction/text.py

+            try:
+                check_scalar(self.min_df, "min_df", int, min_val=0)
+            except TypeError:
+                check_scalar(self.min_df, "min_df", float, min_val=0.0, max_val=1.0)


Suggested change

try:

check_scalar(self.min_df, "min_df", int, min_val=0)

except TypeError:

check_scalar(self.min_df, "min_df", float, min_val=0.0, max_val=1.0)

if isinstance(self.min_df, numbers.Integral):

check_scalar(self.min_df, "min_df", numbers.Integral, min_val=0)

else:

check_scalar(self.min_df, "min_df", numbers.Real, min_val=0.0, max_val=1.0)

glemaitre · 2021-09-23T13:33:38Z

sklearn/feature_extraction/text.py

+                check_scalar(self.min_df, "min_df", float, min_val=0.0, max_val=1.0)
+
+        if self.max_df is not None:
+            try:


You can follow the same pattern as above.

glemaitre · 2021-09-23T13:36:00Z

sklearn/feature_extraction/tests/test_text.py

+        (1.5, None, "min_df == 1.5, must be <= 1.0."),
+    ),
+)
+def test_vectorizer_max_df_unwanted_float(min_df, max_df, message):


Since you as well modify max_features, could you add the validation in this check?

You can rename the test test_vectorizer_params_validation

glemaitre · 2021-09-23T13:37:03Z

sklearn/feature_extraction/tests/test_text.py

@@ -832,6 +832,20 @@ def test_vectorizer_min_df():
    assert len(vect.stop_words_) == 5


+@pytest.mark.parametrize(
+    "min_df, max_df, message",


instead of parametrizing on 2 variables, you can pass directly a dictionary with all the values

glemaitre · 2021-09-23T13:39:00Z

In addition, codecov was failing. Make sure to try all the possible combinations for min_df and max_df (i.e integer and float)

glemaitre

LGTM. I just resolve the merge issue

glemaitre · 2021-10-12T20:56:55Z

Merging @AlekLefebvre Thanks

…cikit-learn#20752) Co-authored-by: Alek Lefebvre <info@aleklefebvre.ca> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

…20752) Co-authored-by: Alek Lefebvre <info@aleklefebvre.ca> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

…cikit-learn#20752) Co-authored-by: Alek Lefebvre <info@aleklefebvre.ca> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Throw error when min_idf or max_idf is float and > 1

937666a

github-actions bot added the module:feature_extraction label Aug 15, 2021

changelog

5b004f5

glemaitre self-requested a review September 2, 2021 17:16

rth reviewed Sep 10, 2021

View reviewed changes

merge and moved changes to 1.1

ca5835d

rth approved these changes Sep 11, 2021

View reviewed changes

removed debug file

06f70e9

glemaitre reviewed Sep 16, 2021

View reviewed changes

glemaitre requested changes Sep 16, 2021

View reviewed changes

glemaitre changed the title ~~Throw error when min_idf or max_idf is float and > 1. Fixes #20746.~~ FIX raise error for max_df and min_df greater than 1 in Vectorizer Sep 16, 2021

Alek Lefebvre added 3 commits September 19, 2021 11:03

refactor validation and parametrize tests

0a9640c

resolve conflicts

03ee6d0

resolve conflicts

83233c1

glemaitre reviewed Sep 23, 2021

View reviewed changes

Alek Lefebvre and others added 3 commits October 1, 2021 11:06

improve validation and test coverage

caa022b

resolve conflicts

60af959

Merge branch 'main' into unwanted_float_min_max_idf

98802a6

glemaitre approved these changes Oct 12, 2021

View reviewed changes

glemaitre merged commit c9525d1 into scikit-learn:main Oct 12, 2021

glemaitre mentioned this pull request Oct 23, 2021

Release 1.0.1 #21404

Merged

10 tasks

glemaitre added a commit that referenced this pull request Oct 25, 2021

FIX raise error for max_df and min_df greater than 1 in Vectorizer (#…

0d598f4

…20752) Co-authored-by: Alek Lefebvre <info@aleklefebvre.ca> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

reshamas mentioned this pull request Dec 8, 2021

Use the function check_scalar for parameters validation #21927

Closed

41 tasks

	- \|Fix\| Fixed a bug in :class:`feature_extraction.CountVectorizer` by raising an
	- \|Fix\| Fixed a bug in :class:`feature_extraction.CountVectorizer` and
	:class:`feature_extraction.TfidfVectorizer` by raising an

		@@ -0,0 +1,18 @@
		from sklearn.feature_extraction.text import CountVectorizer

		@@ -832,6 +832,28 @@ def test_vectorizer_min_df():
		assert len(vect.stop_words_) == 5


		def test_vectorizer_max_df_unwanted_float():

	error when min_idf or max_idf is a float and > 1.0.
	error when `min_idf` or `max_idf` are floating-point numbers greater than 1.

Uh oh!

FIX raise error for max_df and min_df greater than 1 in Vectorizer #20752

FIX raise error for max_df and min_df greater than 1 in Vectorizer #20752

Conversation

AlekLefebvre commented Aug 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rth commented Sep 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Sep 23, 2021

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Oct 12, 2021

Uh oh!

Uh oh!

AlekLefebvre commented Aug 15, 2021 •

edited

Loading

rth commented Sep 11, 2021 •

edited

Loading