Text vectorizer with unwanted float for max_idf or min_idf #20746

maziyank · 2021-08-15T01:52:30Z

When working using CountVectorizer and accidentaly set max_df with float greater than 1.0 and give me wrong result. According to documentation we can set max_idf or min_idf with float [0,1] or int when using CountVectorizer.

Steps/Code to Reproduce

Taken from example:

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = CountVectorizer(analyzer='word', min_df=2, max_df=3)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
# result ['document', 'first']

Suppose, I accidentaly set max_df with float greater than 1.0. This give me different result:

vectorizer = CountVectorizer(analyzer='word', min_df=2, max_df=3.0)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
# result ['document', 'first', 'is', 'the', 'this']

Looking at the source,

   max_doc_count = (max_df
                             if isinstance(max_df, numbers.Integral)
                             else max_df * n_doc)

It seems setting max_df=3.0 will be multiplied by number of docs (4, in this case) and give same result as we directly set max_df=12

Expected Results

Warning or a prevention for user if they set max_df or min_df with value that is not in range [0,1]

Versions

Linux-5.11.0-25-generic-x86_64-with-glibc2.10
Python 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0]
NumPy 1.19.5
SciPy 1.6.2
Scikit-Learn 0.24.2
Imbalanced-Learn 0.7.0

The text was updated successfully, but these errors were encountered:

AlekLefebvre · 2021-08-15T14:55:46Z

Thank you for your detailed bug report.

I was able to reproduce it and would like to contribute a PR. Is a warning sufficient or should it throw an error?

AlekLefebvre · 2021-08-15T19:43:53Z

I opted for throwing an error to respect the range. Let me know if it should be done differently.

maziyank · 2021-08-16T09:04:37Z

Yes @AlekLefebvre that is sufficient i think.

maziyank added the Bug: triage label Aug 15, 2021

AlekLefebvre mentioned this issue Aug 15, 2021

FIX raise error for max_df and min_df greater than 1 in Vectorizer #20752

Merged

glemaitre closed this as completed in #20752 Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text vectorizer with unwanted float for max_idf or min_idf #20746

Text vectorizer with unwanted float for max_idf or min_idf #20746

maziyank commented Aug 15, 2021

AlekLefebvre commented Aug 15, 2021

AlekLefebvre commented Aug 15, 2021

maziyank commented Aug 16, 2021 •

edited

Loading

Text vectorizer with unwanted float for max_idf or min_idf #20746

Text vectorizer with unwanted float for max_idf or min_idf #20746

Comments

maziyank commented Aug 15, 2021

Steps/Code to Reproduce

Expected Results

Versions

AlekLefebvre commented Aug 15, 2021

AlekLefebvre commented Aug 15, 2021

maziyank commented Aug 16, 2021 • edited Loading

maziyank commented Aug 16, 2021 •

edited

Loading