Skip to content

Text vectorizer with unwanted float for max_idf or min_idf #20746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
maziyank opened this issue Aug 15, 2021 · 3 comments · Fixed by #20752
Closed

Text vectorizer with unwanted float for max_idf or min_idf #20746

maziyank opened this issue Aug 15, 2021 · 3 comments · Fixed by #20752

Comments

@maziyank
Copy link

When working using CountVectorizer and accidentaly set max_df with float greater than 1.0 and give me wrong result. According to documentation we can set max_idf or min_idf with float [0,1] or int when using CountVectorizer.

Steps/Code to Reproduce

Taken from example:

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = CountVectorizer(analyzer='word', min_df=2, max_df=3)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
# result ['document', 'first']

Suppose, I accidentaly set max_df with float greater than 1.0. This give me different result:

vectorizer = CountVectorizer(analyzer='word', min_df=2, max_df=3.0)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
# result ['document', 'first', 'is', 'the', 'this']

Looking at the source,

   max_doc_count = (max_df
                             if isinstance(max_df, numbers.Integral)
                             else max_df * n_doc)

It seems setting max_df=3.0 will be multiplied by number of docs (4, in this case) and give same result as we directly set max_df=12

Expected Results

Warning or a prevention for user if they set max_df or min_df with value that is not in range [0,1]

Versions

Linux-5.11.0-25-generic-x86_64-with-glibc2.10
Python 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0]
NumPy 1.19.5
SciPy 1.6.2
Scikit-Learn 0.24.2
Imbalanced-Learn 0.7.0

@AlekLefebvre
Copy link
Contributor

Thank you for your detailed bug report.

I was able to reproduce it and would like to contribute a PR. Is a warning sufficient or should it throw an error?

@AlekLefebvre
Copy link
Contributor

I opted for throwing an error to respect the range. Let me know if it should be done differently.

@maziyank
Copy link
Author

maziyank commented Aug 16, 2021

Yes @AlekLefebvre that is sufficient i think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants