You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When working using CountVectorizer and accidentaly set max_df with float greater than 1.0 and give me wrong result. According to documentation we can set max_idf or min_idf with float [0,1] or int when using CountVectorizer.
Steps/Code to Reproduce
Taken from example:
fromsklearn.feature_extraction.textimportCountVectorizercorpus= [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer=CountVectorizer(analyzer='word', min_df=2, max_df=3)
X=vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
# result ['document', 'first']
Suppose, I accidentaly set max_df with float greater than 1.0. This give me different result:
When working using CountVectorizer and accidentaly set max_df with float greater than 1.0 and give me wrong result. According to documentation we can set max_idf or min_idf with float [0,1] or int when using CountVectorizer.
Steps/Code to Reproduce
Taken from example:
Suppose, I accidentaly set
max_df
with float greater than 1.0. This give me different result:Looking at the source,
It seems setting
max_df=3.0
will be multiplied by number of docs (4, in this case) and give same result as we directly setmax_df=12
Expected Results
Warning or a prevention for user if they set
max_df
ormin_df
with value that is not in range [0,1]Versions
Linux-5.11.0-25-generic-x86_64-with-glibc2.10
Python 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0]
NumPy 1.19.5
SciPy 1.6.2
Scikit-Learn 0.24.2
Imbalanced-Learn 0.7.0
The text was updated successfully, but these errors were encountered: