-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Describe the bug
Some characters are transformed to uppercase chars (eg ™ -> TM), despite lowercase=True.
This then gives warning messages "UserWarning: Upper case characters found in vocabulary while 'lowercase' is True"
(OK, there are worse things happening in the world ;-). Thanks for scikit-learn)
Steps/Code to Reproduce
from sklearn.feature_extraction.text import CountVectorizer
x = ['This is Problematic™.','THIS IS NOT']
cv = CountVectorizer(
lowercase=True,
strip_accents='unicode',
ngram_range=(1,1)
)
x_v = cv.fit_transform(x)
print(cv.get_feature_names_out()) # Contains "problematicTM"
# then you can get a warning
# "UserWarning: Upper case characters found in vocabulary while 'lowercase' is True"
# if you both create a classifier AND run cv.fit_transform
# (no message if you only call one)
from sklearn.svm import LinearSVC
y = [1,0]
xtest = ['This is not']
ytest = [0]
# comment either of the 2 following lines, no warning message displayed
clf = LinearSVC(random_state=42).fit(x_v, y)
xtest_v = cv.transform(xtest)
Expected Results
print: ['is' 'not' 'problematictm' 'this']
Actual Results
print: ['is' 'not' 'problematicTM' 'this']
warning message:
/Users/fps/_fps/DeveloperTools/virtualenvs/fps_env/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:1208: UserWarning: Upper case characters found in vocabulary while 'lowercase' is True. These entries will not be matched with any documents
warnings.warn(
Versions
/Users/fps/_fps/DeveloperTools/virtualenvs/fps_env/lib/python3.9/site-packages/setuptools/distutils_patch.py:25: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first.
warnings.warn(
System:
python: 3.9.2 (v3.9.2:1a79785e3e, Feb 19 2021, 09:06:10) [Clang 6.0 (clang-600.0.57)]
executable: /Users/fps/_fps/DeveloperTools/virtualenvs/fps_env/bin/python3
machine: macOS-10.16-x86_64-i386-64bit
Python dependencies:
pip: 21.2.4
setuptools: 49.2.1
sklearn: 1.0
numpy: 1.21.2
scipy: 1.7.1
Cython: None
pandas: 1.3.3
matplotlib: None
joblib: 1.0.1
threadpoolctl: 2.2.0
Built with OpenMP: True