Skip to content

CountVectorizer(lowercase=True,strip_accents='unicode') may produce vocab that contains uppercase chars #21207

@fpservant

Description

@fpservant

Describe the bug

Some characters are transformed to uppercase chars (eg ™ -> TM), despite lowercase=True.
This then gives warning messages "UserWarning: Upper case characters found in vocabulary while 'lowercase' is True"
(OK, there are worse things happening in the world ;-). Thanks for scikit-learn)

Steps/Code to Reproduce

from sklearn.feature_extraction.text import CountVectorizer
x = ['This is Problematic™.','THIS IS NOT']

cv = CountVectorizer(
    lowercase=True,
    strip_accents='unicode',
    ngram_range=(1,1)
    )

x_v = cv.fit_transform(x)
print(cv.get_feature_names_out()) # Contains "problematicTM"

# then you can get a warning
# "UserWarning: Upper case characters found in vocabulary while 'lowercase' is True"
# if you both create a classifier AND run cv.fit_transform
# (no message if you only call one)

from sklearn.svm import LinearSVC
y = [1,0]
xtest = ['This is not']
ytest = [0]

# comment either of the 2 following lines, no warning message displayed
clf = LinearSVC(random_state=42).fit(x_v, y)
xtest_v = cv.transform(xtest)

Expected Results

print: ['is' 'not' 'problematictm' 'this']

Actual Results

print: ['is' 'not' 'problematicTM' 'this']

warning message:
/Users/fps/_fps/DeveloperTools/virtualenvs/fps_env/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:1208: UserWarning: Upper case characters found in vocabulary while 'lowercase' is True. These entries will not be matched with any documents
warnings.warn(

Versions

/Users/fps/_fps/DeveloperTools/virtualenvs/fps_env/lib/python3.9/site-packages/setuptools/distutils_patch.py:25: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first.
warnings.warn(

System:
python: 3.9.2 (v3.9.2:1a79785e3e, Feb 19 2021, 09:06:10) [Clang 6.0 (clang-600.0.57)]
executable: /Users/fps/_fps/DeveloperTools/virtualenvs/fps_env/bin/python3
machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
pip: 21.2.4
setuptools: 49.2.1
sklearn: 1.0
numpy: 1.21.2
scipy: 1.7.1
Cython: None
pandas: 1.3.3
matplotlib: None
joblib: 1.0.1
threadpoolctl: 2.2.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions