-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Describe the bug
Since version 1.0, calling CountVectorizer.transform()
is more than 100 times slower compared to previous versions. I did some basic profiling and I think it is related to the changes introduced in #19401
Benchmarks from my machine (see example code):
- 0.24.2: 32.9 ms ± 352 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
- 1.0: 6.19 s ± 175 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I understand the value of checking the vocabulary for non-lowercase entries, but it would be great if as a user I could somehow opt-out of this to not impact performance. We have some models in production making predictions record-by-record (batching not possible) and this issue is unfortunately blocking us from upgrading to 1.0 at the moment.
Steps/Code to Reproduce
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='train')
vec = CountVectorizer(lowercase=True).fit(data.data)
%timeit [vec.transform([text]) for text in data.data[:100]]
Expected Results
Performance of CountVectorizer.transform()
(roughly) similar between 0.24.2 and 1.0.
Actual Results
CountVectorizer.transform()
more than 100 times slower in 1.0 (6.2s vs. 33ms).
Versions
Old:
System:
python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]
executable: /home/sobayed/miniconda3/envs/py39/bin/python
machine: Linux-4.15.0-135-generic-x86_64-with-glibc2.27
Python dependencies:
pip: 21.2.4
setuptools: 58.0.4
sklearn: 0.24.2
numpy: 1.20.3
scipy: 1.7.1
Cython: None
pandas: None
matplotlib: None
joblib: 1.0.1
threadpoolctl: 2.2.0
Built with OpenMP: True
New:
System:
python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]
executable: /home/sobayed/miniconda3/envs/temp/bin/python
machine: Linux-4.15.0-135-generic-x86_64-with-glibc2.27
Python dependencies:
pip: 21.2.4
setuptools: 58.0.4
sklearn: 1.0
numpy: 1.21.1
scipy: 1.7.1
Cython: 0.29.14
pandas: 1.3.3
matplotlib: 3.4.2
joblib: 1.0.1
threadpoolctl: 2.2.0
Built with OpenMP: True