CountVectorizer.transform() much slower since version 1.0 #21242

sobayed · 2021-10-05T09:24:54Z

Describe the bug

Since version 1.0, calling CountVectorizer.transform() is more than 100 times slower compared to previous versions. I did some basic profiling and I think it is related to the changes introduced in #19401

Benchmarks from my machine (see example code):

0.24.2: 32.9 ms ± 352 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.0: 6.19 s ± 175 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I understand the value of checking the vocabulary for non-lowercase entries, but it would be great if as a user I could somehow opt-out of this to not impact performance. We have some models in production making predictions record-by-record (batching not possible) and this issue is unfortunately blocking us from upgrading to 1.0 at the moment.

Steps/Code to Reproduce

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='train')
vec = CountVectorizer(lowercase=True).fit(data.data)

%timeit [vec.transform([text]) for text in data.data[:100]]

Expected Results

Performance of CountVectorizer.transform() (roughly) similar between 0.24.2 and 1.0.

Actual Results

CountVectorizer.transform() more than 100 times slower in 1.0 (6.2s vs. 33ms).

Versions

Old:
System:
python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]
executable: /home/sobayed/miniconda3/envs/py39/bin/python
machine: Linux-4.15.0-135-generic-x86_64-with-glibc2.27

Python dependencies:
pip: 21.2.4
setuptools: 58.0.4
sklearn: 0.24.2
numpy: 1.20.3
scipy: 1.7.1
Cython: None
pandas: None
matplotlib: None
joblib: 1.0.1
threadpoolctl: 2.2.0

Built with OpenMP: True

New:
System:
python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]
executable: /home/sobayed/miniconda3/envs/temp/bin/python
machine: Linux-4.15.0-135-generic-x86_64-with-glibc2.27

Python dependencies:
pip: 21.2.4
setuptools: 58.0.4
sklearn: 1.0
numpy: 1.21.1
scipy: 1.7.1
Cython: 0.29.14
pandas: 1.3.3
matplotlib: 3.4.2
joblib: 1.0.1
threadpoolctl: 2.2.0

Built with OpenMP: True

The text was updated successfully, but these errors were encountered:

jeremiedbb · 2021-10-06T09:54:36Z

take

sobayed added the Bug: triage label Oct 5, 2021

jeremiedbb added Performance Regression labels Oct 6, 2021

github-actions bot assigned jeremiedbb Oct 6, 2021

jeremiedbb mentioned this issue Oct 6, 2021

FIX CountVectorizer: check upper case in vocab only in fit #21251

Merged

glemaitre closed this as completed in #21251 Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CountVectorizer.transform() much slower since version 1.0 #21242

CountVectorizer.transform() much slower since version 1.0 #21242

sobayed commented Oct 5, 2021

jeremiedbb commented Oct 6, 2021

CountVectorizer.transform() much slower since version 1.0 #21242

CountVectorizer.transform() much slower since version 1.0 #21242

Comments

sobayed commented Oct 5, 2021

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

jeremiedbb commented Oct 6, 2021