Skip to content

CountVectorizer.transform() much slower since version 1.0 #21242

@sobayed

Description

@sobayed

Describe the bug

Since version 1.0, calling CountVectorizer.transform() is more than 100 times slower compared to previous versions. I did some basic profiling and I think it is related to the changes introduced in #19401

Benchmarks from my machine (see example code):

  • 0.24.2: 32.9 ms ± 352 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • 1.0: 6.19 s ± 175 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I understand the value of checking the vocabulary for non-lowercase entries, but it would be great if as a user I could somehow opt-out of this to not impact performance. We have some models in production making predictions record-by-record (batching not possible) and this issue is unfortunately blocking us from upgrading to 1.0 at the moment.

Steps/Code to Reproduce

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='train')
vec = CountVectorizer(lowercase=True).fit(data.data)

%timeit [vec.transform([text]) for text in data.data[:100]]

Expected Results

Performance of CountVectorizer.transform() (roughly) similar between 0.24.2 and 1.0.

Actual Results

CountVectorizer.transform() more than 100 times slower in 1.0 (6.2s vs. 33ms).

Versions

Old:
System:
python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]
executable: /home/sobayed/miniconda3/envs/py39/bin/python
machine: Linux-4.15.0-135-generic-x86_64-with-glibc2.27

Python dependencies:
pip: 21.2.4
setuptools: 58.0.4
sklearn: 0.24.2
numpy: 1.20.3
scipy: 1.7.1
Cython: None
pandas: None
matplotlib: None
joblib: 1.0.1
threadpoolctl: 2.2.0

Built with OpenMP: True

New:
System:
python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]
executable: /home/sobayed/miniconda3/envs/temp/bin/python
machine: Linux-4.15.0-135-generic-x86_64-with-glibc2.27

Python dependencies:
pip: 21.2.4
setuptools: 58.0.4
sklearn: 1.0
numpy: 1.21.1
scipy: 1.7.1
Cython: 0.29.14
pandas: 1.3.3
matplotlib: 3.4.2
joblib: 1.0.1
threadpoolctl: 2.2.0

Built with OpenMP: True

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions