Skip to content

CountVectorizer.transform() much slower since version 1.0 #21242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sobayed opened this issue Oct 5, 2021 · 1 comment · Fixed by #21251
Closed

CountVectorizer.transform() much slower since version 1.0 #21242

sobayed opened this issue Oct 5, 2021 · 1 comment · Fixed by #21251

Comments

@sobayed
Copy link

sobayed commented Oct 5, 2021

Describe the bug

Since version 1.0, calling CountVectorizer.transform() is more than 100 times slower compared to previous versions. I did some basic profiling and I think it is related to the changes introduced in #19401

Benchmarks from my machine (see example code):

  • 0.24.2: 32.9 ms ± 352 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • 1.0: 6.19 s ± 175 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I understand the value of checking the vocabulary for non-lowercase entries, but it would be great if as a user I could somehow opt-out of this to not impact performance. We have some models in production making predictions record-by-record (batching not possible) and this issue is unfortunately blocking us from upgrading to 1.0 at the moment.

Steps/Code to Reproduce

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='train')
vec = CountVectorizer(lowercase=True).fit(data.data)

%timeit [vec.transform([text]) for text in data.data[:100]]

Expected Results

Performance of CountVectorizer.transform() (roughly) similar between 0.24.2 and 1.0.

Actual Results

CountVectorizer.transform() more than 100 times slower in 1.0 (6.2s vs. 33ms).

Versions

Old:
System:
python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]
executable: /home/sobayed/miniconda3/envs/py39/bin/python
machine: Linux-4.15.0-135-generic-x86_64-with-glibc2.27

Python dependencies:
pip: 21.2.4
setuptools: 58.0.4
sklearn: 0.24.2
numpy: 1.20.3
scipy: 1.7.1
Cython: None
pandas: None
matplotlib: None
joblib: 1.0.1
threadpoolctl: 2.2.0

Built with OpenMP: True

New:
System:
python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]
executable: /home/sobayed/miniconda3/envs/temp/bin/python
machine: Linux-4.15.0-135-generic-x86_64-with-glibc2.27

Python dependencies:
pip: 21.2.4
setuptools: 58.0.4
sklearn: 1.0
numpy: 1.21.1
scipy: 1.7.1
Cython: 0.29.14
pandas: 1.3.3
matplotlib: 3.4.2
joblib: 1.0.1
threadpoolctl: 2.2.0

Built with OpenMP: True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants