You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since version 1.0, calling CountVectorizer.transform() is more than 100 times slower compared to previous versions. I did some basic profiling and I think it is related to the changes introduced in #19401
Benchmarks from my machine (see example code):
0.24.2: 32.9 ms ± 352 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.0: 6.19 s ± 175 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I understand the value of checking the vocabulary for non-lowercase entries, but it would be great if as a user I could somehow opt-out of this to not impact performance. We have some models in production making predictions record-by-record (batching not possible) and this issue is unfortunately blocking us from upgrading to 1.0 at the moment.
Describe the bug
Since version 1.0, calling
CountVectorizer.transform()
is more than 100 times slower compared to previous versions. I did some basic profiling and I think it is related to the changes introduced in #19401Benchmarks from my machine (see example code):
I understand the value of checking the vocabulary for non-lowercase entries, but it would be great if as a user I could somehow opt-out of this to not impact performance. We have some models in production making predictions record-by-record (batching not possible) and this issue is unfortunately blocking us from upgrading to 1.0 at the moment.
Steps/Code to Reproduce
Expected Results
Performance of
CountVectorizer.transform()
(roughly) similar between 0.24.2 and 1.0.Actual Results
CountVectorizer.transform()
more than 100 times slower in 1.0 (6.2s vs. 33ms).Versions
Old:
System:
python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]
executable: /home/sobayed/miniconda3/envs/py39/bin/python
machine: Linux-4.15.0-135-generic-x86_64-with-glibc2.27
Python dependencies:
pip: 21.2.4
setuptools: 58.0.4
sklearn: 0.24.2
numpy: 1.20.3
scipy: 1.7.1
Cython: None
pandas: None
matplotlib: None
joblib: 1.0.1
threadpoolctl: 2.2.0
Built with OpenMP: True
New:
System:
python: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]
executable: /home/sobayed/miniconda3/envs/temp/bin/python
machine: Linux-4.15.0-135-generic-x86_64-with-glibc2.27
Python dependencies:
pip: 21.2.4
setuptools: 58.0.4
sklearn: 1.0
numpy: 1.21.1
scipy: 1.7.1
Cython: 0.29.14
pandas: 1.3.3
matplotlib: 3.4.2
joblib: 1.0.1
threadpoolctl: 2.2.0
Built with OpenMP: True
The text was updated successfully, but these errors were encountered: