[MRG+1] Add text vectorizers benchmarks #9086

rth · 2017-06-09T14:43:36Z

In text processing, vectorization usually has large run times compared to .fit and .predict times of models further in the processing pipeline (cf. last two figures in this example). As a result, new PRs are recurrently submitted aiming improve the run time performance of text vectrization (e.g. #7567, #7272, #7470)

This PR adds standard run time and memory benchmarks for CountVectorizer, TfidfVectorizer and HashingVectorizer and a combination of common values for ngram_range and analyzer, so that such benchmarks wouldn't need to be implemented anew for each new PR. cc @lesteve

The output of this benchmark can be found below,

$ python benchmarks/bench_text_vectorizers.py
================================================================================
#    Text vectorizers benchmark
================================================================================

Using a subset of the 20 newsrgoups dataset (11314 documents).
This benchmarks runs in ~20 min ...

========== Run time performance (sec) ===========

vectorizer           CountVectorizer HashingVectorizer TfidfVectorizer
analyzer ngram_range
char     (4, 4)        19.18 (±0.07)     18.40 (±0.02)   20.94 (±0.03)
char_wb  (4, 4)        16.87 (±0.06)     16.95 (±0.10)   17.73 (±0.03)
word     (1, 1)         3.23 (±0.02)      3.63 (±0.02)    3.29 (±0.02)
         (1, 2)        12.99 (±0.02)      8.13 (±0.00)   14.10 (±0.02)
         (1, 4)        41.93 (±0.02)     15.18 (±0.06)   45.08 (±0.11)

=============== Memory usage (MB) ===============

vectorizer           CountVectorizer HashingVectorizer TfidfVectorizer
analyzer ngram_range
char     (4, 4)                785.5             611.1           821.8
char_wb  (4, 4)                570.5             380.7           593.9
word     (1, 1)                205.9             205.4           246.9
         (1, 2)                840.9             238.0           828.5
         (1, 4)               3286.7             370.5          3329.5

lesteve · 2017-06-09T14:46:55Z

LGTM, thanks a lot!

rth · 2017-06-09T16:43:07Z

@TomDLT Would you be able to review this PR? Thanks.

vene · 2017-06-10T14:50:02Z

benchmarks/bench_text_vectorizers.py

+
+res = []
+
+# Wrap the result of iteritools.product with tqdm to get a progress bar


is this comment supposed to be a tip? I'd remove it

That was more intended as a justification why the iteritools.product was used there, and to prevent somebody from refactoring it into a double for loop which would be, after all, simpler and more readable. Since the run time is quite long (~20min) I felt it's useful to have some indication of the the progress, even though I didn't want to add tqdm as a dependency (or do some manual progress indicator with print)..

I didn't get that, and I didn't question the use of product at all. I think it's idiomatic to flatten deep loops this way in Python.

vene · 2017-06-10T14:51:39Z

benchmarks/bench_text_vectorizers.py

+
+    bench = {'vectorizer': Vectorizer.__name__}
+    params = {'analyzer': analyzer, 'ngram_range': ngram_range}
+    bench.update(params)


What am I missing, why not simply

bench = { 'vectorizer': ..., 'analyzer': ..., 'ngram_range': ..., }

bench are results, params are input parameters (so I can do e.g. run_vectorizer(Vectorizer, text, **params)), couldn't find a way to use a single dictionary for both here...

you're right

vene · 2017-06-10T15:00:02Z

benchmarks/bench_text_vectorizers.py

+    dt = timeit.repeat(run_vectorizer(Vectorizer, text, **params),
+                       number=1,
+                       repeat=3)
+    bench['time'] = "{:.2f} (±{:.2f})".format(np.mean(dt), np.std(dt))


In Python 2 you need to add an encoding declaration at the top of the file for non-ascii characters such as the one used here.

I find it very strange to report the mean and std of three points. Is this too slow to run at least 10 times? If so, it might be better to show the median and maybe range, not sure...

@ogrisel what would you report?

Thanks, forgot about this. Added an an encoding declaration on the top.

I agree that computing it on 3 points is non ideal, but these benchmarks already take quite some time (20min) and I felt that the current 10k document dataset would be quite representative of a real work dataset (vocabulary size wise etc), and and wasn't sure about reducing it's size to make the computations faster. Besides, in this benchmark the std is ~1% or the run time, so even if it's off by a factor of 2 or so, it doesn't really matter for any optimization that gains more than 5% performance. It was about giving an idea of the order magnitude of the standard deviation, to say that run to run variations are almost negligible.

If so, it might be better to show the median and maybe range, not sure...

Well the thing is that when running this benchmarks on multiple configurations these tables are already hard to parse as it is (e.g. see this example ) adding a range would be even less readable, I think. About median, you might be right but the mean is more consistent with the regular %timeit typically used in benchmarks and should be more stable run to run..

What do you think? Also thanks for the review! Removed the the comment over iteritools.product, you are right.

yeah, i see what you mean. Maybe just change the printout to "mean %f std %f over 3 runs" so that nobody is misled by the script. (in which case no need for the #encoding either :P )

but it's a nitpick. other than that this is great, thanks!

OK, Thanks, I added an additional printed sentence to specify that the statistics are computed over 3 runs, and replaced \pm by +- to remove the unicode header (trying to keep the cells in the table with as repeated text as possible).

rth · 2017-06-11T10:49:23Z

@lesteve Hmm, flake8 is misbehaving, raises an error on print("="*80 + ...) in this run.

Style wise this should be correct statement (cf. discussion here) and besides I can't reproduce any style errors in this scripts with latest flake8 (version 3.3.0) nor with the one in Travis CI, installed with

 conda create -n flake8-env flake8=2.5.1 python=3.4

Not sure why it's unhappy in Travis..

vene · 2017-06-11T10:58:43Z

Our convention is to use spaces around arithmetic operators as far as I know...

rth · 2017-06-11T11:14:44Z

Our convention is to use spaces around arithmetic operators as far as I know..

I though that too, but according to PEP 8 all the following lines are valid (see 3rd point of Other recommendations section),

x = 2
x*x + x*x

while

x * x + x * x

is not valid.

And flake8 doesn't raise any errors for any of these 3 lines. So TravisCI flake8 job shouldn't fail...

Edit : edited the comment, the previous version of it was incorrect.

rth · 2017-06-14T20:59:46Z

Rebased on master, after #9123 all tests appear to pass. @vene Is there any other comments I should address? Thanks again for the review !

lesteve · 2017-06-28T08:25:28Z

Merging, thanks!

lesteve changed the title ~~[MRG] Add text vectorizers benchmarks~~ [MRG+1] Add text vectorizers benchmarks Jun 9, 2017

vene reviewed Jun 10, 2017

View reviewed changes

rth mentioned this pull request Jun 14, 2017

Too strict settings for flake8 in TravisCI build #9121

Closed

rth added 3 commits June 14, 2017 14:37

Add benchmark_text_vectorizers.py

455c287

Add encoding information and remove comment

fcdec51

Review comments

ce66d1e

rth force-pushed the text-benchmark branch from 8dfea89 to ce66d1e Compare June 14, 2017 13:37

rth mentioned this pull request Jun 17, 2017

[MRG+1] Support for 64 bit sparse array indices in text vectorizers #9147

Merged

lesteve merged commit edabec2 into scikit-learn:master Jun 28, 2017

rth deleted the text-benchmark branch June 28, 2017 17:26

midinas pushed a commit to midinas/scikit-learn that referenced this pull request Jun 29, 2017

[MRG+1] Add text vectorizers benchmarks (scikit-learn#9086)

8a4f2f0

midinas pushed a commit to midinas/scikit-learn that referenced this pull request Jun 29, 2017

[MRG+1] Add text vectorizers benchmarks (scikit-learn#9086)

7dcec16

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+1] Add text vectorizers benchmarks (scikit-learn#9086)

793df77

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+1] Add text vectorizers benchmarks (scikit-learn#9086)

11c462c

NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+1] Add text vectorizers benchmarks (scikit-learn#9086)

c82109b

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] Add text vectorizers benchmarks (scikit-learn#9086)

ec17250

AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

[MRG+1] Add text vectorizers benchmarks (scikit-learn#9086)

85b957a

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+1] Add text vectorizers benchmarks (scikit-learn#9086)

96dab9d

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG+1] Add text vectorizers benchmarks (scikit-learn#9086)

4b6db72

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] Add text vectorizers benchmarks #9086

[MRG+1] Add text vectorizers benchmarks #9086

rth commented Jun 9, 2017 •

edited

Loading

lesteve commented Jun 9, 2017

rth commented Jun 9, 2017

vene Jun 10, 2017

rth Jun 11, 2017

vene Jun 11, 2017

vene Jun 10, 2017

rth Jun 11, 2017

vene Jun 11, 2017

vene Jun 10, 2017

rth Jun 11, 2017

vene Jun 11, 2017

rth Jun 11, 2017

rth commented Jun 11, 2017 •

edited

Loading

vene commented Jun 11, 2017

rth commented Jun 11, 2017 •

edited

Loading

rth commented Jun 14, 2017

lesteve commented Jun 28, 2017


		res = []

		# Wrap the result of iteritools.product with tqdm to get a progress bar

[MRG+1] Add text vectorizers benchmarks #9086

[MRG+1] Add text vectorizers benchmarks #9086

Conversation

rth commented Jun 9, 2017 • edited Loading

lesteve commented Jun 9, 2017

rth commented Jun 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rth commented Jun 11, 2017 • edited Loading

vene commented Jun 11, 2017

rth commented Jun 11, 2017 • edited Loading

rth commented Jun 14, 2017

lesteve commented Jun 28, 2017

rth commented Jun 9, 2017 •

edited

Loading

rth commented Jun 11, 2017 •

edited

Loading

rth commented Jun 11, 2017 •

edited

Loading