Skip to content

[MRG+1] Add text vectorizers benchmarks #9086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 28, 2017

Conversation

rth
Copy link
Member

@rth rth commented Jun 9, 2017

In text processing, vectorization usually has large run times compared to .fit and .predict times of models further in the processing pipeline (cf. last two figures in this example). As a result, new PRs are recurrently submitted aiming improve the run time performance of text vectrization (e.g. #7567, #7272, #7470)

This PR adds standard run time and memory benchmarks for CountVectorizer, TfidfVectorizer and HashingVectorizer and a combination of common values for ngram_range and analyzer, so that such benchmarks wouldn't need to be implemented anew for each new PR. cc @lesteve

The output of this benchmark can be found below,

$ python benchmarks/bench_text_vectorizers.py
================================================================================
#    Text vectorizers benchmark
================================================================================

Using a subset of the 20 newsrgoups dataset (11314 documents).
This benchmarks runs in ~20 min ...

========== Run time performance (sec) ===========

vectorizer           CountVectorizer HashingVectorizer TfidfVectorizer
analyzer ngram_range
char     (4, 4)        19.18 (±0.07)     18.40 (±0.02)   20.94 (±0.03)
char_wb  (4, 4)        16.87 (±0.06)     16.95 (±0.10)   17.73 (±0.03)
word     (1, 1)         3.23 (±0.02)      3.63 (±0.02)    3.29 (±0.02)
         (1, 2)        12.99 (±0.02)      8.13 (±0.00)   14.10 (±0.02)
         (1, 4)        41.93 (±0.02)     15.18 (±0.06)   45.08 (±0.11)

=============== Memory usage (MB) ===============

vectorizer           CountVectorizer HashingVectorizer TfidfVectorizer
analyzer ngram_range
char     (4, 4)                785.5             611.1           821.8
char_wb  (4, 4)                570.5             380.7           593.9
word     (1, 1)                205.9             205.4           246.9
         (1, 2)                840.9             238.0           828.5
         (1, 4)               3286.7             370.5          3329.5

@lesteve
Copy link
Member

lesteve commented Jun 9, 2017

LGTM, thanks a lot!

@lesteve lesteve changed the title [MRG] Add text vectorizers benchmarks [MRG+1] Add text vectorizers benchmarks Jun 9, 2017
@rth
Copy link
Member Author

rth commented Jun 9, 2017

@TomDLT Would you be able to review this PR? Thanks.


res = []

# Wrap the result of iteritools.product with tqdm to get a progress bar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment supposed to be a tip? I'd remove it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was more intended as a justification why the iteritools.product was used there, and to prevent somebody from refactoring it into a double for loop which would be, after all, simpler and more readable. Since the run time is quite long (~20min) I felt it's useful to have some indication of the the progress, even though I didn't want to add tqdm as a dependency (or do some manual progress indicator with print)..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get that, and I didn't question the use of product at all. I think it's idiomatic to flatten deep loops this way in Python.


bench = {'vectorizer': Vectorizer.__name__}
params = {'analyzer': analyzer, 'ngram_range': ngram_range}
bench.update(params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What am I missing, why not simply

bench = {
    'vectorizer': ...,
    'analyzer': ...,
    'ngram_range': ...,
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bench are results, params are input parameters (so I can do e.g. run_vectorizer(Vectorizer, text, **params)), couldn't find a way to use a single dictionary for both here...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right

dt = timeit.repeat(run_vectorizer(Vectorizer, text, **params),
number=1,
repeat=3)
bench['time'] = "{:.2f} (±{:.2f})".format(np.mean(dt), np.std(dt))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python 2 you need to add an encoding declaration at the top of the file for non-ascii characters such as the one used here.

I find it very strange to report the mean and std of three points. Is this too slow to run at least 10 times? If so, it might be better to show the median and maybe range, not sure...

@ogrisel what would you report?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, forgot about this. Added an an encoding declaration on the top.

I agree that computing it on 3 points is non ideal, but these benchmarks already take quite some time (20min) and I felt that the current 10k document dataset would be quite representative of a real work dataset (vocabulary size wise etc), and and wasn't sure about reducing it's size to make the computations faster. Besides, in this benchmark the std is ~1% or the run time, so even if it's off by a factor of 2 or so, it doesn't really matter for any optimization that gains more than 5% performance. It was about giving an idea of the order magnitude of the standard deviation, to say that run to run variations are almost negligible.

If so, it might be better to show the median and maybe range, not sure...

Well the thing is that when running this benchmarks on multiple configurations these tables are already hard to parse as it is (e.g. see this example ) adding a range would be even less readable, I think. About median, you might be right but the mean is more consistent with the regular %timeit typically used in benchmarks and should be more stable run to run..

What do you think? Also thanks for the review! Removed the the comment over iteritools.product, you are right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i see what you mean. Maybe just change the printout to "mean %f std %f over 3 runs" so that nobody is misled by the script. (in which case no need for the #encoding either :P )

but it's a nitpick. other than that this is great, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, Thanks, I added an additional printed sentence to specify that the statistics are computed over 3 runs, and replaced \pm by +- to remove the unicode header (trying to keep the cells in the table with as repeated text as possible).

@rth
Copy link
Member Author

rth commented Jun 11, 2017

@lesteve Hmm, flake8 is misbehaving, raises an error on print("="*80 + ...) in this run.

Style wise this should be correct statement (cf. discussion here) and besides I can't reproduce any style errors in this scripts with latest flake8 (version 3.3.0) nor with the one in Travis CI, installed with

 conda create -n flake8-env flake8=2.5.1 python=3.4

Not sure why it's unhappy in Travis..

@vene
Copy link
Member

vene commented Jun 11, 2017

Our convention is to use spaces around arithmetic operators as far as I know...

@rth
Copy link
Member Author

rth commented Jun 11, 2017

Our convention is to use spaces around arithmetic operators as far as I know..

I though that too, but according to PEP 8 all the following lines are valid (see 3rd point of Other recommendations section),

x = 2
x*x + x*x

while

x * x + x * x

is not valid.

And flake8 doesn't raise any errors for any of these 3 lines. So TravisCI flake8 job shouldn't fail...

Edit : edited the comment, the previous version of it was incorrect.

@rth rth force-pushed the text-benchmark branch from 8dfea89 to ce66d1e Compare June 14, 2017 13:37
@rth
Copy link
Member Author

rth commented Jun 14, 2017

Rebased on master, after #9123 all tests appear to pass. @vene Is there any other comments I should address? Thanks again for the review !

@lesteve
Copy link
Member

lesteve commented Jun 28, 2017

Merging, thanks!

@lesteve lesteve merged commit edabec2 into scikit-learn:master Jun 28, 2017
@rth rth deleted the text-benchmark branch June 28, 2017 17:26
midinas pushed a commit to midinas/scikit-learn that referenced this pull request Jun 29, 2017
midinas pushed a commit to midinas/scikit-learn that referenced this pull request Jun 29, 2017
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants