-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Add text vectorizers benchmarks #9086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
LGTM, thanks a lot! |
@TomDLT Would you be able to review this PR? Thanks. |
benchmarks/bench_text_vectorizers.py
Outdated
|
||
res = [] | ||
|
||
# Wrap the result of iteritools.product with tqdm to get a progress bar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment supposed to be a tip? I'd remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was more intended as a justification why the iteritools.product
was used there, and to prevent somebody from refactoring it into a double for
loop which would be, after all, simpler and more readable. Since the run time is quite long (~20min) I felt it's useful to have some indication of the the progress, even though I didn't want to add tqdm
as a dependency (or do some manual progress indicator with print
)..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get that, and I didn't question the use of product at all. I think it's idiomatic to flatten deep loops this way in Python.
|
||
bench = {'vectorizer': Vectorizer.__name__} | ||
params = {'analyzer': analyzer, 'ngram_range': ngram_range} | ||
bench.update(params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What am I missing, why not simply
bench = {
'vectorizer': ...,
'analyzer': ...,
'ngram_range': ...,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bench
are results, params
are input parameters (so I can do e.g. run_vectorizer(Vectorizer, text, **params)
), couldn't find a way to use a single dictionary for both here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right
benchmarks/bench_text_vectorizers.py
Outdated
dt = timeit.repeat(run_vectorizer(Vectorizer, text, **params), | ||
number=1, | ||
repeat=3) | ||
bench['time'] = "{:.2f} (±{:.2f})".format(np.mean(dt), np.std(dt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python 2 you need to add an encoding declaration at the top of the file for non-ascii characters such as the one used here.
I find it very strange to report the mean and std of three points. Is this too slow to run at least 10 times? If so, it might be better to show the median and maybe range, not sure...
@ogrisel what would you report?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, forgot about this. Added an an encoding declaration on the top.
I agree that computing it on 3 points is non ideal, but these benchmarks already take quite some time (20min) and I felt that the current 10k document dataset would be quite representative of a real work dataset (vocabulary size wise etc), and and wasn't sure about reducing it's size to make the computations faster. Besides, in this benchmark the std is ~1% or the run time, so even if it's off by a factor of 2 or so, it doesn't really matter for any optimization that gains more than 5% performance. It was about giving an idea of the order magnitude of the standard deviation, to say that run to run variations are almost negligible.
If so, it might be better to show the median and maybe range, not sure...
Well the thing is that when running this benchmarks on multiple configurations these tables are already hard to parse as it is (e.g. see this example ) adding a range would be even less readable, I think. About median, you might be right but the mean is more consistent with the regular %timeit
typically used in benchmarks and should be more stable run to run..
What do you think? Also thanks for the review! Removed the the comment over iteritools.product
, you are right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i see what you mean. Maybe just change the printout to "mean %f std %f over 3 runs" so that nobody is misled by the script. (in which case no need for the #encoding either :P )
but it's a nitpick. other than that this is great, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, Thanks, I added an additional printed sentence to specify that the statistics are computed over 3 runs, and replaced \pm
by +-
to remove the unicode header (trying to keep the cells in the table with as repeated text as possible).
@lesteve Hmm, flake8 is misbehaving, raises an error on Style wise this should be correct statement (cf. discussion here) and besides I can't reproduce any style errors in this scripts with latest flake8 (version 3.3.0) nor with the one in Travis CI, installed with conda create -n flake8-env flake8=2.5.1 python=3.4 Not sure why it's unhappy in Travis.. |
Our convention is to use spaces around arithmetic operators as far as I know... |
I though that too, but according to PEP 8 all the following lines are valid (see 3rd point of Other recommendations section), x = 2
x*x + x*x while x * x + x * x is not valid. And flake8 doesn't raise any errors for any of these 3 lines. So TravisCI flake8 job shouldn't fail... Edit : edited the comment, the previous version of it was incorrect. |
Merging, thanks! |
In text processing, vectorization usually has large run times compared to
.fit
and.predict
times of models further in the processing pipeline (cf. last two figures in this example). As a result, new PRs are recurrently submitted aiming improve the run time performance of text vectrization (e.g. #7567, #7272, #7470)This PR adds standard run time and memory benchmarks for
CountVectorizer
,TfidfVectorizer
andHashingVectorizer
and a combination of common values forngram_range
andanalyzer
, so that such benchmarks wouldn't need to be implemented anew for each new PR. cc @lesteveThe output of this benchmark can be found below,