Skip to content

gh-106628: email parsing speedup #106629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 13, 2023

Conversation

cfbolz
Copy link
Contributor

@cfbolz cfbolz commented Jul 11, 2023

As described in #106628, this PR speeds up email parsing by not compiling a regular expression for every single email parsed. On the benchmark that the original bug reporter submitted to us, this gives a 20% speedup when parsing the 235MiB example mbox file containing 10,000 emails on CPython main (on PyPy the speedup is massively larger even, but only because the previous performance was extra bad).

cfbolz added 2 commits July 11, 2023 16:32
Don't compile a new regular expression for every single email that is
being parsed. Instead, use str.startswith and a generic regular
expression.
Copy link
Member

@corona10 corona10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

baseline

Number of emails in the mbox file: 10000 filesize: 246081876
[1000]  avg 0.434 msec/email   (total time: 0.44 seconds)
[2000]  avg 0.350 msec/email   (total time: 0.79 seconds)
[3000]  avg 0.368 msec/email   (total time: 1.15 seconds)
[4000]  avg 0.462 msec/email   (total time: 1.62 seconds)
[5000]  avg 0.375 msec/email   (total time: 1.99 seconds)
[6000]  avg 0.412 msec/email   (total time: 2.40 seconds)
[7000]  avg 0.374 msec/email   (total time: 2.78 seconds)
[8000]  avg 0.405 msec/email   (total time: 3.18 seconds)
[9000]  avg 0.383 msec/email   (total time: 3.56 seconds)

PR

Number of emails in the mbox file: 10000 filesize: 246081876
[1000]  avg 0.398 msec/email   (total time: 0.40 seconds)
[2000]  avg 0.309 msec/email   (total time: 0.71 seconds)
[3000]  avg 0.329 msec/email   (total time: 1.04 seconds)
[4000]  avg 0.435 msec/email   (total time: 1.48 seconds)
[5000]  avg 0.336 msec/email   (total time: 1.81 seconds)
[6000]  avg 0.385 msec/email   (total time: 2.20 seconds)
[7000]  avg 0.326 msec/email   (total time: 2.52 seconds)
[8000]  avg 0.350 msec/email   (total time: 2.87 seconds)
[9000]  avg 0.321 msec/email   (total time: 3.19 seconds)

@corona10
Copy link
Member

I will merge this PR if there is no major objection in this week :)

@corona10 corona10 self-assigned this Jul 11, 2023
@corona10 corona10 merged commit 7e6ce48 into python:main Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants