-
-
Notifications
You must be signed in to change notification settings - Fork 32.6k
gh-137627: Make csv.Sniffer.sniff()
delimiter detection 1.5x faster
#137628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
csv.Sniffer._guess_delimiter()
2x fastercsv.Sniffer.sniff()
2x faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the benchmarks done with a POG+LTO build?
Misc/NEWS.d/next/Library/2025-08-11-04-52-18.gh-issue-137627.Ku5Yi2.rst
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a CSV expert, but here is a cursory review of the set logic. You should provide a (range of) benchmarks to back up the claim that it is twice as fast, though, ideally using pyperformance
.
@picnixz @AA-Turner I really appreciate your feedback! It's great. I will provide more benchmarks, including with enabled optimizations and ideally with |
Benchmarks without optimizations are not relevant so just run those with. |
csv.Sniffer.sniff()
2x fastercsv.Sniffer.sniff()
delimiter detection 1.5x faster
@picnixz @AA-Turner @ZeroIntensity Thank you for all the comments:
|
The basic idea is not to iterate over all 127 ASCII characters and count their frequency on each line in
_guess_delimiter
but only over present characters, and just backfill zeros.Benchmark
There is no
csv.Sniffer
benchmark inpyperformance
, so I constructed a simple benchmark with:using all 149 files from CSVSniffer (MIT License), reading only the sample, as recommended in docs.python.org example. That's what real users do, too.
Results
The full results:
Environment
sudo ./python -m pyperf system tune
ensured.Notes
csv.Sniffer()._read_delimiter()
which runs only if regular expressions incsv.Sniffer()._guess_quote_and_delimiter()
failed, so there's no guarantee thatcsv.Sniffer().sniff()
will always be fastercsv.Sniffer._guess_delimiter()
iterates over all ASCII on each line #137627