-
-
Notifications
You must be signed in to change notification settings - Fork 32.5k
gh-137627: Make csv.Sniffer.sniff()
2x faster
#137628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
csv.Sniffer._guess_delimiter()
2x fastercsv.Sniffer.sniff()
2x faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the benchmarks done with a POG+LTO build?
Lib/csv.py
Outdated
candidate_chars = set("".join(chunk)) | ||
candidate_chars.intersection_update(ascii) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do candidates = ascii.intersection(chunk) I think.
@@ -0,0 +1 @@ | |||
:meth:`csv.Sniffer.sniff` 2x faster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a bit more formal description. Also it would be good to have it in a whatsnew as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a CSV expert, but here is a cursory review of the set logic. You should provide a (range of) benchmarks to back up the claim that it is twice as fast, though, ideally using pyperformance
.
|
||
# build frequency tables | ||
chunkLength = min(10, len(data)) | ||
iteration = 0 | ||
charFrequency = {} | ||
# {char -> {count_per_line -> num_lines_with_that_count}} | ||
charFrequency = defaultdict(Counter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally discourage renaming like this, but you're changing almost all the use-sites in this PR already:
charFrequency = defaultdict(Counter) | |
char_frequency = defaultdict(Counter) |
@picnixz @AA-Turner I really appreciate your feedback! It's great. I will provide more benchmarks, including with enabled optimizations and ideally with |
Benchmarks without optimizations are not relevant so just run those with. |
The basic idea is not to iterate over all 127 ASCII characters and count their frequency on each line in
_guess_delimiter
but only over present characters, and just backfill zeros.Benchmark
yob2024.txt
from U.S. Social Security Baby Names data, 400K, 31904 linesThe
main
branch:The PR branch:
The comparison:
csv.Sniffer._guess_delimiter()
iterates over all ASCII on each line #137627