Skip to content

gh-137627: Make csv.Sniffer.sniff() 2x faster #137628

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

maurycy
Copy link
Contributor

@maurycy maurycy commented Aug 11, 2025

The basic idea is not to iterate over all 127 ASCII characters and count their frequency on each line in _guess_delimiter but only over present characters, and just backfill zeros.

Benchmark

The main branch:

% ./python.exe -m pyperf timeit -s "import csv; from pathlib import Path; data = Path('yob2024.txt').read_text(encoding='utf-8')" "csv.Sniffer().sniff(data)" -o main.json
Mean +- std dev: 1.93 sec +- 0.04 sec

The PR branch:

% ./python.exe -m pyperf timeit -s "import csv; from pathlib import Path; data = Path('yob2024.txt').read_text(encoding='utf-8')" "csv.Sniffer().sniff(data)" -o pr.json
.....................
Mean +- std dev: 990 ms +- 19 ms

The comparison:

% ./python.exe -m pyperf compare_to main.json pr.json
Mean +- std dev: [main] 1.93 sec +- 0.03 sec -> [pr] 990 ms +- 19 ms: 1.95x faster

@maurycy maurycy changed the title gh-137627: Make csv.Sniffer._guess_delimiter() 2x faster gh-137627: Make csv.Sniffer.sniff() 2x faster Aug 11, 2025
@maurycy maurycy requested a review from AA-Turner August 11, 2025 05:37
Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the benchmarks done with a POG+LTO build?

Lib/csv.py Outdated
Comment on lines 384 to 385
candidate_chars = set("".join(chunk))
candidate_chars.intersection_update(ascii)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do candidates = ascii.intersection(chunk) I think.

@@ -0,0 +1 @@
:meth:`csv.Sniffer.sniff` 2x faster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a bit more formal description. Also it would be good to have it in a whatsnew as well.

Copy link
Member

@AA-Turner AA-Turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a CSV expert, but here is a cursory review of the set logic. You should provide a (range of) benchmarks to back up the claim that it is twice as fast, though, ideally using pyperformance.


# build frequency tables
chunkLength = min(10, len(data))
iteration = 0
charFrequency = {}
# {char -> {count_per_line -> num_lines_with_that_count}}
charFrequency = defaultdict(Counter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally discourage renaming like this, but you're changing almost all the use-sites in this PR already:

Suggested change
charFrequency = defaultdict(Counter)
char_frequency = defaultdict(Counter)

@maurycy
Copy link
Contributor Author

maurycy commented Aug 11, 2025

@picnixz @AA-Turner I really appreciate your feedback! It's great. I will provide more benchmarks, including with enabled optimizations and ideally with pyperformance, rephrase NEWS, and add in a whatsnew.

@picnixz
Copy link
Member

picnixz commented Aug 12, 2025

Benchmarks without optimizations are not relevant so just run those with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants