gh-137627: Make `csv.Sniffer.sniff()` 2x faster #137628

maurycy · 2025-08-11T02:57:13Z

The basic idea is not to iterate over all 127 ASCII characters and count their frequency on each line in _guess_delimiter but only over present characters, and just backfill zeros.

Benchmark

yob2024.txt from U.S. Social Security Baby Names data, 400K, 31904 lines

The main branch:

% ./python.exe -m pyperf timeit -s "import csv; from pathlib import Path; data = Path('yob2024.txt').read_text(encoding='utf-8')" "csv.Sniffer().sniff(data)" -o main.json
Mean +- std dev: 1.93 sec +- 0.04 sec

The PR branch:

% ./python.exe -m pyperf timeit -s "import csv; from pathlib import Path; data = Path('yob2024.txt').read_text(encoding='utf-8')" "csv.Sniffer().sniff(data)" -o pr.json
.....................
Mean +- std dev: 990 ms +- 19 ms

The comparison:

% ./python.exe -m pyperf compare_to main.json pr.json
Mean +- std dev: [main] 1.93 sec +- 0.03 sec -> [pr] 990 ms +- 19 ms: 1.95x faster

Issue: csv.Sniffer._guess_delimiter() iterates over all ASCII on each line #137627

Lib/csv.py

picnixz

Are the benchmarks done with a POG+LTO build?

picnixz · 2025-08-11T07:19:44Z

Lib/csv.py

+            candidate_chars = set("".join(chunk))
+            candidate_chars.intersection_update(ascii)


You can do candidates = ascii.intersection(chunk) I think.

picnixz · 2025-08-11T07:21:18Z

Misc/NEWS.d/next/Library/2025-08-11-04-52-18.gh-issue-137627.Ku5Yi2.rst

@@ -0,0 +1 @@
+:meth:`csv.Sniffer.sniff` 2x faster


This needs a bit more formal description. Also it would be good to have it in a whatsnew as well.

AA-Turner

I'm not a CSV expert, but here is a cursory review of the set logic. You should provide a (range of) benchmarks to back up the claim that it is twice as fast, though, ideally using pyperformance.

Lib/csv.py

AA-Turner · 2025-08-11T07:16:15Z

Lib/csv.py


        # build frequency tables
        chunkLength = min(10, len(data))
        iteration = 0
-        charFrequency = {}
+        # {char -> {count_per_line -> num_lines_with_that_count}}
+        charFrequency = defaultdict(Counter)


We generally discourage renaming like this, but you're changing almost all the use-sites in this PR already:

Suggested change

charFrequency = defaultdict(Counter)

char_frequency = defaultdict(Counter)

Lib/csv.py

maurycy · 2025-08-11T15:50:36Z

@picnixz @AA-Turner I really appreciate your feedback! It's great. I will provide more benchmarks, including with enabled optimizations and ideally with pyperformance, rephrase NEWS, and add in a whatsnew.

picnixz · 2025-08-12T07:43:48Z

Benchmarks without optimizations are not relevant so just run those with.

maurycy added 3 commits August 11, 2025 04:40

do not iterate over all ascii

80be530

NEWS entry

2d636cf

bring back the comment

1f0b25e

bedevere-app bot added the awaiting review label Aug 11, 2025

bedevere-app bot mentioned this pull request Aug 11, 2025

csv.Sniffer._guess_delimiter() iterates over all ASCII on each line #137627

Open

bang

601b2f1

maurycy changed the title ~~gh-137627: Make csv.Sniffer._guess_delimiter() 2x faster~~ gh-137627: Make csv.Sniffer.sniff() 2x faster Aug 11, 2025

document the public method

2dc0d41

AA-Turner reviewed Aug 11, 2025

View reviewed changes

Lib/csv.py Outdated Show resolved Hide resolved

import within Sniffer

f106da2

maurycy requested a review from AA-Turner August 11, 2025 05:37

Merge branch 'main' into csv-sniffer-counter-set

7f7dca1

picnixz reviewed Aug 11, 2025

View reviewed changes

AA-Turner reviewed Aug 11, 2025

View reviewed changes

_ASCII_CHARS, set operators

2f1ea73

ZeroIntensity reviewed Aug 11, 2025

View reviewed changes

Lib/csv.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-137627: Make `csv.Sniffer.sniff()` 2x faster #137628

gh-137627: Make `csv.Sniffer.sniff()` 2x faster #137628

maurycy commented Aug 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

picnixz left a comment

Uh oh!

picnixz Aug 11, 2025

Uh oh!

picnixz Aug 11, 2025

Uh oh!

AA-Turner left a comment •

edited

Loading

Uh oh!

Uh oh!

AA-Turner Aug 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maurycy commented Aug 11, 2025 •

edited

Loading

Uh oh!

picnixz commented Aug 12, 2025

Uh oh!

Uh oh!

		candidate_chars = set("".join(chunk))
		candidate_chars.intersection_update(ascii)

	charFrequency = defaultdict(Counter)
	char_frequency = defaultdict(Counter)

Uh oh!

gh-137627: Make csv.Sniffer.sniff() 2x faster #137628

Are you sure you want to change the base?

gh-137627: Make csv.Sniffer.sniff() 2x faster #137628

Conversation

maurycy commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

picnixz Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

picnixz Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

AA-Turner left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AA-Turner Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maurycy commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

picnixz commented Aug 12, 2025

Uh oh!

Uh oh!

gh-137627: Make `csv.Sniffer.sniff()` 2x faster #137628

gh-137627: Make `csv.Sniffer.sniff()` 2x faster #137628

maurycy commented Aug 11, 2025 •

edited

Loading

AA-Turner left a comment •

edited

Loading

maurycy commented Aug 11, 2025 •

edited

Loading