Skip to content

gh-137627: Make csv.Sniffer.sniff() delimiter detection 1.6x faster #137628

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

maurycy
Copy link
Contributor

@maurycy maurycy commented Aug 11, 2025

The basic idea is not to iterate over all ASCII characters and count their frequency on each line in _guess_delimiter but only over present characters, and just backfill zeros.

Benchmark

There is no csv.Sniffer benchmark in pyperformance, so I created a simple benchmark instead:

import csv, pathlib, pyperf
def sniff(s): csv.Sniffer()._guess_delimiter(s, None)
r = pyperf.Runner()
sizes = [1024, 2048, 4096]
for file in pathlib.Path("/home/maurycy/CSVsniffer/CSV/").glob("*.csv"):
    for s in sizes:
        with file.open() as f:
            try:
                r.bench_func(f"csv_sniff({file.name}, {s})", sniff, f.read(s))
            except UnicodeDecodeError:
                pass

using all 149 files from CSVSniffer (MIT License), reading only the sample, as recommended in docs.python.org example. That's what real users do, too. Unfortunately, it takes a few hours to run.

Results

Geometric mean: 1.60x faster

The full results (compare_to --table --table-format=md, and JSON files):

Environment

% ./python -c "import sysconfig; print(sysconfig.get_config_var('CONFIG_ARGS'))"
'--with-lto' '--enable-optimizations'

sudo ./python -m pyperf system tune ensured.

Notes

  • The optimization is in csv.Sniffer()._read_delimiter() which runs only if regular expressions in csv.Sniffer()._guess_quote_and_delimiter() failed, so there's no guarantee that csv.Sniffer().sniff() will always be faster
  • My original patch relied on confusing set operations.

@maurycy maurycy changed the title gh-137627: Make csv.Sniffer._guess_delimiter() 2x faster gh-137627: Make csv.Sniffer.sniff() 2x faster Aug 11, 2025
@maurycy maurycy requested a review from AA-Turner August 11, 2025 05:37
Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the benchmarks done with a POG+LTO build?

Copy link
Member

@AA-Turner AA-Turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a CSV expert, but here is a cursory review of the set logic. You should provide a (range of) benchmarks to back up the claim that it is twice as fast, though, ideally using pyperformance.

@maurycy
Copy link
Contributor Author

maurycy commented Aug 11, 2025

@picnixz @AA-Turner I really appreciate your feedback! It's great. I will provide more benchmarks, including with enabled optimizations and ideally with pyperformance, rephrase NEWS, and add in a whatsnew.

@picnixz
Copy link
Member

picnixz commented Aug 12, 2025

Benchmarks without optimizations are not relevant so just run those with.

maurycy added a commit to maurycy/cpython-test that referenced this pull request Aug 13, 2025
@maurycy maurycy changed the title gh-137627: Make csv.Sniffer.sniff() 2x faster gh-137627: Make csv.Sniffer.sniff() delimiter detection 1.5x faster Aug 13, 2025
@maurycy
Copy link
Contributor Author

maurycy commented Aug 13, 2025

@picnixz @AA-Turner @ZeroIntensity

Thank you for all the comments:

  • I created a much more rigorous benchmark than before, see the results: https://github.com/maurycy/cpython-test/tree/main/csv-sniffer-counter-set The benchmark is now on par with how people use the class.
  • I've changed the approach, ditching confusing set operations. The speed up now comes mostly from not iterating over ascii over and over, and more efficient zero backfilling.
  • There is nothing about csv.Sniffer() in pyperformance, so I was a bit stuck here.
  • I updated the docs.

Lib/csv.py Outdated
seen += 1
charCounts = Counter(line)
for char, count in charCounts.items():
if ord(char) < 127:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it faster to do ord(char) or char.isascii()? (just run python -m pyperf timeit -s "x='x'" "x.isascii()" and compare it to python -m pyperf timeit -s "char='x'" "ord(char) < 127" (ensure that we use a name lookup ord(char) or "x.*" to avoid inlining the checks)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@picnixz

20:52:34.106551089PM CEST maurycy@gunnbjorn ~/cpython (main) % ./python -m pyperf timeit -s "x='x'" "x.isascii()"
.....................
Mean +- std dev: 13.8 ns +- 0.0 ns
20:52:46.818268693PM CEST maurycy@gunnbjorn ~/cpython (main) % ./python -m pyperf timeit -s "char='x'" "ord(char) < 127"
.....................
Mean +- std dev: 17.4 ns +- 0.0 ns

The reason for having ord in the first place is preserving the original off-by-one error and not changing the fragile behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Wait. The check is < 127 not <= 127. Was it intentional?

Copy link
Member

@picnixz picnixz Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well there's an easy way to use isascii() and still be efficient: after processing the lines, just do charFrequency.pop('\x7f', None). But I don't think we should do it because counting Control characters (e.g., 0x1) is still as nonsensical as counting \x7f.

Copy link
Contributor Author

@maurycy maurycy Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@picnixz

Yes, intentional:

I agree it doesn't make sense to keep this off-by-one.

str.isascii() is much faster than ord(), confirmed with the new benchmark:

Geometric mean: 1.60x faster

Lib/csv.py Outdated
Comment on lines 390 to 395
presentCount = sum(counts.values())
zeroCount = seen - presentCount
if zeroCount > 0:
items = list(counts.items()) + [(0, zeroCount)]
else:
items = list(counts.items())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
presentCount = sum(counts.values())
zeroCount = seen - presentCount
if zeroCount > 0:
items = list(counts.items()) + [(0, zeroCount)]
else:
items = list(counts.items())
items = list(counts.items())
missed_lines = seen - sum(counts.values())
if missed_lines:
# charFrequency[char][0] can only be deduced now
# as it cannot be obtained when parsing the lines.
assert 0 not in counts.keys()
# Store the number of lines 'char' was missing from.
items.append((0, missed_lines))

I think this is what you want right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure of my suggestion so someone else needs to check this (I'm sick hence tired..)

Copy link
Contributor Author

@maurycy maurycy Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@picnixz It’s OK. I’ve spent way too much thinking about this code and I believe they’re the same, yours a bit more readable. :-)

I’m running the benchmark as we speak.

There are only three gotchas:

  • Not sure if I agree with the comment (the zeros can be obtained but it’s inefficient),
  • Are we OK with leaving the assert?
  • There is a very subtle difference between the original code and my code, that is the tie break. I don’t think it matters, though.

Get better soon!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I agree with the comment (the zeros can be obtained but it’s inefficient),

Yes if we change the construction. But with the usage of Counter(line) it's not possible to obtain them as we ... don't count chars that are missing.

There is a very subtle difference between the original code and my code, that is the tie break. It should not matter, though.

Oh. Can you give us an example if possible? and then we can add a test as well.

Are we OK with leaving the assert?

This specific assert should be cheap but we can remove it (along with the comment that can be misread). It's good that benchmarks are performed with that assert though.

Copy link
Contributor Author

@maurycy maurycy Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a very subtle difference between the original code and my code, that is the tie break. It should not matter, though.

Oh. Can you give us an example if possible? and then we can add a test as well.

Definitely!

FYI, the first test:

caught my attention because of:

which should be just str.splitlines() but I'm still afraid of touching too much here. :-)

Are we OK with leaving the assert?

This specific assert should be cheap but we can remove it (along with the comment that can be misread). It's good that benchmarks are performed with that assert though.

I've removed it:

The new benchmark:

ran over 4b62c84:

and included the assert.

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
@maurycy maurycy changed the title gh-137627: Make csv.Sniffer.sniff() delimiter detection 1.5x faster gh-137627: Make csv.Sniffer.sniff() delimiter detection 1.6x faster Aug 19, 2025
@maurycy
Copy link
Contributor Author

maurycy commented Aug 19, 2025

@picnixz

Thank you for your yesterday's review:

  • I renamed the variables and simplified the zero-bucket, as per your suggestion,
  • I added the tests for the tie break (insert v. append of the zero bucket) to confirm that the behavior has not changed,
  • I updated the docs and the benchmark (it's now 1.6x faster!)

Thank you!

@maurycy maurycy requested a review from picnixz August 19, 2025 10:49
@ZeroIntensity
Copy link
Member

Love the enthusiasm, but please try to avoid continuously rebasing (pressing "update branch"). It wastes CI time and also puts a ding in all of our inboxes. Updating the branch should generally only be done to resolve merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants