gh-102140: fix false negative in csv.Sniffer.has_header #103341

Drakariboo · 2023-04-07T10:50:14Z

We've improved the heuristic of has_header() method in Lib/csv.py. We wanted to respect the methodology on how this function was created, even if the determining factor string length is meaningless .

We've made the average of string lengths and compared it to the header length to keep the consistency. We added a condition in which we check if the dictionnary is empty and if all elements are strings. If it's true, we use the average calculated before.

Contributors : @Drakariboo and @Vanille-22

Issue: False negative from csv.Sniffer.has_header with only strings #102140

bedevere-bot · 2023-04-07T10:50:17Z

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

Lib/csv.py

ghost · 2023-04-07T12:32:31Z

All commit authors signed the Contributor License Agreement.

Lib/csv.py

Drakariboo · 2023-05-30T17:51:46Z

What you expect us to do to solve labels backports ? Is there another file to modify ?

merwok · 2023-05-30T18:05:21Z

Oh don’t worry, the backport labels are used by bots to create follow-up pull requests!

johnD18 · 2023-05-31T03:30:41Z

I have made the requested changes; please review again

Drakariboo · 2023-05-31T09:19:44Z

Hi @merwok !
I don't really get it. What do we have to do exactly to pass every tests ?
Also, is test/hypothesis ubuntu fixed ?
Because, we have again a failure with test_xxsubintepreters, but as said before, it was not coming from us.

Thanks for your time to help us. I saw the other issue with the same problem, tell me if there is something new from this.
Have a good day ! :)

AlexWaygood · 2023-05-31T09:30:40Z

@merwok, because you previously requested changes on this PR, you will either need to dismiss your prior review as "stale", or formally approve this PR. Otherwise the "check labels" CI check will continue to fail due to the "awaiting changes" label on this PR.

If you don't know how to dismiss your prior review as stale but would like to do that, I can do that for you.

@Drakariboo: please don't worry about the test_threading and/or test__xxsubinterpreters failing on this PR. We're fully aware that it's not your fault, and it's not blocking this PR being merged. Once the PR has been approved by a core developer, we will be able to merge the PR even if test_threading and/or test__xxsubinterpreters are failing on this PR. (There's no requirement for all tests to be passing in order for a PR to be merged — if it's known that a test is failing for unrelated reasons, it can be ignored 🙂)

The test_threading and test__xxsubinterpreters crashes are a known problem, and other people are working on fixing those tests.

AlexWaygood · 2023-05-31T10:39:44Z

(You also don't really need to worry too much about keeping your PR branch bang-up-to-date with main, unless there's a merge conflict. The merge commits just add noise for people subscribed to the thread :-)

arhadthedev · 2023-06-14T08:35:01Z

I don't really get it. What do we have to do exactly to pass every tests ?

Check labels / DO-NOT-MERGE / unresolved review fails because of awaiting changes label left after the first review of @merwok. So we just need to wait.

CAM-Gerlach

Standard reminder: You can directly apply all the suggestions you want in one go by going to Files changed -> Clicking Add to batch on each suggestion -> When done, clicking Commit

Thanks for the ping @merwok (and all your great help and guidance here!) and sorry for the delay, I was taking my annual post-PyCon GitHub notification break to recover a bit.

BTW, the docs warnings that are not on or near lines touched by this PR can be ignored for now; we want to have those only show up for such lines, but due to a few issues it's not quite as easy to do as it would seem, and we haven't been able to implement that just yet, sorry.

Misc/NEWS.d/next/Library/2023-05-01-18-53-20.gh-issue-102140._4gFLu.rst

CAM-Gerlach · 2023-06-15T04:08:08Z

Doc/library/csv.rst

+      lengths, the average length of all the strings becomes a crucial factor
+      in the determination process.


I found this rather vague, and would really recommend being specific here about how the average length is used, and under what conditions it means this method returns True, just like the rest of this description does for the other cases. Even skimming the code and description here it wasn't totally clear to me, so I didn't suggest something specific, but this should presumably look something like the following:

lengths, the average length of each row is (used to/compared with) ... and if (greater than/less than) ... , ``True`` is returned.

CAM-Gerlach · 2023-06-15T04:09:53Z

Lib/csv.py

@@ -394,6 +394,8 @@ def has_header(self, sample):
        # can't be determined, it is assumed to be a string in which case
        # the length of the string is the determining factor: if all of the
        # rows except for the first are the same length, it's a header.
+        # when the strings have varying length, the average length of all
+        # strings becomes a determining factor.


Similar to the above, this is very unclear to me. I suggest something like

# columns in a row determines...what? how?

Lib/csv.py

Change line 397 : "w" in uppercase. Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>

…4gFLu.rst Rewording to specify this is more a defect fix than an enhancement Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>

Replacing "checking" by "check" in the comments Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>

Change comments to keep a more reasonable line length and use imperative. Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>

change lines 407-410, init and assignment columnTypes directly. Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>

merwok · 2023-06-15T14:06:48Z

Lib/csv.py

@@ -402,8 +404,9 @@ def has_header(self, sample):
        header = next(rdr) # assume first row is header

        columns = len(header)
-        columnTypes = {}
-        for i in range(columns): columnTypes[i] = None
+        columnTypes = {i: None for i in range(columns)}


(could even use dict.fromkeys, but this is already clear!)

johnD18 and others added 2 commits March 27, 2023 12:07

fixing has_header

60d7501

Merge branch 'python:main' into has_header_false_neg_fix

1636e7b

bedevere-bot mentioned this pull request Apr 7, 2023

False negative from csv.Sniffer.has_header with only strings #102140

Open

bedevere-bot added the awaiting review label Apr 7, 2023

Merge branch 'main' into has_header_false_neg_fix

7020dd5

This comment was marked as duplicate.

Sign in to view

arhadthedev added the stdlib Python modules in the Lib dir label Apr 7, 2023

arhadthedev changed the title ~~# gh-102140: False neg csv header bug fix~~ gh-102140: False neg csv header bug fix Apr 7, 2023

arhadthedev reviewed Apr 7, 2023

View reviewed changes

Lib/csv.py Outdated Show resolved Hide resolved

adding genericalias

e2a76d9

This comment was marked as duplicate.

Sign in to view

adding fieldnames

02b645d

This comment was marked as duplicate.

Sign in to view

johnD18 force-pushed the has_header_false_neg_fix branch from 667c9a2 to 02b645d Compare April 7, 2023 12:44

This comment was marked as duplicate.

Sign in to view

Merge branch 'python:main' into has_header_false_neg_fix

747cbc8

This comment was marked as duplicate.

Sign in to view

correcting deletions

6db356f

This comment was marked as duplicate.

Sign in to view

Eclips4 reviewed Apr 7, 2023

View reviewed changes

Lib/csv.py Outdated Show resolved Hide resolved

Eclips4 reviewed Apr 7, 2023

View reviewed changes

Lib/csv.py Outdated Show resolved Hide resolved

Eclips4 reviewed Apr 7, 2023

View reviewed changes

Lib/csv.py Outdated Show resolved Hide resolved

Merge branch 'python:main' into has_header_false_neg_fix

38166fe

This comment was marked as duplicate.

Sign in to view

correction of comments

2fe30c0

This comment was marked as duplicate.

Sign in to view

corrections

b78cd78

merwok added the needs backport to 3.12 only security fixes label May 29, 2023

adding a comment on csv.py & csv.rst update

bb136bb

Drakariboo and others added 4 commits May 30, 2023 21:37

Merge branch 'main' into has_header_false_neg_fix

dcf1af4

Merge branch 'python:main' into has_header_false_neg_fix

920cc1e

Merge branch 'python:main' into has_header_false_neg_fix

1971df6

Merge branch 'main' into has_header_false_neg_fix

8746e64

Merge branch 'main' into has_header_false_neg_fix

a9ea1be

Merge branch 'python:main' into has_header_false_neg_fix

6d139a6

merwok self-requested a review May 31, 2023 14:02

Merge branch 'main' into has_header_false_neg_fix

141cf5d

CAM-Gerlach added the type-bug An unexpected behavior, bug, or error label Jun 15, 2023

CAM-Gerlach reviewed Jun 15, 2023

View reviewed changes

Drakariboo and others added 6 commits June 15, 2023 09:37

Update Lib/csv.py

9686653

Change line 397 : "w" in uppercase. Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>

Update Misc/NEWS.d/next/Library/2023-05-01-18-53-20.gh-issue-102140._…

680c67b

…4gFLu.rst Rewording to specify this is more a defect fix than an enhancement Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>

Update Lib/csv.py

149938a

Replacing "checking" by "check" in the comments Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>

Update Lib/csv.py

4b22c77

Change comments to keep a more reasonable line length and use imperative. Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>

Update Lib/csv.py

2092173

change lines 407-410, init and assignment columnTypes directly. Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>

Merge branch 'main' into has_header_false_neg_fix

95f2b61

merwok reviewed Jun 15, 2023

View reviewed changes

serhiy-storchaka added needs backport to 3.13 bugs and security fixes and removed needs backport to 3.11 only security fixes labels May 9, 2024

Yhg1s removed the needs backport to 3.12 only security fixes label Apr 8, 2025

serhiy-storchaka added the needs backport to 3.14 bugs and security fixes label May 8, 2025

		lengths, the average length of all the strings becomes a crucial factor
		in the determination process.

Uh oh!

gh-102140: fix false negative in csv.Sniffer.has_header #103341

Are you sure you want to change the base?

gh-102140: fix false negative in csv.Sniffer.has_header #103341

Uh oh!

Conversation

Drakariboo commented Apr 7, 2023 • edited by AlexWaygood Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bedevere-bot commented Apr 7, 2023

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

ghost commented Apr 7, 2023 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

Uh oh!

Uh oh!

Uh oh!

This comment was marked as duplicate.

This comment was marked as duplicate.

Drakariboo commented May 30, 2023

Uh oh!

merwok commented May 30, 2023

Uh oh!

johnD18 commented May 31, 2023

Uh oh!

Drakariboo commented May 31, 2023

Uh oh!

AlexWaygood commented May 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexWaygood commented May 31, 2023

Uh oh!

arhadthedev commented Jun 14, 2023

Uh oh!

CAM-Gerlach left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CAM-Gerlach Jun 15, 2023

Choose a reason for hiding this comment

Uh oh!

CAM-Gerlach Jun 15, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

merwok Jun 15, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Drakariboo commented Apr 7, 2023 •

edited by AlexWaygood

Loading

ghost commented Apr 7, 2023 •

edited by ghost

Loading

AlexWaygood commented May 31, 2023 •

edited

Loading

CAM-Gerlach left a comment •

edited

Loading