Skip to content

gh-102140: fix false negative in csv.Sniffer.has_header #103341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 75 commits into
base: main
Choose a base branch
from

Conversation

Drakariboo
Copy link

@Drakariboo Drakariboo commented Apr 7, 2023

We've improved the heuristic of has_header() method in Lib/csv.py. We wanted to respect the methodology on how this function was created, even if the determining factor string length is meaningless .

We've made the average of string lengths and compared it to the header length to keep the consistency. We added a condition in which we check if the dictionnary is empty and if all elements are strings. If it's true, we use the average calculated before.

Contributors : @Drakariboo and @Vanille-22

@bedevere-bot
Copy link

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

@bedevere-bot

This comment was marked as duplicate.

@arhadthedev arhadthedev added the stdlib Python modules in the Lib dir label Apr 7, 2023
@arhadthedev arhadthedev changed the title # gh-102140: False neg csv header bug fix gh-102140: False neg csv header bug fix Apr 7, 2023
@bedevere-bot

This comment was marked as duplicate.

@bedevere-bot

This comment was marked as duplicate.

1 similar comment
@bedevere-bot

This comment was marked as duplicate.

@ghost
Copy link

ghost commented Apr 7, 2023

All commit authors signed the Contributor License Agreement.
CLA signed

@johnD18 johnD18 force-pushed the has_header_false_neg_fix branch from 667c9a2 to 02b645d Compare April 7, 2023 12:44
@bedevere-bot

This comment was marked as duplicate.

@bedevere-bot

This comment was marked as duplicate.

@bedevere-bot

This comment was marked as duplicate.

@bedevere-bot

This comment was marked as duplicate.

@bedevere-bot

This comment was marked as duplicate.

@merwok merwok added needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes labels May 29, 2023
@Drakariboo
Copy link
Author

What you expect us to do to solve labels backports ? Is there another file to modify ?

@merwok
Copy link
Member

merwok commented May 30, 2023

Oh don’t worry, the backport labels are used by bots to create follow-up pull requests!

@johnD18
Copy link

johnD18 commented May 31, 2023

I have made the requested changes; please review again

@Drakariboo
Copy link
Author

Hi @merwok !
I don't really get it. What do we have to do exactly to pass every tests ?
Also, is test/hypothesis ubuntu fixed ?
Because, we have again a failure with test_xxsubintepreters, but as said before, it was not coming from us.

Thanks for your time to help us. I saw the other issue with the same problem, tell me if there is something new from this.
Have a good day ! :)

@AlexWaygood
Copy link
Member

AlexWaygood commented May 31, 2023

@merwok, because you previously requested changes on this PR, you will either need to dismiss your prior review as "stale", or formally approve this PR. Otherwise the "check labels" CI check will continue to fail due to the "awaiting changes" label on this PR.

If you don't know how to dismiss your prior review as stale but would like to do that, I can do that for you.

@Drakariboo: please don't worry about the test_threading and/or test__xxsubinterpreters failing on this PR. We're fully aware that it's not your fault, and it's not blocking this PR being merged. Once the PR has been approved by a core developer, we will be able to merge the PR even if test_threading and/or test__xxsubinterpreters are failing on this PR. (There's no requirement for all tests to be passing in order for a PR to be merged — if it's known that a test is failing for unrelated reasons, it can be ignored 🙂)

The test_threading and test__xxsubinterpreters crashes are a known problem, and other people are working on fixing those tests.

@AlexWaygood
Copy link
Member

(You also don't really need to worry too much about keeping your PR branch bang-up-to-date with main, unless there's a merge conflict. The merge commits just add noise for people subscribed to the thread :-)

@merwok merwok self-requested a review May 31, 2023 14:02
@arhadthedev
Copy link
Member

I don't really get it. What do we have to do exactly to pass every tests ?

Check labels / DO-NOT-MERGE / unresolved review fails because of awaiting changes label left after the first review of @merwok. So we just need to wait.

@CAM-Gerlach CAM-Gerlach added the type-bug An unexpected behavior, bug, or error label Jun 15, 2023
Copy link
Member

@CAM-Gerlach CAM-Gerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standard reminder: You can directly apply all the suggestions you want in one go by going to Files changed -> Clicking Add to batch on each suggestion -> When done, clicking Commit

Thanks for the ping @merwok (and all your great help and guidance here!) and sorry for the delay, I was taking my annual post-PyCon GitHub notification break to recover a bit.

BTW, the docs warnings that are not on or near lines touched by this PR can be ignored for now; we want to have those only show up for such lines, but due to a few issues it's not quite as easy to do as it would seem, and we haven't been able to implement that just yet, sorry.

Comment on lines +299 to +300
lengths, the average length of all the strings becomes a crucial factor
in the determination process.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this rather vague, and would really recommend being specific here about how the average length is used, and under what conditions it means this method returns True, just like the rest of this description does for the other cases. Even skimming the code and description here it wasn't totally clear to me, so I didn't suggest something specific, but this should presumably look something like the following:

      lengths, the average length of each row is (used to/compared with) ...
      and if (greater than/less than) ... , ``True`` is returned.

@@ -394,6 +394,8 @@ def has_header(self, sample):
# can't be determined, it is assumed to be a string in which case
# the length of the string is the determining factor: if all of the
# rows except for the first are the same length, it's a header.
# when the strings have varying length, the average length of all
# strings becomes a determining factor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above, this is very unclear to me. I suggest something like

        # columns in a row determines...what? how?

Drakariboo and others added 6 commits June 15, 2023 09:37
Change line 397 : "w" in uppercase.

Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>
…4gFLu.rst


Rewording to specify this is more a defect fix than an enhancement

Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>
Replacing "checking" by "check" in the comments

Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>
Change comments to keep a more reasonable line length and use imperative.

Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>
change lines 407-410, init and assignment columnTypes directly.

Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>
@@ -402,8 +404,9 @@ def has_header(self, sample):
header = next(rdr) # assume first row is header

columns = len(header)
columnTypes = {}
for i in range(columns): columnTypes[i] = None
columnTypes = {i: None for i in range(columns)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(could even use dict.fromkeys, but this is already clear!)

@serhiy-storchaka serhiy-storchaka added needs backport to 3.13 bugs and security fixes and removed needs backport to 3.11 only security fixes labels May 9, 2024
@Yhg1s Yhg1s removed the needs backport to 3.12 only security fixes label Apr 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.13 bugs and security fixes awaiting changes needs backport to 3.13 bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants