-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
gh-102140: fix false negative in csv.Sniffer.has_header #103341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Most changes to Python require a NEWS entry. Please add it using the blurb_it web app or the blurb command-line tool. |
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
1 similar comment
This comment was marked as duplicate.
This comment was marked as duplicate.
667c9a2
to
02b645d
Compare
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
What you expect us to do to solve labels backports ? Is there another file to modify ? |
Oh don’t worry, the backport labels are used by bots to create follow-up pull requests! |
I have made the requested changes; please review again |
Hi @merwok ! Thanks for your time to help us. I saw the other issue with the same problem, tell me if there is something new from this. |
@merwok, because you previously requested changes on this PR, you will either need to dismiss your prior review as "stale", or formally approve this PR. Otherwise the "check labels" CI check will continue to fail due to the "awaiting changes" label on this PR. If you don't know how to dismiss your prior review as stale but would like to do that, I can do that for you. @Drakariboo: please don't worry about the The |
(You also don't really need to worry too much about keeping your PR branch bang-up-to-date with |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Standard reminder: You can directly apply all the suggestions you want in one go by going to Files changed
-> Clicking Add to batch on each suggestion -> When done, clicking Commit
Thanks for the ping @merwok (and all your great help and guidance here!) and sorry for the delay, I was taking my annual post-PyCon GitHub notification break to recover a bit.
BTW, the docs warnings that are not on or near lines touched by this PR can be ignored for now; we want to have those only show up for such lines, but due to a few issues it's not quite as easy to do as it would seem, and we haven't been able to implement that just yet, sorry.
Misc/NEWS.d/next/Library/2023-05-01-18-53-20.gh-issue-102140._4gFLu.rst
Outdated
Show resolved
Hide resolved
lengths, the average length of all the strings becomes a crucial factor | ||
in the determination process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this rather vague, and would really recommend being specific here about how the average length is used, and under what conditions it means this method returns True
, just like the rest of this description does for the other cases. Even skimming the code and description here it wasn't totally clear to me, so I didn't suggest something specific, but this should presumably look something like the following:
lengths, the average length of each row is (used to/compared with) ...
and if (greater than/less than) ... , ``True`` is returned.
@@ -394,6 +394,8 @@ def has_header(self, sample): | |||
# can't be determined, it is assumed to be a string in which case | |||
# the length of the string is the determining factor: if all of the | |||
# rows except for the first are the same length, it's a header. | |||
# when the strings have varying length, the average length of all | |||
# strings becomes a determining factor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the above, this is very unclear to me. I suggest something like
# columns in a row determines...what? how?
Change line 397 : "w" in uppercase. Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>
…4gFLu.rst Rewording to specify this is more a defect fix than an enhancement Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>
Replacing "checking" by "check" in the comments Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>
Change comments to keep a more reasonable line length and use imperative. Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>
change lines 407-410, init and assignment columnTypes directly. Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>
@@ -402,8 +404,9 @@ def has_header(self, sample): | |||
header = next(rdr) # assume first row is header | |||
|
|||
columns = len(header) | |||
columnTypes = {} | |||
for i in range(columns): columnTypes[i] = None | |||
columnTypes = {i: None for i in range(columns)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(could even use dict.fromkeys, but this is already clear!)
We've improved the heuristic of
has_header()
method inLib/csv.py
. We wanted to respect the methodology on how this function was created, even if the determining factorstring length
is meaningless .Contributors : @Drakariboo and @Vanille-22