Skip to content

Line filtering error on large patches #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
justinboneh opened this issue Sep 13, 2022 · 17 comments · Fixed by #11 or #26
Closed

Line filtering error on large patches #10

justinboneh opened this issue Sep 13, 2022 · 17 comments · Fixed by #11 or #26
Labels
bug Something isn't working

Comments

@justinboneh
Copy link

KeyError: 'line_filter' when running with lines-changed-only: true.
It fails on a file with 4288 line additions, so the file object is returned with no "patch" field.

{
    "sha": "a6af39f8ae9809a353b26af5f1f2909a76cda742",
    "filename": "xxxxxxxxxx.cpp",
    "status": "added",
    "additions": 4288,
    "deletions": 0,
    "changes": 4288,
    "blob_url": "xxx",
    "raw_url": "xxx",
    "contents_url": "xxx"
}

From the log:

2022-09-13T13:08:09.8017447Z Traceback (most recent call last):
2022-09-13T13:08:09.8018099Z File "/home/runner/.local/bin/cpp-linter", line 8, in
2022-09-13T13:08:09.8018358Z sys.exit(main())
2022-09-13T13:08:09.8018791Z File "/home/runner/.local/lib/python3.8/site-packages/cpp_linter/run.py", line 1007, in main
2022-09-13T13:08:09.8019111Z capture_clang_tools_output(
2022-09-13T13:08:09.8019617Z File "/home/runner/.local/lib/python3.8/site-packages/cpp_linter/run.py", line 697, in capture_clang_tools_output
2022-09-13T13:08:09.8020009Z run_clang_format(filename, file, version, style, lines_changed_only)
2022-09-13T13:08:09.8020518Z File "/home/runner/.local/lib/python3.8/site-packages/cpp_linter/run.py", line 595, in run_clang_format
2022-09-13T13:08:09.8020878Z for line_range in file_obj["line_filter"][ranges]:
2022-09-13T13:08:09.8021162Z KeyError: 'line_filter'
2022-09-13T13:08:09.8314093Z ##[error]Process completed with exit code 1.

@2bndy5
Copy link
Contributor

2bndy5 commented Sep 13, 2022

If the file object has no patch field then we can't respect the lines-changed-only feature when it is enabled. This means clang-tidy will analyze the whole file.

I suppose it is better than crashing, but we should output a warning about file objects that don't have a diff (a patch field).

It would be beneficial to reproduce this for unit tests, but I would be hard pressed to induce this on purpose.

@2bndy5
Copy link
Contributor

2bndy5 commented Sep 13, 2022

@justinboneh I'm curious, did this happened on a PR event or a push event?

@shenxianpeng I don't think this would be a problem if we used git directly to get the diff for a push event. (#4)

@justinboneh
Copy link
Author

@2bndy5
Copy link
Contributor

2bndy5 commented Sep 13, 2022

Yeah, I've read the REST API docs - it seems to have changed a bit though since I last read it (the JSON schema was not included before). It doesn't seem to say anything about the possibility of not including the patch field. In the past, we also hit a similar problem with the thread comments on private repos - seems those docs could still use more info.

2bndy5 added a commit that referenced this issue Sep 16, 2022
clang-tidy analyses the whole file if there is no diff info for the file

This means the annotations & comments will be unfiltered - even if the
lines-changed-only option is enabled but only when
there is no diff for the file.
@2bndy5 2bndy5 linked a pull request Sep 16, 2022 that will close this issue
@2bndy5 2bndy5 added the bug Something isn't working label Sep 16, 2022
2bndy5 added a commit that referenced this issue Sep 16, 2022
clang-tidy analyses the whole file if there is no diff info for the file

This means the annotations & comments will be unfiltered - even if the
lines-changed-only option is enabled but only when
there is no diff for the file.
@kleisauke
Copy link

I think I encountered the same issue on a large patch.

 INFO:CPP Linter:Fetching files list from url: https://api.github.com/repos/libvips/libvips/pulls/3125/files
  Traceback (most recent call last):
    File "/home/runner/.local/bin/cpp-linter", line 8, in <module>
      sys.exit(main())
    File "/home/runner/.local/lib/python3.10/site-packages/cpp_linter/run.py", line 1018, in main
      exit_early = not filter_out_non_source_files(
    File "/home/runner/.local/lib/python3.10/site-packages/cpp_linter/run.py", line 426, in filter_out_non_source_files
      or (lines_changed_only == 2 and file["line_filter"]["lines_added"])
  KeyError: 'line_filter'
  Error: Process completed with exit code 1.

https://github.com/libvips/libvips/actions/runs/3531825080/jobs/5925431028

@2bndy5
Copy link
Contributor

2bndy5 commented Nov 23, 2022

I forgot that our solution to this was to lint the whole file if no patch was given (by REST API payload). This means our solution to #18 has introduced a regression that instigated this issue again.

the exact file that the workflow run failed on (as reported by @kleisauke ) is

  {
    "sha": "e40dca00ae710ba292abfef1006b248586a76617",
    "filename": "cplusplus/include/vips/VImage8.h",
    "status": "modified",
    "additions": 4570,
    "deletions": 4568,
    "changes": 9138,
    "blob_url": "https://github.com/libvips/libvips/blob/0fff6c6384fd0c2bb5108e65119d1643e6f53ffc/cplusplus%2Finclude%2Fvips%2FVImage8.h",
    "raw_url": "https://github.com/libvips/libvips/raw/0fff6c6384fd0c2bb5108e65119d1643e6f53ffc/cplusplus%2Finclude%2Fvips%2FVImage8.h",
    "contents_url": "https://api.github.com/repos/libvips/libvips/contents/cplusplus%2Finclude%2Fvips%2FVImage8.h?ref=0fff6c6384fd0c2bb5108e65119d1643e6f53ffc"
  },

I have a WIP branch that will fetch the diff format from REST API calls, but it hasn't been fully vetted yet. I believe that such a change will prevent this issue as the diff will be parsed directly instead of parsing JSON with the expectation of having a patch field for every file. My one main concern is if the diff is quite large, then we have to cycle through multiple pages of diffs returned by multiple calls to the REST API.

@2bndy5 2bndy5 reopened this Nov 23, 2022
@2bndy5
Copy link
Contributor

2bndy5 commented Nov 23, 2022

Wow, for a diff of 563 files, the payload returned a diff 311412 lines long (in a single REST API call). Of that 563 files, cpp-linter is focusing on 537 (using what the ignore option was set to). And the WIP branch (using git diff) was able to discern that 4570 lines were added and 5907 lines were in the diff chunks. 👍🏼

This preliminary test/parsing is promising. I just have to adjust/add unit tests on the WIP branch and submit the PR.

@kleisauke

This comment was marked as off-topic.

@2bndy5
Copy link
Contributor

2bndy5 commented Nov 23, 2022

The WIP branch uses git directly when executed in a local dev env, but we have to use REST API calls because the default behavior for actions/checkout provides a shallow reference (no head or base commits/trees) -- more info on that problem in #4. My experiments using git diff in CI returned all changes to the repo in question since it was created.

@kleisauke
Copy link

... also, as noticed in libvips/libvips#3125 (comment), GitHub's Rest API doesn't return all files, see for example: https://api.github.com/repos/libvips/libvips/pulls/3125/files

You could parse these files using https://github.com/libvips/libvips/pull/3125.patch, but that'll be huge. 😅

@kleisauke
Copy link

Ah, that's indeed tracked in #4, I only checked the issues at https://github.com/cpp-linter/cpp-linter-action, sorry for the noise.

@2bndy5
Copy link
Contributor

2bndy5 commented Nov 23, 2022

... also, as noticed in libvips/libvips#3125 (comment), GitHub's Rest API doesn't return all files, see for example: https://api.github.com/repos/libvips/libvips/pulls/3125/files

There are limits to the REST API.

  1. Using the JSON format, fetching changed files only provides max of 30 files per page. We currently don't have the logic implemented to continue fetching pages from REST API if the diff includes more than 30 files. Using the diff format seems to avoid this problem.
  2. There's a limit to the length of payload we can send to create a thread comment (something like a 50KB payload). There's also a limit to the number of file annotations (seen in the picture from that linked comment) that can be emitted from a single workflow run (currently 50 annotations max). For annotations, we're using github's log commands.

@2bndy5
Copy link
Contributor

2bndy5 commented Nov 23, 2022

You could parse these files using https://github.com/libvips/libvips/pull/3125.patch, but that'll be huge.

That is essentially what the diff format is returning from a REST API call. So, my investigation does exactly that.

@shenxianpeng
Copy link
Contributor

shenxianpeng commented Nov 24, 2022

Could we check the dict of file["line_filter"] before using it to avoid KeyError: 'line_filter' first and then consider a better solution? @2bndy5

@2bndy5
Copy link
Contributor

2bndy5 commented Nov 24, 2022

Yes, but not having a line_filter key means we can't adhere to the lines-changed-only setting. So it doesn't really fix the problem.

My use-git-directly branch is the better solution because it ensures we always have a diff for each file and it seems to not require multiple pages (meaning multiple REST API calls) for events with more than 30 changed files (another problem we haven't really discussed).

I'm currently working on rewriting the unit tests to make use of the diff format from REST API calls.

@2bndy5
Copy link
Contributor

2bndy5 commented Nov 24, 2022

You could parse these files using https://github.com/libvips/libvips/pull/3125.patch, but that'll be huge.

That is essentially what the diff format is returning from a REST API call. So, my investigation does exactly that.

I just noticed that the URL + ".patch" tactic includes info about every commit (like description/title/summary) in the tree (embedded in the diff between chunks). But, using the diff format from the REST API excludes this info (closely resembling git diff output). This is good for cpp-linter because we don't need to know the commit(s) title/description/summary which would probably screw up my diff parsing logic anyway.

EDIT: The URL + ".patch" tactic can also have multiple diffs for a single file as each commit in the patch is described separately.

@2bndy5
Copy link
Contributor

2bndy5 commented Nov 26, 2022

@shenxianpeng I'm going to push some major updates to the units tests. 1 of the major changes is using the very large diff in libvips/libvips@fe82be3. Currently, the pre-commit config doesn't allow large files to be pushed, but this is an exceptional case because the unit test needs to have 4 very large files:

  • the commit diff as returned by REST API (8930 KB) This is to avoid hitting the REST API usage limits when running the unit tests.
  • 3 expected_results.json files (3 * 2129 KB). Each json will represents the expected results after various lines-changed-only filtering.

I'm going to increase the check-added-large-files pre-commit hook's threshold to 9000 KB because this issue is more important than keeping our repo's footprint extremely minimal. Currently, I don't see a better way around this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants