Skip to content

switching to git directly for getting an event's diff #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2bndy5 opened this issue Aug 24, 2022 · 15 comments · Fixed by #26
Closed

switching to git directly for getting an event's diff #4

2bndy5 opened this issue Aug 24, 2022 · 15 comments · Fixed by #26
Assignees
Labels
enhancement New feature or request pinned

Comments

@2bndy5
Copy link
Contributor

2bndy5 commented Aug 24, 2022

I've had an idea about switching to git directly for getting a push event's diff (and list of changed files). Maybe such feature would warrant v1.5.0 tag also.

I'm just not sure how use git to list a diff for a PR. Determining the head and base commits in that context seems non-trivial.

Originally posted by @2bndy5 in #1 (comment)

@2bndy5
Copy link
Contributor Author

2bndy5 commented Aug 24, 2022

Do you want to get push event diff using git instead GitHub API? it's a good idea if we can use git directly, and support other Non-GitHub SCM like Bitbucket and GitLab easily. This can be considered after v1.5.0 is released.

posted by @shenxianpeng in #1 (comment)

@2bndy5
Copy link
Contributor Author

2bndy5 commented Aug 24, 2022

Do you want to get push event diff using git instead GitHub API?

Yes. I've been playing around with this idea locally using git status. It is very possible for push events, but it isn't as easy for PR'd branches,

support other Non-GitHub SCM like Bitbucket and GitLab

That's the main intention. It would also allow better usage of the pkg in local dev env, namely the files-changed-only and lines-changed-only options. Although, I think it would only make sense to support push events locally because PRs tend to be specific to the SCM host (ie GitHub, GitLab, etc)

@2bndy5 2bndy5 added the enhancement New feature or request label Aug 24, 2022
@2bndy5 2bndy5 self-assigned this Aug 24, 2022
@shenxianpeng
Copy link
Contributor

Switching to git for getting event diff, I assume we need to git checkout all history, so it should have two problems

  1. users must allow to checkout all history
  2. if the repo is too big git checkout all history is impossible

Because adding comments or annotation also needs to call API, probably calling the API to get the event diff is still the best way.

@2bndy5
Copy link
Contributor Author

2bndy5 commented Aug 25, 2022

users must allow to checkout all history

Only for PRs, but I haven't assessed the best way to address that event. This will be less of a problem in local dev context. If it is a problem, then we can try to use git to accommodate for shallow history.

if the repo is too big git checkout all history is impossible

I have rarely seen this occur. The only time that I had a problem during local checkout was when the repo itself was extremely large (like Google's Material Icons repo). Even then, the Github runner is better able to do a checkout of any repo hosted on Github with ease (likely because they share the same servers).

Because adding comments or annotation also needs to call API, probably calling the API to get the event diff is still the best way

  • thread-comments requires REST API. There's no way around that.
  • file-annotations currently does not require the REST API, but the code to do so is there and tested if we need to use the REST API.

Using git directly won't require a GITHUB_TOKEN supplied because the token is specific to the REST API. Again, I'm not sure how to address PR events using git only. I think I could get the head and base commits from the event payload of a PR event, then just use git dif <head>...<base>. This applies to the following options (only for private repos):

  • files-changed-only
  • lines-changed-only

@shenxianpeng
Copy link
Contributor

I have rarely seen this occur.

For example, when I clone llvm-project and linux repos.

-sh-4.2$ time git clone git@github.com:llvm/llvm-project.git
Cloning into 'llvm-project'...
remote: Enumerating objects: 5005074, done.
remote: Counting objects: 100% (2813/2813), done.
remote: Compressing objects: 100% (485/485), done.
remote: Total 5005074 (delta 2509), reused 2441 (delta 2326), pack-reused 5002261
Receiving objects: 100% (5005074/5005074), 1.96 GiB | 51.08 MiB/s, done.
Resolving deltas: 100% (4098747/4098747), done.
Checking out files: 100% (119993/119993), done.

real    3m53.050s
user    5m40.690s
sys     0m32.289s

-sh-4.2$ time git clone git@github.com:torvalds/linux.git
Cloning into 'linux'...
remote: Enumerating objects: 8978158, done.
remote: Total 8978158 (delta 0), reused 0 (delta 0), pack-reused 8978158
Receiving objects: 100% (8978158/8978158), 3.71 GiB | 62.09 MiB/s, done.
Resolving deltas: 100% (7452118/7452118), done.
Checking out files: 100% (78011/78011), done.

real    5m23.826s
user    7m36.899s
sys     0m59.142s

-sh-4.2$ du -h --max-depth=1
3.6G    ./llvm-project
5.4G    ./linux

I also see some Git repos mixed code and binaries together (for some repos migrate from SVN to GIT), git clone with --depth 1 also needs 7GB, if clone all history, it will be 21 GB, so it will speed much more time than the above examples.

The main concern is we don't know how users use git, so checkout with all history will have performance issues and takes extreme space.

Also, supporting and maintaining other SCM's REST APIs might help us explore commercialization.

@2bndy5
Copy link
Contributor Author

2bndy5 commented Aug 26, 2022

supporting and maintaining other SCM's REST APIs

I'm trying to avoid having to do this because I envision a "spaghetti code" solution (riddled with if blocks). If the SCM site's event payload is somewhat uniformed, then we might be able to do it easily. Currently, I'm constructing the REST API URLs manually in the code, but most of the REST API URLs I'm using should be available in the event payload's JSON (though I suspect the field's keys will vary depending on the SCM site used).

I don't think we should be held accountable for the size of the user's repo. The full history would only be required for PR events when using git.

This implementation depends on knowing how to get the diff (including filenames with paths relative to the repo's root) for the push/PR event. We may not need to use git in the CI runners, but it would help to use git locally when not using a CI runner. More importantly, I was hoping to stop needing the GITHUB_TOKEN (as objected to in cpp-linter/cpp-linter-action#55) because it could potentially compromise user security (in many ways).

@shenxianpeng
Copy link
Contributor

I suspect the field's keys will vary depending on the SCM site used

I guess so. Different SCM could create different py files to deal with.

I don't think we should be held accountable for the size of the user's repo. The full history would only be required for PR events when using git.

Analyzing PR events is the most important scenario for each project's CI workflow. I suspect that using REST API the performance would be better than git directly most of the time, if using git directly also works for PR events you could also try.

More importantly, I was hoping to stop needing the GITHUB_TOKEN

To access GitHub API it's common practice to use a GitHub token, like the following actions:

Switching to git directly, it seems getting event diffs can stop using GITHUB_TOKEN, but it is still needed when adding a comment.

@2bndy5
Copy link
Contributor Author

2bndy5 commented Sep 17, 2022

Well, I'm rather inexperienced with using git directly, but I think I got it working on the use-git-directly branch. I've often preferred to use tools like gitkraken to do complex git commands because I always tend to mess things up when I use git manually in CLI.

I had some trouble with getting the diff in CI because the default for actions/checkout action is an extremely shallow history. Meaning that the reported diff was showing everything in the repo as a new file. I tried using

subprocess.run(["git", "fetch", "--depth=3"], check=True)

but that didn't seem to work. So, I had to set the depth using the action

      - uses: actions/checkout@v3
        with:
          fetch-depth: 3

I haven't added any unit tests to verify the integrity of the diff reported yet. And, there was some regex parsing error (specifically when getting the file name) when the depth was 1 (still don't know why)...

I think it would be safer to use git directly from a local machine (eg. during a pre-commit hook) which should allow using the lines-changed-only and files-changed-only options without needing the Github REST API.


As for compatibility with non-GitHub SCM, I'm not sure yet how we could abstract the REST API usage for others. I've been looking at GitLab's REST API, and the response schema is wildly different because the runners for GitLab are less "pre-configured". I still don't know how to get the JSON payload describing what event triggered the GitLab CI run.

@stale
Copy link

stale bot commented Nov 7, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Nov 7, 2022
@2bndy5
Copy link
Contributor Author

2bndy5 commented Nov 7, 2022

Please don't close this.

I still have a branch for this but it seems to only work for local clones of a repo. In the CI, the depth needs to be >= 3 (not shallow).

@shenxianpeng
Copy link
Contributor

Added label pinned, this issue will not be closed.

@shenxianpeng shenxianpeng removed the wontfix This will not be worked on label Nov 7, 2022
@2bndy5
Copy link
Contributor Author

2bndy5 commented Nov 16, 2022

I went with a different approach concerning get_list_of_changed_files()...

  • If executed on a CI runner, then get the diff format from REST API.
  • If executed from a local dev env, then use git status --verbose.

This should ensure that we are getting the "patch" from REST API calls. And, this allows the same diff parsing algorithm used for either CI env or local env. Essentially, this approach doesn't fully satisfy my original idea, but working with shallow checkouts in the CI has proved rather painful when using git directly in CI.

In the end, we should be able to now use --lines-changed-only and --files-changed-only options from a local env. I still want to investigate how this will affect using cpp-linter as a pre-commit hook, but I'm optimistic since pre-commit temporarily stashes unstaged files (which should not show up in git status -v).

@shenxianpeng
Copy link
Contributor

It looks like you've made good progress, it's nice to support cpp-linter as a pre-commit hook.
This is the clang-format hook https://github.com/pre-commit/mirrors-clang-format, just for reference.

@2bndy5
Copy link
Contributor Author

2bndy5 commented Nov 23, 2022

I just realized that by using the diff format from REST API, we will not be getting the raw_url field that we current use to download files when the user has not added the actions/checkout step. I guess I could manually assemble a link that would act as the raw_url field instead.

@2bndy5
Copy link
Contributor Author

2bndy5 commented Dec 13, 2023

I just found out about pygit2. It is a python binding for the C library that git uses under the hood (called libgit2).

I'm currently working on changing the current solution for this issue (using git CLI) into using pygit2.

PS - I'm also implementing some performance improvements I discovered when experimenting with cpp_linter_dart, so stayed tuned for another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pinned
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants