Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: repository.Init() to Add gitRepoPaths using the proper new cache key format #20962

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jpbelangerupgrade
Copy link

@jpbelangerupgrade jpbelangerupgrade commented Nov 26, 2024

Regression from #18388

The gitRepoPaths cache key was updated to now include not just the repository url but the repository Project. Since during init we have no way of knowing the repository project. At least add that repository into cache as the "default" / undefined project.

Note: Cherrypicking this in all active release would be great given 2.12+ are all impacted.

Context:
Our manifest repository is ~20Gi or so. It takes about 5mins minimum to get it cloned properly. What happens on repo-server startup is that repo-server become available BEFORE the manifest repository is cloned. Creating a high cpu usage and also a massive pile-up of request to the restarting repo-server which then cause app to become "unknown" and paging our dev team. This makes argo-cd completely freeze up for 5-15mins everytime we do a node update, recycle a node or any update to repo-server itself.
More information and alternative discussed can be read on #7927 (comment)

To alleviate that situation, we run an init-container that pre-clones the repo before starting the repo-server process. Which makes repo-server startup much more stable and doesn't come with a freeze on argo-cd reconciliation

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

@jpbelangerupgrade jpbelangerupgrade requested a review from a team as a code owner November 26, 2024 14:44
Copy link

bunnyshell bot commented Nov 26, 2024

🔴 Preview Environment stopped on Bunnyshell

See: Environment Details | Pipeline Logs

Available commands (reply to this comment):

  • 🔵 /bns:start to start the environment
  • 🚀 /bns:deploy to redeploy the environment
  • /bns:delete to remove the environment

@jpbelangerupgrade jpbelangerupgrade force-pushed the fix/repository-path-cache-key branch from fcb7feb to 2ed5acb Compare November 26, 2024 14:58
…key format

Signed-off-by: Jean-Philippe Bélanger <30598824+jpbelangerupgrade@users.noreply.github.com>
@jpbelangerupgrade jpbelangerupgrade force-pushed the fix/repository-path-cache-key branch from 2ed5acb to c6534c0 Compare November 26, 2024 14:58
@jpbelangerupgrade
Copy link
Author

jpbelangerupgrade commented Nov 26, 2024

We use pre-cloned repository so if a repository is already cloned, before it would cache that repo location. Now because of the project key, it’s ignoring the existing cloned repository and cloning a new one.
There are also other using persistent volume for their repo-server to have long live repo. Since some apparently have monorepo in the 20+ gig size.

So before this change. we’d end up with twice the repo cloned.
After, the cache is use and our existing cloned repo is getting used again (same as 2.11.x)

return err
}

s.gitRepoPaths.Add(string(keyData), fullPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we update getters as well? How does it work?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other "Get" (GetPath and GetPathIfExists) have been updated already to use the Project with the composite key. Add was left behind apparently in the original PR

@ddumaisupgrade
Copy link

Hey @andrii-korotkov-verkada . Do you have think we can have an ETA on that ? We are waiting for that to be able to update.

@andrii-korotkov-verkada
Copy link
Contributor

Unfortunately no. I'm not an approver myself and it's a holiday season. I suggest posting in #argo-cd-contributors in CNCF Slack, but still it can take time.

@rumstead
Copy link
Member

rumstead commented Jan 7, 2025

@blakepettersson just as an FYI

@crenshaw-dev
Copy link
Member

We should have a unit test that reproduces the bug.

@jpbelangerupgrade
Copy link
Author

@crenshaw-dev As much as I'd like to, this whole method uses os package and binds itself to the filesystem. This would be extremely hard to mock and actually do a valid test case.
Any chance we can move forward as-is to fix the regression at least. That will allow us to update argo-cd that's lagging being 2 major release right now. As the behavior makes argo-cd unusable for us given the size of our repo.

Copy link

codecov bot commented Jan 14, 2025

Codecov Report

Attention: Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 55.27%. Comparing base (c3600d2) to head (cc2a012).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
reposerver/repository/repository.go 40.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #20962      +/-   ##
==========================================
+ Coverage   55.21%   55.27%   +0.05%     
==========================================
  Files         337      337              
  Lines       56840    56843       +3     
==========================================
+ Hits        31387    31419      +32     
+ Misses      22757    22736      -21     
+ Partials     2696     2688       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@leoluz leoluz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey JP! Long time no see :)

@jpbelangerupgrade Can you please clarify in the PR description the real impact of this supposed regression?
It seems that you are trying to pre-clone the monorepo and have the repo-server using it. Can you pls confirm? Why is this a hard requirement for monorepos? I'd like to understand how impactful this regression really is.

reposerver/repository/repository.go Outdated Show resolved Hide resolved
@@ -172,7 +172,12 @@ func (s *Service) Init() error {
closer := s.gitRepoInitializer(fullPath)
if repo, err := gogit.PlainOpen(fullPath); err == nil {
if remotes, err := repo.Remotes(); err == nil && len(remotes) > 0 && len(remotes[0].Config().URLs) > 0 {
s.gitRepoPaths.Add(git.NormalizeGitURL(remotes[0].Config().URLs[0]), fullPath)
keyData, err := json.Marshal(map[string]string{"url": git.NormalizeGitURL(remotes[0].Config().URLs[0]), "project": ""})
Copy link
Collaborator

@leoluz leoluz Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blakepettersson Do you see an issue about setting the project with an empty string at this stage? User is claiming that the introduction of the project in the key broke this flow. Can you please confirm if this is indeed a regression of #18388?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also updated PR description with more details and link to another issue we were discussing for alternative

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpbelangerupgrade Can you pls confirm how you use AppProject at your company? Do you have a single AppProject for all your apps, 1:1 or everything goes in the default project??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow up on @leoluz's question, I presume you have a single (non-project scoped) credential for the repo (I presume yes since you have not yet upgraded to 2.12)?

My concern with this approach is that while this may solve your immediate predicament, this would break once/if a project scoped repo credential is added for the same repo url, since that would involve another git clone. Thinking out loud, it might be worthwhile having a way to configure these cache mappings in the repo-server which it would then use to initialize the cache mappings? Something like Repo URL -> n projects + paths? IMO this needs a bit more thought before saying yay nor nay.

Ultimately, since your repo is ginormous (20gb+) I think alternative ways should be considered (e.g #11198 and/or something more immediate, e.g this comment).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shallow and sparse checkouts aren't a viable short-term option. I'm really not sure they're even a viable long-term option. We really should be able to support 20GB repos without too much pain.

I'm trying to make sure I understand why we don't have the project name available in Service.Init. It looks to me as if we're reconstructing the git paths cache from the repos found on disk. If we need additional info from storage (i.e. the project(s) associated with each repo dir), why don't we just write that info to disk so it's available when we need it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I feel reasonably okay w/ multiple projects sharing the same cache dir. Even though different credentials could technically expose different contents, I think that would be super atypical. The permissions granularity in every SCM afaik is "clone the repo" or "not clone the repo."

Copy link
Collaborator

@leoluz leoluz Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I feel reasonably okay w/ multiple projects sharing the same cache dir. Even though different credentials could technically expose different contents, I think that would be super atypical. The permissions granularity in every SCM afaik is "clone the repo" or "not clone the repo."

@crenshaw-dev Do you mean, removing the project from the cache key?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first comment is proposing keeping the cache key, but storing it on disk so it's available for cache restoration.

My second comment is, yep, might make sense to just remove the project from the cache key.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which entity is the owner of the cached revision? Would it be the project or the credentials used to access the repo?

Copy link
Member

@crenshaw-dev crenshaw-dev Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, I think it makes sense to say that the credential owns the cached revision. But practically, I think project serves as a reasonable proxy for credentials which may or may not be different. At worst, we treat two projects as different when the credentials are actually the same.

But I'm not sure it makes sense to differentiate cache ownership by credential or project. I feel like it's safe to assume that each credential has the same read access to the repo (full clone access) and that any write operations performed by applications in different projects are protected by the concurrent access lock (i.e. one project can't lift a secret that was written to disk by a different project).

But I think it would be fair to review any write operations that occur during manifest generation (for example, modifications to files in the .git config repository) to ensure that 1) they're protected by the concurrent access lock and 2) sensitive changes are cleaned up between operations.

Signed-off-by: Jean-Philippe Bélanger <30598824+jpbelangerupgrade@users.noreply.github.com>
@jpbelangerupgrade
Copy link
Author

jpbelangerupgrade commented Jan 16, 2025

Hey JP! Long time no see :)

@jpbelangerupgrade Can you please clarify in the PR description the real impact of this supposed regression? It seems that you are trying to pre-clone the monorepo and have the repo-server using it. Can you pls confirm? Why is this a hard requirement for monorepos? I'd like to understand how impactful this regression really is.

Long time no see indeed. Thank for taking a look. Updated the description with more details and impact on our side.

@jpbelangerupgrade
Copy link
Author

I did a validation of before and after this change. One thing that @leoluz was mentioning was that my change would regress the current behavior of the project based auth.

Based on my test it doesn't:

  1. If a user leverage project based repo clone, this PR change won't do anything. The repo will be added as default but also, the repo will get recloned for the required project on first use. So this will be of no impact for this use-case
  2. If a user doesn't use projects, which is probably most people (like me), then this will behave as before and allow user to reuse existing clone instead of recloning on each startup of repo-server

@crenshaw-dev and @blakepettersson this might be of interest to figure out the next step here.

Running v2.13.1
Pre-Cloning enabled
drwx------. 30 argocd argocd 16384 Feb  3 15:07 461130fb-e5de-4983-a8b8-7f5484835df8
drwxr-xr-x.  3 argocd argocd    18 Feb  3 15:17 f6e0b016-48c9-4893-8601-771d433a06b7
d---------. 30 argocd root   16384 Feb  3 15:02 k8s-manifests

Default init
time="2025-02-03T15:04:03Z" level=info msg="Initializing https://github.com/credify/k8s-manifests.git to /tmp/_argocd-repo/461130fb-e5de-4983-a8b8-7f5484835df8"
time="2025-02-03T15:07:46Z" level=info msg=Trace args="[git fetch origin --tags --force --prune]" dir=/tmp/_argocd-repo/461130fb-e5de-4983-a8b8-7f5484835df8 operation_name="exec git" time_ms=222532.09260099998

Project based init
time="2025-02-03T15:17:48Z" level=info msg="Initializing https://github.com/credify/k8s-manifests to /tmp/_argocd-repo/f6e0b016-48c9-4893-8601-771d433a06b7"
level=info msg=Trace args="[git fetch origin --tags --force --prune]" dir=/tmp/_argocd-repo/f6e0b016-48c9-4893-8601-771d433a06b7 operation_name="exec git" time_ms=484082.929101


Results:
- Pre-cloned repo is unused. Associated to a cache item that is never able to be looked up
- f6e0b016-48c9-4893-8601-771d433a06b7 is a project based clone
- 461130fb-e5de-4983-a8b8-7f5484835df8 is the default (no project) clone

——
Updated to PR code w/ pre-cloning enabled

total 16
drwxr-xr-x.  3 argocd root    27 Feb  3 18:25 .
drwxrwxrwx.  3 root   root    58 Feb  3 18:32 ..
d---------. 30 argocd root 16384 Feb  3 18:31 k8s-manifests

time="2025-02-03T18:39:52Z" level=info msg="Initializing https://github.com/credify/k8s-manifests to /tmp/_argocd-repo/7131c06f-3577-4da1-95c8-6f72afbbc22f"
time="2025-02-03T18:46:35Z" level=info msg=Trace args="[git fetch origin --tags --force --prune]" dir=/tmp/_argocd-repo/7131c06f-3577-4da1-95c8-6f72afbbc22f operation_name="exec git" time_ms=402383.445823
time="2025-02-03T18:46:35Z" level=error msg="finished unary call with code Unknown" error="failed to initialize repository resources: rpc error: code = Internal desc = Failed to fetch default: `git fetch origin --tags --force --prune` failed timeout after 1m30s" grpc.code=Unknown grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2025-02-03T18:39:52Z" grpc.time_ms=403061.62 span.kind=server system=grpc

No repo initializing done for “default” reused the pre-clone
total 32
drwxr-xr-x.  4 argocd root      71 Feb  3 18:39 .
drwxrwxrwx.  3 root   root      58 Feb  3 18:32 ..
drwx------. 30 argocd argocd 16384 Feb  3 18:56 7131c06f-3577-4da1-95c8-6f72afbbc22f
d---------. 30 argocd root   16384 Feb  3 18:31 k8s-manifests

Results:
- Pre-cloned repo was reused. And no other clone was done
- 7131c06f-3577-4da1-95c8-6f72afbbc22f is a project based clone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Changes Requested
Development

Successfully merging this pull request may close these issues.

8 participants