Skip to content

GitHub Rate Limit #14982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
francisco-mata opened this issue Oct 4, 2024 · 19 comments
Open

GitHub Rate Limit #14982

francisco-mata opened this issue Oct 4, 2024 · 19 comments
Labels
customer-reported Bugs reported by enterprise customers. Only humans may set this.

Comments

@francisco-mata
Copy link

francisco-mata commented Oct 4, 2024

We are hitting this issue: GitHub Rate Limit #10853

We are hitting this issue on Coder v2.10.3

image

We are not sure if users are triggering this aumount of Token Source code 429 and 200. We started to see this issue when we accidentally corrupt our DB and we had to recover it with a daily backup we had of it. Right after the restore, the following day we start too see the issue of oauth too many requests.

@coder-labeler coder-labeler bot added bug customer-reported Bugs reported by enterprise customers. Only humans may set this. labels Oct 4, 2024
@ammario
Copy link
Member

ammario commented Oct 5, 2024

Hey, I see you're 6 versions behind at this point. Is updating a practical option?

cc @Emyrk is probably the best on our side to investigate.

@francisco-mata
Copy link
Author

We will keep current version for a while. Now we have monitoring of this issue with Prometheus.

Currently since Friday, we stop seeing issue. If we have new findings I will come here to report them. And considering to upgrade server version :)

@Emyrk
Copy link
Member

Emyrk commented Oct 9, 2024

What may have resolved your issue is all the tokens hit the expiration time stored in the database. So we can invalidate the tokens before reaching out to github.

What is strange strange is the TokenSource response code seems to be 200 in your metrics. A 200 would not trigger any invalidation code, and it seems for whatever reason the tokens believe they should be continually refreshed.

I will take a longer look into this section of code for infinite loop style behavior.

For your specific situation though, I'd ask what the oauth_expirary column of external_auth_links seems to generally be. And you might have gone past the original expiry, resetting all the tokens after the db corruption.

@francisco-mata
Copy link
Author

@Emyrk is there a safe way to delete all expired tokens directly from DB?

I found the oauth_expirary, we got 56 hits, should I just deleted them? Does this action might break any correlation from other Table?

select * from external_auth_links WHERE oauth_expiry < NOW();

@Emyrk
Copy link
Member

Emyrk commented Oct 14, 2024

@francisco-mata do you also have refresh tokens?

If you delete all rows with expired oauth, users will have to relink.

Oauth can be expired with a refresh key though, and the key will be refreshed on the next use.

If the refresh key is also expired, or invalid because it was already used, then the key is essentially useless. But you cannot tell if the refresh is expired, as that is all kept on the IDP end of things.

@Emyrk Emyrk self-assigned this Oct 14, 2024
@francisco-mata
Copy link
Author

Hi @Emyrk,

I will try to delete all old "expired" row to refresh that table to see if its possible to un-trigger the event.

We hit the issue recently,
Image

So we are considering to upgrade Coder Server, but we try upgrading Coder Stage deployment from 2.10 directly to 2.15 and DB crash, there are some migrations issues on DB. We could jump only to 2.11, after 2.12 it start to crash some of the migrations.

Example:

2.11 to 2.12

ALTER TABLE template_versions ADD CONSTRAINT template_versions_id_key UNIQUE (id);

2.12 to 2.13

ALTER TABLE users ADD CONSTRAINT users_id_key UNIQUE (id);
ALTER TABLE users ADD PRIMARY KEY (id);

So we are considering now to do a fresh 2.15 deployment and thne restore DB to restore data on new instance, because altering DB structure without really knowing the structure seems a flaky fix.

@Emyrk
Copy link
Member

Emyrk commented Oct 14, 2024

@francisco-mata I see your status codes are 200.Is there any way we can get insight into the workspace activity?

I ask because GIT_ASKPASS does an external auth query on each git operation. So a workspace doing a lot of git cli commands could be triggering these auth token Validates?

@matifali matifali removed the bug label Oct 14, 2024
@francisco-mata
Copy link
Author

@Emyrk I dont follow you. We have 2 Coder + Github interactions.
Login using Github Oauth and External Auth using a Github App to do WS operations.

The one that block us is the Login using the Oauth. Can you explain how this token might be used? We expected every user when logged in via Github, when click the auth button, will go to Github and create a token that auths the user email belong to specific organization and it will grant access to Coder WebUI. Would be just 1 call/per user. Is there another process that create a new Oauth token that called this app inside Coder workflow?

We know the other token created by the Github app should be the used by the WS if do git pull, or git fetch and so on, but it also should just generate 1 by WS and should be valid for X time.

@Emyrk
Copy link
Member

Emyrk commented Oct 17, 2024

Login using Github Oauth and External Auth using a Github App to do WS operations.

I am assuming the large amount of requests is from the External Auth to do WS operations. There should be a name property on the metrics to tell you if it is github-login or the External Auth app. It might separating TokenSource by the name label so we can know for sure.

We know the other token created by the Github app should be the used by the WS if do git pull, or git fetch and so on, but it also should just generate 1 by WS and should be valid for X time.

Every git operation can trigger a ValidateToken request to github. This is because a token can be revoked prematurely, so it's impossible to tell if a token will work without a request.

You are correct that TokenSource though should only occur on an expired auth token.

@francisco-mata
Copy link
Author

francisco-mata commented Oct 18, 2024

Awesome, I didnt saw the label name.... really useful. Yeah you are correct seems the issue is on the external github. We need to check with the template owners to see which WS could be triggering this loop of calls.

Is there a way to have the visibility on which workspace might be triggering these Oauth requests? Or maybe a network on pod?

Could be a possibility for Coder to alert or block a WS if it creates 500 request/min or X reqts/hr ?

@Emyrk
Copy link
Member

Emyrk commented Oct 21, 2024

@francisco-mata There is no way today to know which workspace it is. Unfortunately I decided against adding labels for the workspaces, as workspace IDs are an unbounded set. And prometheus doe not like infinite labels.

Could be a possibility for Coder to alert or block a WS if it creates 500 request/min or X reqts/hr ?

This was an idea posted here: #11843. We had no reports of this issue for awhile, so we closed it. Might be worth bringing back.

@Emyrk Emyrk removed their assignment Oct 22, 2024
@francisco-mata
Copy link
Author

@Emyrk seems like a good feature to have in place, so you can protect Enterprise Infrastructure from a Coder misconfiguration or user miss behaving. We move our plan to deploy 2.15, we had some migration issues we didnt found solution so we only migrate templates and start WS from scratch.

@Emyrk
Copy link
Member

Emyrk commented Nov 20, 2024

I am unable to identify any sort of loop in Coder that blows up the TokenSource function.

A solution I can (and will today) implement is to not try and refresh a token we know will not work. Currently if a refresh token is valid, we will continually retry on each invocation. This should limit the retry count for any single token.

I do not think the spec has a standard "Bad Refresh" token response, so I'm unsure if this will work universally, as I need to know the error is with the refresh token, and not some unrelated error that a retry might fix.

I have this idea in my head, will look into the implementation details and see how bullet proof I can make this. I think it should prevent these excessive calls from what I can tell.

@francisco-mata
Copy link
Author

francisco-mata commented Nov 20, 2024

Hi @Emyrk, we finally upgraded to latest version and restore templates, this issue block all Coder operations so we needed to do something.

We couldn't restore all DB due to migration issues on DBs. But well we figure it out, a full wipe out of all users/ws, just restoring templates on latest version. Even tho, as you say its better to be bullet proof on this case, because hitting this issue block all auth on Coder via Github.

Also exploring the idea of having multiple auth id users for the same id Coder user so it can link up to same WS inside Coder might help out. We usually use LDAPs on some our internal applications, and you can correlate the same email address on user LDAPs and Github as the same Coder User.

Regards

@Emyrk
Copy link
Member

Emyrk commented Nov 20, 2024

@francisco-mata there is likely still value in creating an internal rate limit. My hesitancy is requiring even more configuration to make that work across all different types of IDPs.

In the short term, I have this PR #15608 that might fix the issue. Essentially, if there is a problematic token, and it's spamming because of a bad refresh token, then this reduces it's spam to 1 refresh call per token. Rather than 1 refresh call per api request (git action, loading the account settings UI page, etc).

This will only fix it if the problem is related to refresh tokens, which I have a suspicion it might be.

@Emyrk
Copy link
Member

Emyrk commented Dec 16, 2024

The same way I solved this for invalid refresh tokens, we should solve this for revoked tokens: #15890

@davo-canva
Copy link

FWIW, GitHub has some guidelines around hitting their rate limits, which includes a few headers with information about when the rate limits expire.

https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#exceeding-the-rate-limit

@Emyrk
Copy link
Member

Emyrk commented Dec 20, 2024

@davo-canva yes, we should take this into account. Technically the external apps are generic oauth apps, I'm not sure how consistent these headers are across services.

Considering these tokens are validated via some user action, like git clone, the best we can do is just block the request locally. Very unfortunate, but better than a 429

@bpmct
Copy link
Member

bpmct commented Jan 4, 2025

Moving this as a candidate for our next sprint to respect GitHub rate limits: #16040

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer-reported Bugs reported by enterprise customers. Only humans may set this.
Projects
None yet
Development

No branches or pull requests

6 participants