-
Notifications
You must be signed in to change notification settings - Fork 881
chore: remove stats batcher and workspace usage tracker #13393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I was under the impression that |
Yep, up to 1024 stats updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to remove this if it's not enough of a net positive!
The batcher is totally valid, but my concern is that we are only saving 1 db call when reporting the stat already takes 7-8 db calls to get to that point, and continues to use the DB afterwards too. In return we lose the ability to return an error on failed writes, and a good bit of code to manage this buffer. |
Going to continue removing some batchers on this branch and going to scaletest the impact to see if these changes are safe to merge |
I see, that's fair reasoning. I believe writing to this table (agent stats) used to be more expensive as we kept 6 months of data in it. Now we only keep about 1 day. Edit: Still, there's a big difference between read and write queries. I assume most of those 8 DB calls are read-only? |
@mafredri Thanks for bringing this up, I do share your concerns since it's a write operation, so I've went ahead and made an alternate PR (#13418) where we keep these structures in place under the |
👍🏻
If we find that they're not beneficial, I'm fine with removing them entirely. We can always dig in the git history if we realize we need them back. |
@@ -146,11 +147,45 @@ func (r *Reporter) ReportAgentStats(ctx context.Context, now time.Time, workspac | |||
|
|||
var errGroup errgroup.Group | |||
errGroup.Go(func() error { | |||
err := r.opts.StatsBatcher.Add(now, workspaceAgent.ID, workspace.TemplateID, workspace.OwnerID, workspace.ID, stats) | |||
start := time.Now() | |||
connectionsByProto := json.RawMessage(`[]`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if [{}]
would be a more accurate representation here? 🤔 Not sure it's necessary though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After performing a 1,000 user scaletest and comparing with results from yesterday. Given a GCP db2-custom-8
instance with 100GB of disk:
- Before:
- With no active connections, seeing DB CPU usage of around 80% on a GCP
db2-custom-8
instance. - With active SSH connections, seeing DB CPU usage of around 90%
- InsertWorkspaceAgentStats averaging 1.2 q/s
- With no active connections, seeing DB CPU usage of around 80% on a GCP
- After:
- With no active connections, seeing DB CPU usage of around 86%
- With active SSH connections, seeing DB CPU usage maxing out.
- InsertWorkspaceAgentStats averaging 33.3 q/s
The increased queries during active connections is likely due to the agents only reporting stats if there are active connections.
I can't be sure of the wider ramifications of this change, but it does definitely cause increased database load in a scaletest environment. I would recommend against merging this.
Thanks for this data, glad to close and happy to know these are working as intended! |
It's become clear that only
workspacestats.ReportAgentStat()
is the only caller ofStatsBatcher.Add()
. Using the stats batcher saves us a DB call, but only 1 call is saved and many are still called outside of the batcher for reporting a stat. I don't think the StatsBatcher caching abstraction is serving us anymore, so I think it should be removed in favor of writing directly to the DB when called.