Skip to content

chore: remove stats batcher and workspace usage tracker #13393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

f0ssel
Copy link
Contributor

@f0ssel f0ssel commented May 29, 2024

It's become clear that only workspacestats.ReportAgentStat() is the only caller of StatsBatcher.Add(). Using the stats batcher saves us a DB call, but only 1 call is saved and many are still called outside of the batcher for reporting a stat. I don't think the StatsBatcher caching abstraction is serving us anymore, so I think it should be removed in favor of writing directly to the DB when called.

@f0ssel f0ssel requested a review from johnstcn May 29, 2024 15:54
@f0ssel f0ssel requested a review from johnstcn May 29, 2024 18:12
@mafredri
Copy link
Member

I was under the impression that batchstats buffers up to a certain amount of updates, so let's say 100 workspaces all report their agent stats at the same time, all 100 stats would be written in one go. Is this not the case? (Please note, I'm not making a case for or against the batcher here, just wondering if the premise is correct?)

@johnstcn
Copy link
Member

I was under the impression that batchstats buffers up to a certain amount of updates, so let's say 100 workspaces all report their agent stats at the same time, all 100 stats would be written in one go. Is this not the case? (Please note, I'm not making a case for or against the batcher here, just wondering if the premise is correct?)

Yep, up to 1024 stats updates.

johnstcn
johnstcn previously approved these changes May 29, 2024
Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to remove this if it's not enough of a net positive!

@f0ssel
Copy link
Contributor Author

f0ssel commented May 29, 2024

The batcher is totally valid, but my concern is that we are only saving 1 db call when reporting the stat already takes 7-8 db calls to get to that point, and continues to use the DB afterwards too. In return we lose the ability to return an error on failed writes, and a good bit of code to manage this buffer.

@f0ssel
Copy link
Contributor Author

f0ssel commented May 30, 2024

Going to continue removing some batchers on this branch and going to scaletest the impact to see if these changes are safe to merge

@f0ssel f0ssel changed the title chore: remove stats batcher chore: remove stats batcher and workspace usage tracker May 30, 2024
@mafredri
Copy link
Member

mafredri commented May 30, 2024

The batcher is totally valid, but my concern is that we are only saving 1 db call when reporting the stat already takes 7-8 db calls to get to that point, and continues to use the DB afterwards too. In return we lose the ability to return an error on failed writes, and a good bit of code to manage this buffer.

I see, that's fair reasoning. I believe writing to this table (agent stats) used to be more expensive as we kept 6 months of data in it. Now we only keep about 1 day.

Edit: Still, there's a big difference between read and write queries. I assume most of those 8 DB calls are read-only?

@f0ssel
Copy link
Contributor Author

f0ssel commented May 30, 2024

@mafredri Thanks for bringing this up, I do share your concerns since it's a write operation, so I've went ahead and made an alternate PR (#13418) where we keep these structures in place under the workspacestats.Reporter. I'd also like your perspective on scale testing: If we do a scale test and find that these batchers are not showing any impact (that we measure), would you still want to keep them for posterity or would you be in favor of removing them?

@f0ssel f0ssel requested a review from mafredri May 30, 2024 19:45
@mafredri
Copy link
Member

@mafredri Thanks for bringing this up, I do share your concerns since it's a write operation, so I've went ahead and made an alternate PR (#13418) where we keep these structures in place under the workspacestats.Reporter.

👍🏻

I'd also like your perspective on scale testing: If we do a scale test and find that these batchers are not showing any impact (that we measure), would you still want to keep them for posterity or would you be in favor of removing them?

If we find that they're not beneficial, I'm fine with removing them entirely. We can always dig in the git history if we realize we need them back.

@@ -146,11 +147,45 @@ func (r *Reporter) ReportAgentStats(ctx context.Context, now time.Time, workspac

var errGroup errgroup.Group
errGroup.Go(func() error {
err := r.opts.StatsBatcher.Add(now, workspaceAgent.ID, workspace.TemplateID, workspace.OwnerID, workspace.ID, stats)
start := time.Now()
connectionsByProto := json.RawMessage(`[]`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if [{}] would be a more accurate representation here? 🤔 Not sure it's necessary though.

Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After performing a 1,000 user scaletest and comparing with results from yesterday. Given a GCP db2-custom-8 instance with 100GB of disk:

  • Before:
    • With no active connections, seeing DB CPU usage of around 80% on a GCP db2-custom-8 instance.
    • With active SSH connections, seeing DB CPU usage of around 90%
    • InsertWorkspaceAgentStats averaging 1.2 q/s
  • After:
    • With no active connections, seeing DB CPU usage of around 86%
    • With active SSH connections, seeing DB CPU usage maxing out.
    • InsertWorkspaceAgentStats averaging 33.3 q/s

The increased queries during active connections is likely due to the agents only reporting stats if there are active connections.

I can't be sure of the wider ramifications of this change, but it does definitely cause increased database load in a scaletest environment. I would recommend against merging this.

@johnstcn johnstcn dismissed their stale review May 31, 2024 13:21

scaletest results

@f0ssel
Copy link
Contributor Author

f0ssel commented May 31, 2024

Thanks for this data, glad to close and happy to know these are working as intended!

@f0ssel f0ssel closed this May 31, 2024
@github-actions github-actions bot locked and limited conversation to collaborators May 31, 2024
@github-actions github-actions bot deleted the f0ssel/remove-stats-batcher-2 branch December 1, 2024 00:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants