-
Notifications
You must be signed in to change notification settings - Fork 887
feat(coderd/database/dbpurge): retain most recent agent build logs #14460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
139174a
to
34be657
Compare
a7e0f15
to
e826ae9
Compare
12df09c
to
3321a28
Compare
|
||
// After dbpurge completes, the ticker is reset. Trap this call. | ||
trapReset := clk.Trap().TickerReset() | ||
func awaitDoTick(ctx context.Context, t *testing.T, clk *quartz.Mock) chan struct{} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addresses #14480 (comment)
@@ -181,139 +173,216 @@ func containsWorkspaceAgentStat(stats []database.GetWorkspaceAgentStatsRow, need | |||
|
|||
//nolint:paralleltest // It uses LockIDDBPurge. | |||
func TestDeleteOldWorkspaceAgentLogs(t *testing.T) { | |||
db, _ := dbtestutil.NewDB(t) | |||
ctx := testutil.Context(t, testutil.WaitShort) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addresses #14480 (comment)
…t builds and delete never-connected workspaces
31c3351
to
0f8ab25
Compare
// Start the ticker with the initial delay. | ||
ticker := clk.NewTicker(delay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review: I found this to be the source of a race condition while testing.
I moved the initial 'tick' to instead be inside the goroutine below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the race condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may actually be the same race condition you mentioned below, just moved around slightly.
now := dbtime.Now() | ||
|
||
//nolint:paralleltest // It uses LockIDDBPurge. | ||
t.Run("AgentHasNotConnectedSinceWeek_LogsExpired", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review: I collapsed both of the sub-tests into a single test with more breadth.
-- If an agent hasn't connected in the last 7 days, we purge it's logs. | ||
-- Exception: if the logs are related to the latest build, we keep those around. | ||
-- Logs can take up a lot of space, so it's important we clean up frequently. | ||
-- name: DeleteOldWorkspaceAgentLogs :exec | ||
DELETE FROM workspace_agent_logs WHERE agent_id IN | ||
(SELECT id FROM workspace_agents WHERE last_connected_at IS NOT NULL | ||
AND last_connected_at < @threshold :: timestamptz); | ||
WITH | ||
latest_builds AS ( | ||
SELECT | ||
workspace_id, max(build_number) AS max_build_number | ||
FROM | ||
workspace_builds | ||
GROUP BY | ||
workspace_id | ||
), | ||
old_agents AS ( | ||
SELECT | ||
wa.id | ||
FROM | ||
workspace_agents AS wa | ||
JOIN | ||
workspace_resources AS wr | ||
ON | ||
wa.resource_id = wr.id | ||
JOIN | ||
workspace_builds AS wb | ||
ON | ||
wb.job_id = wr.job_id | ||
LEFT JOIN | ||
latest_builds | ||
ON | ||
latest_builds.workspace_id = wb.workspace_id | ||
AND | ||
latest_builds.max_build_number = wb.build_number | ||
WHERE | ||
-- Filter out the latest builds for each workspace. | ||
latest_builds.workspace_id IS NULL | ||
AND CASE | ||
-- If the last time the agent connected was before @threshold | ||
WHEN wa.last_connected_at IS NOT NULL THEN | ||
wa.last_connected_at < @threshold :: timestamptz | ||
-- The agent never connected, and was created before @threshold | ||
ELSE wa.created_at < @threshold :: timestamptz | ||
END | ||
) | ||
DELETE FROM workspace_agent_logs WHERE agent_id IN (SELECT id FROM old_agents); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main change here. I welcome any further scrutiny or suggestions of this query.
I realize that old_agents
can potentially get pretty big, but #14340 may help address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be interesting to see a https://explain.dalibo.com against dogfood. If the postgres query analyzer is working good with this query, it should hopefully be able to minimize the work in the CTE, but since it's a CTE it might fail at that optimization. Might actually produce a better result moving the old agents
CTE inline in the delete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the query and two variants (one with the old_agents
CTE inlined, and one with both CTEs inelined).
The plan looks basically the same for all three:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks like there's not much difference and the original is easiest to plan for so might as well leave it. Perhaps issues may surface on a larger scale but I think it's alright to deal with those as they appear.
(Side note: So many sequential scans though, guess we don't have indexes to optimize this join order :/.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, with this change you leave the latest workspace build logs as the agent might have not connected yet?
If I understood you weel, then 👍 from me.
Correct, that was the overall consensus from #10576 |
if !toBeDeleted { | ||
validLogs = append(validLogs, log) | ||
/* | ||
old_agents AS ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice!
I'm concerned it'll go stale...
Although it could be argued all the business logic duplicated in this file is stale the second it's written.
We really need to get rid of dbmem.
coderd/database/dbpurge/dbpurge.go
Outdated
@@ -72,13 +75,15 @@ func New(ctx context.Context, logger slog.Logger, db database.Store, clk quartz. | |||
go func() { | |||
defer close(closed) | |||
defer ticker.Stop() | |||
// Force an initial tick immediately. | |||
ticker.Reset(time.Nanosecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this set the ticker's new period to a nanosecond?
I believe this is the idiomatic way to tick immediately:
ticker := time.NewTicker(period)
defer ticker.Stop()
for ; true; <- ticker.C {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this set the ticker's new period to a nanosecond?
Yes, briefly until the second prong of the select
is reached. Then the ticker is stopped and doTick()
executes. At the end of doTick()
, the ticker is reset with a longer delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correction reset happens in doTick
, so yes and no. This does however introduce a race where we may tick two times before instead of once on startup.
- tick reset nano
- tick
- <-ticker.C
- tick
- ticker.Stop()
- doTick()
- <-ticker.C (immediately after because of 4)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "immediate tick" idiom you mention may not play nicely with <-ctx.Done()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, that's very possible. I didn't think about it when I commented but original code is also not impervious to this race.
An alternative non-racy way is to kick it off with a different channel:
var tickerC <-chan time.Time
initC := make(chan time.Time, 1)
initC <- time.Time{}
close(initC)
tickerC = initC
// ...
case tick := tickerC:
tickerC = ticker.C
doTick(...)
(Or a separate case if that's preferable.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative is to just call doTick()
immediately before the for-select
loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get out of here @johnstcn with your good and reasonable ideas!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "initCh" approach definitely has merits, but not sure if it's necessary here as the context is still available in doTick
. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except for the ticker init change which may produce two initial ticks, LGTM.
// Start the ticker with the initial delay. | ||
ticker := clk.NewTicker(delay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the race condition?
coderd/database/dbpurge/dbpurge.go
Outdated
@@ -72,13 +75,15 @@ func New(ctx context.Context, logger slog.Logger, db database.Store, clk quartz. | |||
go func() { | |||
defer close(closed) | |||
defer ticker.Stop() | |||
// Force an initial tick immediately. | |||
ticker.Reset(time.Nanosecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correction reset happens in doTick
, so yes and no. This does however introduce a race where we may tick two times before instead of once on startup.
- tick reset nano
- tick
- <-ticker.C
- tick
- ticker.Stop()
- doTick()
- <-ticker.C (immediately after because of 4)
func mustCreateAgentWithLogs(ctx context.Context, t *testing.T, db database.Store, user database.User, org database.Organization, tmpl database.Template, tv database.TemplateVersion, agentLastConnectedAt time.Time, output string) uuid.UUID { | ||
agent := mustCreateAgent(t, db, user, org, tmpl, tv) | ||
return ch | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👍🏻
Builds upon #14480
Addresses #10576
Updates the
DeleteOldWorkspaceAgentLogs
to:The motivation for the second changes is from running the following query on the dogfood database: