Skip to content

feat(coderd/database/dbpurge): retain most recent agent build logs #14460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Aug 30, 2024

Conversation

johnstcn
Copy link
Member

@johnstcn johnstcn commented Aug 27, 2024

Builds upon #14480
Addresses #10576

Updates the DeleteOldWorkspaceAgentLogs to:

  • Retain logs for the most recent build regardless of age,
  • Delete logs for agents that never connected and were created before the cutoff for deleting logs while still retaining the logs most recent build.

The motivation for the second changes is from running the following query on the dogfood database:

coder=> SELECT COUNT(*), MIN(workspace_agents.created_at) FROM workspace_agent_logs 
JOIN workspace_agents 
ON workspace_agent_logs.agent_id = workspace_agents.id
WHERE workspace_agents.last_connected_at IS NULL
AND workspace_agents.created_at < (NOW() - INTERVAL '7 days');
 count |              min              
-------+-------------------------------
  2169 | 2023-11-21 21:48:26.339238+00
(1 row)

@johnstcn johnstcn self-assigned this Aug 27, 2024
@johnstcn johnstcn force-pushed the cj/retain-last-build-logs branch 4 times, most recently from 139174a to 34be657 Compare August 29, 2024 15:59
@johnstcn johnstcn changed the base branch from main to cj/dbpurge-quartz August 29, 2024 15:59
@johnstcn johnstcn changed the title wip: feat(coderd/database/dbpurge): retain most recent agent build logs feat(coderd/database/dbpurge): retain most recent agent build logs Aug 29, 2024
@johnstcn johnstcn force-pushed the cj/retain-last-build-logs branch from 12df09c to 3321a28 Compare August 30, 2024 09:38

// After dbpurge completes, the ticker is reset. Trap this call.
trapReset := clk.Trap().TickerReset()
func awaitDoTick(ctx context.Context, t *testing.T, clk *quartz.Mock) chan struct{} {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addresses #14480 (comment)

@@ -181,139 +173,216 @@ func containsWorkspaceAgentStat(stats []database.GetWorkspaceAgentStatsRow, need

//nolint:paralleltest // It uses LockIDDBPurge.
func TestDeleteOldWorkspaceAgentLogs(t *testing.T) {
db, _ := dbtestutil.NewDB(t)
ctx := testutil.Context(t, testutil.WaitShort)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addresses #14480 (comment)

Base automatically changed from cj/dbpurge-quartz to main August 30, 2024 10:55
@johnstcn johnstcn force-pushed the cj/retain-last-build-logs branch from 31c3351 to 0f8ab25 Compare August 30, 2024 11:00
Comment on lines +34 to +35
// Start the ticker with the initial delay.
ticker := clk.NewTicker(delay)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: I found this to be the source of a race condition while testing.
I moved the initial 'tick' to instead be inside the goroutine below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the race condition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may actually be the same race condition you mentioned below, just moved around slightly.

now := dbtime.Now()

//nolint:paralleltest // It uses LockIDDBPurge.
t.Run("AgentHasNotConnectedSinceWeek_LogsExpired", func(t *testing.T) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: I collapsed both of the sub-tests into a single test with more breadth.

Comment on lines 190 to +233
-- If an agent hasn't connected in the last 7 days, we purge it's logs.
-- Exception: if the logs are related to the latest build, we keep those around.
-- Logs can take up a lot of space, so it's important we clean up frequently.
-- name: DeleteOldWorkspaceAgentLogs :exec
DELETE FROM workspace_agent_logs WHERE agent_id IN
(SELECT id FROM workspace_agents WHERE last_connected_at IS NOT NULL
AND last_connected_at < @threshold :: timestamptz);
WITH
latest_builds AS (
SELECT
workspace_id, max(build_number) AS max_build_number
FROM
workspace_builds
GROUP BY
workspace_id
),
old_agents AS (
SELECT
wa.id
FROM
workspace_agents AS wa
JOIN
workspace_resources AS wr
ON
wa.resource_id = wr.id
JOIN
workspace_builds AS wb
ON
wb.job_id = wr.job_id
LEFT JOIN
latest_builds
ON
latest_builds.workspace_id = wb.workspace_id
AND
latest_builds.max_build_number = wb.build_number
WHERE
-- Filter out the latest builds for each workspace.
latest_builds.workspace_id IS NULL
AND CASE
-- If the last time the agent connected was before @threshold
WHEN wa.last_connected_at IS NOT NULL THEN
wa.last_connected_at < @threshold :: timestamptz
-- The agent never connected, and was created before @threshold
ELSE wa.created_at < @threshold :: timestamptz
END
)
DELETE FROM workspace_agent_logs WHERE agent_id IN (SELECT id FROM old_agents);
Copy link
Member Author

@johnstcn johnstcn Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main change here. I welcome any further scrutiny or suggestions of this query.

I realize that old_agents can potentially get pretty big, but #14340 may help address this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be interesting to see a https://explain.dalibo.com against dogfood. If the postgres query analyzer is working good with this query, it should hopefully be able to minimize the work in the CTE, but since it's a CTE it might fail at that optimization. Might actually produce a better result moving the old agents CTE inline in the delete.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the query and two variants (one with the old_agents CTE inlined, and one with both CTEs inelined).

The plan looks basically the same for all three:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks like there's not much difference and the original is easiest to plan for so might as well leave it. Perhaps issues may surface on a larger scale but I think it's alright to deal with those as they appear.

(Side note: So many sequential scans though, guess we don't have indexes to optimize this join order :/.)

@johnstcn johnstcn marked this pull request as ready for review August 30, 2024 11:05
Copy link
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, with this change you leave the latest workspace build logs as the agent might have not connected yet?

If I understood you weel, then 👍 from me.

@johnstcn
Copy link
Member Author

Correct me if I'm wrong, with this change you leave the latest workspace build logs as the agent might have not connected yet?

Correct, that was the overall consensus from #10576

if !toBeDeleted {
validLogs = append(validLogs, log)
/*
old_agents AS (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice!

I'm concerned it'll go stale...
Although it could be argued all the business logic duplicated in this file is stale the second it's written.
We really need to get rid of dbmem.

@@ -72,13 +75,15 @@ func New(ctx context.Context, logger slog.Logger, db database.Store, clk quartz.
go func() {
defer close(closed)
defer ticker.Stop()
// Force an initial tick immediately.
ticker.Reset(time.Nanosecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this set the ticker's new period to a nanosecond?

I believe this is the idiomatic way to tick immediately:

ticker := time.NewTicker(period)
defer ticker.Stop()
for ; true; <- ticker.C {
    ...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this set the ticker's new period to a nanosecond?

Yes, briefly until the second prong of the select is reached. Then the ticker is stopped and doTick() executes. At the end of doTick(), the ticker is reset with a longer delay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correction reset happens in doTick, so yes and no. This does however introduce a race where we may tick two times before instead of once on startup.

  1. tick reset nano
  2. tick
  3. <-ticker.C
  4. tick
  5. ticker.Stop()
  6. doTick()
  7. <-ticker.C (immediately after because of 4)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "immediate tick" idiom you mention may not play nicely with <-ctx.Done()

Copy link
Member

@mafredri mafredri Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, that's very possible. I didn't think about it when I commented but original code is also not impervious to this race.

An alternative non-racy way is to kick it off with a different channel:

var tickerC <-chan time.Time
initC := make(chan time.Time, 1)
initC <- time.Time{}
close(initC)
tickerC = initC

// ...

	case tick := tickerC:
		tickerC = ticker.C
		doTick(...)

(Or a separate case if that's preferable.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative is to just call doTick() immediately before the for-select loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get out of here @johnstcn with your good and reasonable ideas!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "initCh" approach definitely has merits, but not sure if it's necessary here as the context is still available in doTick. :)

Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for the ticker init change which may produce two initial ticks, LGTM.

Comment on lines +34 to +35
// Start the ticker with the initial delay.
ticker := clk.NewTicker(delay)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the race condition?

@@ -72,13 +75,15 @@ func New(ctx context.Context, logger slog.Logger, db database.Store, clk quartz.
go func() {
defer close(closed)
defer ticker.Stop()
// Force an initial tick immediately.
ticker.Reset(time.Nanosecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correction reset happens in doTick, so yes and no. This does however introduce a race where we may tick two times before instead of once on startup.

  1. tick reset nano
  2. tick
  3. <-ticker.C
  4. tick
  5. ticker.Stop()
  6. doTick()
  7. <-ticker.C (immediately after because of 4)

func mustCreateAgentWithLogs(ctx context.Context, t *testing.T, db database.Store, user database.User, org database.Organization, tmpl database.Template, tv database.TemplateVersion, agentLastConnectedAt time.Time, output string) uuid.UUID {
agent := mustCreateAgent(t, db, user, org, tmpl, tv)
return ch
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍🏻

@johnstcn johnstcn merged commit 0f8251b into main Aug 30, 2024
27 checks passed
@johnstcn johnstcn deleted the cj/retain-last-build-logs branch August 30, 2024 16:39
@github-actions github-actions bot locked and limited conversation to collaborators Aug 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants