chore: prevent nil derefs in non-critical paths #11411

Emyrk · 2024-01-04T15:39:50Z

Similar PR here: https://github.com/coder/coder/pull/9892/files

I could not reproduce this, so it must just be flakey. This obviously sidesteps the problem, because ideally we just track down why the times are nil. However all this does is print a message with a duration to the UI. It is not served in any stats, or used in anything besides the build logs.

So I think this solution is fine just to prevent these in the future.

mafredri · 2024-01-04T17:26:32Z

This is an interesting one. I was quite sure that this shouldn't happen since the API is supposed to give us guarantees about which fields are available based on what lifecycle state we're in. So I did some digging and quite a bit of head scratching later, I do think we have a bug, but it's not here.

If we look at the agent logs, it's reporting in as it should:

2023-12-21T11:27:49.6967595Z     t.go:84: 2023-12-21 11:26:24.479 [debu]  agent: set lifecycle state  current={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}  last={"state":"created","changed_at":"0001-01-01T00:00:00Z"}
2023-12-21T11:27:49.6971361Z     t.go:84: 2023-12-21 11:26:24.480 [debu]  agent: reporting lifecycle state  payload={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}
2023-12-21T11:27:49.6974135Z     t.go:84: 2023-12-21 11:26:24.480 [debu]  agent: sdk request  method=POST  url=http://localhost:64379/api/v2/workspaceagents/me/report-lifecycle  body=""
2023-12-21T11:27:49.6977363Z     t.go:84: 2023-12-21 11:26:24.480 [debu]  agent: set lifecycle state  current={"state":"ready","changed_at":"2023-12-21T11:26:24.480422Z"}  last={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}
2023-12-21T11:27:49.7020898Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  agent: sdk response  method=POST  url=http://localhost:64379/api/v2/workspaceagents/me/report-lifecycle  status=204  body=""  trace_id=""  span_id=""
2023-12-21T11:27:49.7026281Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  agent: reporting lifecycle state  payload={"state":"ready","changed_at":"2023-12-21T11:26:24.480422Z"}
2023-12-21T11:27:49.7027697Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  agent: sdk request  method=POST  url=http://localhost:64379/api/v2/workspaceagents/me/report-lifecycle  body=""
2023-12-21T11:27:49.7039717Z     t.go:84: 2023-12-21 11:26:24.487 [debu]  agent: sdk response  method=POST  url=http://localhost:64379/api/v2/workspaceagents/me/report-lifecycle  status=204  body=""  trace_id=""  span_id=""

Likewise, the API is seeing what we expect it to see:

2023-12-21T11:27:49.7003352Z     t.go:84: 2023-12-21 11:26:24.482 [debu]  workspace agent state report  workspace_agent_id=ff758f63-c35e-4767-a37e-52f9c937f835  workspace_id=c18b2771-df4b-42d5-a3f1-a014910197b8  payload={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}  request_id=3ce9c96a-feb5-4f09-803a-d3ec49016807
2023-12-21T11:27:49.7013125Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  POST  host=localhost:64379  path=/api/v2/workspaceagents/me/report-lifecycle  proto=HTTP/1.1  remote_addr=127.0.0.1  start="2023-12-21T11:26:24.4822116Z"  took=2.1672ms  status_code=204  latency_ms=2  request_id=3ce9c96a-feb5-4f09-803a-d3ec49016807
2023-12-21T11:27:49.7034778Z     t.go:84: 2023-12-21 11:26:24.485 [debu]  workspace agent state report  workspace_agent_id=e16a4d48-6057-425b-afcb-2d5ca5660e49  workspace_id=c18b2771-df4b-42d5-a3f1-a014910197b8  payload={"state":"ready","changed_at":"2023-12-21T11:26:24.480422Z"}  request_id=db18b534-cd87-48b5-96d8-fc919bb72a8e
2023-12-21T11:27:49.7037500Z     t.go:84: 2023-12-21 11:26:24.486 [debu]  POST  host=localhost:64379  path=/api/v2/workspaceagents/me/report-lifecycle  proto=HTTP/1.1  remote_addr=127.0.0.1  start="2023-12-21T11:26:24.4855906Z"  took=1.1765ms  status_code=204  latency_ms=1  request_id=db18b534-cd87-48b5-96d8-fc919bb72a8e

However, a bit later it seems like either the started_at field has not been updated, or it has been overwritten by stale data:

2023-12-21T11:27:49.7086907Z     t.go:84: 2023-12-21 11:26:24.574 [debu]  accepting agent details  request_id=1afb922e-aa48-4a48-b427-36854b703ada  agent={"id":"e16a4d48-6057-425b-afcb-2d5ca5660e49","created_at":"2023-12-21T11:26:24.483817Z","updated_at":"2023-12-21T11:26:24.483817Z","name":"example","first_connected_at":{"Time":"0001-01-01T00:00:00Z","Valid":false},"last_connected_at":{"Time":"0001-01-01T00:00:00Z","Valid":false},"disconnected_at":{"Time":"0001-01-01T00:00:00Z","Valid":false},"resource_id":"b0af077e-bc12-403b-acd7-bee021dcbac7","auth_token":"9d39a8d6-0d75-456e-90f1-546b6575a011","auth_instance_id":{"String":"","Valid":false},"architecture":"","environment_variables":{"RawMessage":null,"Valid":false},"operating_system":"","instance_metadata":{"RawMessage":null,"Valid":false},"resource_metadata":{"RawMessage":null,"Valid":false},"directory":"","version":"","last_connected_replica_id":null,"connection_timeout_seconds":0,"troubleshooting_url":"","motd_file":"","lifecycle_state":"ready","expanded_directory":"","logs_length":0,"logs_overflowed":false,"started_at":{"Time":"0001-01-01T00:00:00Z","Valid":false},"ready_at":{"Time":"2023-12-21T11:26:24.480422Z","Valid":true},"subsystems":null,"display_apps":null,"api_version":""}

As you can see, it's null/invalid even though the agent pushed the updated and coderd saw it, before the update for ready.

Since this is the in-memory database, we can assume that the HTTP 204 response means the data was most definitely written. So how did we lose it? We even send lifecycle updates synchronously, so there should be no risk that start/ready concurrently competing for the update.

As for this PR, I'm not opposed to handling this so that we can guarantee there are no panics, but I wouldn't want to pass it silently off as 0, either. We could at least indicate to the user that it's a non-normal state. Maybe returning -1 from the safe func and then printing [unknown], or similar?

Emyrk · 2024-01-04T18:32:18Z

Trying to see if we just forgot to do some assignment in dbmem...

Found something unrelated: #11412

Emyrk · 2024-01-04T18:36:16Z

A duration of <0 gets overwritten here:

coder/cli/cliui/provisionerjob.go

Lines 232 to 234 in b3613ec

    
           if duration < 0 { 
        
           	duration = 0 
        
           }

If we want to fail non-silently, we can write some line to the stage writer?

mafredri · 2024-01-04T19:04:33Z

A duration of <0 gets overwritten here:

coder/cli/cliui/provisionerjob.go

Lines 232 to 234 in b3613ec

if duration < 0 {

duration = 0

}

Oh yeah, that was to protect against us occasionally showing negative millis.

If we want to fail non-silently, we can write some line to the stage writer?

Sure 👍🏻

Emyrk · 2024-01-04T19:16:05Z

@mafredri I added the logging of nil times.

Emyrk · 2024-01-04T19:20:56Z

@mafredri that makes the test fail anyways 🤦

https://github.com/coder/coder/actions/runs/7413833011/job/20173631242?pr=11411#step:5:384

Flake happened

mafredri · 2024-01-04T19:28:25Z

cli/cliui/agent.go

 			for agent.Status == codersdk.WorkspaceAgentDisconnected {
 				if agent, err = fetch(); err != nil {
 					return xerrors.Errorf("fetch: %w", err)
 				}
 			}
-			sw.Complete(stage, agent.LastConnectedAt.Sub(disconnectedAt))
+			sw.Complete(stage, safeDuration(sw, agent.LastConnectedAt, agent.DisconnectedAt))


This failed the new log change because the logic changed here, we handled the nil time by checking status before, but now we log instead.

Or wait, that doesn't sound quite right.

Ok third comments the charm. You need to keep handling @ disconnectedAt := *agent.DisconnectedAt because we want the cached disconnect value after we've reconnected.

Pushed a fix.

mafredri · 2024-01-04T19:34:42Z

cli/cliui/agent.go

+// A panic has been observed in a test.
+func safeDuration(sw *stageWriter, a, b *time.Time) time.Duration {
+	if a == nil || b == nil {
+		if sw != nil {


No need for sw to be an optional param here, IMO 🤔 (i.e. no need to check).

I just really didn't want to accidently introduce a new place to deref a nil lol

spikecurtis · 2024-01-05T04:51:05Z

This smells like db corruption because of a read-modify-update cycle not protected by a transaction

spikecurtis · 2024-01-05T05:57:34Z

Ok, I have a theory:

check out the timing of the API logs:

2023-12-21T11:27:49.7003352Z     t.go:84: 2023-12-21 11:26:24.482 [debu]  workspace agent state report  workspace_agent_id=ff758f63-c35e-4767-a37e-52f9c937f835  workspace_id=c18b2771-df4b-42d5-a3f1-a014910197b8  payload={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}  request_id=3ce9c96a-feb5-4f09-803a-d3ec49016807
2023-12-21T11:27:49.7013125Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  POST  host=localhost:64379  path=/api/v2/workspaceagents/me/report-lifecycle  proto=HTTP/1.1  remote_addr=127.0.0.1  start="2023-12-21T11:26:24.4822116Z"  took=2.1672ms  status_code=204  latency_ms=2  request_id=3ce9c96a-feb5-4f09-803a-d3ec49016807
2023-12-21T11:27:49.7034778Z     t.go:84: 2023-12-21 11:26:24.485 [debu]  workspace agent state report  workspace_agent_id=e16a4d48-6057-425b-afcb-2d5ca5660e49  workspace_id=c18b2771-df4b-42d5-a3f1-a014910197b8  payload={"state":"ready","changed_at":"2023-12-21T11:26:24.480422Z"}  request_id=db18b534-cd87-48b5-96d8-fc919bb72a8e
2023-12-21T11:27:49.7037500Z     t.go:84: 2023-12-21 11:26:24.486 [debu]  POST  host=localhost:64379  path=/api/v2/workspaceagents/me/report-lifecycle  proto=HTTP/1.1  remote_addr=127.0.0.1  start="2023-12-21T11:26:24.4855906Z"  took=1.1765ms  status_code=204  latency_ms=1  request_id=db18b534-cd87-48b5-96d8-fc919bb72a8e

The API call to set the lifecycle to "starting" begins at 2023-12-21T11:26:24.4822116Z

However, the "created_at" time of the agent row is: 2023-12-21T11:26:24.483817Z, that is about a millisecond after the "starting" lifecycle.

Then finally the "ready" lifecycle update comes in at 2023-12-21T11:26:24.4855906Z

So, the "starting" lifecycle update was processed before the agent row was created, which explains why the data is missing.

But, the "starting" lifecycle update was successful, which doesn't make sense if the agent row didn't exist. However, the agent doesn't actually know it's own ID, it uses /api/v2/workspaceagents/me/report-lifecycle to report lifecycle, and coderd looks up the agent ID using the auth token. In the test that failed, we create the workspace, then stop it before running the SSH command which then starts a new build. Crucially, we hard-code the authentication token in the template version, which is a limitation of the Echo provisioner. So, when the "starting" lifecycle change was POSTed, Coderd found the agent from the previous build and set the time there.

Then later, when "ready" posted, the new agent row existed, and so Coderd found the new agent (it always looks for the latest matching build for the token).

Some take-aways:

The Coder API doesn't enforce that "ready" comes after "starting" in the lifecycle. We rely on the Agent doing the right thing. Bugs in the Agent shouldn't create panics elsewhere, so the API needs to enforce the rules about consistent state (e.g. that ready implies started_at is valid and/or "ready" comes after starting). Also, the Agent and it's token are under end user control.
Token uniqueness with respect to agents is assumed by the way our API works (i.e. paths like /api/v2/workspaceagents/me/report-lifecycle), but Echo provisioner doesn't create unique tokens and we don't enforce uniqueness in the database. @johnstcn once tried to enforce this at the database and found a bunch of tests broke so he gave up. But this bug is evidence that there are lurking issues with this oversight.
The in-mem database doesn't enforce uniqueness of agent IDs. I don't think that contributed directly to this bug, but I noticed it and went down a rabbit hole wondering if we were inserting the same agentID more than once. This is probably endemic to many other types of data in the fake database.

spikecurtis · 2024-01-05T06:04:42Z

Oh, also, https://github.com/coder/coder/blame/58873fa7e2e90cb026938b4cb4a05bcc7007b3bf/coderd/workspaceagents.go#L1755

does suffer from read-modify-update outside a transaction. With a well-behaved agent, it serializes calls to this API, but again, a bugged agent shouldn't cause panics in other components. And, a poorly made template could create two instances of the same agent with the same token, so the assumption that calls are serialized is dangerous.

chore: prevent nil derefs in non-critical paths

b3613ec

Emyrk requested a review from mafredri January 4, 2024 15:39

github-actions bot assigned Emyrk Jan 4, 2024

log line when nil time detected

ae7c34c

mafredri reviewed Jan 4, 2024

View reviewed changes

fix

55258d0

mafredri approved these changes Jan 4, 2024

View reviewed changes

Emyrk merged commit c6366e5 into main Jan 4, 2024

Emyrk deleted the stevenmasley/time_panics branch January 4, 2024 20:51

github-actions bot locked and limited conversation to collaborators Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: prevent nil derefs in non-critical paths #11411

chore: prevent nil derefs in non-critical paths #11411

Emyrk commented Jan 4, 2024

mafredri commented Jan 4, 2024

Emyrk commented Jan 4, 2024

Emyrk commented Jan 4, 2024

mafredri commented Jan 4, 2024

Emyrk commented Jan 4, 2024

Emyrk commented Jan 4, 2024

mafredri Jan 4, 2024

mafredri Jan 4, 2024

mafredri Jan 4, 2024

mafredri Jan 4, 2024

mafredri Jan 4, 2024

Emyrk Jan 4, 2024

spikecurtis commented Jan 5, 2024

spikecurtis commented Jan 5, 2024

spikecurtis commented Jan 5, 2024

chore: prevent nil derefs in non-critical paths #11411

chore: prevent nil derefs in non-critical paths #11411

Conversation

Emyrk commented Jan 4, 2024

mafredri commented Jan 4, 2024

Emyrk commented Jan 4, 2024

Emyrk commented Jan 4, 2024

mafredri commented Jan 4, 2024

Emyrk commented Jan 4, 2024

Emyrk commented Jan 4, 2024

mafredri Jan 4, 2024

Choose a reason for hiding this comment

mafredri Jan 4, 2024

Choose a reason for hiding this comment

mafredri Jan 4, 2024

Choose a reason for hiding this comment

mafredri Jan 4, 2024

Choose a reason for hiding this comment

mafredri Jan 4, 2024

Choose a reason for hiding this comment

Emyrk Jan 4, 2024

Choose a reason for hiding this comment

spikecurtis commented Jan 5, 2024

spikecurtis commented Jan 5, 2024

spikecurtis commented Jan 5, 2024