Skip to content

chore: prevent nil derefs in non-critical paths #11411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 4, 2024
Merged

Conversation

Emyrk
Copy link
Member

@Emyrk Emyrk commented Jan 4, 2024

Similar PR here: https://github.com/coder/coder/pull/9892/files

Fixes: #11308

I could not reproduce this, so it must just be flakey. This obviously sidesteps the problem, because ideally we just track down why the times are nil. However all this does is print a message with a duration to the UI. It is not served in any stats, or used in anything besides the build logs.

So I think this solution is fine just to prevent these in the future.

@Emyrk Emyrk requested a review from mafredri January 4, 2024 15:39
@mafredri
Copy link
Member

mafredri commented Jan 4, 2024

This is an interesting one. I was quite sure that this shouldn't happen since the API is supposed to give us guarantees about which fields are available based on what lifecycle state we're in. So I did some digging and quite a bit of head scratching later, I do think we have a bug, but it's not here.

If we look at the agent logs, it's reporting in as it should:

2023-12-21T11:27:49.6967595Z     t.go:84: 2023-12-21 11:26:24.479 [debu]  agent: set lifecycle state  current={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}  last={"state":"created","changed_at":"0001-01-01T00:00:00Z"}
2023-12-21T11:27:49.6971361Z     t.go:84: 2023-12-21 11:26:24.480 [debu]  agent: reporting lifecycle state  payload={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}
2023-12-21T11:27:49.6974135Z     t.go:84: 2023-12-21 11:26:24.480 [debu]  agent: sdk request  method=POST  url=http://localhost:64379/api/v2/workspaceagents/me/report-lifecycle  body=""
2023-12-21T11:27:49.6977363Z     t.go:84: 2023-12-21 11:26:24.480 [debu]  agent: set lifecycle state  current={"state":"ready","changed_at":"2023-12-21T11:26:24.480422Z"}  last={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}
2023-12-21T11:27:49.7020898Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  agent: sdk response  method=POST  url=http://localhost:64379/api/v2/workspaceagents/me/report-lifecycle  status=204  body=""  trace_id=""  span_id=""
2023-12-21T11:27:49.7026281Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  agent: reporting lifecycle state  payload={"state":"ready","changed_at":"2023-12-21T11:26:24.480422Z"}
2023-12-21T11:27:49.7027697Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  agent: sdk request  method=POST  url=http://localhost:64379/api/v2/workspaceagents/me/report-lifecycle  body=""
2023-12-21T11:27:49.7039717Z     t.go:84: 2023-12-21 11:26:24.487 [debu]  agent: sdk response  method=POST  url=http://localhost:64379/api/v2/workspaceagents/me/report-lifecycle  status=204  body=""  trace_id=""  span_id=""

Likewise, the API is seeing what we expect it to see:

2023-12-21T11:27:49.7003352Z     t.go:84: 2023-12-21 11:26:24.482 [debu]  workspace agent state report  workspace_agent_id=ff758f63-c35e-4767-a37e-52f9c937f835  workspace_id=c18b2771-df4b-42d5-a3f1-a014910197b8  payload={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}  request_id=3ce9c96a-feb5-4f09-803a-d3ec49016807
2023-12-21T11:27:49.7013125Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  POST  host=localhost:64379  path=/api/v2/workspaceagents/me/report-lifecycle  proto=HTTP/1.1  remote_addr=127.0.0.1  start="2023-12-21T11:26:24.4822116Z"  took=2.1672ms  status_code=204  latency_ms=2  request_id=3ce9c96a-feb5-4f09-803a-d3ec49016807
2023-12-21T11:27:49.7034778Z     t.go:84: 2023-12-21 11:26:24.485 [debu]  workspace agent state report  workspace_agent_id=e16a4d48-6057-425b-afcb-2d5ca5660e49  workspace_id=c18b2771-df4b-42d5-a3f1-a014910197b8  payload={"state":"ready","changed_at":"2023-12-21T11:26:24.480422Z"}  request_id=db18b534-cd87-48b5-96d8-fc919bb72a8e
2023-12-21T11:27:49.7037500Z     t.go:84: 2023-12-21 11:26:24.486 [debu]  POST  host=localhost:64379  path=/api/v2/workspaceagents/me/report-lifecycle  proto=HTTP/1.1  remote_addr=127.0.0.1  start="2023-12-21T11:26:24.4855906Z"  took=1.1765ms  status_code=204  latency_ms=1  request_id=db18b534-cd87-48b5-96d8-fc919bb72a8e

However, a bit later it seems like either the started_at field has not been updated, or it has been overwritten by stale data:

2023-12-21T11:27:49.7086907Z     t.go:84: 2023-12-21 11:26:24.574 [debu]  accepting agent details  request_id=1afb922e-aa48-4a48-b427-36854b703ada  agent={"id":"e16a4d48-6057-425b-afcb-2d5ca5660e49","created_at":"2023-12-21T11:26:24.483817Z","updated_at":"2023-12-21T11:26:24.483817Z","name":"example","first_connected_at":{"Time":"0001-01-01T00:00:00Z","Valid":false},"last_connected_at":{"Time":"0001-01-01T00:00:00Z","Valid":false},"disconnected_at":{"Time":"0001-01-01T00:00:00Z","Valid":false},"resource_id":"b0af077e-bc12-403b-acd7-bee021dcbac7","auth_token":"9d39a8d6-0d75-456e-90f1-546b6575a011","auth_instance_id":{"String":"","Valid":false},"architecture":"","environment_variables":{"RawMessage":null,"Valid":false},"operating_system":"","instance_metadata":{"RawMessage":null,"Valid":false},"resource_metadata":{"RawMessage":null,"Valid":false},"directory":"","version":"","last_connected_replica_id":null,"connection_timeout_seconds":0,"troubleshooting_url":"","motd_file":"","lifecycle_state":"ready","expanded_directory":"","logs_length":0,"logs_overflowed":false,"started_at":{"Time":"0001-01-01T00:00:00Z","Valid":false},"ready_at":{"Time":"2023-12-21T11:26:24.480422Z","Valid":true},"subsystems":null,"display_apps":null,"api_version":""}

As you can see, it's null/invalid even though the agent pushed the updated and coderd saw it, before the update for ready.

Since this is the in-memory database, we can assume that the HTTP 204 response means the data was most definitely written. So how did we lose it? We even send lifecycle updates synchronously, so there should be no risk that start/ready concurrently competing for the update.


As for this PR, I'm not opposed to handling this so that we can guarantee there are no panics, but I wouldn't want to pass it silently off as 0, either. We could at least indicate to the user that it's a non-normal state. Maybe returning -1 from the safe func and then printing [unknown], or similar?

@Emyrk
Copy link
Member Author

Emyrk commented Jan 4, 2024

Trying to see if we just forgot to do some assignment in dbmem...

Found something unrelated: #11412

@Emyrk
Copy link
Member Author

Emyrk commented Jan 4, 2024

A duration of <0 gets overwritten here:

if duration < 0 {
duration = 0
}

If we want to fail non-silently, we can write some line to the stage writer?

@mafredri
Copy link
Member

mafredri commented Jan 4, 2024

A duration of <0 gets overwritten here:

if duration < 0 {
duration = 0
}

Oh yeah, that was to protect against us occasionally showing negative millis.

If we want to fail non-silently, we can write some line to the stage writer?

Sure 👍🏻

@Emyrk
Copy link
Member Author

Emyrk commented Jan 4, 2024

@mafredri I added the logging of nil times.

@Emyrk
Copy link
Member Author

Emyrk commented Jan 4, 2024

@mafredri that makes the test fail anyways 🤦

https://github.com/coder/coder/actions/runs/7413833011/job/20173631242?pr=11411#step:5:384

Flake happened

for agent.Status == codersdk.WorkspaceAgentDisconnected {
if agent, err = fetch(); err != nil {
return xerrors.Errorf("fetch: %w", err)
}
}
sw.Complete(stage, agent.LastConnectedAt.Sub(disconnectedAt))
sw.Complete(stage, safeDuration(sw, agent.LastConnectedAt, agent.DisconnectedAt))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This failed the new log change because the logic changed here, we handled the nil time by checking status before, but now we log instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or wait, that doesn't sound quite right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok third comments the charm. You need to keep handling @ disconnectedAt := *agent.DisconnectedAt because we want the cached disconnect value after we've reconnected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a fix.

// A panic has been observed in a test.
func safeDuration(sw *stageWriter, a, b *time.Time) time.Duration {
if a == nil || b == nil {
if sw != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for sw to be an optional param here, IMO 🤔 (i.e. no need to check).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just really didn't want to accidently introduce a new place to deref a nil lol

@Emyrk Emyrk merged commit c6366e5 into main Jan 4, 2024
@Emyrk Emyrk deleted the stevenmasley/time_panics branch January 4, 2024 20:51
@github-actions github-actions bot locked and limited conversation to collaborators Jan 4, 2024
@spikecurtis
Copy link
Contributor

This smells like db corruption because of a read-modify-update cycle not protected by a transaction

@spikecurtis
Copy link
Contributor

Ok, I have a theory:

check out the timing of the API logs:

2023-12-21T11:27:49.7003352Z     t.go:84: 2023-12-21 11:26:24.482 [debu]  workspace agent state report  workspace_agent_id=ff758f63-c35e-4767-a37e-52f9c937f835  workspace_id=c18b2771-df4b-42d5-a3f1-a014910197b8  payload={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}  request_id=3ce9c96a-feb5-4f09-803a-d3ec49016807
2023-12-21T11:27:49.7013125Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  POST  host=localhost:64379  path=/api/v2/workspaceagents/me/report-lifecycle  proto=HTTP/1.1  remote_addr=127.0.0.1  start="2023-12-21T11:26:24.4822116Z"  took=2.1672ms  status_code=204  latency_ms=2  request_id=3ce9c96a-feb5-4f09-803a-d3ec49016807
2023-12-21T11:27:49.7034778Z     t.go:84: 2023-12-21 11:26:24.485 [debu]  workspace agent state report  workspace_agent_id=e16a4d48-6057-425b-afcb-2d5ca5660e49  workspace_id=c18b2771-df4b-42d5-a3f1-a014910197b8  payload={"state":"ready","changed_at":"2023-12-21T11:26:24.480422Z"}  request_id=db18b534-cd87-48b5-96d8-fc919bb72a8e
2023-12-21T11:27:49.7037500Z     t.go:84: 2023-12-21 11:26:24.486 [debu]  POST  host=localhost:64379  path=/api/v2/workspaceagents/me/report-lifecycle  proto=HTTP/1.1  remote_addr=127.0.0.1  start="2023-12-21T11:26:24.4855906Z"  took=1.1765ms  status_code=204  latency_ms=1  request_id=db18b534-cd87-48b5-96d8-fc919bb72a8e

The API call to set the lifecycle to "starting" begins at 2023-12-21T11:26:24.4822116Z

However, the "created_at" time of the agent row is: 2023-12-21T11:26:24.483817Z, that is about a millisecond after the "starting" lifecycle.

Then finally the "ready" lifecycle update comes in at 2023-12-21T11:26:24.4855906Z

So, the "starting" lifecycle update was processed before the agent row was created, which explains why the data is missing.

But, the "starting" lifecycle update was successful, which doesn't make sense if the agent row didn't exist. However, the agent doesn't actually know it's own ID, it uses /api/v2/workspaceagents/me/report-lifecycle to report lifecycle, and coderd looks up the agent ID using the auth token. In the test that failed, we create the workspace, then stop it before running the SSH command which then starts a new build. Crucially, we hard-code the authentication token in the template version, which is a limitation of the Echo provisioner. So, when the "starting" lifecycle change was POSTed, Coderd found the agent from the previous build and set the time there.

Then later, when "ready" posted, the new agent row existed, and so Coderd found the new agent (it always looks for the latest matching build for the token).


Some take-aways:

  1. The Coder API doesn't enforce that "ready" comes after "starting" in the lifecycle. We rely on the Agent doing the right thing. Bugs in the Agent shouldn't create panics elsewhere, so the API needs to enforce the rules about consistent state (e.g. that ready implies started_at is valid and/or "ready" comes after starting). Also, the Agent and it's token are under end user control.

  2. Token uniqueness with respect to agents is assumed by the way our API works (i.e. paths like /api/v2/workspaceagents/me/report-lifecycle), but Echo provisioner doesn't create unique tokens and we don't enforce uniqueness in the database. @johnstcn once tried to enforce this at the database and found a bunch of tests broke so he gave up. But this bug is evidence that there are lurking issues with this oversight.

  3. The in-mem database doesn't enforce uniqueness of agent IDs. I don't think that contributed directly to this bug, but I noticed it and went down a rabbit hole wondering if we were inserting the same agentID more than once. This is probably endemic to many other types of data in the fake database.

@spikecurtis
Copy link
Contributor

Oh, also, https://github.com/coder/coder/blame/58873fa7e2e90cb026938b4cb4a05bcc7007b3bf/coderd/workspaceagents.go#L1755

does suffer from read-modify-update outside a transaction. With a well-behaved agent, it serializes calls to this API, but again, a bugged agent shouldn't cause panics in other components. And, a poorly made template could create two instances of the same agent with the same token, so the assumption that calls are serialized is dangerous.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test flake: TestSSH/StartStoppedWorkspace
3 participants