-
Notifications
You must be signed in to change notification settings - Fork 887
chore: prevent nil derefs in non-critical paths #11411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is an interesting one. I was quite sure that this shouldn't happen since the API is supposed to give us guarantees about which fields are available based on what lifecycle state we're in. So I did some digging and quite a bit of head scratching later, I do think we have a bug, but it's not here. If we look at the agent logs, it's reporting in as it should:
Likewise, the API is seeing what we expect it to see:
However, a bit later it seems like either the
As you can see, it's null/invalid even though the agent pushed the updated and coderd saw it, before the update for ready. Since this is the in-memory database, we can assume that the HTTP 204 response means the data was most definitely written. So how did we lose it? We even send lifecycle updates synchronously, so there should be no risk that start/ready concurrently competing for the update. As for this PR, I'm not opposed to handling this so that we can guarantee there are no panics, but I wouldn't want to pass it silently off as 0, either. We could at least indicate to the user that it's a non-normal state. Maybe returning |
Trying to see if we just forgot to do some assignment in dbmem... Found something unrelated: #11412 |
A duration of coder/cli/cliui/provisionerjob.go Lines 232 to 234 in b3613ec
If we want to fail non-silently, we can write some line to the stage writer? |
Oh yeah, that was to protect against us occasionally showing negative millis.
Sure 👍🏻 |
@mafredri I added the logging of nil times. |
@mafredri that makes the test fail anyways 🤦 https://github.com/coder/coder/actions/runs/7413833011/job/20173631242?pr=11411#step:5:384 Flake happened |
cli/cliui/agent.go
Outdated
for agent.Status == codersdk.WorkspaceAgentDisconnected { | ||
if agent, err = fetch(); err != nil { | ||
return xerrors.Errorf("fetch: %w", err) | ||
} | ||
} | ||
sw.Complete(stage, agent.LastConnectedAt.Sub(disconnectedAt)) | ||
sw.Complete(stage, safeDuration(sw, agent.LastConnectedAt, agent.DisconnectedAt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This failed the new log change because the logic changed here, we handled the nil time by checking status before, but now we log instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or wait, that doesn't sound quite right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok third comments the charm. You need to keep handling @ disconnectedAt := *agent.DisconnectedAt because we want the cached disconnect value after we've reconnected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a fix.
// A panic has been observed in a test. | ||
func safeDuration(sw *stageWriter, a, b *time.Time) time.Duration { | ||
if a == nil || b == nil { | ||
if sw != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for sw to be an optional param here, IMO 🤔 (i.e. no need to check).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just really didn't want to accidently introduce a new place to deref a nil lol
This smells like db corruption because of a read-modify-update cycle not protected by a transaction |
Ok, I have a theory: check out the timing of the API logs:
The API call to set the lifecycle to "starting" begins at 2023-12-21T11:26:24.4822116Z However, the "created_at" time of the agent row is: 2023-12-21T11:26:24.483817Z, that is about a millisecond after the "starting" lifecycle. Then finally the "ready" lifecycle update comes in at 2023-12-21T11:26:24.4855906Z So, the "starting" lifecycle update was processed before the agent row was created, which explains why the data is missing. But, the "starting" lifecycle update was successful, which doesn't make sense if the agent row didn't exist. However, the agent doesn't actually know it's own ID, it uses Then later, when "ready" posted, the new agent row existed, and so Coderd found the new agent (it always looks for the latest matching build for the token). Some take-aways:
|
does suffer from read-modify-update outside a transaction. With a well-behaved agent, it serializes calls to this API, but again, a bugged agent shouldn't cause panics in other components. And, a poorly made template could create two instances of the same agent with the same token, so the assumption that calls are serialized is dangerous. |
Similar PR here: https://github.com/coder/coder/pull/9892/files
Fixes: #11308
I could not reproduce this, so it must just be flakey. This obviously sidesteps the problem, because ideally we just track down why the times are nil. However all this does is print a message with a duration to the UI. It is not served in any stats, or used in anything besides the build logs.
So I think this solution is fine just to prevent these in the future.