-
Notifications
You must be signed in to change notification settings - Fork 1k
Do not allow multiple active runners for a subgraph #5715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not allow multiple active runners for a subgraph #5715
Conversation
87d6188
to
7f430fa
Compare
@@ -58,6 +58,10 @@ impl SubgraphKeepAlive { | |||
self.sg_metrics.running_count.inc(); | |||
} | |||
} | |||
|
|||
pub fn contains(&self, deployment_id: &DeploymentId) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I would rename this to is_alive or is_running which seems better (subjectively) suited for "SubgraphKeepAlive"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are already remove
and insert
methods, and I thought contains
would follow the same format :)
|
||
// It is possible that the subgraph was unassigned, but the runner was in | ||
// a retry delay state and did not observe the cancel signal. | ||
if block_stream_cancel_handle.is_canceled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the SG is re-assigned, could we lose the cancellation and end up with this not triggered and have 2 SGs on different nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the current implementation of CancelGuard, this should not be possible because it is considered canceled on drop and reassign to the same node replaces the canceler in the context that essentially drops the previous one. Unassign, followed by reassign to a different node will just drop the canceller from the context and it will be considered cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok from that then I understand we are not missing events, rather the event was processed but we just didn't check is_cancelled often enough. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct. There were several paths that skipped the check, and the fix essentially covers all cases
4e82847
to
603045e
Compare
603045e
to
18eda16
Compare
This PR fixes a long-standing bug that caused many subgraphs to fail repeatedly when a simple sequence of errors, retries, and restarts occurred.
Closes #5452
Context
There have been multiple reports of some subgraphs failing randomly, and restarts only helping temporarily.
I have examined several days of logs for some subgraphs and found patterns that lead to this strange state when nothing seems to help recover the subgraphs.
Based on that, I started to reproduce the bug locally and looked for a way to fix it with minimal refactoring.
Reproducing the bug locally
graphman restart <ID>
.graphman restart
, leading to indefinite failures.* There are many more ways to reproduce this bug, but I have chosen the simplest one.
Testing the solution
I have tested the proposed solution in a local setup with attempts to reproduce the bug based on the patterns I saw in the logs, and none of them were successful. The fix detects the duplicate runner and shuts it down, allowing the new runner to progress.
* I've set up a local Grafana loki instance to view logs and make sure that graph-node behaves as expected after the fix.