Skip to content

Do not allow multiple active runners for a subgraph #5715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

isum
Copy link
Member

@isum isum commented Nov 22, 2024

This PR fixes a long-standing bug that caused many subgraphs to fail repeatedly when a simple sequence of errors, retries, and restarts occurred.

Closes #5452

Context

There have been multiple reports of some subgraphs failing randomly, and restarts only helping temporarily.

I have examined several days of logs for some subgraphs and found patterns that lead to this strange state when nothing seems to help recover the subgraphs.

Based on that, I started to reproduce the bug locally and looked for a way to fix it with minimal refactoring.

Reproducing the bug locally

  • I set up a minimal subgraph to record block numbers and timestamps and deployed it locally.
  • Modified the graph-node code (here) to introduce a non-deterministic error after processing 10 blocks.
  • When the simulated error triggered, the runner entered a retry delay state.
  • Then, I executed graphman restart <ID>.
  • The subgraph restarted, creating a new runner, but the first runner continued waiting for the retry delay to expire.
  • During the retry delay, the second runner progressed, but when the delay ended, the first runner initiated a soft restart. This replaced the block stream canceller in the context and terminated the second runner.
  • At this stage, only the first runner remained, but its writer had been killed by the graphman restart, leading to indefinite failures.
  • After that, restarts no longer helped.
  • The simplest way to recover was to restart graph-node. Unassigns weren’t always reliable, as I observed scenarios with multiple (up to 4) failing runners for the same subgraph active at the same time.

* There are many more ways to reproduce this bug, but I have chosen the simplest one.

Testing the solution

I have tested the proposed solution in a local setup with attempts to reproduce the bug based on the patterns I saw in the logs, and none of them were successful. The fix detects the duplicate runner and shuts it down, allowing the new runner to progress.

* I've set up a local Grafana loki instance to view logs and make sure that graph-node behaves as expected after the fix.

@isum isum self-assigned this Nov 22, 2024
@isum isum force-pushed the ion/do-not-allow-multiple-active-runners-for-a-subgraph branch from 87d6188 to 7f430fa Compare November 22, 2024 17:49
@isum isum marked this pull request as ready for review November 22, 2024 17:49
@@ -58,6 +58,10 @@ impl SubgraphKeepAlive {
self.sg_metrics.running_count.inc();
}
}

pub fn contains(&self, deployment_id: &DeploymentId) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I would rename this to is_alive or is_running which seems better (subjectively) suited for "SubgraphKeepAlive"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already remove and insert methods, and I thought contains would follow the same format :)


// It is possible that the subgraph was unassigned, but the runner was in
// a retry delay state and did not observe the cancel signal.
if block_stream_cancel_handle.is_canceled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the SG is re-assigned, could we lose the cancellation and end up with this not triggered and have 2 SGs on different nodes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the current implementation of CancelGuard, this should not be possible because it is considered canceled on drop and reassign to the same node replaces the canceler in the context that essentially drops the previous one. Unassign, followed by reassign to a different node will just drop the canceller from the context and it will be considered cancelled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok from that then I understand we are not missing events, rather the event was processed but we just didn't check is_cancelled often enough. Is that correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. There were several paths that skipped the check, and the fix essentially covers all cases

@isum isum force-pushed the ion/do-not-allow-multiple-active-runners-for-a-subgraph branch 2 times, most recently from 4e82847 to 603045e Compare November 22, 2024 18:24
@isum isum force-pushed the ion/do-not-allow-multiple-active-runners-for-a-subgraph branch from 603045e to 18eda16 Compare November 22, 2024 18:41
@isum isum merged commit 1e7732c into master Nov 25, 2024
6 checks passed
@isum isum deleted the ion/do-not-allow-multiple-active-runners-for-a-subgraph branch November 25, 2024 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure that non-deterministic failures are cleared when a subgraph restarts after shutting down the writer too early
2 participants