Do not allow multiple active runners for a subgraph #5715

isum · 2024-11-22T16:33:45Z

This PR fixes a long-standing bug that caused many subgraphs to fail repeatedly when a simple sequence of errors, retries, and restarts occurred.

Closes #5452

Context

There have been multiple reports of some subgraphs failing randomly, and restarts only helping temporarily.

I have examined several days of logs for some subgraphs and found patterns that lead to this strange state when nothing seems to help recover the subgraphs.

Based on that, I started to reproduce the bug locally and looked for a way to fix it with minimal refactoring.

Reproducing the bug locally

I set up a minimal subgraph to record block numbers and timestamps and deployed it locally.
Modified the graph-node code (here) to introduce a non-deterministic error after processing 10 blocks.
When the simulated error triggered, the runner entered a retry delay state.
Then, I executed graphman restart <ID>.
The subgraph restarted, creating a new runner, but the first runner continued waiting for the retry delay to expire.
During the retry delay, the second runner progressed, but when the delay ended, the first runner initiated a soft restart. This replaced the block stream canceller in the context and terminated the second runner.
At this stage, only the first runner remained, but its writer had been killed by the graphman restart, leading to indefinite failures.
After that, restarts no longer helped.
The simplest way to recover was to restart graph-node. Unassigns weren’t always reliable, as I observed scenarios with multiple (up to 4) failing runners for the same subgraph active at the same time.

* There are many more ways to reproduce this bug, but I have chosen the simplest one.

Testing the solution

I have tested the proposed solution in a local setup with attempts to reproduce the bug based on the patterns I saw in the logs, and none of them were successful. The fix detects the duplicate runner and shuts it down, allowing the new runner to progress.

* I've set up a local Grafana loki instance to view logs and make sure that graph-node behaves as expected after the fix.

mangas · 2024-11-22T18:10:59Z

core/src/subgraph/context/mod.rs

@@ -58,6 +58,10 @@ impl SubgraphKeepAlive {
            self.sg_metrics.running_count.inc();
        }
    }
+
+    pub fn contains(&self, deployment_id: &DeploymentId) -> bool {


nitpick: I would rename this to is_alive or is_running which seems better (subjectively) suited for "SubgraphKeepAlive"

There are already remove and insert methods, and I thought contains would follow the same format :)

mangas · 2024-11-22T18:12:06Z

core/src/subgraph/runner.rs

+
+                // It is possible that the subgraph was unassigned, but the runner was in
+                // a retry delay state and did not observe the cancel signal.
+                if block_stream_cancel_handle.is_canceled() {


If the SG is re-assigned, could we lose the cancellation and end up with this not triggered and have 2 SGs on different nodes?

Based on the current implementation of CancelGuard, this should not be possible because it is considered canceled on drop and reassign to the same node replaces the canceler in the context that essentially drops the previous one. Unassign, followed by reassign to a different node will just drop the canceller from the context and it will be considered cancelled.

ok from that then I understand we are not missing events, rather the event was processed but we just didn't check is_cancelled often enough. Is that correct?

Yes, that is correct. There were several paths that skipped the check, and the fix essentially covers all cases

isum self-assigned this Nov 22, 2024

isum force-pushed the ion/do-not-allow-multiple-active-runners-for-a-subgraph branch from 87d6188 to 7f430fa Compare November 22, 2024 17:49

isum marked this pull request as ready for review November 22, 2024 17:49

mangas reviewed Nov 22, 2024

View reviewed changes

isum force-pushed the ion/do-not-allow-multiple-active-runners-for-a-subgraph branch 2 times, most recently from 4e82847 to 603045e Compare November 22, 2024 18:24

core: do not allow multiple active runners for a subgraph

18eda16

isum force-pushed the ion/do-not-allow-multiple-active-runners-for-a-subgraph branch from 603045e to 18eda16 Compare November 22, 2024 18:41

mangas approved these changes Nov 22, 2024

View reviewed changes

isum merged commit 1e7732c into master Nov 25, 2024
6 checks passed

isum deleted the ion/do-not-allow-multiple-active-runners-for-a-subgraph branch November 25, 2024 17:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not allow multiple active runners for a subgraph #5715

Do not allow multiple active runners for a subgraph #5715

Uh oh!

isum commented Nov 22, 2024 •

edited

Loading

Uh oh!

mangas Nov 22, 2024

Uh oh!

isum Nov 22, 2024

Uh oh!

mangas Nov 22, 2024

Uh oh!

isum Nov 22, 2024

Uh oh!

mangas Nov 22, 2024

Uh oh!

isum Nov 22, 2024

Uh oh!

Uh oh!

Uh oh!

Do not allow multiple active runners for a subgraph #5715

Do not allow multiple active runners for a subgraph #5715

Uh oh!

Conversation

isum commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Reproducing the bug locally

Testing the solution

Uh oh!

mangas Nov 22, 2024

Choose a reason for hiding this comment

Uh oh!

isum Nov 22, 2024

Choose a reason for hiding this comment

Uh oh!

mangas Nov 22, 2024

Choose a reason for hiding this comment

Uh oh!

isum Nov 22, 2024

Choose a reason for hiding this comment

Uh oh!

mangas Nov 22, 2024

Choose a reason for hiding this comment

Uh oh!

isum Nov 22, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

isum commented Nov 22, 2024 •

edited

Loading