Skip to content

Occasionally client/agent connection stalls when sending data over SSH #7262

@mafredri

Description

@mafredri

As the title says, there is occasionally a network stall when communicating with the agent over SSH and quite possibly affects agent connections in general.

I've tried to narrow down when/how this happens but at least with SSH connections, we can simply observe that it happens in various scenarios.

❯ ./test_upload.sh
run: slow print dots (2048)
took: 4.1984798908233643
run: upload data.bin (size: 100 MB)
took: 4.8269269466400146
run: upload data.bin (size: 100 MB) after 3s delay
took: 4.3507531166076658
run: upload data.bin (size: 100 MB) after 10s delay
took: 36.445715951919553
run: slow print dots (2048)
took: 4.1797709465026855
run: upload data.bin (size: 100 MB)
took: 5.8617269992828369
run: upload data.bin (size: 100 MB) after 3s delay
took: 4.652900981903076
run: upload data.bin (size: 100 MB) after 10s delay
took: 4.8792789459228514
run: slow print dots (2048)
took: 4.2638399600982666
run: upload data.bin (size: 100 MB)
took: 27.658783912658691
run: upload data.bin (size: 100 MB) after 3s delay
took: 5.0256939411163328
run: upload data.bin (size: 100 MB) after 10s delay
took: 36.462532091140744
run: slow print dots (2048)
took: 4.1673328876495361
run: upload data.bin (size: 100 MB)
took: 59.976847887039185
run: upload data.bin (size: 100 MB) after 3s delay
took: 35.357865858078
run: upload data.bin (size: 100 MB) after 10s delay
took: 4.3665278434753416
run: slow print dots (2048)
took: 4.2197508811950684
run: upload data.bin (size: 100 MB)
took: 4.948814868927002
run: upload data.bin (size: 100 MB) after 3s delay
took: 4.5854170799255369
run: upload data.bin (size: 100 MB) after 10s delay
took: 4.4678581237792967

Test script: test_upload.sh.txt

According to my observations, the stall that happens seems be ~ a multiple of 5 and the numbers repeat often, ~35 (i.e. 30s stall) is especially popular in my test setup. This 5 seconds is also the time between WireGuard reconnection attempts, which could be related.

In the above run, I didn't manage to trigger the stall for run: slow print dots (2048), however, it did seem to happen in another run, it took 5 seconds longer than it would normally. (I.e. 4s -> 9s). This could suggest it doesn't matter how much data is sent, but to trigger the problem data has to be sent at the right time to trigger the issue.

This stall could be the reason for many of our test flakes where we're expecting something to happen in a reasonable amount of time (5-15s), but as shown here, it could take 30s to reconnect, or even more.

Metadata

Metadata

Assignees

Labels

s1Bugs that break core workflows. Only humans may set this.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions