Skip to content

Upgrading coder v0.19.2 -> v0.20.0 results in code-server becoming inaccessible (blank page on load) #6746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
edmondsiu0 opened this issue Mar 23, 2023 · 19 comments
Assignees
Labels
s0 Major regression, all-hands-on-deck to fix

Comments

@edmondsiu0
Copy link

Hi Coder team, first of all, thanks for Coder!

I have noticed an issue with the latest version of Coder v0.20.0, after upgrade the following features results in (near) blank page stuck forever loading:

image

  • code-server
  • Terminal

image

No network activities were captured by Chrome while it sits there waiting to load (screenshot below shows code-server).

image

With that being said, some activities were seen on the container with code-server running, indicating "invalid initiation message".

2023-03-23 08:06:24.838 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - Sending handshake initiation
2023-03-23 08:06:25.034 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - Received handshake initiation
2023-03-23 08:06:25.034 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - ConsumeMessageInitiation: handshake replay @ 2023-03-23 08:06:25.016777216 +0000 UTC
2023-03-23 08:06:25.034 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - Sending handshake response
2023-03-23 08:06:25.034 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] Received invalid initiation message from ee673a36eec77a8bb35c94328b0e331b35ed66eb3d300493831ea31d8bf91d05
2023-03-23 08:06:30.051 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - Handshake did not complete after 5 seconds, retrying (try 2)
2023-03-23 08:06:30.051 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - Sending handshake initiation
2023-03-23 08:06:30.303 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - Received handshake initiation
2023-03-23 08:06:30.303 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - Sending handshake response
2023-03-23 08:06:30.304 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - Received handshake initiation
2023-03-23 08:06:30.304 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - Sending handshake response
2023-03-23 08:06:35.166 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - Handshake did not complete after 5 seconds, retrying (try 3)
2023-03-23 08:06:35.572 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - Received handshake initiation
2023-03-23 08:06:35.572 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - Sending handshake response
2023-03-23 08:06:35.572 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - ConsumeMessageInitiation: handshake replay @ 2023-03-23 08:06:35.536870912 +0000 UTC
2023-03-23 08:06:35.572 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] Received invalid initiation message from ee673a36eec77a8bb35c94328b0e331b35ed66eb3d300493831ea31d8bf91d05
2023-03-23 08:06:36.728 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/userspace.go:1254>	(*userspaceEngine).Ping	ping(fd7a:115c:a1e0:4ae8:a38a:2d92:c807:8a7c): sending disco ping to [7mc6N]  ...
2023-03-23 08:06:36.894 [DEBUG]	<./codersdk/agentsdk/agentsdk.go:196>	(*Client).Listen.func1	got coordinate pong	{"took": "41.230589ms"}
2023-03-23 08:06:40.792 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - Received handshake initiation
2023-03-23 08:06:40.792 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - Received handshake initiation
2023-03-23 08:06:40.792 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - Sending handshake response
2023-03-23 08:06:40.792 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [v2] [7mc6N] - Sending handshake response
2023-03-23 08:06:40.792 [DEBUG]	(tailnet.wgengine)	<./../../../tailscale.com/wgengine/wglog/wglog.go:81>	NewLogger.func1	wg: [7mc6N] - Failed to create response message: handshake initiation must be consumed first

I was able to replicate this issue on both my existing coder installation, as well as a fresh upgrade from a clean v0.19.2.

@matifali
Copy link
Member

Can you reproduce this on a fresh, clean install of 0.20.0?

@edmondsiu0
Copy link
Author

Can you reproduce this on a fresh, clean install of 0.20.0?

Was just about to post an update on this - I can reproduce the same issue with a fresh clean v0.20.0 (while fresh clean v0.19.2 works).

@renesas-brandon-hussey
Copy link

renesas-brandon-hussey commented Mar 23, 2023

I have this same issue. I spent most of yesterday trying to narrow this down and find and issue but could not find one. The issue does not always happen. Some new Workspaces will work after initially being created. If I then go back and use them some time later, I get the blank screen. Even when I receive a blank screen I can still SSH into the Workspace and use it with no issues. The only issue I've seen is in using the coder_app's through the web browser.

I did check in the logs and I have the same Received invalid initiation message message as shown in the original post.

@matifali
Copy link
Member

@bpmct Is this is a regression?

@bpmct
Copy link
Member

bpmct commented Mar 23, 2023

Thanks for reporting, and sorry this happened! We'll start investigating.

It'd be helpful to see both the Coder agent logs (/tmp/coder-agent.log inside the container/VM/pod) and the coder server logs. You can email them to me if you'd prefer (email in my GitHub profile).

@bpmct bpmct added bug s1 Bugs that break core workflows. Only humans may set this. labels Mar 23, 2023
@edmondsiu0
Copy link
Author

@bpmct Logs sent via email! :)

@renesas-brandon-hussey
Copy link

@bpmct I can provide logs too if needed.

@bpmct
Copy link
Member

bpmct commented Mar 23, 2023

It wouldn't hurt :)

@kylecarbs
Copy link
Member

Humph, @edmondsiu0 @renesas-brandon-hussey I'm not sure why it's not working, but I know what the problematic change is...

#6595

I'll revert this for now and we can do a patch, but I'm not sure why it'd seemingly be dropping some of the traffic. Are you using a proxy in front of your Coder deployment?

@kylecarbs
Copy link
Member

@edmondsiu0 @renesas-brandon-hussey could y'all try running Coder with verbose logs and send me the output? That's where the failing connection is coming from.

kylecarbs added a commit that referenced this issue Mar 23, 2023
Not sure why this is breaking deployments, but it seems
likely to be the cause.

See #6746.
kylecarbs added a commit that referenced this issue Mar 23, 2023
This was causing boatloads of connects to reestablish every time...

See #6746
@edmondsiu0
Copy link
Author

edmondsiu0 commented Mar 23, 2023

@kylecarbs I am fronting coder with Traefik (worked until v0.19.2), I tried looking at Traefik earlier for obvious connection errors but nothing stood out to me.

happy to supply verbose logs, but I can't seem to find info on turning on verbose logging.
Is there an environment variable, or CMD override that I can reference?

@kylecarbs
Copy link
Member

kylecarbs commented Mar 23, 2023

@edmondsiu0 actually, I think I found the issue! It only seems to effect some environments...

Could you add me from our Coder Discord so we can debug further? https://discord.gg/coder

kylecarbs added a commit that referenced this issue Mar 23, 2023
This was causing boatloads of connects to reestablish every time...

See #6746
kylecarbs added a commit that referenced this issue Mar 23, 2023
This was causing boatloads of connects to reestablish every time...

See #6746
@renesas-brandon-hussey
Copy link

I am also using Traefik. Same thing as @edmondsiu0 I looked for obvious issues but did not see any. Coder is running in docker and my Templates are docker-in-docker based.

kylecarbs added a commit that referenced this issue Mar 23, 2023
This was causing boatloads of connects to reestablish every time...

See #6746
kylecarbs added a commit that referenced this issue Mar 23, 2023
This was causing boatloads of connects to reestablish every time...

See #6746
kylecarbs added a commit that referenced this issue Mar 23, 2023
This was causing boatloads of connects to reestablish every time...

See #6746
kylecarbs added a commit that referenced this issue Mar 23, 2023
This was causing boatloads of connects to reestablish every time...

See #6746
kylecarbs added a commit that referenced this issue Mar 23, 2023
This was causing boatloads of connects to reestablish every time...

See #6746
@kylecarbs
Copy link
Member

@renesas-brandon-hussey @edmondsiu0 v0.20.1 is coming out, and it should fix the problem for y'all!

@kylecarbs
Copy link
Member

It's out now!

@edmondsiu0
Copy link
Author

🥳 v0.20.1 fixed it for me!
Thanks for fixing!

@kylecarbs
Copy link
Member

Sick! Happy to hear it :)

@renesas-brandon-hussey
Copy link

I have updated to v0.20.1 and in my initial testing the issue is fixed. Thanks for the quick turnaround!

@kylecarbs
Copy link
Member

Of course! Apologies that it happened in the first place, but now our code is improved, so win win.

@bpmct bpmct added s0 Major regression, all-hands-on-deck to fix and removed s1 Bugs that break core workflows. Only humans may set this. labels Apr 17, 2023
@matifali matifali added the bug label Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
s0 Major regression, all-hands-on-deck to fix
Projects
None yet
Development

No branches or pull requests

5 participants