Skip to content

Panic when trying to open terminal session #3870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
skerit opened this issue Sep 5, 2022 · 10 comments
Closed

Panic when trying to open terminal session #3870

skerit opened this issue Sep 5, 2022 · 10 comments
Labels
api Area: HTTP API

Comments

@skerit
Copy link

skerit commented Sep 5, 2022

Coder version & template

I'm using coder v0.8.11.
The template in question is the docker-code-server example template.
(The only thing I changed in it was the DNS setting)

Error

When I try to open a terminal session via the web interface, my docker container crashes with this error:

[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x120d8a5],
panic: runtime error: invalid memory address or nil pointer dereference,
2022-09-05 10:14:27.027 [DEBUG]	<./peer/channel.go:128>	(*Channel).init.func3	datachannel opening	{"id": 5, "label": "65bfa104-41da-4399-a059-d3738f521653:80:80:"},
2022-09-05 10:14:27.027 [WARN]	<./agent/agent.go:744>	(*agent).handleReconnectingPTY	start reconnecting pty command	{"id": "65bfa104-41da-4399-a059-d3738f521653"},
2022-09-05 10:14:27.035 [INFO]	<./cli/agent.go:73>	workspaceAgent.func1	reaper process exiting,
	/home/runner/work/coder/coder/agent/agent.go:775 +0x83b,
created by github.com/coder/coder/agent.(*agent).handleReconnectingPTY,
github.com/coder/coder/agent.(*agent).handleReconnectingPTY.func2(),
	/home/runner/work/coder/coder/agent/agent.go:778 +0x25,
goroutine 1268 [running]:,

And when I try to connect using coder ssh testspace it just fails:

2022-09-05 10:24:49.061 [DEBUG]	<./peer/conn.go:241>	(*Conn).init.func5	dtls transport state updated	{"state": "connected"}
2022-09-05 10:24:49.062 [DEBUG]	<./peer/conn.go:201>	(*Conn).init.func3.1	rtc connection updated	{"state": "connected"}
2022-09-05 10:24:49.062 [DEBUG]	<./peer/channel.go:128>	(*Channel).init.func3	datachannel opening	{"id": 1, "label": "ping"}
2022-09-05 10:24:49.062 [DEBUG]	<./peer/channel.go:128>	(*Channel).init.func3	datachannel opening	{"id": 2, "label": "echo"}
2022-09-05 10:24:49.063 [DEBUG]	<./peer/channel.go:128>	(*Channel).init.func3	datachannel opening	{"id": 3, "label": "ssh"}
2022-09-05 10:24:49.070 [WARN]	<./agent/agent.go:481>	(*agent).init.func2	ssh session failed ...
  "error": start command:
               github.com/coder/coder/agent.(*agent).handleSSHSession
                   /home/runner/work/coder/coder/agent/agent.go:663
             - start:
               github.com/coder/coder/pty.startPty
                   /home/runner/work/coder/coder/pty/start_other.go:41
             - fork/exec /bin/bash: no such file or directory

Opening a console via docker just works though. (And /bin/bash exists just fine)

Also...

I can't open code-server either, not sure if it's a related issue:

code-server is offline!
remote dial error: dial 'tcp://localhost:8080': dial tcp 127.0.0.1:8080: connect: connection refused

Which might be because the agent can't seem to connect to coder?
I'm getting a failed to dial error in the logs, immediately after starting the workspace:


2022-09-05 10:13:06.929 [WARN]	<./agent/agent.go:150>	(*agent).run	failed to dial	{"error": "GET https://coder.domain.be/api/v2/workspaceagents/me/metadata: unexpected status code 401: Agent token is invalid.: Try logging in using 'coder login \u003curl\u003e'."},
2022-09-05 10:13:06.927 [INFO]	<./agent/agent.go:141>	(*agent).run	connecting,
2022-09-05 10:13:06.604 [INFO]	<./agent/agent.go:450>	(*agent).init	generating host key,
2022-09-05 10:13:06.581 [INFO]	<./cli/agent.go:78>	workspaceAgent.func1	starting agent	{"url": "https://coder.domain.be/", "auth": "token", "version": "v0.8.11+cde036c"},
2022-09-05 10:13:06.513 [INFO]	<./cli/agent.go:63>	workspaceAgent.func1	spawning reaper process

After restarting the workspace, I'm getting slightly different agent errors:

2022-09-05 10:58:46.515 [WARN]	<./agent/agent.go:170>	(*agent).run.func1	agent script failed ...,
  "error": run:,
               github.com/coder/coder/agent.(*agent).runStartupScript,
                   /home/runner/work/coder/coder/agent/agent.go:375,
             - chdir /home/skerit: no such file or directory,
2022-09-05 10:58:46.516 [DEBUG]	(tailnet.wgengine)	<./tailnet/conn.go:521>	Logger.func1	[v1] using fake (no-op) tun device,
2022-09-05 10:58:46.518 [DEBUG]	(tailnet.wgengine)	<./tailnet/conn.go:521>	Logger.func1	[v1] magicsock: starting endpoint update (derp-map-update),
2022-09-05 10:58:46.516 [DEBUG]	(tailnet.wgengine)	<./tailnet/conn.go:521>	Logger.func1	[v1] using fake (no-op) OS network configurator,
2022-09-05 10:58:46.516 [DEBUG]	(tailnet.wgengine)	<./tailnet/conn.go:521>	Logger.func1	[v1] using fake (no-op) DNS configurator,
2022-09-05 10:58:46.516 [DEBUG]	(tailnet.wgengine)	<./tailnet/conn.go:521>	Logger.func1	dns: using dns.noopManager
...
2022-09-05 10:58:46.623 [WARN]	<./agent/agent.go:320>	(*agent).runWebRTCNetworking	failed to dial ...,
  "error": GET https://coder.domain.be/api/v2/workspaceagents/me/listen: unexpected status code 403: Agent trying to connect from non-latest build.,
           	Error: build is outdated,

Weird that now the error turned into a Agent trying to connect from non-latest build error, when the template has not been changed.

@mafredri
Copy link
Member

mafredri commented Sep 5, 2022

Thanks for your detailed report @skerit!

I created #3871 to fix the nil pointer deref you ran into with reconnecting PTY, however it won't help with your underlying problem.

I'm assuming you're using codercom/code-server as the image (it's the default in docker-code-server)?

There's something odd going on with your setup though, are you sure DNS is the only thing you changed? Especially this error: chdir /home/skerit: no such file or directory would indicate that there are some other changes too. Typically the codercom/code-server image only has the coder user, and if you're using a non-existent user then that could perhaps explain the error (and maybe even /bin/bash error).

Regarding the build is outdated error, this could happen if you e.g. did coder stop workspace (or start) but there was an error and the workspace wasn't deleted. I.e. a state transition like start/stop trigger a new workspace build.

@skerit
Copy link
Author

skerit commented Sep 5, 2022

My pleasure @mafredri, it's the least I can do 😄

I'm actually not running coder itself inside a docker container. I used the install script (on an Arch linux server)

Here's my systemd service (which I copied from this repository, I believe)

[Unit]
Description="Coder - Self-hosted developer workspaces on your infra"
Documentation=https://coder.com/docs/coder-oss
Requires=network-online.target postgres-coder.service
After=network-online.target
ConditionFileNotEmpty=/etc/coder.d/coder.env
StartLimitIntervalSec=60
StartLimitBurst=3

[Service]
Type=notify
EnvironmentFile=/etc/coder.d/coder.env
User=coder
Group=coder
ProtectSystem=full
PrivateTmp=yes
PrivateDevices=yes
SecureBits=keep-caps
AmbientCapabilities=CAP_IPC_LOCK CAP_NET_BIND_SERVICE
CacheDirectory=coder
CapabilityBoundingSet=CAP_SYSLOG CAP_IPC_LOCK CAP_NET_BIND_SERVICE
KillSignal=SIGINT
NoNewPrivileges=yes
ExecStart=/usr/local/bin/coder server
Restart=on-failure
RestartSec=5
TimeoutStopSec=30

[Install]
WantedBy=multi-user.target

And the environment file:

# Run "coder server --help" for flag information.
CODER_ACCESS_URL=https://coder.mydomain.be
CODER_ADDRESS=127.0.0.1:3091
CODER_PG_CONNECTION_URL=postgresql://postgres:password@127.0.0.1:5433?sslmode=disable
CODER_TLS_CERT_FILE=
CODER_TLS_ENABLE=
CODER_TLS_KEY_FILE=
# Generate a unique *.try.coder.app access URL
CODER_TUNNEL=false

I've cleared the entire Postgres database & started fresh, but I'm still getting the same chdir /home/skerit: no such file or directory error.

I thought the install script was the preferred way of installing coder, but I can try the docker way too.
(Update: Tried running it inside docker, but I keep getting the permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: error)

@mafredri
Copy link
Member

mafredri commented Sep 5, 2022

I was under the impression that you were seeing chdir /home/skerit: no such file or directory in the agent/workspace, not on the server? If so, it shouldn't matter how you run the coder server.

I think it'd help if you could share the exact template you're using and we can see if there's something odd there. (E.g. coder templates pull docker-code-server my-docker-code-server.tar, and upload the file, assuming there's nothing sensitive in there.)

(That socket error you're seeing is likely just from wrong permissions on the socket, e.g. an ugly workaround is to chmod 0700 /var/run/docker.sock, but you don't want to do that. Instead make sure e.g. the coder server user group inside the Docker container is allowed to write to it, e.g. chgrp 666 /var/run/docker.sock; chmod g+w /var/run/docker.sock, given it has a group ID of 666.)

@skerit
Copy link
Author

skerit commented Sep 5, 2022

Yes indeed, the chdir /home/skerit error is in the container of my workspace (coder-skerit-test1 in this case)

Here's the template. Saved it again as a zip so I could upload it on Github. (But again: it's just the docker-code-server example with a different dns server... 😬)
testspace.zip

About the permission error when running the server in docker: ok, I thought maybe the docker image would have the correct permissions set automatically, but I guess I can just connect to the container's console and change the permissions like that then.
Though I still prefer to not run coder itself in docker :)

@mafredri
Copy link
Member

mafredri commented Sep 5, 2022

Alright, the template you shared definitely matches the example, like you said. Then I suppose we'll have to figure out where /home/skerit is coming from since it doesn't come from the Terraform template, nor from the code-server Docker image. This is what the Docker image does.

Perhaps there would be some hints in the environment of the workspace?

$ docker exec -it coder-skerit-test1 /bin/bash
coder@test1:~$ env
...

Variables like PWD, HOME, USER would be most interesting here.

Though I still prefer to not run coder itself in docker :)

Yes, feel free to run it however you like. The problem is most likely not with the Coder server here.

@skerit
Copy link
Author

skerit commented Sep 5, 2022

Interesting:

[skerit@test1 coder]$ env
HOSTNAME=test1
PWD=/home/coder
HOME=/home/skerit
ENTRYPOINTD=/entrypoint.d
LANG=en_US.UTF-8
CODER_AGENT_TOKEN=xxx
TERM=xterm
USER=coder
SHLVL=1
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
_=/usr/sbin/env

I have no idea how that HOME environment variable got in there.
When I inspect the workspace container, these are the configured environment variables there:

0: CODER_AGENT_TOKEN=xxx
1: PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2: LANG=en_US.UTF-8
3: ENTRYPOINTD=/entrypoint.d
4: USER=coder

To clarify: skerit is my account on the host server itself (uid 1000)
It's also my username of coder itself.

My other containers (the one running coder's postgres database for example) do not have HOME set to /home/skerit

@mafredri
Copy link
Member

mafredri commented Sep 5, 2022

Ok, thanks. It's as I suspected, but I am confused as to why it happens.

I think you can solve this (for you) by updating your template:

resource "docker_container" "workspace" {
  ...
  user = "coder:coder"
}

(Or alternatively, adding "HOME=/home/coder" to env = [...].)

However, it would be great to understand why this happens. Are you running Docker rootless for instance?

There are some hints in Docker run reference: ENV:

HOME Set based on the value of USER

And since the coder-server Dockerfile defines user as USER 1000, it could be picking up the UID->name mapping for your system (theory). If this theory holds, then it means our current Dockerfile/docker-compose setup (in repo) actually hides this issue because USER 1000 is coder (contents of /etc/passwd is coder:x:1000:1000::/:/bin/sh).

It may very well be that we should update our example template(s), or some at least.

@skerit
Copy link
Author

skerit commented Sep 5, 2022

If you want me to run any more test: no problem.

I'm not running rootless docker. Or at least: I did not install it that way. It's just a plain pacman -S docker install :)
I did test podman for a while (for other containers) but decided to stick with plain docker instead, so it has been totally removed.

If this theory holds, then it means our current Dockerfile/docker-compose setup (in repo) actually hides this issue

So I tested this in 2 ways now:

  • With code-server running outside of docker, as a systemd service
  • With code-server setup using docker-compose

Both had the issue.
(Though I didn't recreate the template for both tests, I'm not sure if that was needed?)

Updating the template

I added the user change to the template & pushed an update.
The agent still refuses to work unfortunately 😬

2022-09-05 16:12:58.723 [DEBUG]	(tailnet.wgengine)	<./tailnet/conn.go:521>	Logger.func1	"[v1] magicsock: ReSTUN: endpoint update active, need another later (\"derp-map-update\")"
2022-09-05 16:12:58.723 [DEBUG]	(tailnet.wgengine)	<./tailnet/conn.go:521>	Logger.func1	[v1] magicsock: starting endpoint update (derp-map-update)
2022-09-05 16:12:58.723 [WARN]	<./agent/agent.go:170>	(*agent).run.func1	agent script failed ...
  "error": run:
               github.com/coder/coder/agent.(*agent).runStartupScript
                   /home/runner/work/coder/coder/agent/agent.go:375
             - exit status 127
2022-09-05 16:12:58.827 [INFO]	<./agent/agent.go:323>	(*agent).runWebRTCNetworking	connected to webrtc broker
2022-09-05 16:12:58.827 [INFO]	<./agent/agent.go:279>	(*agent).runCoordinator	connected to coordination server

*Edit: Ah, seems I've got another error now. somehow the token is invalid inside the container?

2022-09-05 16:20:46.013 [INFO]	<./agent/agent.go:450>	(*agent).init	generating host key
2022-09-05 16:20:46.213 [INFO]	<./agent/agent.go:141>	(*agent).run	connecting
2022-09-05 16:20:46.215 [WARN]	<./agent/agent.go:150>	(*agent).run	failed to dial	{"error": "GET https://coder.mydomain.be/api/v2/workspaceagents/me/metadata: unexpected status code 401: Agent token is invalid.: Try logging in using 'coder login \u003curl\u003e'."}

I've restarted coder, deleted the workspaces, deleted the template & created it again, but now I always get the "Agent token is invalid" error. This seems to prevent code-server from starting, but I can open a terminal session via the web interface now.

@skerit
Copy link
Author

skerit commented Sep 6, 2022

Since I've got a good workaround for the original issue, I'll ask for help for the "Agent token is invalid" error on Discord!

@mafredri
Copy link
Member

mafredri commented Sep 8, 2022

After a lot of investigation on Discord, @skerit found this was caused by the Btrfs driver for Docker. Somehow the host filesystem got mounted inside the Docker container which resulted in a wrong HOME (via host /etc/passwd) and even in the Agent token is invalid error (we still don't know the why for this one).

This may have been tied to performing Btrfs rollbacks from snapshots. But somehow the issue only affected the docker-coder-server template (i.e. the codercom/code-server Docker image). Using the docker coder template worked fine. Perhaps the rollback corrupted the code-server image.

I'm closing this since the reported nil pointer dereference has been fixed and the rest is not a bug in Coder.

@mafredri mafredri closed this as completed Sep 8, 2022
@mafredri mafredri added bug api Area: HTTP API labels Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Area: HTTP API
Projects
None yet
Development

No branches or pull requests

2 participants