Skip to content

Investigate possible solutions for "Text file busy" errors during builds #14726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sreya opened this issue Sep 18, 2024 · 13 comments
Open

Investigate possible solutions for "Text file busy" errors during builds #14726

sreya opened this issue Sep 18, 2024 · 13 comments
Assignees
Labels
s2 Broken use cases or features (with a workaround). Only humans may set this.

Comments

@sreya
Copy link
Collaborator

sreya commented Sep 18, 2024

This is unfortunately an issue with Terraform (and largely just filesystems in general). Some background:

  • text file busy can occur in Linux when a process tries to write or modify an executable that's currently running. In TF's case this typically happens when multiple processes try to write and use plugins from the same cache directory.
  • In Coder's case we use a separate cache directory per runner. e.g. if you're running 3 runners they should all be writing their plugins to their own exclusive directories avoiding this problem.
  • Despite this I'm guessing that some terraform or plugin process is not exiting on a run which causes subsequent runs to fail.

Regardless we should do our best to circumvent this when we detect this during a build. It's unknown to me how long the process lingers but it could potentially be indefinitely which would render that particular runner dead for subsequent runs, requiring the end user to rebuild a number of times before getting a successful build which is not acceptable.

e.g.

Initializing the backend...

09:45:57.672Initializing provider plugins...

09:45:57.672- Finding coder/coder versions matching "~> 0.23.0"...

09:45:57.761- Finding hashicorp/kubernetes versions matching "~> 2.30.0"...

09:45:58.031- Installing coder/coder v0.23.0...

09:45:58.330- Installing hashicorp/kubernetes v2.30.0...

09:45:59.405- Installed hashicorp/kubernetes v2.30.0 (signed by HashiCorp)

09:45:59.406Error: Failed to install provider

09:45:59.406Error while installing coder/coder v0.23.0: open

09:45:59.407/tmp/coder/provisioner-1/tf/registry.terraform.io/coder/coder/0.23.0/linux_amd64/terraform-provider-coder_v0.23.0:

09:45:59.408text file busy
@sreya sreya added s2 Broken use cases or features (with a workaround). Only humans may set this. bug labels Sep 18, 2024
@spikecurtis
Copy link
Contributor

Terraform plugins use https://github.com/hashicorp/go-plugin to communicate with Terraform. Plugins themselves are gRPC services that listen on a Unix domain socket in a temp directory (or on localhost on Windows). Terraform is the "client" of the gRPC service, but manages the lifecycle of the "service".

  1. Terraform starts the provider binary as a subcommand
  2. The provider starts a Unix domain socket in a temp directory
  3. The provider writes the socket path to stdout (along with things like protocol version)
  4. Terraform reads the socket path over the pipe to the provider process
  5. Terraform connects to the gRPC server over the domain, and uses the provider via it's gRPC API

At the end of the day, Terraform is supposed to call a special API to tell the provider to shut down, and/or forcibly kill the child process. However, it seems that sometimes that's not happening, and the provider process can linger, just listening on its domain socket.

@spikecurtis
Copy link
Contributor

Some ideas:

If we get "text file busy" error:

  1. we could try to find (via ps or the /proc system) the offending process and kill it
  2. we could attempt to connect to its unix domain socket and send the Shutdown command --- might be complicated by authentication protocols (go-plugin supports an "auto mTLS" setting that ensures only the original client can connect).
  3. we could just have that particular provisioner daemon exit. If it's external, then the cluster manager (e.g. K8s) can restart it. If its in-process with Coderd, then we could have some threshold of killed provisioner daemons that triggers coderd to also exit and be restarted

@spikecurtis
Copy link
Contributor

We could fix Terraform and/or OpenTofu such that they don't reinstall the provider binary if it already exists (possibly including hashing contents). This would sidestep the issue, since we don't write to the file. However, the underlying issue of leaking provider processes would remain.

@spikecurtis
Copy link
Contributor

spikecurtis commented Sep 20, 2024

I'll also check whether Terraform is setting Pdeathsignal in https://pkg.go.dev/syscall#SysProcAttr

UPDATE: Terraform doesn't set any special SysProcAttr, so I think that just means the child gets SIGHUP by default if the parent dies without sending it a signal.

@spikecurtis
Copy link
Contributor

I don't think I have the full story yet, but I've just confirmed that if you specify a provider_installation -> filesystem_mirror in your .terraformrc, then when "installing" providers, Terraform just creates symlinks to your mirror. That means that when running multiple provisionerds on a single host, they're not actually being isolated at the filesystem level, even though we take some care to give them unique cache directories.

If you use the unpacked layout, Terraform will attempt to create a symbolic link to the mirror directory when installing the provider, rather than creating a deep copy of the directory. The packed layout prevents this because Terraform must extract the zip file during installation.

https://developer.hashicorp.com/terraform/cli/config/config-file#filesystem_mirror

Pretty sure at least one customer who is seeing this issue is using a filesystem_mirror, we should confirm with others.

@spikecurtis
Copy link
Contributor

One of the customers seeing this is using the zipped layout for their filesystem mirror. That explains why the provider gets opened for writing (unzipping the provider package), but doesn't explain why the provider is still executing, since the symlink stuff I mentioned only applies to unpacked layout.

I've confirmed that the output of lsof on Busybox which the customer is using lists the real path of the executable, and it shows that the cache directory is not symlinked to the filesystem mirror. So symlinking to the mirror doesn't explain how they saw text file busy.

I'm following up with the other customers who have seen the issue, so see what kind of layout they are using.

I'm back to thinking that our provider must be still running from a previous build on the same provisionerd. Not sure how this could happen.

@spikecurtis
Copy link
Contributor

Another customer has said they aren't using a filesystem_mirror, so that discredits the symlinking theory even more.

@datapedd
Copy link

datapedd commented Oct 3, 2024

I also get this error:
Initializing the backend...
Initializing provider plugins...

  • Finding latest version of hashicorp/kubernetes...
  • Finding latest version of coder/coder...
  • Installing hashicorp/kubernetes v2.32.0...
  • Installed hashicorp/kubernetes v2.32.0 (signed by HashiCorp)
  • Installing coder/coder v1.0.3...
    Error: Failed to install provider
    Error while installing coder/coder v1.0.3: open
    /home/coder/.cache/coder/provisioner-0/tf/registry.terraform.io/coder/coder/1.0.3/linux_amd64/terraform-provider-coder_v1.0.3:
    text file busy

this is the tf file:
main.tf.txt

and here are the coderd provisioner logs (busy error):
coderd logs.txt

@spikecurtis
Copy link
Contributor

Well, I guess that means coder/terraform-provider-coder#290 didn't help, since that was released in v1.0.3

@spikecurtis
Copy link
Contributor

@datapedd what was the workspace ID of the build that failed with "text file busy"?

Is there any chance you could exec into the coderd pod and see if /home/coder/.cache/coder/provisioner-0/tf/registry.terraform.io/coder/coder/1.0.3/linux_amd64/terraform-provider-coder_v1.0.3 is still running, and if so, get a core dump?

@datapedd
Copy link

datapedd commented Oct 3, 2024

In coderd pod there is the provisioner running (ps aux). Cant create a dump as read only it say. After I killed the process for the provisioner it worked again.

@datapedd
Copy link

datapedd commented Oct 3, 2024

terra_2
terra_1

Maybe related to some pod failing during the bash script to get released by terraform?

@spikecurtis
Copy link
Contributor

@datapedd what was the exact time the build that failed started? You should be able to see this by clicking the builds icon on the left:

builds-icon

Then, click the failed build.

@matifali matifali removed the bug label Oct 14, 2024
spikecurtis added a commit that referenced this issue Oct 16, 2024
re: #14726

If we see "text file busy" in the errors while initializing terraform,
attempt to query the pprof endpoint set up by
coder/terraform-provider-coder#295 and log at
CRITICAL.

---------

Signed-off-by: Spike Curtis <spike@coder.com>
stirby pushed a commit that referenced this issue Oct 28, 2024
re: #14726

If we see "text file busy" in the errors while initializing terraform,
attempt to query the pprof endpoint set up by
coder/terraform-provider-coder#295 and log at
CRITICAL.

---------

Signed-off-by: Spike Curtis <spike@coder.com>
(cherry picked from commit d676ad5)
stirby added a commit that referenced this issue Oct 28, 2024
re: #14726

If we see "text file busy" in the errors while initializing terraform,
attempt to query the pprof endpoint set up by
coder/terraform-provider-coder#295 and log at
CRITICAL.

---------

Signed-off-by: Spike Curtis <spike@coder.com>
(cherry picked from commit d676ad5)

Co-authored-by: Spike Curtis <spike@coder.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
s2 Broken use cases or features (with a workaround). Only humans may set this.
Projects
None yet
Development

No branches or pull requests

4 participants