Investigate possible solutions for "Text file busy" errors during builds #14726

sreya · 2024-09-18T21:27:55Z

This is unfortunately an issue with Terraform (and largely just filesystems in general). Some background:

text file busy can occur in Linux when a process tries to write or modify an executable that's currently running. In TF's case this typically happens when multiple processes try to write and use plugins from the same cache directory.
In Coder's case we use a separate cache directory per runner. e.g. if you're running 3 runners they should all be writing their plugins to their own exclusive directories avoiding this problem.
Despite this I'm guessing that some terraform or plugin process is not exiting on a run which causes subsequent runs to fail.

Regardless we should do our best to circumvent this when we detect this during a build. It's unknown to me how long the process lingers but it could potentially be indefinitely which would render that particular runner dead for subsequent runs, requiring the end user to rebuild a number of times before getting a successful build which is not acceptable.

e.g.

Initializing the backend...

09:45:57.672Initializing provider plugins...

09:45:57.672- Finding coder/coder versions matching "~> 0.23.0"...

09:45:57.761- Finding hashicorp/kubernetes versions matching "~> 2.30.0"...

09:45:58.031- Installing coder/coder v0.23.0...

09:45:58.330- Installing hashicorp/kubernetes v2.30.0...

09:45:59.405- Installed hashicorp/kubernetes v2.30.0 (signed by HashiCorp)

09:45:59.406Error: Failed to install provider

09:45:59.406Error while installing coder/coder v0.23.0: open

09:45:59.407/tmp/coder/provisioner-1/tf/registry.terraform.io/coder/coder/0.23.0/linux_amd64/terraform-provider-coder_v0.23.0:

09:45:59.408text file busy

The text was updated successfully, but these errors were encountered:

spikecurtis · 2024-09-20T11:38:48Z

Terraform plugins use https://github.com/hashicorp/go-plugin to communicate with Terraform. Plugins themselves are gRPC services that listen on a Unix domain socket in a temp directory (or on localhost on Windows). Terraform is the "client" of the gRPC service, but manages the lifecycle of the "service".

Terraform starts the provider binary as a subcommand
The provider starts a Unix domain socket in a temp directory
The provider writes the socket path to stdout (along with things like protocol version)
Terraform reads the socket path over the pipe to the provider process
Terraform connects to the gRPC server over the domain, and uses the provider via it's gRPC API

At the end of the day, Terraform is supposed to call a special API to tell the provider to shut down, and/or forcibly kill the child process. However, it seems that sometimes that's not happening, and the provider process can linger, just listening on its domain socket.

spikecurtis · 2024-09-20T11:52:26Z

Some ideas:

If we get "text file busy" error:

we could try to find (via ps or the /proc system) the offending process and kill it
we could attempt to connect to its unix domain socket and send the Shutdown command --- might be complicated by authentication protocols (go-plugin supports an "auto mTLS" setting that ensures only the original client can connect).
we could just have that particular provisioner daemon exit. If it's external, then the cluster manager (e.g. K8s) can restart it. If its in-process with Coderd, then we could have some threshold of killed provisioner daemons that triggers coderd to also exit and be restarted

spikecurtis · 2024-09-20T11:54:39Z

We could fix Terraform and/or OpenTofu such that they don't reinstall the provider binary if it already exists (possibly including hashing contents). This would sidestep the issue, since we don't write to the file. However, the underlying issue of leaking provider processes would remain.

spikecurtis · 2024-09-20T12:01:26Z

I'll also check whether Terraform is setting Pdeathsignal in https://pkg.go.dev/syscall#SysProcAttr

UPDATE: Terraform doesn't set any special SysProcAttr, so I think that just means the child gets SIGHUP by default if the parent dies without sending it a signal.

spikecurtis · 2024-09-23T08:03:45Z

I don't think I have the full story yet, but I've just confirmed that if you specify a provider_installation -> filesystem_mirror in your .terraformrc, then when "installing" providers, Terraform just creates symlinks to your mirror. That means that when running multiple provisionerds on a single host, they're not actually being isolated at the filesystem level, even though we take some care to give them unique cache directories.

If you use the unpacked layout, Terraform will attempt to create a symbolic link to the mirror directory when installing the provider, rather than creating a deep copy of the directory. The packed layout prevents this because Terraform must extract the zip file during installation.

https://developer.hashicorp.com/terraform/cli/config/config-file#filesystem_mirror

Pretty sure at least one customer who is seeing this issue is using a filesystem_mirror, we should confirm with others.

spikecurtis · 2024-09-24T11:25:42Z

One of the customers seeing this is using the zipped layout for their filesystem mirror. That explains why the provider gets opened for writing (unzipping the provider package), but doesn't explain why the provider is still executing, since the symlink stuff I mentioned only applies to unpacked layout.

I've confirmed that the output of lsof on Busybox which the customer is using lists the real path of the executable, and it shows that the cache directory is not symlinked to the filesystem mirror. So symlinking to the mirror doesn't explain how they saw text file busy.

I'm following up with the other customers who have seen the issue, so see what kind of layout they are using.

I'm back to thinking that our provider must be still running from a previous build on the same provisionerd. Not sure how this could happen.

spikecurtis · 2024-09-30T04:18:19Z

Another customer has said they aren't using a filesystem_mirror, so that discredits the symlinking theory even more.

datapedd · 2024-10-03T08:56:37Z

I also get this error:
Initializing the backend...
Initializing provider plugins...

Finding latest version of hashicorp/kubernetes...
Finding latest version of coder/coder...
Installing hashicorp/kubernetes v2.32.0...
Installed hashicorp/kubernetes v2.32.0 (signed by HashiCorp)
Installing coder/coder v1.0.3...
Error: Failed to install provider
Error while installing coder/coder v1.0.3: open
/home/coder/.cache/coder/provisioner-0/tf/registry.terraform.io/coder/coder/1.0.3/linux_amd64/terraform-provider-coder_v1.0.3:
text file busy

this is the tf file:
main.tf.txt

and here are the coderd provisioner logs (busy error):
coderd logs.txt

spikecurtis · 2024-10-03T11:02:41Z

Well, I guess that means coder/terraform-provider-coder#290 didn't help, since that was released in v1.0.3

spikecurtis · 2024-10-03T11:04:28Z

@datapedd what was the workspace ID of the build that failed with "text file busy"?

Is there any chance you could exec into the coderd pod and see if /home/coder/.cache/coder/provisioner-0/tf/registry.terraform.io/coder/coder/1.0.3/linux_amd64/terraform-provider-coder_v1.0.3 is still running, and if so, get a core dump?

datapedd · 2024-10-03T11:35:20Z

In coderd pod there is the provisioner running (ps aux). Cant create a dump as read only it say. After I killed the process for the provisioner it worked again.

datapedd · 2024-10-03T11:41:22Z

Maybe related to some pod failing during the bash script to get released by terraform?

spikecurtis · 2024-10-03T12:36:17Z

@datapedd what was the exact time the build that failed started? You should be able to see this by clicking the builds icon on the left:

Then, click the failed build.

re: #14726 If we see "text file busy" in the errors while initializing terraform, attempt to query the pprof endpoint set up by coder/terraform-provider-coder#295 and log at CRITICAL. --------- Signed-off-by: Spike Curtis <spike@coder.com>

re: #14726 If we see "text file busy" in the errors while initializing terraform, attempt to query the pprof endpoint set up by coder/terraform-provider-coder#295 and log at CRITICAL. --------- Signed-off-by: Spike Curtis <spike@coder.com> (cherry picked from commit d676ad5)

re: #14726 If we see "text file busy" in the errors while initializing terraform, attempt to query the pprof endpoint set up by coder/terraform-provider-coder#295 and log at CRITICAL. --------- Signed-off-by: Spike Curtis <spike@coder.com> (cherry picked from commit d676ad5) Co-authored-by: Spike Curtis <spike@coder.com>

sreya added s2 Broken use cases or features (with a workaround). Only humans may set this. bug labels Sep 18, 2024

sreya assigned spikecurtis Sep 19, 2024

spikecurtis mentioned this issue Sep 24, 2024

fix: update to terraform-plugin-sdk v2.34.0 coder/terraform-provider-coder#290

Merged

spikecurtis mentioned this issue Oct 4, 2024

chore: add http/pprof server over unix socket for debug coder/terraform-provider-coder#295

Merged

matifali removed the bug label Oct 14, 2024

spikecurtis mentioned this issue Oct 15, 2024

chore: log provider stack traces on text file busy #15078

Merged

stirby mentioned this issue Oct 28, 2024

chore: log provider stack traces on text file busy #15249

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate possible solutions for "Text file busy" errors during builds #14726

Investigate possible solutions for "Text file busy" errors during builds #14726

sreya commented Sep 18, 2024 •

edited

Loading

spikecurtis commented Sep 20, 2024

spikecurtis commented Sep 20, 2024

spikecurtis commented Sep 20, 2024

spikecurtis commented Sep 20, 2024 •

edited

Loading

spikecurtis commented Sep 23, 2024

spikecurtis commented Sep 24, 2024

spikecurtis commented Sep 30, 2024

datapedd commented Oct 3, 2024 •

edited

Loading

spikecurtis commented Oct 3, 2024

spikecurtis commented Oct 3, 2024

datapedd commented Oct 3, 2024

datapedd commented Oct 3, 2024 •

edited

Loading

spikecurtis commented Oct 3, 2024

Investigate possible solutions for "Text file busy" errors during builds #14726

Investigate possible solutions for "Text file busy" errors during builds #14726

Comments

sreya commented Sep 18, 2024 • edited Loading

spikecurtis commented Sep 20, 2024

spikecurtis commented Sep 20, 2024

spikecurtis commented Sep 20, 2024

spikecurtis commented Sep 20, 2024 • edited Loading

spikecurtis commented Sep 23, 2024

spikecurtis commented Sep 24, 2024

spikecurtis commented Sep 30, 2024

datapedd commented Oct 3, 2024 • edited Loading

spikecurtis commented Oct 3, 2024

spikecurtis commented Oct 3, 2024

datapedd commented Oct 3, 2024

datapedd commented Oct 3, 2024 • edited Loading

spikecurtis commented Oct 3, 2024

sreya commented Sep 18, 2024 •

edited

Loading

spikecurtis commented Sep 20, 2024 •

edited

Loading

datapedd commented Oct 3, 2024 •

edited

Loading

datapedd commented Oct 3, 2024 •

edited

Loading