Skip to content

fix: Ensure terraform tests have a cache path and logger #3161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 4, 2022

Conversation

mafredri
Copy link
Member

This PR sets cache path for terraform in tests.

However, for some reason this does not work.

Related: #3159

@mafredri
Copy link
Member Author

Looks like CI tests pass, but locally I keep running into:

--- FAIL: TestProvision (0.06s)
    t.go:81: 2022-07-22 21:28:14.595 [DEBUG]	<github.com/coder/coder/provisioner/terraform/provision.go:90>	(*server).Provision	running initialization
    t.go:81: 2022-07-22 21:28:14.604 [DEBUG]	<github.com/coder/coder/provisioner/terraform/provision.go:90>	(*server).Provision	running initialization
    t.go:81: 2022-07-22 21:28:14.604 [DEBUG]	<github.com/coder/coder/provisioner/terraform/provision.go:90>	(*server).Provision	running initialization
    t.go:81: 2022-07-22 21:28:14.613 [DEBUG]	<github.com/coder/coder/provisioner/terraform/provision.go:90>	(*server).Provision	running initialization
    t.go:81: 2022-07-22 21:28:14.620 [DEBUG]	<github.com/coder/coder/provisioner/terraform/provision.go:90>	(*server).Provision	running initialization
    t.go:81: 2022-07-22 21:28:14.631 [DEBUG]	<github.com/coder/coder/provisioner/terraform/provision.go:90>	(*server).Provision	running initialization
    t.go:81: 2022-07-22 21:28:14.687 [DEBUG]	<github.com/coder/coder/provisioner/terraform/provision.go:95>	(*server).Provision	ran initialization
    t.go:81: 2022-07-22 21:28:14.723 [DEBUG]	<github.com/coder/coder/provisioner/terraform/provision.go:95>	(*server).Provision	ran initialization
    t.go:81: 2022-07-22 21:28:14.724 [DEBUG]	<github.com/coder/coder/provisioner/terraform/provision.go:95>	(*server).Provision	ran initialization
    t.go:81: 2022-07-22 21:28:14.786 [DEBUG]	<github.com/coder/coder/provisioner/terraform/provision.go:90>	(*server).Provision	running initialization
    t.go:81: 2022-07-22 21:28:14.831 [DEBUG]	<github.com/coder/coder/provisioner/terraform/provision.go:90>	(*server).Provision	running initialization
    t.go:81: 2022-07-22 21:28:14.894 [DEBUG]	<github.com/coder/coder/provisioner/terraform/provision.go:95>	(*server).Provision	ran initialization
    t.go:81: 2022-07-22 21:28:16.656 [DEBUG]	<github.com/coder/coder/provisioner/terraform/provision.go:95>	(*server).Provision	ran initialization
    --- FAIL: TestProvision/single-resource-dry-run (3.40s)
        provision_test.go:244: log: [DEBUG]
        provision_test.go:244: log: [DEBUG] Initializing the backend...
        provision_test.go:244: log: [DEBUG]
        provision_test.go:244: log: [DEBUG] Initializing provider plugins...
        provision_test.go:244: log: [DEBUG] - Finding latest version of hashicorp/null...
        provision_test.go:244: log: [DEBUG] - Installing hashicorp/null v3.1.1...
        provision_test.go:244: log: [ERROR]
        provision_test.go:244: log: [ERROR] Error: Failed to install provider
        provision_test.go:244: log: [ERROR]
        provision_test.go:244: log: [ERROR] Error while installing hashicorp/null v3.1.1: open
        provision_test.go:244: log: [ERROR] /tmp/TestProvision4045446544/001/registry.terraform.io/hashicorp/null/3.1.1/linux_amd64/terraform-provider-null_v3.1.1_x5:
        provision_test.go:244: log: [ERROR] text file busy
        provision_test.go:244: log: [ERROR]
        provision_test.go:251:
            	Error Trace:	/home/maf/src/coder/provisioner/terraform/provision_test.go:251
            	Error:      	Received unexpected error:
            	            	initialize terraform: exit status 1
            	            		storj.io/drpc/drpcwire.UnmarshalError:26
            	            		storj.io/drpc/drpcstream.(*Stream).HandlePacket:198
            	            		storj.io/drpc/drpcmanager.(*Manager).manageReader:216
            	Test:       	TestProvision/single-resource-dry-run
FAIL
FAIL	github.com/coder/coder/provisioner/terraform	63.882s

@mafredri
Copy link
Member Author

Here's another error I see locally, it's even more weird since the installation of null seems successful:

    --- FAIL: TestProvision/single-resource (19.10s)
        provision_test.go:246: log: [DEBUG]
        provision_test.go:246: log: [DEBUG] Initializing the backend...
        provision_test.go:246: log: [DEBUG]
        provision_test.go:246: log: [DEBUG] Initializing provider plugins...
        provision_test.go:246: log: [DEBUG] - Finding latest version of hashicorp/null...
        provision_test.go:246: log: [DEBUG] - Installing hashicorp/null v3.1.1...
        provision_test.go:246: log: [DEBUG] - Installed hashicorp/null v3.1.1 (signed by HashiCorp)
        provision_test.go:246: log: [DEBUG]
        provision_test.go:246: log: [DEBUG] Terraform has created a lock file .terraform.lock.hcl to record the provider
        provision_test.go:246: log: [DEBUG] selections it made above. Include this file in your version control repository
        provision_test.go:246: log: [DEBUG] so that Terraform can guarantee to make the same selections by default when
        provision_test.go:246: log: [DEBUG] you run "terraform init" in the future.
        provision_test.go:246: log: [DEBUG]
        provision_test.go:246: log: [DEBUG] Terraform has been successfully initialized!
        provision_test.go:246: log: [DEBUG]
        provision_test.go:246: log: [DEBUG] You may now begin working with Terraform. Try running "terraform plan" to see
        provision_test.go:246: log: [DEBUG] any changes that are required for your infrastructure. All Terraform commands
        provision_test.go:246: log: [DEBUG] should now work.
        provision_test.go:246: log: [DEBUG]
        provision_test.go:246: log: [DEBUG] If you ever set or change modules or backend configuration for Terraform,
        provision_test.go:246: log: [DEBUG] rerun this command to reinitialize your working directory. If you forget, other
        provision_test.go:246: log: [DEBUG] commands will detect it and remind you to do so if necessary.
        provision_test.go:246: log: [INFO] Terraform 1.1.9
        provision_test.go:246: log: [INFO] null_resource.A: Plan to create
        provision_test.go:246: log: [INFO] Plan: 1 to add, 0 to change, 0 to destroy.
        provision_test.go:246: log: [INFO] null_resource.A: Creating...
        provision_test.go:246: log: [INFO] null_resource.A: Creation complete after 0s [id=5091610938476510646]
        provision_test.go:246: log: [INFO] Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
        provision_test.go:246: log: [INFO] Outputs: 0
        provision_test.go:283:
            	Error Trace:	/home/maf/src/coder/provisioner/terraform/provision_test.go:283
            	Error:      	Not equal:
            	            	expected: ""
            	            	actual  : "terraform show state: \x1b[31m\x1b[31m╷\x1b[0m\x1b[0m\n\x1b[31m│\x1b[0m \x1b[0m\x1b[1m\x1b[31mError: \x1b[0m\x1b[0m\x1b[1mRequired plugins are not installed\x1b[0m\n\x1b[31m│\x1b[0m \x1b[0m\n\x1b[31m│\x1b[0m \x1b[0m\x1b[0mThe installed provider plugins are not consistent with the packages\n\x1b[31m│\x1b[0m \x1b[0mselected in the dependency lock file:\n\x1b[31m│\x1b[0m \x1b[0m  - registry.terraform.io/hashicorp/null: the cached package for registry.terraform.io/hashicorp/null 3.1.1 (in .terraform/providers) does not match any of the checksums recorded in the dependency lock file\n\x1b[31m│\x1b[0m \x1b[0m\n\x1b[31m│\x1b[0m \x1b[0mTerraform uses external plugins to integrate with a variety of different\n\x1b[31m│\x1b[0m \x1b[0minfrastructure services. To download the plugins required for this\n\x1b[31m│\x1b[0m \x1b[0mconfiguration, run:\n\x1b[31m│\x1b[0m \x1b[0m  terraform init\n\x1b[31m╵\x1b[0m\x1b[0m\n\x1b[0m\x1b[0m\n: exit status 1"

            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1 +1,14 @@
            	            	-
            	            	+terraform show state: ╷
            	            	+│ Error: Required plugins are not installed
            	            	+│
            	            	+│ The installed provider plugins are not consistent with the packages
            	            	+│ selected in the dependency lock file:
            	            	+│   - registry.terraform.io/hashicorp/null: the cached package for registry.terraform.io/hashicorp/null 3.1.1 (in .terraform/providers) does not match any of the checksums recorded in the dependency lock file
            	            	+│
            	            	+│ Terraform uses external plugins to integrate with a variety of different
            	            	+│ infrastructure services. To download the plugins required for this
            	            	+│ configuration, run:
            	            	+│   terraform init
            	            	+╵
            	            	+
            	            	+: exit status 1
            	Test:       	TestProvision/single-resource
    t.go:81: 2022-07-25 11:14:24.049 [DEBUG]	<github.com/coder/coder/provisioner/terraform/provision.go:95>	(*server).Provision	ran initialization

@mafredri
Copy link
Member Author

Hmm, problem goes away when disabling parallel tests. Would indicate there's a concurrency problem with running multiple Terraform instances in parallel. Not sure why it only seems to manifests when setting the CachePath.

@mafredri
Copy link
Member Author

@kylecarbs @spikecurtis any ideas as to why this is happening? Do we just disable parallel tests for these cases?

@spikecurtis
Copy link
Contributor

It appears that when the cachedir is not set, we don't set Terraform's TF_PLUGIN_CACHE_DIR environment variable, and so the Plugin Cache is not set. This means that each individual job has the plugins downloaded during terraform init.

By setting that directory, and setting at the top level, all the test cases are using the same cache dir, and perhaps terraform is not carefully built to handle non-exclusive access to that cache by multiple terraform processes.

@mafredri
Copy link
Member Author

Ah, thanks for that insight @spikecurtis. While testing I did try to explicitly set terraform init -lock=true in our provisioner code because I was unsure of the default:

  -lock=false             Don't hold a state lock during backend migration.
                          This is dangerous if others might concurrently run
                          commands against the same workspace.

However, that did not help. So based on what you wrote it would seem that indeed terraform is not being careful (enough).

How about if we ensure each provisioner sets the terraform cache dir to a subdir, e.g. TF_PLUGIN_CACHE_DIR=filepath.Join(cachePath, id)? Where id could be like 001, 002.

@spikecurtis
Copy link
Contributor

It does seem a shame that we can’t have multiple jobs share the plug-in cache, as this will limit performance/scale. Perhaps we should report the issue to Hashicorp and see what they say.

Setting the cache to a subdir based on provisioner won’t unblock this PR because there is just one provisioner created at the top level of the test. And if you refactor to create new provisioners in the subtests, the issue will probably go away for now since we only run a single job at a time.

@mafredri
Copy link
Member Author

Oh doh, that makes sense, thanks for saving me from that goose chase. I'll look into reporting this upstream.

@github-actions
Copy link

github-actions bot commented Aug 4, 2022

This Pull Request is becoming stale. In order to minimize WIP, prevent merge conflicts and keep the tracker readable, I'm going close to this PR in 3 days if there isn't more activity.

@github-actions github-actions bot added the stale This issue is like stale bread. label Aug 4, 2022
@mafredri mafredri force-pushed the mafredri/provisionerd-cache-path branch from 22886d3 to ca6b211 Compare August 4, 2022 09:08
@mafredri
Copy link
Member Author

mafredri commented Aug 4, 2022

@spikecurtis I found this in the documentation:

Note: The plugin cache directory is not guaranteed to be concurrency safe. The provider installer's behavior in environments with multiple terraform init calls is undefined.

So it looks like a conscious choice not to support it.

I've implemented a fix in ca6b211 (#3161).

One thing that worries me is that we could, potentially, still have multiple calls to terraform init in the same directory, with no cache path set. Perhaps we should always allow only one single call to terraform init at a time (per server).

@mafredri mafredri marked this pull request as ready for review August 4, 2022 11:08
@spikecurtis
Copy link
Contributor

One thing that worries me is that we could, potentially, still have multiple calls to terraform init in the same directory, with no cache path set. Perhaps we should always allow only one single call to terraform init at a time (per server).

If no cache path is set, then no cache is used. Each job gets its own directory, so it should not be possible to have multiple calls to terraform init in the same directory. If it were possible to have multiple terraform calls in the same directory, any call to terraform would be bad news, not just init.

One thing I am worried about though, is that we are in the habit of starting multiple provisionerds on the same system, so we'll need to take some care to ensure that they have independent cache directories (if set) since the locking is only in the context of a single provisioner.

@mafredri
Copy link
Member Author

mafredri commented Aug 4, 2022

If no cache path is set, then no cache is used. Each job gets its own directory, so it should not be possible to have multiple calls to terraform init in the same directory. If it were possible to have multiple terraform calls in the same directory, any call to terraform would be bad news, not just init.

Yeah, that's a fair conclusion.

One thing I am worried about though, is that we are in the habit of starting multiple provisionerds on the same system, so we'll need to take some care to ensure that they have independent cache directories (if set) since the locking is only in the context of a single provisioner.

Good callout. This is something I had in mind too. In this case I was thinking we could assign each provisioner a subfolder within the cache, combined with the mutex that should solve the immediate issue. We'd need to assign each provisioner an ID though. Perhaps a numerical ID from 0 onward so that the cache can be re-used after a restart of the service.

@mafredri mafredri merged commit fb9fca8 into main Aug 4, 2022
@mafredri mafredri deleted the mafredri/provisionerd-cache-path branch August 4, 2022 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This issue is like stale bread.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants