Skip to content

chore: cache terraform providers between CI test runs #17373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 28, 2025

Conversation

hugodutka
Copy link
Contributor

@hugodutka hugodutka commented Apr 12, 2025

Addresses coder/internal#322.

This PR starts caching Terraform providers used by TestProvision in provisioner/terraform/provision_test.go. The goal is to improve the reliability of this test by cutting down on the number of network calls to external services. It leverages GitHub Actions cache, which on depot runners is persisted for 14 days by default.

Other than the aforementioned TestProvision, I couldn't find any other tests which depend on external terraform providers.

@hugodutka hugodutka force-pushed the hugodutka/terraform-providers-ci-cache branch 12 times, most recently from 9a5b5a1 to 8885000 Compare April 13, 2025 19:47
@github-actions github-actions bot added the stale This issue is like stale bread. label Apr 21, 2025
@johnstcn johnstcn removed the stale This issue is like stale bread. label Apr 22, 2025
@hugodutka hugodutka force-pushed the hugodutka/terraform-providers-ci-cache branch 11 times, most recently from 9d41243 to 2ab657c Compare April 22, 2025 12:05
@hugodutka hugodutka force-pushed the hugodutka/terraform-providers-ci-cache branch from 2ab657c to 3a9e0b3 Compare April 22, 2025 12:17
@hugodutka hugodutka marked this pull request as ready for review April 22, 2025 13:13
@hugodutka hugodutka requested a review from spikecurtis April 22, 2025 13:13
file := templateFiles[fileName]
_, err := hasher.Write([]byte(fileName))
require.NoError(t, err)
_, err = hasher.Write([]byte(file))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to include some sort of delimiter or dogfood: stuff will hash the same as dog: foodstuff


for fileName, file := range templateFiles {
filePath := filepath.Join(filesDir, fileName)
if _, err := os.Stat(filePath); os.IsNotExist(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't we just check that the directory doesn't already exist? How could the file already exist if the directory doesn't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this check is redundant. I'll remove it.

// Each test gets a unique cache dir based on its name and template files.
// This ensures that tests can download providers in parallel and that they
// will redownload providers if the template files change.
hash := hashTemplateFilesAndTestName(t, t.Name(), templateFiles)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providers are versioned, so I don't think each test or each version of each test needs its own unique mirror.

It results in multiple copies of the providers being downloaded on each system, and there doesn't seem to be any mechanism to clean up old versions of tests. The CI caches expire after 14 days, but people's personal development environments won't get cleaned up.

terraform providers mirror can be run multiple times on the same target directory and it will merge the results:

You can run terraform providers mirror again on an existing mirror directory to update it with new packages.

https://developer.hashicorp.com/terraform/cli/commands/providers/mirror

So, what do you think of doing two passes over the test cases: one serially that sets of the mirror with the totality of the modules we need, then a second, parallel pass that runs the tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a serial pass is pretty slow if you haven't cached the providers beforehand. I just tried it, and it takes 27 seconds to cache the providers serially in my Coder workspace. In contrast, doing it in parallel only takes 5 seconds.

The cache size across all tests is currently around 70 MB - probably not a big deal. That said, it could balloon if someone starts iterating on a test locally. To avoid that, maybe we should add cleanup logic instead of doing a serial pass?

We already know the paths to provider directories based on test hashes, so at the start of TestProvision, we could check which ones are needed and remove the rest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup logic would also be fine here, so we don't blow up people's cache.

with:
path: ${{ inputs.cache-path }}
# The key doesn't need to include an OS name. The action already takes
# that into account: https://github.com/actions/cache/tree/5a3ec84eff668545956fd18022155c47e93e2684?tab=readme-ov-file#cache-version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're entirely out of the woods here. While the different paths on Windows vs Linux might mean that those caches are separate, different architectures on the same OS don't seem like they'd have a different cache version.

Also, the paths might not be so different on macOS vs Linux, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you're right about different architectures on the same OS, I haven't thought about that. I'll generate unique cache key prefixes based on os and architecture.

# job is running on the main branch.
# Without it, PRs would create one-use caches that would linger until
# expiration and we'd be charged for them. I evaluated a couple of options
# for limiting the cache to the main branch, and forking was the simplest.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we handle this use case by using explicit restore and save steps? I.e. we restore in all cases, but only save on main?

https://github.com/actions/cache/blob/5a3ec84eff668545956fd18022155c47e93e2684/caching-strategies.md#saving-cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered it but the GitHub Actions syntax makes it unwieldy. I wanted to package all the cache steps into a single action to avoid code duplication across the 5 workflows that need the cache. I tried doing it with restore and save, but found that I cannot add a defer-like save statement in a composite action. So I'd need to add 3 additional steps to each of the workflows: one to generate the cache keys, one restore, one save. That's a lot of code duplication - I believe the current approach is cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forking the entire repo seems like a lot of code duplication! It's just better hidden, and means we have yet another fork to manage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I'll refactor it to use the more verbose restore and save approach.

# This prevents PRs from rebuilding the cache on the first day of the month.
restore-keys: |
${{ inputs.key-prefix }}-${{ steps.dates.outputs.year-month }}-
${{ github.ref != 'refs/heads/main' && format('{0}-{1}-', inputs.key-prefix, steps.dates.outputs.prev-year-month) || '' }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why we need to have this special case for PRs. If we are preventing PRs from saving to the cache anyway, why do we need to restrict only PRs from restoring last month's cache if this month is not available?

Also, why are we even bothering to match on month? If today's cache isn't available, doesn't the cache action automatically match the most recent available cache based on the key-prefix alone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach lets us reset the cache completely at the start of each month. We upload an updated cache daily, and each update is based on the previous day's cache. Without the monthly cut-off on main, any file added to the cache would persist indefinitely. The monthly reset ensures that old or obsolete files do not linger in the cache forever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you implement cleanup in the test code, which I think is a good idea, then we'll only cache what we need daily, and a monthly reset won't be required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to get rid of the monthly clean-up for two reasons:

  1. If other tests start using the cache and don't clean up properly after themselves, the cache will grow.
  2. If a Terraform provider that one of the tests depends on becomes permanently unavailable, we won't be able to detect it without a cache reset.

I don't think matching on the month complicates the code too much, so I'd be for keeping it.

@hugodutka hugodutka force-pushed the hugodutka/terraform-providers-ci-cache branch 3 times, most recently from 2c226f2 to d56aaa7 Compare April 24, 2025 12:03
@hugodutka
Copy link
Contributor Author

@spikecurtis I addressed your feedback:

  • started using a delimiter when calculating test hashes
  • skipped the os.Stat check
  • started cleaning up unused test caches
  • started using cache key prefixes that include the OS and the Arch
  • refactored caching to use separate restore and save steps

@hugodutka hugodutka merged commit b47d54d into main Apr 28, 2025
35 of 36 checks passed
@hugodutka hugodutka deleted the hugodutka/terraform-providers-ci-cache branch April 28, 2025 08:57
@github-actions github-actions bot locked and limited conversation to collaborators Apr 28, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants