-
Notifications
You must be signed in to change notification settings - Fork 873
chore: cache terraform providers between CI test runs #17373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9a5b5a1
to
8885000
Compare
9d41243
to
2ab657c
Compare
2ab657c
to
3a9e0b3
Compare
file := templateFiles[fileName] | ||
_, err := hasher.Write([]byte(fileName)) | ||
require.NoError(t, err) | ||
_, err = hasher.Write([]byte(file)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to include some sort of delimiter or dogfood: stuff
will hash the same as dog: foodstuff
|
||
for fileName, file := range templateFiles { | ||
filePath := filepath.Join(filesDir, fileName) | ||
if _, err := os.Stat(filePath); os.IsNotExist(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't we just check that the directory doesn't already exist? How could the file already exist if the directory doesn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this check is redundant. I'll remove it.
// Each test gets a unique cache dir based on its name and template files. | ||
// This ensures that tests can download providers in parallel and that they | ||
// will redownload providers if the template files change. | ||
hash := hashTemplateFilesAndTestName(t, t.Name(), templateFiles) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Providers are versioned, so I don't think each test or each version of each test needs its own unique mirror.
It results in multiple copies of the providers being downloaded on each system, and there doesn't seem to be any mechanism to clean up old versions of tests. The CI caches expire after 14 days, but people's personal development environments won't get cleaned up.
terraform providers mirror
can be run multiple times on the same target directory and it will merge the results:
You can run terraform providers mirror again on an existing mirror directory to update it with new packages.
https://developer.hashicorp.com/terraform/cli/commands/providers/mirror
So, what do you think of doing two passes over the test cases: one serially that sets of the mirror with the totality of the modules we need, then a second, parallel pass that runs the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing a serial pass is pretty slow if you haven't cached the providers beforehand. I just tried it, and it takes 27 seconds to cache the providers serially in my Coder workspace. In contrast, doing it in parallel only takes 5 seconds.
The cache size across all tests is currently around 70 MB - probably not a big deal. That said, it could balloon if someone starts iterating on a test locally. To avoid that, maybe we should add cleanup logic instead of doing a serial pass?
We already know the paths to provider directories based on test hashes, so at the start of TestProvision
, we could check which ones are needed and remove the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup logic would also be fine here, so we don't blow up people's cache.
with: | ||
path: ${{ inputs.cache-path }} | ||
# The key doesn't need to include an OS name. The action already takes | ||
# that into account: https://github.com/actions/cache/tree/5a3ec84eff668545956fd18022155c47e93e2684?tab=readme-ov-file#cache-version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we're entirely out of the woods here. While the different paths on Windows vs Linux might mean that those caches are separate, different architectures on the same OS don't seem like they'd have a different cache version.
Also, the paths might not be so different on macOS vs Linux, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, you're right about different architectures on the same OS, I haven't thought about that. I'll generate unique cache key prefixes based on os and architecture.
# job is running on the main branch. | ||
# Without it, PRs would create one-use caches that would linger until | ||
# expiration and we'd be charged for them. I evaluated a couple of options | ||
# for limiting the cache to the main branch, and forking was the simplest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we handle this use case by using explicit restore
and save
steps? I.e. we restore in all cases, but only save on main
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered it but the GitHub Actions syntax makes it unwieldy. I wanted to package all the cache steps into a single action to avoid code duplication across the 5 workflows that need the cache. I tried doing it with restore
and save
, but found that I cannot add a defer
-like save
statement in a composite action. So I'd need to add 3 additional steps to each of the workflows: one to generate the cache keys, one restore
, one save
. That's a lot of code duplication - I believe the current approach is cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forking the entire repo seems like a lot of code duplication! It's just better hidden, and means we have yet another fork to manage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I'll refactor it to use the more verbose restore and save approach.
# This prevents PRs from rebuilding the cache on the first day of the month. | ||
restore-keys: | | ||
${{ inputs.key-prefix }}-${{ steps.dates.outputs.year-month }}- | ||
${{ github.ref != 'refs/heads/main' && format('{0}-{1}-', inputs.key-prefix, steps.dates.outputs.prev-year-month) || '' }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why we need to have this special case for PRs. If we are preventing PRs from saving to the cache anyway, why do we need to restrict only PRs from restoring last month's cache if this month is not available?
Also, why are we even bothering to match on month? If today's cache isn't available, doesn't the cache action automatically match the most recent available cache based on the key-prefix
alone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach lets us reset the cache completely at the start of each month. We upload an updated cache daily, and each update is based on the previous day's cache. Without the monthly cut-off on main
, any file added to the cache would persist indefinitely. The monthly reset ensures that old or obsolete files do not linger in the cache forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you implement cleanup in the test code, which I think is a good idea, then we'll only cache what we need daily, and a monthly reset won't be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hesitant to get rid of the monthly clean-up for two reasons:
- If other tests start using the cache and don't clean up properly after themselves, the cache will grow.
- If a Terraform provider that one of the tests depends on becomes permanently unavailable, we won't be able to detect it without a cache reset.
I don't think matching on the month complicates the code too much, so I'd be for keeping it.
2c226f2
to
d56aaa7
Compare
@spikecurtis I addressed your feedback:
|
Addresses coder/internal#322.
This PR starts caching Terraform providers used by
TestProvision
inprovisioner/terraform/provision_test.go
. The goal is to improve the reliability of this test by cutting down on the number of network calls to external services. It leverages GitHub Actions cache, which on depot runners is persisted for 14 days by default.Other than the aforementioned
TestProvision
, I couldn't find any other tests which depend on external terraform providers.