Skip to content

test: fix cleanup order on provisioner daemon work dir #9668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 13, 2023

Conversation

Emyrk
Copy link
Member

@Emyrk Emyrk commented Sep 13, 2023

Attempting to fix some of the error here: #9379. I think other categories of this test flake exist. I was only investigating the provisioner daemon route.

In unit tests our provisioner daemon creates a working directory. For some unit tests (like one of the ones in the issue), we do not wait for the template version job to complete before ending the test. So a provisioner job can exist when the test is closing.

The previous cleanup order tried to delete the working directory first, and then close the priovisioner, which can cause the race condition of a job writing files when the delete happens.

This PR fixes the t.Cleanup order, however I think there is still a race condition because in the job runner, we do not check the context when extracting an archive:

func (s *Session) extractArchive() error {


A small race condition still exists. The for loop loops over the tar file, writing each file to disk. I added a check for the context cancelled to abort the loop early and stop writing files if it fails, but I did not remove the race condition entirely (see comment).

Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks plausible. I wonder if we should have a check in NewProvisionerDaemon after t.TempDir() that validates that the directory was successfully removed?

@Emyrk
Copy link
Member Author

Emyrk commented Sep 13, 2023

This looks plausible. I wonder if we should have a check in NewProvisionerDaemon after t.TempDir() that validates that the directory was successfully removed?

Eh, t.TempDir() already does that. It does an os.RemoveAll() and if it fails, the test fails. Any write to a file in the dir after would fail because the directory no longer exists. So that check would just be another failure. It would never be the only failure.

@Emyrk Emyrk merged commit 0e4d689 into main Sep 13, 2023
@Emyrk Emyrk deleted the stevenmasley/cleanup_order branch September 13, 2023 18:13
@github-actions github-actions bot locked and limited conversation to collaborators Sep 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants