Skip to content

Conversation

jku
Copy link
Member

@jku jku commented Aug 29, 2025

Add advisory file locking to make it safer to run multiple updaters (using the same metadata directory) in separate processes. This should help with #2836

  • Metadata access is protected with <METADATA_DIR>/.lock: this lock is used in Updater.__init__(), Updater.refresh() and Updater.get_targetinfo() -- lock is typically held through the whole method (in other words while the metadata is being downloaded as well)
  • Artifact cache access is protected with a lock on the specific artifact file
  • No dependencies (whether this is a good idea remains to be seen, see later comments about complexity and alternatives)

@stefanberger comments welcome

Implementation notes

The implementation has just enough platform specific complexity that I'm not totally happy with it. There's clearly a chance of bugs here... Unfortunately I'm not convinced using a dependency would make that possibility go away

  • The main complexity is relates to the weird file access mechanism in Windows: mscvrt.locking() is the recommended API but 99% of the time it's not useful because we need a file handle to call it, and we can't get one because another process already has the file open. So we need a dumb loop that keeps calling open() and sleep(). It's ugly but no-one seems to have a better solution
  • the core difference between the two implementations is this: windows implementation will timeout in ~30 seconds if it does not get a lock. the posix implementation will keep waiting in fcntl.lockf() until the lock opens. Both have a tradeoff: On windows the updater might fail if another updater is a bit slow. On posix, if one updater hangs with a lock for some reason, all future updaters will just wait indefinitely until the original hung process is killed
  • There is a test that starts 10 processes that all run 50 metadata refreshes as fast as possible using the same metadata directory

TODO

  • I think the posix locking probably should timeout as well if someone is hogging the lock for long enough.
  • on the other hand, someone somewhere will always see the timeout (even though it would work some time later) whatever the we set the timeout to :(

Alternatives

https://github.com/tox-dev/filelock/ looks reasonable -- although artifact locking will be either more complicated or less parallel since you can't use filelock on the actual file you want to write.

jku added 10 commits August 22, 2025 18:37
This likely fails on all platforms right now, but the Windows
behaviour cannot be fixed without actual locking.

Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
Use get_targetinfo() so that the delegated role loading is tested as
well

Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
This should prevent issues with multiple processes trying to write at
same time.

Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
Otherwise another process might delete the file underneath us

Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
There does not seem to be a way around a ugly loop over open()...

Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
The file locking should make multiple processes safe

Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
@jku jku requested a review from a team as a code owner August 29, 2025 08:05
@jku jku marked this pull request as draft August 29, 2025 08:05
@coveralls
Copy link

Coverage Status

coverage: 95.316% (-1.3%) from 96.603%
when pulling 55dbb53 on advisory-locking
into 7ad10ad on develop.

@lukpueh
Copy link
Member

lukpueh commented Aug 29, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants