Skip to content

Conversation

jku
Copy link
Member

@jku jku commented Aug 29, 2025

Add advisory file locking to make it safer to run multiple updaters (using the same metadata directory) in separate processes. This should help with #2836

  • Metadata access is protected with <METADATA_DIR>/.lock: this lock is used in Updater.__init__(), Updater.refresh() and Updater.get_targetinfo() -- lock is typically held through the whole method (in other words while the metadata is being downloaded as well)
  • Artifact cache access is protected with a lock on the specific artifact file
  • No dependencies (whether this is a good idea remains to be seen, see later comments about complexity and alternatives)

@stefanberger comments welcome

Implementation notes

The implementation has just enough platform specific complexity that I'm not totally happy with it. There's clearly a chance of bugs here... Unfortunately I'm not convinced using a dependency would make that possibility go away

  • The main complexity is relates to the weird file access mechanism in Windows: mscvrt.locking() is the recommended API but 99% of the time it's not useful because we need a file handle to call it, and we can't get one because another process already has the file open. So we need a dumb loop that keeps calling open() and sleep(). It's ugly but no-one seems to have a better solution
  • the core difference between the two implementations is this: windows implementation will timeout in ~30 seconds if it does not get a lock. the posix implementation will keep waiting in fcntl.lockf() until the lock opens. Both have a tradeoff: On windows the updater might fail if another updater is a bit slow. On posix, if one updater hangs with a lock for some reason, all future updaters will just wait indefinitely until the original hung process is killed
  • There is a test that starts 10 processes that all run 50 metadata refreshes as fast as possible using the same metadata directory

TODO

  • I think the posix locking probably should timeout as well if someone is hogging the lock for long enough.
  • on the other hand, someone somewhere will always see the timeout (even though it would work some time later) whatever the we set the timeout to :(

Alternatives

https://github.com/tox-dev/filelock/ looks reasonable -- although artifact locking will be either more complicated or less parallel since you can't use filelock on the actual file you want to write.

jku added 10 commits August 22, 2025 18:37
This likely fails on all platforms right now, but the Windows
behaviour cannot be fixed without actual locking.

Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
Use get_targetinfo() so that the delegated role loading is tested as
well

Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
This should prevent issues with multiple processes trying to write at
same time.

Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
Otherwise another process might delete the file underneath us

Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
There does not seem to be a way around a ugly loop over open()...

Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
The file locking should make multiple processes safe

Signed-off-by: Jussi Kukkonen <jkukkonen@google.com>
@jku jku requested a review from a team as a code owner August 29, 2025 08:05
@jku jku marked this pull request as draft August 29, 2025 08:05
@coveralls
Copy link

Coverage Status

coverage: 95.316% (-1.3%) from 96.603%
when pulling 55dbb53 on advisory-locking
into 7ad10ad on develop.

@lukpueh
Copy link
Member

lukpueh commented Aug 29, 2025 via email

@stefanberger
Copy link

On Windows I get a lot of sequences of the errors shown below. It looks like one process gets to lock the file in each one of the test loops but the other ones all have to wait for quite a while until another one starts locking for a while then.

How many files are you locking so that the Windows error 'Resource deadlock avoided' would be justified? I see only one file reported as being cause for 'a deadlock'.

Other than that I see no errors, so that's good.

(venv) C:\Users\StefanBerger\python-tuf>python
Python 3.13.7 (tags/v3.13.7:bcee1c3, Aug 14 2025, 14:15:11) [MSC v.1944 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from sigstore import sign
... while True:
...     sign.TrustedRoot.production()
...
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided
Unsuccessful lock attempt for C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\.lock: [Errno 36] Resource deadlock avoided

@stefanberger
Copy link

Update... All(!!!) 3 of the 3 test processes terminated like this:

  File "C:\Users\StefanBerger\python-tuf\tuf\ngclient\updater.py", line 376, in _persist_file
    raise e
  File "C:\Users\StefanBerger\python-tuf\tuf\ngclient\updater.py", line 369, in _persist_file
    os.replace(temp_file.name, filename)
    ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [WinError 5] Access is denied: 'C:\\Users\\StefanBerger\\AppData\\Local\\sigstore\\sigstore-python\\tuf\\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\\tmpzxnsj1mu' -> 'C:\\Users\\StefanBerger\\AppData\\Local\\sigstore\\sigstore-python\\tuf\\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\\root_history\\12.root.json'

yield f
return
except FileNotFoundError:
# could be from yield or from open() -- either way we bail

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are opening the file for wring. Should this ever lead to a FileNotFoundError?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants