-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Reftables support #7117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
pks-gitlab
wants to merge
42
commits into
libgit2:main
Choose a base branch
from
pks-gitlab:pks/reftables-support
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Reftables support #7117
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We have a bunch of references that we treat like pseudo-refs. Those references are (sometimes) read and written by going to the filesystem directly, at other times they are read and written via the refdb. This works alright with the "files" ref storage format given that any root reference never gets packed into the "packed-refs" file, and thus they would always be accessible a loose ref if present. The behaviour is wrong though when considering alternate backends like the "reftable" backend. All references except for pseudo-refs must be read via the backend, and that includes root refs. Historically this part of Git has been ill-defined, and it wasn't quite clear which refs are considered pseudo-refs in the first place. This was clarified in 6fd80375640 (Documentation/glossary: redefine pseudorefs as special refs, 2024-05-15): there only are two pseudorefs, "FETCH_HEAD" and "MERGE_HEAD". The reason why those two references are considered special is that they may contain additional data that doesn't fit into the normal reference format. In any case, our current handling of a couple of root references is broken in this new world. Fix this for "ORIG_HEAD" by exclusively going through the refdb to read and write that reference. Rename the define accordingly to clarify that it is a reference and not a file.
Fix handling of "REVERT_HEAD" by exclusively reading and writing it via the reference database.
Fix handling of "CHERRY_PICK_HEAD" by exclusively reading and writing it via the reference database.
The GIT_STASH_FILE define contains the path to the stash reference. While we know that this used to be a file with the "files" backend, it's not a standalone file with the "reftable" backend anymore. Rename the macro to GIT_STASH_REF to indicate that this is a proper ref.
Introduce the GIT_HEAD_REF define so that we can clearly indicate that we're talking about the "HEAD" reference and not necessarily a file. Note that there still are a couple of places where GIT_HEAD_FILE is being used: - `git_repository_create_head()`: This function is used to create HEAD when initializing a new repository. This should get fixed eventually so that we create HEAD via the refdb, but this is a more involved refactoring that will be handled in a separate patch series. - `repo_init_head()`: Likewise. - `conditional_match_onbranch()`: This function is used to decide whether or not an `includeIf.onbranch` condition matches. This will be fixed in subsequent commits. Other than that there shouldn't be any more references to GIT_HEAD_FILE.
When we read the repository format information we do so by using the full configuration of that repository. This configuration not only includes the repository-level configuration though, but it also includes the global- and system-level configuration. These configurations should in practice never contain information about which format a specific repository uses. Despite this obvious conceptual error there's also a more subtle issue: reading the full configuration may require us to evaluate conditional includes. Those conditional includes may themselves require that the repository format is already populated though. This is for example the case with the "onbranch" condition: we need to populate the refdb to evaluate that condition, but to populate the refdb we need to first know about the repository format. Fix this by using the repository-level configuration, only, to determine the repository's format.
To support multiple different reference backend implementations, Git introduced a "refStorage" extension that stores the reference storage format a Git client should try to use. Wire up the logic to read this new extension when we open a repository from disk. For now, only the "files" backend is supported by us. When trying to open a repository that has a refstorage format that we don't understand we now error out. There are two functions that create a new repository that doesn't really have references. While those are mostly non-functional when it comes to references, we do expect that you can access the refdb, even if it's not yielding any refs. For now we mark those to use the "files" backend, so that the status quo is retained. Eventually though it might not be the worst idea to introduce an explicit "in-memory" reference database. But that is outside the scope of this patch series.
While we only support initializing repositories with the "files" reference backend right now, we are in the process of implementing a second backend with the "reftable" format. And while we already have the infrastructure to decide which format a repository should use when we open it, we do not have infrastructure yet to create new repositories with a different reference format. Introduce a new field `git_repository_init_options::refdb_type`. If unset, we'll default to the "files" backend. Otherwise though, if set to a valid `git_refdb_t`, we will use that new format to initialize the repostiory. Note that for now the only thing we do is to write the "refStorage" extension accordingly. What we explicitly don't yet do is to also handle the backend-specific logic to initialize the refdb on disk. This will be implemented in subsequent commits.
In our tests for "onbranch" config conditionals we set HEAD to point to various different branches via `git_repository_create_head()`. This function circumvents the refdb though and directly writes to the "HEAD" file. While this works now, it will create problems once we have multiple refdb backends. Furthermore, the function is about to go away in the next commit. So let's prepare for that and use `git_reference_symbolic_create()` instead.
The initialization of the on-disk state of refdbs is currently not handled by the actual refdb backend, but it's implemented ad-hoc where needed. This is problematic once we have multiple different refdbs as the filesystem structure is of course not the same. Introduce a new callback function `git_refdb_backend::init()`. If set, this callback can be invoked via `git_refdb_init()` to initialize the on-disk state of a refdb. Like this, each backend can decide for itself how exactly to do this. Note that the initialization of the refdb is a bit intricate. A repository is only recognized as such when it has a "HEAD" file as well as a "refs/" directory. Consequently, regardless of which refdb format we use, those files must always be present. This also proves to be problematic for us, as we cannot access the repository and thus don't have access to the refdb if those files didn't exist. To work around the issue we thus handle the creation of those files outside of the refdb-specific logic. We actually use the same strategy as Git does, and write the invalid reference "ref: refs/heads/.invalid" into "HEAD". This looks almost like a ref, but the name of that ref is not valid and should thus trip up Git clients that try to read that ref in a repository that really uses a different format. So while that invalid "HEAD" reference will of course get rewritten by the "files" backend, other backends should just retain it as-is.
Introduce a function that reads the "refStorage" extension so that we can easily figure out whether a specific repository uses the "files" or any other reference format. While we don't support other formats yet, we are about to add support for the "reftable" format.
There are a bunch of tests where we read or write references via the filesystem directly. This only works with the "files" backend, but naturally breaks if we supported any other reference format. Refactor these tests to instead use the refdb to access those.
When testing conditional includes we overwrite the repository's config file with the relevant conditions. This causes us to fully overwrite all repository configuration, including the repository format version and any extensions. While the test repository used in this test does not have any extensions, we will add a reftable-enabled repository that does rely on the "refStorage" extension eventually. Fix this by only modifying the relevant config keys.
We have a bunch of checks for properties of the "files" reference backend: - Whether a specific reference has been packed or whether it still exists as a loose reference. - Whether empty ref directories get pruned. - Whether we properly fsync data to disk. These checks continue to be sensible for that backend, but for any other backend they plain don't work. Adapt the tests so that we only run them in case the repository uses the "files" backend.
When initializing the "files" refdb we read a couple of values from the Git configuration. Unfortunately, this causes a chicken-and-egg problem when reading configuration with "includeIf.onbranch" conditionals: we need to instantiate the refdb to evaluate the condition, but we need to read the configuration to initialize the refdb. We currently work around the issue by reading the "HEAD" file directly when evaluating the conditional include. But while that works with the "files" backend, any other backends that store "HEAD" anywhere else will break. Prepare for a fix by deferring reading the configuration. We really only need to be able to execute `git_refdb_lookup()`, so all we need to ensure is that we can look up a branch without triggering any config reads.
With the preceding commit we have refactored the "files" backend so that it can be both instantiated and used to look up a reference without reading any configuration. With this change in place we don't cause infinite recursion anymore when using the refdb to evaluate "onbranch" conditions. Refactor the code to use the refdb to look up "HEAD". Note that we cannot use `git_reference_lookup()` here, as that function itself tries to normalize reference names, which in turn involves reading the Git configuration. So instead, we use the lower-level `git_refdb_lookup()` function, as we don't need the normalization anyway.
Expose a function to read a loose reference. This function will be used in a subsequent commit to read pseudo-refs on the generic refdb layer.
Regardless of which reference storage format is used, pseudorefs will always be looked up via the filesystem as loose refs. This is because pseudorefs do not strictly follow the reference format and may contain additional metadata that is not present in a normal reference. We don't honor this in `git_reference_lookup()` though but instead defer to the refdb to read such references. This obviously works just fine with the "files" backend, but any other backend would have to grow custom logic to handle reading pseudorefs. Refactor `git_reference_lookup_resolved()` so that it knows to always read pseudorefs as loose references. This allows refdb implementations to not care about pseudoref handling at all.
Closed
Import the reftable library from commit f1228cd12c1 (reftable: make REFTABLE_UNUSED C99 compatible, 2025-05-29). This is an exact copy of the reftable library. The library will be wired into libgit2 over the next couple of commits.
e9d60f7
to
a8f50a6
Compare
a293186
to
0c29ee0
Compare
While the reftable library is mostly decoupled from the Git codebase, some specific functionality intentionally taps into Git subsystems. To make porting of the reftable library easier all of these are contained in "system.h" and "system.c". Reimplement those compatibility shims so that they work for libgit2.
The reftable writer accidentally uses the Git-specific `QSORT()` macro. This macro removes the need for the caller to provide the element size, but other than that it's mostly equivalent to `qsort()`. Replace the macro accordingly. This incompatibility is a bug in Git that needs to be fixed upstream.
While perfectly legal, older compiler toolchains complain when zero-initializing structs that contain nested structs with `{0}`: /home/libgit2/source/deps/reftable/stack.c:862:35: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces] struct reftable_addition empty = REFTABLE_ADDITION_INIT; ^~~~~~~~~~~~~~~~~~~~~~ /home/libgit2/source/deps/reftable/stack.c:707:33: note: expanded from macro 'REFTABLE_ADDITION_INIT' #define REFTABLE_ADDITION_INIT {0} ^ Silence this warning by using `{{0}}` instead.
Wire up the reftable library with CMake. At the current point in time the library is not yet plugged into libgit2 itself.
While it is possible to pass flags to `reftable_stack_new_addition()` to alter the behaviour of a stack addition, it is not possible to do so via `reftable_stack_add()`. Plug this gap.
The `flock` interface is implemented as part of "reftable/system.c" and thus needs to be implemented by the integrator between the reftable library and its parent code base. As such, we cannot rely on any specific implementation thereof. Regardless of that, users of the `flock` subsystem rely on `errno` being set to specific values. This is fragile and not documented anywhere and doesn't really make for a good interface. Refactor the code so that the implementations themselves are expected to return reftable-specific error codes.
The `inline` keyword is not supported by all compilers, but we use it freely in the reftable library. Introduce the `REFTABLE_INLINE()` macro as part of the integration library so that projects can define it as supported by their respective platforms.
error C2036: 'void *': unknown size
Both `reftable_writer_add_refs()` and `reftable_writer_add_logs()` accept an array of records that should be added to the new table. Callers of this function are expected to also pass the number of such records to the function to tell it how many such records it is supposed to write. But while all callers pass in a `size_t`, which is a sensible choice, the function in fact accepts an `int` as argument, which is less so. Fix this. Signed-off-by: Patrick Steinhardt <ps@pks.im>
Implement the reftable backend that is used to read and write reftables. The backend is not yet used anywhere after this commit.
Wire up the "reftable" storage format via the newly implemented reftable backend.
Generate test resources for reftables. These resources are basically the Git repositories we already have, but converted to use the "reftable" format. For most of the part, this conversion is done by executing `git refs migrate`. A couple notes: - This require a recent Git upstream version with not-yet-upstreamed patches due to a bug in `git refs migrate` with reflogs. - The migration command does not yet support repositories with worktrees. Those were converted by first removing the worktrees, migrating the refs and then recreating them. - The HEAD_TRACKER reference in testrepo.git is not recognized as a root ref and is thus not automatically migrated. - testrepo.git has an empty reflog for refs/heads/with-empty-log that does not get migrated.
b810239
to
fba6f0e
Compare
Wire up reftable tests so that they can be executed by setting the `CLAR_REF_FORMAT` environment variable. This only catches tests that use `cl_git_sandbox_init()`, but that should cover most of our tests. So this infrastructure isn't perfect, but for now it's good enough. We may want to iterate on it in the future.
fba6f0e
to
b122fb0
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces support for the reftable backend. It depends on the following list of pull requests:
Other than that, all tests are passing and there aren't any memory leaks. But some polishing is still required for non-Linux platforms.
This pull request supersedes #5462.