Reftables support #7117

pks-gitlab · 2025-08-04T14:23:25Z

This pull request introduces support for the reftable backend. It depends on the following list of pull requests:

Introduction of the "refFormat" extension #7102 introduces the refFormat extension.
Fix handling of pseudorefs with different refdb backends #7111 fixes our handling of pseudo and root refs to align with modern Git.
Test fixes to prepare for the reftable backend #7113 adjusts our tests to not access references via the filesystem directly anymore.
cmake: disable warnings for operands with different enum types #7114 is a semi-related fix, but our CI is currently broken due to changes in the MSVC compiler version.

Other than that, all tests are passing and there aren't any memory leaks. But some polishing is still required for non-Linux platforms.

This pull request supersedes #5462.

We have a bunch of references that we treat like pseudo-refs. Those references are (sometimes) read and written by going to the filesystem directly, at other times they are read and written via the refdb. This works alright with the "files" ref storage format given that any root reference never gets packed into the "packed-refs" file, and thus they would always be accessible a loose ref if present. The behaviour is wrong though when considering alternate backends like the "reftable" backend. All references except for pseudo-refs must be read via the backend, and that includes root refs. Historically this part of Git has been ill-defined, and it wasn't quite clear which refs are considered pseudo-refs in the first place. This was clarified in 6fd80375640 (Documentation/glossary: redefine pseudorefs as special refs, 2024-05-15): there only are two pseudorefs, "FETCH_HEAD" and "MERGE_HEAD". The reason why those two references are considered special is that they may contain additional data that doesn't fit into the normal reference format. In any case, our current handling of a couple of root references is broken in this new world. Fix this for "ORIG_HEAD" by exclusively going through the refdb to read and write that reference. Rename the define accordingly to clarify that it is a reference and not a file.

Fix handling of "REVERT_HEAD" by exclusively reading and writing it via the reference database.

Fix handling of "CHERRY_PICK_HEAD" by exclusively reading and writing it via the reference database.

The GIT_STASH_FILE define contains the path to the stash reference. While we know that this used to be a file with the "files" backend, it's not a standalone file with the "reftable" backend anymore. Rename the macro to GIT_STASH_REF to indicate that this is a proper ref.

Introduce the GIT_HEAD_REF define so that we can clearly indicate that we're talking about the "HEAD" reference and not necessarily a file. Note that there still are a couple of places where GIT_HEAD_FILE is being used: - `git_repository_create_head()`: This function is used to create HEAD when initializing a new repository. This should get fixed eventually so that we create HEAD via the refdb, but this is a more involved refactoring that will be handled in a separate patch series. - `repo_init_head()`: Likewise. - `conditional_match_onbranch()`: This function is used to decide whether or not an `includeIf.onbranch` condition matches. This will be fixed in subsequent commits. Other than that there shouldn't be any more references to GIT_HEAD_FILE.

When we read the repository format information we do so by using the full configuration of that repository. This configuration not only includes the repository-level configuration though, but it also includes the global- and system-level configuration. These configurations should in practice never contain information about which format a specific repository uses. Despite this obvious conceptual error there's also a more subtle issue: reading the full configuration may require us to evaluate conditional includes. Those conditional includes may themselves require that the repository format is already populated though. This is for example the case with the "onbranch" condition: we need to populate the refdb to evaluate that condition, but to populate the refdb we need to first know about the repository format. Fix this by using the repository-level configuration, only, to determine the repository's format.

To support multiple different reference backend implementations, Git introduced a "refStorage" extension that stores the reference storage format a Git client should try to use. Wire up the logic to read this new extension when we open a repository from disk. For now, only the "files" backend is supported by us. When trying to open a repository that has a refstorage format that we don't understand we now error out. There are two functions that create a new repository that doesn't really have references. While those are mostly non-functional when it comes to references, we do expect that you can access the refdb, even if it's not yielding any refs. For now we mark those to use the "files" backend, so that the status quo is retained. Eventually though it might not be the worst idea to introduce an explicit "in-memory" reference database. But that is outside the scope of this patch series.

While we only support initializing repositories with the "files" reference backend right now, we are in the process of implementing a second backend with the "reftable" format. And while we already have the infrastructure to decide which format a repository should use when we open it, we do not have infrastructure yet to create new repositories with a different reference format. Introduce a new field `git_repository_init_options::refdb_type`. If unset, we'll default to the "files" backend. Otherwise though, if set to a valid `git_refdb_t`, we will use that new format to initialize the repostiory. Note that for now the only thing we do is to write the "refStorage" extension accordingly. What we explicitly don't yet do is to also handle the backend-specific logic to initialize the refdb on disk. This will be implemented in subsequent commits.

In our tests for "onbranch" config conditionals we set HEAD to point to various different branches via `git_repository_create_head()`. This function circumvents the refdb though and directly writes to the "HEAD" file. While this works now, it will create problems once we have multiple refdb backends. Furthermore, the function is about to go away in the next commit. So let's prepare for that and use `git_reference_symbolic_create()` instead.

The initialization of the on-disk state of refdbs is currently not handled by the actual refdb backend, but it's implemented ad-hoc where needed. This is problematic once we have multiple different refdbs as the filesystem structure is of course not the same. Introduce a new callback function `git_refdb_backend::init()`. If set, this callback can be invoked via `git_refdb_init()` to initialize the on-disk state of a refdb. Like this, each backend can decide for itself how exactly to do this. Note that the initialization of the refdb is a bit intricate. A repository is only recognized as such when it has a "HEAD" file as well as a "refs/" directory. Consequently, regardless of which refdb format we use, those files must always be present. This also proves to be problematic for us, as we cannot access the repository and thus don't have access to the refdb if those files didn't exist. To work around the issue we thus handle the creation of those files outside of the refdb-specific logic. We actually use the same strategy as Git does, and write the invalid reference "ref: refs/heads/.invalid" into "HEAD". This looks almost like a ref, but the name of that ref is not valid and should thus trip up Git clients that try to read that ref in a repository that really uses a different format. So while that invalid "HEAD" reference will of course get rewritten by the "files" backend, other backends should just retain it as-is.

Introduce a function that reads the "refStorage" extension so that we can easily figure out whether a specific repository uses the "files" or any other reference format. While we don't support other formats yet, we are about to add support for the "reftable" format.

There are a bunch of tests where we read or write references via the filesystem directly. This only works with the "files" backend, but naturally breaks if we supported any other reference format. Refactor these tests to instead use the refdb to access those.

When testing conditional includes we overwrite the repository's config file with the relevant conditions. This causes us to fully overwrite all repository configuration, including the repository format version and any extensions. While the test repository used in this test does not have any extensions, we will add a reftable-enabled repository that does rely on the "refStorage" extension eventually. Fix this by only modifying the relevant config keys.

We have a bunch of checks for properties of the "files" reference backend: - Whether a specific reference has been packed or whether it still exists as a loose reference. - Whether empty ref directories get pruned. - Whether we properly fsync data to disk. These checks continue to be sensible for that backend, but for any other backend they plain don't work. Adapt the tests so that we only run them in case the repository uses the "files" backend.

When initializing the "files" refdb we read a couple of values from the Git configuration. Unfortunately, this causes a chicken-and-egg problem when reading configuration with "includeIf.onbranch" conditionals: we need to instantiate the refdb to evaluate the condition, but we need to read the configuration to initialize the refdb. We currently work around the issue by reading the "HEAD" file directly when evaluating the conditional include. But while that works with the "files" backend, any other backends that store "HEAD" anywhere else will break. Prepare for a fix by deferring reading the configuration. We really only need to be able to execute `git_refdb_lookup()`, so all we need to ensure is that we can look up a branch without triggering any config reads.

With the preceding commit we have refactored the "files" backend so that it can be both instantiated and used to look up a reference without reading any configuration. With this change in place we don't cause infinite recursion anymore when using the refdb to evaluate "onbranch" conditions. Refactor the code to use the refdb to look up "HEAD". Note that we cannot use `git_reference_lookup()` here, as that function itself tries to normalize reference names, which in turn involves reading the Git configuration. So instead, we use the lower-level `git_refdb_lookup()` function, as we don't need the normalization anyway.

Expose a function to read a loose reference. This function will be used in a subsequent commit to read pseudo-refs on the generic refdb layer.

Regardless of which reference storage format is used, pseudorefs will always be looked up via the filesystem as loose refs. This is because pseudorefs do not strictly follow the reference format and may contain additional metadata that is not present in a normal reference. We don't honor this in `git_reference_lookup()` though but instead defer to the refdb to read such references. This obviously works just fine with the "files" backend, but any other backend would have to grow custom logic to handle reading pseudorefs. Refactor `git_reference_lookup_resolved()` so that it knows to always read pseudorefs as loose references. This allows refdb implementations to not care about pseudoref handling at all.

Import the reftable library from commit f1228cd12c1 (reftable: make REFTABLE_UNUSED C99 compatible, 2025-05-29). This is an exact copy of the reftable library. The library will be wired into libgit2 over the next couple of commits.

While the reftable library is mostly decoupled from the Git codebase, some specific functionality intentionally taps into Git subsystems. To make porting of the reftable library easier all of these are contained in "system.h" and "system.c". Reimplement those compatibility shims so that they work for libgit2.

The reftable writer accidentally uses the Git-specific `QSORT()` macro. This macro removes the need for the caller to provide the element size, but other than that it's mostly equivalent to `qsort()`. Replace the macro accordingly. This incompatibility is a bug in Git that needs to be fixed upstream.

While perfectly legal, older compiler toolchains complain when zero-initializing structs that contain nested structs with `{0}`: /home/libgit2/source/deps/reftable/stack.c:862:35: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces] struct reftable_addition empty = REFTABLE_ADDITION_INIT; ^~~~~~~~~~~~~~~~~~~~~~ /home/libgit2/source/deps/reftable/stack.c:707:33: note: expanded from macro 'REFTABLE_ADDITION_INIT' #define REFTABLE_ADDITION_INIT {0} ^ Silence this warning by using `{{0}}` instead.

Wire up the reftable library with CMake. At the current point in time the library is not yet plugged into libgit2 itself.

While it is possible to pass flags to `reftable_stack_new_addition()` to alter the behaviour of a stack addition, it is not possible to do so via `reftable_stack_add()`. Plug this gap.

The `flock` interface is implemented as part of "reftable/system.c" and thus needs to be implemented by the integrator between the reftable library and its parent code base. As such, we cannot rely on any specific implementation thereof. Regardless of that, users of the `flock` subsystem rely on `errno` being set to specific values. This is fragile and not documented anywhere and doesn't really make for a good interface. Refactor the code so that the implementations themselves are expected to return reftable-specific error codes.

The `inline` keyword is not supported by all compilers, but we use it freely in the reftable library. Introduce the `REFTABLE_INLINE()` macro as part of the integration library so that projects can define it as supported by their respective platforms.

error C2036: 'void *': unknown size

Both `reftable_writer_add_refs()` and `reftable_writer_add_logs()` accept an array of records that should be added to the new table. Callers of this function are expected to also pass the number of such records to the function to tell it how many such records it is supposed to write. But while all callers pass in a `size_t`, which is a sensible choice, the function in fact accepts an `int` as argument, which is less so. Fix this. Signed-off-by: Patrick Steinhardt <ps@pks.im>

Implement the reftable backend that is used to read and write reftables. The backend is not yet used anywhere after this commit.

Wire up the "reftable" storage format via the newly implemented reftable backend.

Generate test resources for reftables. These resources are basically the Git repositories we already have, but converted to use the "reftable" format. For most of the part, this conversion is done by executing `git refs migrate`. A couple notes: - This require a recent Git upstream version with not-yet-upstreamed patches due to a bug in `git refs migrate` with reflogs. - The migration command does not yet support repositories with worktrees. Those were converted by first removing the worktrees, migrating the refs and then recreating them. - The HEAD_TRACKER reference in testrepo.git is not recognized as a root ref and is thus not automatically migrated. - testrepo.git has an empty reflog for refs/heads/with-empty-log that does not get migrated.

Wire up reftable tests so that they can be executed by setting the `CLAR_REF_FORMAT` environment variable. This only catches tests that use `cl_git_sandbox_init()`, but that should cover most of our tests. So this infrastructure isn't perfect, but for now it's good enough. We may want to iterate on it in the future.

pks-t added 18 commits July 18, 2025 14:06

refs: handle REVERT_HEAD via the refdb

ee55d24

Fix handling of "REVERT_HEAD" by exclusively reading and writing it via the reference database.

refs: handle CHERRY_PICK_HEAD via the refdb

5740985

Fix handling of "CHERRY_PICK_HEAD" by exclusively reading and writing it via the reference database.

refdb_fs: expose function to read a loose ref

c40dd7d

Expose a function to read a loose reference. This function will be used in a subsequent commit to read pseudo-refs on the generic refdb layer.

pks-t mentioned this pull request Aug 4, 2025

WIP: Reftables support #5462

Closed

pks-t added 4 commits August 4, 2025 16:30

Merge branch 'pks-refformat-extension' into HEAD

d7fc1cd

Merge branch 'pks-refdb-pseudorefs' into HEAD

db25710

Merge branch 'pks-test-fixes-for-reftables' into HEAD

d4ea2f3

deps: import reftable library

1d1d1d5

Import the reftable library from commit f1228cd12c1 (reftable: make REFTABLE_UNUSED C99 compatible, 2025-05-29). This is an exact copy of the reftable library. The library will be wired into libgit2 over the next couple of commits.

pks-gitlab force-pushed the pks/reftables-support branch 7 times, most recently from e9d60f7 to a8f50a6 Compare August 5, 2025 05:12

pks-gitlab force-pushed the pks/reftables-support branch 4 times, most recently from a293186 to 0c29ee0 Compare August 5, 2025 06:31

pks-t added 18 commits August 5, 2025 11:34

deps/reftable: wire up library with CMake

736a75a

Wire up the reftable library with CMake. At the current point in time the library is not yet plugged into libgit2 itself.

deps/reftable: allow passing flags to reftable_stack_add()

52b0ece

While it is possible to pass flags to `reftable_stack_new_addition()` to alter the behaviour of a stack addition, it is not possible to do so via `reftable_stack_add()`. Plug this gap.

deps/reftable: fix for concurrent compaction

0abd1a6

deps/reftable: provide mmap and munap via system.h

c9a7198

deps/reftable: allow providing fstat(3p) at all

d4f3862

deps/reftable: allow providing sleeping-related functions

9b24d21

deps/reftable: don't use fsync(3p) if none was provided

f7b9d26

deps/reftable: fix "unknown size" warning with MSVC

76a102c

error C2036: 'void *': unknown size

deps/reftable: consistently include reftable-basics.h

318dedf

refdb: implement reftable backend

1831e12

Implement the reftable backend that is used to read and write reftables. The backend is not yet used anywhere after this commit.

refdb: wire up "reftable" storage format

08d8620

Wire up the "reftable" storage format via the newly implemented reftable backend.

pks-gitlab force-pushed the pks/reftables-support branch 3 times, most recently from b810239 to fba6f0e Compare August 5, 2025 11:27

pks-t added 2 commits August 5, 2025 14:04

github: wire up reftable tests

b122fb0

pks-gitlab force-pushed the pks/reftables-support branch from fba6f0e to b122fb0 Compare August 5, 2025 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reftables support #7117

Reftables support #7117

Uh oh!

pks-gitlab commented Aug 4, 2025

Uh oh!

Uh oh!

Reftables support #7117

Are you sure you want to change the base?

Reftables support #7117

Uh oh!

Conversation

pks-gitlab commented Aug 4, 2025

Uh oh!

Uh oh!