Skip to content

Conversation

pks-gitlab
Copy link

This pull request introduces support for the reftable backend. It depends on the following list of pull requests:

Other than that, all tests are passing and there aren't any memory leaks. But some polishing is still required for non-Linux platforms.

This pull request supersedes #5462.

pks-t added 18 commits July 18, 2025 14:06
We have a bunch of references that we treat like pseudo-refs. Those
references are (sometimes) read and written by going to the filesystem
directly, at other times they are read and written via the refdb. This
works alright with the "files" ref storage format given that any root
reference never gets packed into the "packed-refs" file, and thus they
would always be accessible a loose ref if present.

The behaviour is wrong though when considering alternate backends like
the "reftable" backend. All references except for pseudo-refs must be
read via the backend, and that includes root refs.

Historically this part of Git has been ill-defined, and it wasn't quite
clear which refs are considered pseudo-refs in the first place. This was
clarified in 6fd80375640 (Documentation/glossary: redefine pseudorefs as
special refs, 2024-05-15): there only are two pseudorefs, "FETCH_HEAD"
and "MERGE_HEAD". The reason why those two references are considered
special is that they may contain additional data that doesn't fit into
the normal reference format.

In any case, our current handling of a couple of root references is
broken in this new world.

Fix this for "ORIG_HEAD" by exclusively going through the refdb to read
and write that reference. Rename the define accordingly to clarify that
it is a reference and not a file.
Fix handling of "REVERT_HEAD" by exclusively reading and writing it via
the reference database.
Fix handling of "CHERRY_PICK_HEAD" by exclusively reading and writing it
via the reference database.
The GIT_STASH_FILE define contains the path to the stash reference.
While we know that this used to be a file with the "files" backend, it's
not a standalone file with the "reftable" backend anymore.

Rename the macro to GIT_STASH_REF to indicate that this is a proper ref.
Introduce the GIT_HEAD_REF define so that we can clearly indicate that
we're talking about the "HEAD" reference and not necessarily a file.
Note that there still are a couple of places where GIT_HEAD_FILE is
being used:

  - `git_repository_create_head()`: This function is used to create HEAD
    when initializing a new repository. This should get fixed eventually
    so that we create HEAD via the refdb, but this is a more involved
    refactoring that will be handled in a separate patch series.

  - `repo_init_head()`: Likewise.

  - `conditional_match_onbranch()`: This function is used to decide
    whether or not an `includeIf.onbranch` condition matches. This will
    be fixed in subsequent commits.

Other than that there shouldn't be any more references to GIT_HEAD_FILE.
When we read the repository format information we do so by using the
full configuration of that repository. This configuration not only
includes the repository-level configuration though, but it also includes
the global- and system-level configuration. These configurations should
in practice never contain information about which format a specific
repository uses.

Despite this obvious conceptual error there's also a more subtle issue:
reading the full configuration may require us to evaluate conditional
includes. Those conditional includes may themselves require that the
repository format is already populated though. This is for example the
case with the "onbranch" condition: we need to populate the refdb to
evaluate that condition, but to populate the refdb we need to first know
about the repository format.

Fix this by using the repository-level configuration, only, to determine
the repository's format.
To support multiple different reference backend implementations,
Git introduced a "refStorage" extension that stores the reference
storage format a Git client should try to use.

Wire up the logic to read this new extension when we open a repository
from disk. For now, only the "files" backend is supported by us. When
trying to open a repository that has a refstorage format that we don't
understand we now error out.

There are two functions that create a new repository that doesn't really
have references. While those are mostly non-functional when it comes to
references, we do expect that you can access the refdb, even if it's not
yielding any refs. For now we mark those to use the "files" backend, so
that the status quo is retained. Eventually though it might not be the
worst idea to introduce an explicit "in-memory" reference database. But
that is outside the scope of this patch series.
While we only support initializing repositories with the "files"
reference backend right now, we are in the process of implementing a
second backend with the "reftable" format. And while we already have the
infrastructure to decide which format a repository should use when we
open it, we do not have infrastructure yet to create new repositories
with a different reference format.

Introduce a new field `git_repository_init_options::refdb_type`. If
unset, we'll default to the "files" backend. Otherwise though, if set to
a valid `git_refdb_t`, we will use that new format to initialize the
repostiory.

Note that for now the only thing we do is to write the "refStorage"
extension accordingly. What we explicitly don't yet do is to also handle
the backend-specific logic to initialize the refdb on disk. This will be
implemented in subsequent commits.
In our tests for "onbranch" config conditionals we set HEAD to point to
various different branches via `git_repository_create_head()`. This
function circumvents the refdb though and directly writes to the "HEAD"
file. While this works now, it will create problems once we have
multiple refdb backends.

Furthermore, the function is about to go away in the next commit. So
let's prepare for that and use `git_reference_symbolic_create()`
instead.
The initialization of the on-disk state of refdbs is currently not
handled by the actual refdb backend, but it's implemented ad-hoc where
needed. This is problematic once we have multiple different refdbs as
the filesystem structure is of course not the same.

Introduce a new callback function `git_refdb_backend::init()`. If set,
this callback can be invoked via `git_refdb_init()` to initialize the
on-disk state of a refdb. Like this, each backend can decide for itself
how exactly to do this.

Note that the initialization of the refdb is a bit intricate. A
repository is only recognized as such when it has a "HEAD" file as well
as a "refs/" directory. Consequently, regardless of which refdb format
we use, those files must always be present. This also proves to be
problematic for us, as we cannot access the repository and thus don't
have access to the refdb if those files didn't exist.

To work around the issue we thus handle the creation of those files
outside of the refdb-specific logic. We actually use the same strategy
as Git does, and write the invalid reference "ref: refs/heads/.invalid"
into "HEAD". This looks almost like a ref, but the name of that ref
is not valid and should thus trip up Git clients that try to read that
ref in a repository that really uses a different format.

So while that invalid "HEAD" reference will of course get rewritten by
the "files" backend, other backends should just retain it as-is.
Introduce a function that reads the "refStorage" extension so that we
can easily figure out whether a specific repository uses the "files" or
any other reference format. While we don't support other formats yet, we
are about to add support for the "reftable" format.
There are a bunch of tests where we read or write references via the
filesystem directly. This only works with the "files" backend, but
naturally breaks if we supported any other reference format.

Refactor these tests to instead use the refdb to access those.
When testing conditional includes we overwrite the repository's config
file with the relevant conditions. This causes us to fully overwrite all
repository configuration, including the repository format version and
any extensions. While the test repository used in this test does not
have any extensions, we will add a reftable-enabled repository that does
rely on the "refStorage" extension eventually.

Fix this by only modifying the relevant config keys.
We have a bunch of checks for properties of the "files" reference
backend:

  - Whether a specific reference has been packed or whether it still
    exists as a loose reference.

  - Whether empty ref directories get pruned.

  - Whether we properly fsync data to disk.

These checks continue to be sensible for that backend, but for any other
backend they plain don't work. Adapt the tests so that we only run them
in case the repository uses the "files" backend.
When initializing the "files" refdb we read a couple of values from the
Git configuration. Unfortunately, this causes a chicken-and-egg problem
when reading configuration with "includeIf.onbranch" conditionals: we
need to instantiate the refdb to evaluate the condition, but we need to
read the configuration to initialize the refdb.

We currently work around the issue by reading the "HEAD" file directly
when evaluating the conditional include. But while that works with the
"files" backend, any other backends that store "HEAD" anywhere else will
break.

Prepare for a fix by deferring reading the configuration. We really only
need to be able to execute `git_refdb_lookup()`, so all we need to
ensure is that we can look up a branch without triggering any config
reads.
With the preceding commit we have refactored the "files" backend so that
it can be both instantiated and used to look up a reference without
reading any configuration. With this change in place we don't cause
infinite recursion anymore when using the refdb to evaluate "onbranch"
conditions.

Refactor the code to use the refdb to look up "HEAD". Note that we
cannot use `git_reference_lookup()` here, as that function itself tries
to normalize reference names, which in turn involves reading the Git
configuration. So instead, we use the lower-level `git_refdb_lookup()`
function, as we don't need the normalization anyway.
Expose a function to read a loose reference. This function will be used
in a subsequent commit to read pseudo-refs on the generic refdb layer.
Regardless of which reference storage format is used, pseudorefs will
always be looked up via the filesystem as loose refs. This is because
pseudorefs do not strictly follow the reference format and may contain
additional metadata that is not present in a normal reference.

We don't honor this in `git_reference_lookup()` though but instead defer
to the refdb to read such references. This obviously works just fine
with the "files" backend, but any other backend would have to grow
custom logic to handle reading pseudorefs.

Refactor `git_reference_lookup_resolved()` so that it knows to always
read pseudorefs as loose references. This allows refdb implementations
to not care about pseudoref handling at all.
@pks-t pks-t mentioned this pull request Aug 4, 2025
pks-t added 4 commits August 4, 2025 16:30
Import the reftable library from commit f1228cd12c1 (reftable: make
REFTABLE_UNUSED C99 compatible, 2025-05-29). This is an exact copy of
the reftable library.

The library will be wired into libgit2 over the next couple of commits.
@pks-gitlab pks-gitlab force-pushed the pks/reftables-support branch 7 times, most recently from e9d60f7 to a8f50a6 Compare August 5, 2025 05:12
@pks-gitlab pks-gitlab force-pushed the pks/reftables-support branch 4 times, most recently from a293186 to 0c29ee0 Compare August 5, 2025 06:31
pks-t added 18 commits August 5, 2025 11:34
While the reftable library is mostly decoupled from the Git codebase,
some specific functionality intentionally taps into Git subsystems. To
make porting of the reftable library easier all of these are contained
in "system.h" and "system.c".

Reimplement those compatibility shims so that they work for libgit2.
The reftable writer accidentally uses the Git-specific `QSORT()` macro.
This macro removes the need for the caller to provide the element size,
but other than that it's mostly equivalent to `qsort()`.

Replace the macro accordingly. This incompatibility is a bug in Git that
needs to be fixed upstream.
While perfectly legal, older compiler toolchains complain when
zero-initializing structs that contain nested structs with `{0}`:

	/home/libgit2/source/deps/reftable/stack.c:862:35: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
		struct reftable_addition empty = REFTABLE_ADDITION_INIT;
						 ^~~~~~~~~~~~~~~~~~~~~~
	/home/libgit2/source/deps/reftable/stack.c:707:33: note: expanded from macro 'REFTABLE_ADDITION_INIT'
	#define REFTABLE_ADDITION_INIT {0}
					^

Silence this warning by using `{{0}}` instead.
Wire up the reftable library with CMake. At the current point in time
the library is not yet plugged into libgit2 itself.
While it is possible to pass flags to `reftable_stack_new_addition()` to
alter the behaviour of a stack addition, it is not possible to do so via
`reftable_stack_add()`. Plug this gap.
The `flock` interface is implemented as part of "reftable/system.c" and
thus needs to be implemented by the integrator between the reftable
library and its parent code base. As such, we cannot rely on any
specific implementation thereof.

Regardless of that, users of the `flock` subsystem rely on `errno` being
set to specific values. This is fragile and not documented anywhere and
doesn't really make for a good interface.

Refactor the code so that the implementations themselves are expected to
return reftable-specific error codes.
The `inline` keyword is not supported by all compilers, but we use it
freely in the reftable library. Introduce the `REFTABLE_INLINE()` macro
as part of the integration library so that projects can define it as
supported by their respective platforms.
error C2036: 'void *': unknown size
Both `reftable_writer_add_refs()` and `reftable_writer_add_logs()`
accept an array of records that should be added to the new table.
Callers of this function are expected to also pass the number of such
records to the function to tell it how many such records it is supposed
to write.

But while all callers pass in a `size_t`, which is a sensible choice,
the function in fact accepts an `int` as argument, which is less so. Fix
this.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Implement the reftable backend that is used to read and write reftables.
The backend is not yet used anywhere after this commit.
Wire up the "reftable" storage format via the newly implemented reftable
backend.
Generate test resources for reftables. These resources are basically the
Git repositories we already have, but converted to use the "reftable"
format. For most of the part, this conversion is done by executing `git
refs migrate`.

A couple notes:

  - This require a recent Git upstream version with not-yet-upstreamed
    patches due to a bug in `git refs migrate` with reflogs.

  - The migration command does not yet support repositories with
    worktrees. Those were converted by first removing the worktrees,
    migrating the refs and then recreating them.

  - The HEAD_TRACKER reference in testrepo.git is not recognized as a
    root ref and is thus not automatically migrated.

  - testrepo.git has an empty reflog for refs/heads/with-empty-log that
    does not get migrated.
@pks-gitlab pks-gitlab force-pushed the pks/reftables-support branch 3 times, most recently from b810239 to fba6f0e Compare August 5, 2025 11:27
pks-t added 2 commits August 5, 2025 14:04
Wire up reftable tests so that they can be executed by setting the
`CLAR_REF_FORMAT` environment variable. This only catches tests that use
`cl_git_sandbox_init()`, but that should cover most of our tests. So
this infrastructure isn't perfect, but for now it's good enough. We may
want to iterate on it in the future.
@pks-gitlab pks-gitlab force-pushed the pks/reftables-support branch from fba6f0e to b122fb0 Compare August 5, 2025 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants