RFC: Towards reproducible builds for our PyPI release wheels #28151

ogrisel · 2024-01-17T17:31:29Z

Given the popularity of our project, our release automation might be considered an interesting target to conduct supply chain attacks to make our binaries ship spyware or ransomware to some of our users.

One way to detect such attacks would be to:

make sure we produce reproducible builds;
rebuild our wheels from independent build environments and check that we obtain the same hash as for binaries obtained by our release CI to make sure that our release CI environment has not been tampered to inject malware in our binaries;
optionally make it possible to publish GPG signed statements that some released artifact digests were successfully byte-for-byte reproduced from source independently.

The first step would to make our wheels as reproducible as possible would be to define deterministic values for the SOURCE_DATE_EPOCH (and maybe PYTHONHASHSEED, that cannot hurt) environment variables.

HHowever,this would not be enough.

To get this fully work as expected, we would also need to guarantee that:

we use recent enough versions of pip/setuptools/wheel/auditwheel/delocate
that honor SOURCE_DATE_EPOCH;
a full description of the build environment (e.g. versions and sha256 digests of the
compilers and other build dependencies) is archived in our source repo for a given tag
of scikit-learn. Ideally, all those build dependencies should themselves be
byte-for-byte reproducible from their own public source code repo.

Currently some build dependencies such as NumPy and Cython come from the pyproject.toml file which only specifies a minimum version. This means that we may end up with a newer versions of these dependencies than the one used to build the wheels for a given tag. cibuildwheel itself is not pinned, hence neither the dependencies it installs in its managed venvs (pip, setuptools, wheel, auditwheel, delocate).

Furthermore, we do not archive or pin the versions and sha256 digests of the compilers yet. For Linux, this depends on the manylinux docker image used by cibuildwheel, which at the time of writing, is not guaranteed to be reproducible, even when using the same docker image tag. For windows and macOS, the compilers come from the VM image used on our CI which we do not archive neither their version numbers nor the hash of their binaries.

Ideally all this information should be in our source code at the time of the release (reachable via a checkout of our commit tag).

Finally, we might need to set a specific umask:

Builds are not (fully) reproducible due to file permissions stored in .whl pypa/wheel#362

Not sure about how to get deterministic file permission metadata for macOS and Windows wheels.

EDIT: now that we use meson, this problem with umask might have gone away, but we need to check.

EDIT2: I tried and I think we still have a sensitivity to umask after the switch to the meson build system.

Finally, once our builds are made 100% reproducible, we would need to document:

document instructions (and provide official scripts) to allow anyone to easily re-build the binaries independently from source;
make it process easy to automate on private infrastructure (distinct from our usual public CI);
publish official reproducibility results on a public site, ideally not only on our github pages hosted website and maybe sign those with GPG for instance;
coordinate with people from https://scientific-python.org or even the PSF or pypi.org admins to define and follow community-wide best practices.

This is just for scikit-learn itself. But for this kind of supply chain audit to be meaningful, we would need to make sure that all the tools in the build pipeline of scikit-learn are themselves reproducible and regularly and independently reproduced, including:

compilers;
runtime libraries such as the libc;
build dependencies (numpy, Cython, meson-python, ninja, cibuildweel);
wheel binary editing tools such as auditwheel/delocate/delvewheel/repairwheel;
the sha256sum command :)
the whole manylinux docker image and probably docker since it is required to build manylinux wheels.

We would also need to snapshot the provenance info before running the tests (in case pytest or any test dependency are themselves supply chain attacked). For instance, the test environment was effectively used to hide the attack on the xz binaries.

Note that a large fraction of debian is already reproducible but we would need to trace everything in our build process to check that this is the case.

Doing all of this will require significant maintainers time investment, but we can probably start from low hanging fruits such as setting SOURCE_DATE_EPOCH in our release CI scripts.

The text was updated successfully, but these errors were encountered:

ogrisel · 2024-01-18T16:27:13Z

In addition to wheels, we might also want to obtain reproducible conda-forge release artifacts. This is being discussed here:

Support reproducible builds to automate security auditing of binary artifacts conda-forge/conda-forge.github.io#1915
Improve reproducibility of artifacts conda/conda-build#2140

ogrisel · 2024-01-19T09:57:10Z

The following blog posts by @sethmlarson can serve as interesting references:

In particular, we could attempt to generate SBOM descriptions (Software Bill of Materials) using the .spdx.json format for each of our wheels (and for the source tarball) as a first step and then provide a script to automate byte for byte reproductions using the information provided in the SBOM files.

thomasjpfan · 2024-01-29T18:10:00Z

Does this include pinning our build dependencies in the pyproject.toml for the release branch?

ogrisel · 2024-01-29T20:57:32Z

Does this include pinning our build dependencies in the pyproject.toml for the release branch?

I don't think so because we don't want the generated package metadata to be pinned to a particular version of the dependencies.

But we would need to find a way to tell cibuildwheel to not rely on pyproject.toml to setup its build environment but instead use one we can control with some form of lock file.

betatim · 2024-01-30T09:00:14Z

I had understood it so that there is a tool that creates a record of your build environment and a second (set of) tool(s) that can recreated the build environment from that record.

Not sure if this would mean that you are free to create the build environment however you want (pinned or not) or if you need to use a particular tool/strategy to create it so that the tool that creates the record of the build environment can do its job.

In general I think this would be a useful thing to have/work on. As Olivier said it will be quite a lot of work though. So picking some easier first steps that already provide some value is a good way to get started.

A learning from mybinder.org users is that you need to exercise the "recreate environment from your record" step regularly to check that it works and continues to work. A bit like you can't just create a tape backup and then put the tape in a safe. You need to regularly try to restore your backup to check it still works, reposition the tape, copy to a newer tape, etc. The learning was that just because today you can successfully recreate the environment does not mean that you will be able to do so tomorrow. Yes, the chances are high but not 100%. So yet another thing to think about/work on :D

ogrisel · 2025-02-06T10:15:23Z

There is ongoing work to survey SBOM generation tools for Python projects here:

https://github.com/psf/sboms-for-python-packages

github-actions bot added the Needs Triage Issue requires triage label Jan 17, 2024

ogrisel mentioned this issue Jan 17, 2024

MAINT: set SOURCE_DATE_EPOCH in build_wheels.sh #28152

Merged

ogrisel added RFC Meta-issue General issue associated to an identified list of tasks and removed Needs Triage Issue requires triage labels Jan 17, 2024

ogrisel changed the title ~~Towards reproducible builds for our PyPI release wheels~~ RFC: Towards reproducible builds for our PyPI release wheels Jan 17, 2024

lesteve mentioned this issue Jan 30, 2024

MNT Add code scanning workflow #28312

Merged

ogrisel mentioned this issue May 13, 2024

CI Fix wheel builder windows #29006

Merged

andife mentioned this issue Nov 3, 2024

Question: the same bit-for-bit result (https://bestpractices.coreinfrastructure.org/en/projects/3313) onnx/onnx#5357

Open

ogrisel mentioned this issue Feb 7, 2025

Consider tracking provenance of build-time dependencies psf/sboms-for-python-packages#15

Open

ogrisel mentioned this issue Apr 24, 2025

Upper bound the build dependencies in pyproject.toml for release branches #31183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Towards reproducible builds for our PyPI release wheels #28151

RFC: Towards reproducible builds for our PyPI release wheels #28151

ogrisel commented Jan 17, 2024 •

edited

Loading

ogrisel commented Jan 18, 2024

ogrisel commented Jan 19, 2024 •

edited

Loading

thomasjpfan commented Jan 29, 2024

ogrisel commented Jan 29, 2024 •

edited

Loading

betatim commented Jan 30, 2024

ogrisel commented Feb 6, 2025

RFC: Towards reproducible builds for our PyPI release wheels #28151

RFC: Towards reproducible builds for our PyPI release wheels #28151

Comments

ogrisel commented Jan 17, 2024 • edited Loading

ogrisel commented Jan 18, 2024

ogrisel commented Jan 19, 2024 • edited Loading

thomasjpfan commented Jan 29, 2024

ogrisel commented Jan 29, 2024 • edited Loading

betatim commented Jan 30, 2024

ogrisel commented Feb 6, 2025

ogrisel commented Jan 17, 2024 •

edited

Loading

ogrisel commented Jan 19, 2024 •

edited

Loading

ogrisel commented Jan 29, 2024 •

edited

Loading