Skip to content

RFC: Towards reproducible builds for our PyPI release wheels #28151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ogrisel opened this issue Jan 17, 2024 · 6 comments
Open

RFC: Towards reproducible builds for our PyPI release wheels #28151

ogrisel opened this issue Jan 17, 2024 · 6 comments
Labels
Meta-issue General issue associated to an identified list of tasks RFC

Comments

@ogrisel
Copy link
Member

ogrisel commented Jan 17, 2024

Given the popularity of our project, our release automation might be considered an interesting target to conduct supply chain attacks to make our binaries ship spyware or ransomware to some of our users.

One way to detect such attacks would be to:

  • make sure we produce reproducible builds;
  • rebuild our wheels from independent build environments and check that we obtain the same hash as for binaries obtained by our release CI to make sure that our release CI environment has not been tampered to inject malware in our binaries;
  • optionally make it possible to publish GPG signed statements that some released artifact digests were successfully byte-for-byte reproduced from source independently.

The first step would to make our wheels as reproducible as possible would be to define deterministic values for the SOURCE_DATE_EPOCH (and maybe PYTHONHASHSEED, that cannot hurt) environment variables.

HHowever,this would not be enough.

To get this fully work as expected, we would also need to guarantee that:

  • we use recent enough versions of pip/setuptools/wheel/auditwheel/delocate
    that honor SOURCE_DATE_EPOCH;

  • a full description of the build environment (e.g. versions and sha256 digests of the
    compilers and other build dependencies) is archived in our source repo for a given tag
    of scikit-learn. Ideally, all those build dependencies should themselves be
    byte-for-byte reproducible from their own public source code repo.

Currently some build dependencies such as NumPy and Cython come from the pyproject.toml file which only specifies a minimum version. This means that we may end up with a newer versions of these dependencies than the one used to build the wheels for a given tag. cibuildwheel itself is not pinned, hence neither the dependencies it installs in its managed venvs (pip, setuptools, wheel, auditwheel, delocate).

Furthermore, we do not archive or pin the versions and sha256 digests of the compilers yet. For Linux, this depends on the manylinux docker image used by cibuildwheel, which at the time of writing, is not guaranteed to be reproducible, even when using the same docker image tag. For windows and macOS, the compilers come from the VM image used on our CI which we do not archive neither their version numbers nor the hash of their binaries.

Ideally all this information should be in our source code at the time of the release (reachable via a checkout of our commit tag).

Finally, we might need to set a specific umask:

Not sure about how to get deterministic file permission metadata for macOS and Windows wheels.

EDIT: now that we use meson, this problem with umask might have gone away, but we need to check.

EDIT2: I tried and I think we still have a sensitivity to umask after the switch to the meson build system.

Finally, once our builds are made 100% reproducible, we would need to document:

  • document instructions (and provide official scripts) to allow anyone to easily re-build the binaries independently from source;
  • make it process easy to automate on private infrastructure (distinct from our usual public CI);
  • publish official reproducibility results on a public site, ideally not only on our github pages hosted website and maybe sign those with GPG for instance;
  • coordinate with people from https://scientific-python.org or even the PSF or pypi.org admins to define and follow community-wide best practices.

This is just for scikit-learn itself. But for this kind of supply chain audit to be meaningful, we would need to make sure that all the tools in the build pipeline of scikit-learn are themselves reproducible and regularly and independently reproduced, including:

  • compilers;
  • runtime libraries such as the libc;
  • build dependencies (numpy, Cython, meson-python, ninja, cibuildweel);
  • wheel binary editing tools such as auditwheel/delocate/delvewheel/repairwheel;
  • the sha256sum command :)
  • the whole manylinux docker image and probably docker since it is required to build manylinux wheels.

We would also need to snapshot the provenance info before running the tests (in case pytest or any test dependency are themselves supply chain attacked). For instance, the test environment was effectively used to hide the attack on the xz binaries.

Note that a large fraction of debian is already reproducible but we would need to trace everything in our build process to check that this is the case.

Doing all of this will require significant maintainers time investment, but we can probably start from low hanging fruits such as setting SOURCE_DATE_EPOCH in our release CI scripts.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Jan 17, 2024
@ogrisel ogrisel added RFC Meta-issue General issue associated to an identified list of tasks and removed Needs Triage Issue requires triage labels Jan 17, 2024
@ogrisel ogrisel changed the title Towards reproducible builds for our PyPI release wheels RFC: Towards reproducible builds for our PyPI release wheels Jan 17, 2024
@ogrisel
Copy link
Member Author

ogrisel commented Jan 18, 2024

In addition to wheels, we might also want to obtain reproducible conda-forge release artifacts. This is being discussed here:

@ogrisel
Copy link
Member Author

ogrisel commented Jan 19, 2024

The following blog posts by @sethmlarson can serve as interesting references:

In particular, we could attempt to generate SBOM descriptions (Software Bill of Materials) using the .spdx.json format for each of our wheels (and for the source tarball) as a first step and then provide a script to automate byte for byte reproductions using the information provided in the SBOM files.

@thomasjpfan
Copy link
Member

Does this include pinning our build dependencies in the pyproject.toml for the release branch?

@ogrisel
Copy link
Member Author

ogrisel commented Jan 29, 2024

Does this include pinning our build dependencies in the pyproject.toml for the release branch?

I don't think so because we don't want the generated package metadata to be pinned to a particular version of the dependencies.

But we would need to find a way to tell cibuildwheel to not rely on pyproject.toml to setup its build environment but instead use one we can control with some form of lock file.

@betatim
Copy link
Member

betatim commented Jan 30, 2024

I had understood it so that there is a tool that creates a record of your build environment and a second (set of) tool(s) that can recreated the build environment from that record.

Not sure if this would mean that you are free to create the build environment however you want (pinned or not) or if you need to use a particular tool/strategy to create it so that the tool that creates the record of the build environment can do its job.


In general I think this would be a useful thing to have/work on. As Olivier said it will be quite a lot of work though. So picking some easier first steps that already provide some value is a good way to get started.

A learning from mybinder.org users is that you need to exercise the "recreate environment from your record" step regularly to check that it works and continues to work. A bit like you can't just create a tape backup and then put the tape in a safe. You need to regularly try to restore your backup to check it still works, reposition the tape, copy to a newer tape, etc. The learning was that just because today you can successfully recreate the environment does not mean that you will be able to do so tomorrow. Yes, the chances are high but not 100%. So yet another thing to think about/work on :D

@ogrisel
Copy link
Member Author

ogrisel commented Feb 6, 2025

There is ongoing work to survey SBOM generation tools for Python projects here:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Meta-issue General issue associated to an identified list of tasks RFC
Projects
None yet
Development

No branches or pull requests

3 participants