-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
RFC: Towards reproducible builds for our PyPI release wheels #28151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In addition to wheels, we might also want to obtain reproducible conda-forge release artifacts. This is being discussed here: |
The following blog posts by @sethmlarson can serve as interesting references:
In particular, we could attempt to generate SBOM descriptions (Software Bill of Materials) using the |
Does this include pinning our build dependencies in the |
I don't think so because we don't want the generated package metadata to be pinned to a particular version of the dependencies. But we would need to find a way to tell |
I had understood it so that there is a tool that creates a record of your build environment and a second (set of) tool(s) that can recreated the build environment from that record. Not sure if this would mean that you are free to create the build environment however you want (pinned or not) or if you need to use a particular tool/strategy to create it so that the tool that creates the record of the build environment can do its job. In general I think this would be a useful thing to have/work on. As Olivier said it will be quite a lot of work though. So picking some easier first steps that already provide some value is a good way to get started. A learning from mybinder.org users is that you need to exercise the "recreate environment from your record" step regularly to check that it works and continues to work. A bit like you can't just create a tape backup and then put the tape in a safe. You need to regularly try to restore your backup to check it still works, reposition the tape, copy to a newer tape, etc. The learning was that just because today you can successfully recreate the environment does not mean that you will be able to do so tomorrow. Yes, the chances are high but not 100%. So yet another thing to think about/work on :D |
There is ongoing work to survey SBOM generation tools for Python projects here: |
Given the popularity of our project, our release automation might be considered an interesting target to conduct supply chain attacks to make our binaries ship spyware or ransomware to some of our users.
One way to detect such attacks would be to:
The first step would to make our wheels as reproducible as possible would be to define deterministic values for the
SOURCE_DATE_EPOCH
(and maybePYTHONHASHSEED
, that cannot hurt) environment variables.HHowever,this would not be enough.
To get this fully work as expected, we would also need to guarantee that:
we use recent enough versions of pip/setuptools/wheel/auditwheel/delocate
that honor
SOURCE_DATE_EPOCH
;a full description of the build environment (e.g. versions and sha256 digests of the
compilers and other build dependencies) is archived in our source repo for a given tag
of scikit-learn. Ideally, all those build dependencies should themselves be
byte-for-byte reproducible from their own public source code repo.
Currently some build dependencies such as NumPy and Cython come from the
pyproject.toml
file which only specifies a minimum version. This means that we may end up with a newer versions of these dependencies than the one used to build the wheels for a given tag.cibuildwheel
itself is not pinned, hence neither the dependencies it installs in its managed venvs (pip, setuptools, wheel, auditwheel, delocate).Furthermore, we do not archive or pin the versions and sha256 digests of the compilers yet. For Linux, this depends on the manylinux docker image used by cibuildwheel, which at the time of writing, is not guaranteed to be reproducible, even when using the same docker image tag. For windows and macOS, the compilers come from the VM image used on our CI which we do not archive neither their version numbers nor the hash of their binaries.
Ideally all this information should be in our source code at the time of the release (reachable via a checkout of our commit tag).
Finally, we might need to set a specific
umask
:Not sure about how to get deterministic file permission metadata for macOS and Windows wheels.
EDIT: now that we use meson, this problem with umask might have gone away, but we need to check.
EDIT2: I tried and I think we still have a sensitivity to umask after the switch to the meson build system.
Finally, once our builds are made 100% reproducible, we would need to document:
This is just for scikit-learn itself. But for this kind of supply chain audit to be meaningful, we would need to make sure that all the tools in the build pipeline of scikit-learn are themselves reproducible and regularly and independently reproduced, including:
auditwheel/delocate/delvewheel/repairwheel
;sha256sum
command :)We would also need to snapshot the provenance info before running the tests (in case pytest or any test dependency are themselves supply chain attacked). For instance, the test environment was effectively used to hide the attack on the xz binaries.
Note that a large fraction of debian is already reproducible but we would need to trace everything in our build process to check that this is the case.
Doing all of this will require significant maintainers time investment, but we can probably start from low hanging fruits such as setting
SOURCE_DATE_EPOCH
in our release CI scripts.The text was updated successfully, but these errors were encountered: