Towards reproducible minimal source code tarballs? On *-src.tar.gz

While the work to analyze the xz backdoor is in progress, several ideas have been suggested to improve the software supply chain ecosystem. Some of those ideas are good, some of the ideas are at best irrelevant and harmless, and some suggestions are plain bad. I’d like to attempt to formalize two ideas, which have been discussed before, but the context in which they can be appreciated have not been as clear as it is today.

  1. Reproducible tarballs. The idea is that published source tarballs should be possible to reproduce independently somehow, and that this should be continuously tested and verified — preferrably as part of the upstream project continuous integration system (e.g., GitHub action or GitLab pipeline). While nominally this looks easy to achieve, there are some complex matters in this, for example: what timestamps to use for files in the tarball? I’ve brought up this aspect before.
  2. Minimal source tarballs without generated vendor files. Most GNU Autoconf/Automake-based tarballs pre-generated files which are important for bootstrapping on exotic systems that does not have the required dependencies. For the bootstrapping story to succeed, this approach is important to support. However it has become clear that this practice raise significant costs and risks. Most modern GNU/Linux distributions have all the required dependencies and actually prefers to re-build everything from source code. These pre-generated extra files introduce uncertainty to that process.

My strawman proposal to improve things is to define new tarball format *-src.tar.gz with at least the following properties:

  1. The tarball should allow users to build the project, which is the entire purpose of all this. This means that at least all source code for the project has to be included.
  2. The tarballs should be signed, for example with PGP or minisign.
  3. The tarball should be possible to reproduce bit-by-bit by a third party using upstream’s version controlled sources and a pointer to which revision was used (e.g., git tag or git commit).
  4. The tarball should not require an Internet connection to download things.
    • Corollary: every external dependency either has to be explicitly documented as such (e.g., gcc and GnuTLS), or included in the tarball.
    • Observation: This means including all *.po gettext translations which are normally downloaded when building from version controlled sources.
  5. The tarball should contain everything required to build the project from source using as much externally released versioned tooling as possible. This is the “minimal” property lacking today.
    • Corollary: This means including a vendored copy of OpenSSL or libz is not acceptable: link to them as external projects.
    • Open question: How about non-released external tooling such as gnulib or autoconf archive macros? This is a bit more delicate: most distributions either just package one current version of gnulib or autoconf archive, not previous versions. While this could change, and distributions could package the gnulib git repository (up to some current version) and the autoconf archive git repository — and packages were set up to extract the version they need (gnulib’s ./bootstrap already supports this via the –gnulib-refdir parameter), this is not normally in place.
    • Suggested Corollary: The tarball should contain content from git submodule’s such as gnulib and the necessary Autoconf archive M4 macros required by the project.
  6. Similar to how the GNU project specify the ./configure interface we need a documented interface for how to bootstrap the project. I suggest to use the already well established idiom of running ./bootstrap to set up the package to later be able to be built via ./configure. Of course, some projects are not using the autotool ./configure interface and will not follow this aspect either, but like most build systems that compete with autotools have instructions on how to build the project, they should document similar interfaces for bootstrapping the source tarball to allow building.

If tarballs that achieve the above goals were available from popular upstream projects, distributions could more easily use them instead of current tarballs that include pre-generated content. The advantage would be that the build process is not tainted by “unnecessary” files. We need to develop tools for maintainers to create these tarballs, similar to make dist that generate today’s foo-1.2.3.tar.gz files.

I think one common argument against this approach will be: Why bother with all that, and just use git-archive outputs? Or avoid the entire tarball approach and move directly towards version controlled check outs and referring to upstream releases as git URL and commit tag or id. My counter-argument is that this optimize for packagers’ benefits at the cost of upstream maintainers: most upstream maintainers do not want to store gettext *.po translations in their source code repository. A compromise between the needs of maintainers and packagers is useful, so this *-src.tar.gz tarball approach is the indirection we need to solve that.

What do you think?

Enforcing wrap-and-sort -satb

For Debian package maintainers, the wrap-and-sort tool is one of those nice tools that I use once in a while, and every time have to re-read the documentation to conclude that I want to use the --wrap-always --short-indent --trailing-comma --sort-binary-package options (or -satb for short). Every time, I also wish that I could automate this and have it always be invoked to keep my debian/ directory tidy, so I don’t have to do this manually once every blue moon. I haven’t found a way to achieve this automation in a non-obtrusive way that interacts well with my git-based packaging workflow. Ideally I would like for something like the lintian-hook during gbp buildpackage to check for this – ideas?

Meanwhile, I have come up with a way to make sure I don’t forget to run wrap-and-sort for long, and that others who work on the same package won’t either: create an autopkgtest which is invoked during the Salsa CI/CD pipeline using the following as debian/tests/wrap-and-sort:

#!/bin/sh

set -eu

TMPDIR=$(mktemp -d)
trap "rm -rf $TMPDIR" 0 INT QUIT ABRT PIPE TERM

cp -a debian $TMPDIR
cd $TMPDIR
wrap-and-sort -satb
diff -ur $OLDPWD/debian debian

Add the following to debian/tests/control to invoke it – which is intentionally not indented properly so that the self-test will fail so you will learn how it behaves.

Tests: wrap-and-sort
Depends: devscripts, python3-debian
Restrictions: superficial

Now I will get build failures in the pipeline once I upload the package into Salsa, which I usually do before uploading into Debian. I will get a diff output, and it won’t be happy until I push a commit with the output of running wrap-and-sort with the parameters I settled with.

While autopkgtest is intended to test the installed package, the tooling around autopkgtest is powerful and easily allows this mild abuse of its purpose for a pleasant QA improvement.

Thoughts? Happy hacking!