Towards reproducible minimal source code tarballs? On *-src.tar.gz

While the work to analyze the xz backdoor is in progress, several ideas have been suggested to improve the software supply chain ecosystem. Some of those ideas are good, some of the ideas are at best irrelevant and harmless, and some suggestions are plain bad. I’d like to attempt to formalize two ideas, which have been discussed before, but the context in which they can be appreciated have not been as clear as it is today.

  1. Reproducible tarballs. The idea is that published source tarballs should be possible to reproduce independently somehow, and that this should be continuously tested and verified — preferrably as part of the upstream project continuous integration system (e.g., GitHub action or GitLab pipeline). While nominally this looks easy to achieve, there are some complex matters in this, for example: what timestamps to use for files in the tarball? I’ve brought up this aspect before.
  2. Minimal source tarballs without generated vendor files. Most GNU Autoconf/Automake-based tarballs pre-generated files which are important for bootstrapping on exotic systems that does not have the required dependencies. For the bootstrapping story to succeed, this approach is important to support. However it has become clear that this practice raise significant costs and risks. Most modern GNU/Linux distributions have all the required dependencies and actually prefers to re-build everything from source code. These pre-generated extra files introduce uncertainty to that process.

My strawman proposal to improve things is to define new tarball format *-src.tar.gz with at least the following properties:

  1. The tarball should allow users to build the project, which is the entire purpose of all this. This means that at least all source code for the project has to be included.
  2. The tarballs should be signed, for example with PGP or minisign.
  3. The tarball should be possible to reproduce bit-by-bit by a third party using upstream’s version controlled sources and a pointer to which revision was used (e.g., git tag or git commit).
  4. The tarball should not require an Internet connection to download things.
    • Corollary: every external dependency either has to be explicitly documented as such (e.g., gcc and GnuTLS), or included in the tarball.
    • Observation: This means including all *.po gettext translations which are normally downloaded when building from version controlled sources.
  5. The tarball should contain everything required to build the project from source using as much externally released versioned tooling as possible. This is the “minimal” property lacking today.
    • Corollary: This means including a vendored copy of OpenSSL or libz is not acceptable: link to them as external projects.
    • Open question: How about non-released external tooling such as gnulib or autoconf archive macros? This is a bit more delicate: most distributions either just package one current version of gnulib or autoconf archive, not previous versions. While this could change, and distributions could package the gnulib git repository (up to some current version) and the autoconf archive git repository — and packages were set up to extract the version they need (gnulib’s ./bootstrap already supports this via the –gnulib-refdir parameter), this is not normally in place.
    • Suggested Corollary: The tarball should contain content from git submodule’s such as gnulib and the necessary Autoconf archive M4 macros required by the project.
  6. Similar to how the GNU project specify the ./configure interface we need a documented interface for how to bootstrap the project. I suggest to use the already well established idiom of running ./bootstrap to set up the package to later be able to be built via ./configure. Of course, some projects are not using the autotool ./configure interface and will not follow this aspect either, but like most build systems that compete with autotools have instructions on how to build the project, they should document similar interfaces for bootstrapping the source tarball to allow building.

If tarballs that achieve the above goals were available from popular upstream projects, distributions could more easily use them instead of current tarballs that include pre-generated content. The advantage would be that the build process is not tainted by “unnecessary” files. We need to develop tools for maintainers to create these tarballs, similar to make dist that generate today’s foo-1.2.3.tar.gz files.

I think one common argument against this approach will be: Why bother with all that, and just use git-archive outputs? Or avoid the entire tarball approach and move directly towards version controlled check outs and referring to upstream releases as git URL and commit tag or id. One problem with this is that SHA-1 is broken, so placing trust in a SHA-1 identifier is simply not secure. Another counter-argument is that this optimize for packagers’ benefits at the cost of upstream maintainers: most upstream maintainers do not want to store gettext *.po translations in their source code repository. A compromise between the needs of maintainers and packagers is useful, so this *-src.tar.gz tarball approach is the indirection we need to solve that. Update: In my experiment with source-only tarballs for Libntlm I actually did use git-archive output.

What do you think?

9 Replies to “Towards reproducible minimal source code tarballs? On *-src.tar.gz”

  1. Ever since I encountered a crash (can’t remember if it was a security hole) introduced by way of translations with mismatched format strings, even though there are some defences against that kind of thing, I’ve considered it best practice as a maintainer to at least minimally review .po files and commit them rather than just pulling them in automatically.

    It’s certainly more work. I would say I find errors in maybe 10% of submissions – higher than you might expect, but it’s often possible to notice likely structural errors in diffs of translations of technical material even if you don’t read the target language.

    • Indeed — I used to store all *.po files in git for this reason, and to update them manually reviewing changes. I am also concerned that a maliciously crafted *.po file may lead to a compromise of a developer machine — did anyone analyse that? Translations are downloaded without any hash or signature verification, so there is good opportunity to plant backdoors there, and maybe even to trigger them.

      Since then I’ve started to use gnulib’s ./bootstrap instead, and I’ve given up on this practice. Maybe this should be reconsidered.

      /Simon

  2. will c/c++ finally get good library management with booth download / offline capabilities? maybe even as part of standard?

    • I doubt that this will happen, since there is so little least-common-denominator on how to achieve it. GNU autotools is one way, cmake another, homegrown *.sh build scripts another, and I’m sure there are many more widely used approaches. This flexibility is probably one of the strengths with C but, as usual, also a weakness.

      /Simon

  3. I had never heard complaints about storing .po files upstream, and the suggestion seems rather odd, if the alternative is to download random stuff from internet at release time. Translations should be part of the upstream repository. Like Colin, I also tend to review, find and fix issues in .po files submitted by translators (as usual bug reports).

    I’ve considered in the past though, removing the .pot files from the repository, because of my policy on not storing any autogenerated files there, but it’s pretty much the only exception I’ve made to that rule, given that it means translators do not need any local tools to be able to work on translations, just download the .po and or .pot files and are set to go.

    I’m also not a fan of the gnulib and autoconf-archive embedding workflows. The former should IMO be switched to a shared library, which of course would incur significant more overhead for gnulib upstream. Personally for anything I’d need from either, I just tend to implement them myself from scratch if necessary to avoid those workflows.

    For the archive, I’ve also been pondering whether to generate either pristine tarballs, or both pristine and portable ones. If I’d go with the second, I think I’d default to naming the pristine one as the name-version.tar.xz, and the portable one as something else, as I assume that would be the common case I’d want people to use, while people that are on systems that do not have the needed tooling could download the portable fully autoreconf’ed one. But I’ve not decided on this yet.

    What I’ve now also prepared for dpkg for example is to track also the commit id in the tarball via a new .dist-vcs-id, so to add to the traceability, because before that the commit id was only present in the package version iff the current commit was not part of a signed tag.

  4. Here are the problems I can personally think of with your “minimal source tarball” suggestion:
    1. For package maintainers, they may not want to make two tarballs with a release of the software. That’s a increased load for the release process, and provides benefits to few people. I know there are builders like you, including distros, who want to re-generate all pregenerated build files. How about just download the release tarball and strip the files by yourself? For safety you can distrust the pregenerated makefile and run “autoreconf && ./configure && make maintaier-clean”
    There is a potential for simplifying the process for autotools-based packages.
    2. Gnulib. Contrary to what people think, there would be no shared library for Gnulib. Gnulib is support routines that wrap around an existing libc implementation and are intended to be linked statically. What routines from Gnulib are needed are determined by package. Perhaps the best we can do is implement an update mechanism so that if a distro provides Gnulib then “./configure” can replace (or update) the Gnulib module from source with a module from the distro.

  5. Reproducible source tarballs are a no-brainer to me. I’ve been distributing reproducible source tarballs for years GNU Mes, Dezyne and Gash do so, just to name a few.

    Sure, Autotools and Gettext sadly still get (seriously) in the way of this, but they’re not the only ones; the “let’s add a timestamp” fetishism that has spread like a virus also really doesn’t help.

    Just yesterday I submitted “[PATCH v2 00/12] Reproducible `make dist’ tarball in defiance of Autotools and Gettext” to GNU Guix to make their source tarball reproducible.

    See https://issues.guix.gnu.org/70169/#21

  6. Pingback: Reproducible and minimal source-only tarballs – Simon Josefsson's blog

Leave a Reply

Your email address will not be published. Required fields are marked *

*