Apt archive mirrors in Git-LFS

My effort to improve transparency and confidence of public apt archives continues. I started to work on this in “Apt Archive Transparency” in which I mention the debdistget project in passing. Debdistget is responsible for mirroring index files for some public apt archives. I’ve realized that having a publicly auditable and preserved mirror of the apt repositories is central to being able to do apt transparency work, so the debdistget project has become more central to my project than I thought. Currently I track Trisquel, PureOS, Gnuinos and their upstreams Ubuntu, Debian and Devuan.

Debdistget download Release/Package/Sources files and store them in a git repository published on GitLab. Due to size constraints, it uses two repositories: one for the Release/InRelease files (which are small) and one that also include the Package/Sources files (which are large). See for example the repository for Trisquel release files and the Trisquel package/sources files. Repositories for all distributions can be found in debdistutils’ archives GitLab sub-group.

The reason for splitting into two repositories was that the git repository for the combined files become large, and that some of my use-cases only needed the release files. Currently the repositories with packages (which contain a couple of months worth of data now) are 9GB for Ubuntu, 2.5GB for Trisquel/Debian/PureOS, 970MB for Devuan and 450MB for Gnuinos. The repository size is correlated to the size of the archive (for the initial import) plus the frequency and size of updates. Ubuntu’s use of Apt Phased Updates (which triggers a higher churn of Packages file modifications) appears to be the primary reason for its larger size.

Working with large Git repositories is inefficient and the GitLab CI/CD jobs generate quite some network traffic downloading the git repository over and over again. The most heavy user is the debdistdiff project that download all distribution package repositories to do diff operations on the package lists between distributions. The daily job takes around 80 minutes to run, with the majority of time is spent on downloading the archives. Yes I know I could look into runner-side caching but I dislike complexity caused by caching.

Fortunately not all use-cases requires the package files. The debdistcanary project only needs the Release/InRelease files, in order to commit signatures to the Sigstore and Sigsum transparency logs. These jobs still run fairly quickly, but watching the repository size growth worries me. Currently these repositories are at Debian 440MB, PureOS 130MB, Ubuntu/Devuan 90MB, Trisquel 12MB, Gnuinos 2MB. Here I believe the main size correlation is update frequency, and Debian is large because I track the volatile unstable.

So I hit a scalability end with my first approach. A couple of months ago I “solved” this by discarding and resetting these archival repositories. The GitLab CI/CD jobs were fast again and all was well. However this meant discarding precious historic information. A couple of days ago I was reaching the limits of practicality again, and started to explore ways to fix this. I like having data stored in git (it allows easy integration with software integrity tools such as GnuPG and Sigstore, and the git log provides a kind of temporal ordering of data), so it felt like giving up on nice properties to use a traditional database with on-disk approach. So I started to learn about Git-LFS and understanding that it was able to handle multi-GB worth of data that looked promising.

Fairly quickly I scripted up a GitLab CI/CD job that incrementally update the Release/Package/Sources files in a git repository that uses Git-LFS to store all the files. The repository size is now at Ubuntu 650kb, Debian 300kb, Trisquel 50kb, Devuan 250kb, PureOS 172kb and Gnuinos 17kb. As can be expected, jobs are quick to clone the git archives: debdistdiff pipelines went from a run-time of 80 minutes down to 10 minutes which more reasonable correlate with the archive size and CPU run-time.

The LFS storage size for those repositories are at Ubuntu 15GB, Debian 8GB, Trisquel 1.7GB, Devuan 1.1GB, PureOS/Gnuinos 420MB. This is for a couple of days worth of data. It seems native Git is better at compressing/deduplicating data than Git-LFS is: the combined size for Ubuntu is already 15GB for a couple of days data compared to 8GB for a couple of months worth of data with pure Git. This may be a sub-optimal implementation of Git-LFS in GitLab but it does worry me that this new approach will be difficult to scale too. At some level the difference is understandable, Git-LFS probably store two different Packages files — around 90MB each for Trisquel — as two 90MB files, but native Git would store it as one compressed version of the 90MB file and one relatively small patch to turn the old files into the next file. So the Git-LFS approach surprisingly scale less well for overall storage-size. Still, the original repository is much smaller, and you usually don’t have to pull all LFS files anyway. So it is net win.

Throughout this work, I kept thinking about how my approach relates to Debian’s snapshot service. Ultimately what I would want is a combination of these two services. To have a good foundation to do transparency work I would want to have a collection of all Release/Packages/Sources files ever published, and ultimately also the source code and binaries. While it makes sense to start on the latest stable releases of distributions, this effort should scale backwards in time as well. For reproducing binaries from source code, I need to be able to securely find earlier versions of binary packages used for rebuilds. So I need to import all the Release/Packages/Sources packages from snapshot into my repositories. The latency to retrieve files from that server is slow so I haven’t been able to find an efficient/parallelized way to download the files. If I’m able to finish this, I would have confidence that my new Git-LFS based approach to store these files will scale over many years to come. This remains to be seen. Perhaps the repository has to be split up per release or per architecture or similar.

Another factor is storage costs. While the git repository size for a Git-LFS based repository with files from several years may be possible to sustain, the Git-LFS storage size surely won’t be. It seems GitLab charges the same for files in repositories and in Git-LFS, and it is around $500 per 100GB per year. It may be possible to setup a separate Git-LFS backend not hosted at GitLab to serve the LFS files. Does anyone know of a suitable server implementation for this? I had a quick look at the Git-LFS implementation list and it seems the closest reasonable approach would be to setup the Gitea-clone Forgejo as a self-hosted server. Perhaps a cloud storage approach a’la S3 is the way to go? The cost to host this on GitLab will be manageable for up to ~1TB ($5000/year) but scaling it to storing say 500TB of data would mean an yearly fee of $2.5M which seems like poor value for the money.

I realized that ultimately I would want a git repository locally with the entire content of all apt archives, including their binary and source packages, ever published. The storage requirements for a service like snapshot (~300TB of data?) is today not prohibitly expensive: 20TB disks are $500 a piece, so a storage enclosure with 36 disks would be around $18.000 for 720TB and using RAID1 means 360TB which is a good start. While I have heard about ~TB-sized Git-LFS repositories, would Git-LFS scale to 1PB? Perhaps the size of a git repository with multi-millions number of Git-LFS pointer files will become unmanageable? To get started on this approach, I decided to import a mirror of Debian’s bookworm for amd64 into a Git-LFS repository. That is around 175GB so reasonable cheap to host even on GitLab ($1000/year for 200GB). Having this repository publicly available will make it possible to write software that uses this approach (e.g., porting debdistreproduce), to find out if this is useful and if it could scale. Distributing the apt repository via Git-LFS would also enable other interesting ideas to protecting the data. Consider configuring apt to use a local file:// URL to this git repository, and verifying the git checkout using some method similar to Guix’s approach to trusting git content or Sigstore’s gitsign.

A naive push of the 175GB archive in a single git commit ran into pack size limitations:

remote: fatal: pack exceeds maximum allowed size (4.88 GiB)

however breaking up the commit into smaller commits for parts of the archive made it possible to push the entire archive. Here are the commands to create this repository:

git init
git lfs install
git lfs track 'dists/**' 'pool/**'
git add .gitattributes
git commit -m"Add Git-LFS track attributes." .gitattributes
time debmirror --method=rsync --host ftp.se.debian.org --root :debian --arch=amd64 --source --dist=bookworm,bookworm-updates --section=main --verbose --diff=none --keyring /usr/share/keyrings/debian-archive-keyring.gpg --ignore .git .
git add dists project
git commit -m"Add." -a
git remote add origin git@gitlab.com:debdistutils/archives/debian/mirror.git
git push --set-upstream origin --all
for d in pool//; do
echo $d;
time git add $d;
git commit -m"Add $d." -a
git push
done

The resulting repository size is around 27MB with Git LFS object storage around 174GB. I think this approach would scale to handle all architectures for one release, but working with a single git repository for all releases for all architectures may lead to a too large git repository (>1GB). So maybe one repository per release? These repositories could also be split up on a subset of pool/ files, or there could be one repository per release per architecture or sources.

Finally, I have concerns about using SHA1 for identifying objects. It seems both Git and Debian’s snapshot service is currently using SHA1. For Git there is SHA-256 transition and it seems GitLab is working on support for SHA256-based repositories. For serious long-term deployment of these concepts, it would be nice to go for SHA256 identifiers directly. Git-LFS already uses SHA256 but Git internally uses SHA1 as does the Debian snapshot service.

What do you think? Happy Hacking!

Validating debian/copyright: licenserecon

Recently I noticed a new tool called licenserecon written by Peter Blackman, and I helped get licenserecon into Debian. The purpose of licenserecon is to reconcile licenses from debian/copyright against the output from licensecheck, a tool written by Jonas Smedegaard. It assumes DEP5 copyright files. You run the tool in a directory that has a debian/ sub-directory, and its output when it notices mismatches (this is for resolv-wrapper):

# sudo apt install licenserecon
jas@kaka:~/dpkg/resolv-wrapper$ lrc

Parsing Source Tree ....
Running licensecheck ....

d/copyright     | licensecheck

BSD-3-Clauses   | BSD-3-clause     src/resolv_wrapper.c
BSD-3-Clauses   | BSD-3-clause     tests/dns_srv.c
BSD-3-Clauses   | BSD-3-clause     tests/test_dns_fake.c
BSD-3-Clauses   | BSD-3-clause     tests/test_res_query_search.c
BSD-3-Clauses   | BSD-3-clause     tests/torture.c
BSD-3-Clauses   | BSD-3-clause     tests/torture.h

jas@kaka:~/dpkg/resolv-wrapper$ 

Noticing one-character typos like this may not bring satisfaction except to the most obsessive-compulsive among us, however the tool has the potential of discovering more serious mistakes.

Using it manually once in a while may be useful, however I tend to forget QA steps that are not automated. Could we add this to the Salsa CI/CD pipeline? I recently proposed a merge request to add a wrap-and-sort job to the Salsa CI/CD pipeline (disabled by default) and learned how easy it was to extend it. I think licenserecon is still a bit rough on the edges, and I haven’t been able to successfully use it on any but the simplest packages yet. I wouldn’t want to suggest it is added to the normal Salsa CI/CD pipeline, even if disabled. If you maintain a Debian package on Salsa and wish to add a licenserecon job to your pipeline, I wrote licenserecon.yml for you.

The simplest way to use licenserecon.yml is to replace recipes/debian.yml@salsa-ci-team/pipeline as the Salsa CI/CD configuration file setting with debian/salsa-ci.yml@debian/licenserecon. If you use a debian/salsa-ci.yml file you may put something like this in it instead:

---
include:
  - https://salsa.debian.org/salsa-ci-team/pipeline/raw/master/recipes/debian.yml
  - https://salsa.debian.org/debian/licenserecon/raw/main/debian/licenserecon.yml

Once you trigger the pipeline, this will result in a new job licenserecon that validates debian/copyright against licensecheck output on every build! I have added this to the libcpucycles package on Salsa and the pipeline contains a new job licenserecon whose output currently ends with:

$ cd ${WORKING_DIR}/${SOURCE_DIR}
$ lrc
Parsing Source Tree ....
Running licensecheck ....
No differences found
Cleaning up project directory and file based variables

If upstream releases a new version with files not matching our debian/copyright file, we will detect that on the next Salsa build job rather than months later when somebody happens to run the tools manually or there is some license conflict.

Incidentally licenserecon is written in Pascal which brought back old memories with Turbo Pascal back in the MS-DOS days. Thanks Peter for licenserecon, and Jonas for licensecheck making this possible!

Trisquel is 42% Reproducible!

The absolute number may not be impressive, but what I hope is at least a useful contribution is that there actually is a number on how much of Trisquel is reproducible. Hopefully this will inspire others to help improve the actual metric.

tl;dr: go to reproduce-trisquel.

When I set about to understand how Trisquel worked, I identified a number of things that would improve my confidence in it. The lowest hanging fruit for me was to manually audit the package archive, and I wrote a tool called debdistdiff to automate this for me. That led me to think about apt archive transparency more in general. I have made some further work in that area (hint: apt-verify) that deserve its own blog post eventually. Most of apt archive transparency is futile if we don’t trust the intended packages that are in the archive. One way to measurable increase trust in the package are to provide reproducible builds of the packages, which should by now be an established best practice. Code review is still important, but since it will never provide positive guarantees we need other processes that can identify sub-optimal situations automatically. The way reproducible builds easily identify negative results is what I believe has driven much of its success: its results are tangible and measurable. The field of software engineering is in need of more such practices.

The design of my setup to build Trisquel reproducible are as follows.

  • The project debdistget is responsible for downloading Release/Packages files (which are the most relevant files from dists/) from apt archives, and works by commiting them into GitLab-hosted git-repositories. I maintain several such repositories for popular apt-archives, including for Trisquel and its upstream Ubuntu. GitLab invokes a schedule pipeline to do the downloading, and there is some race conditions here.
  • The project debdistdiff is used to produce the list of added and modified packages, which are the input to actually being able to know what packages to reproduce. It publishes human readable summary of difference for several distributions, including Trisquel vs Ubuntu. Early on I decided that rebuilding all of the upstream Ubuntu packages is out of scope for me: my personal trust in the official Debian/Ubuntu apt archives are greater than my trust of the added/modified packages in Trisquel.
  • The final project reproduce-trisquel puts the pieces together briefly as follows, everything being driven from its .gitlab-ci.yml file.
    • There is a (manually triggered) job generate-build-image to create a build image to speed up CI/CD runs, using a simple Dockerfile.
    • There is a (manually triggered) job generate-package-lists that uses debdistdiff to generate and store package lists and puts its output in lists/. The reason this is manually triggered right now is due to a race condition.
    • There is a (scheduled) job that does two things: from the package lists, the script generate-ci-packages.sh builds a GitLab CI/CD instruction file ci-packages.yml that describes jobs for each package to build. The second part is generate-readme.sh that re-generate the project’s README.md based on the build logs and diffoscope outputs that stored in the git repository.
    • Through the ci-packages.yml file, there is a large number of jobs that are dynamically defined, which currently are manually triggered to not overload the build servers. The script build-package.sh is invoked and attempts to rebuild a package, and stores build log and diffoscope output in the git project itself.

I did not expect to be able to use the GitLab shared runners to do the building, however they turned out to work quite well and I postponed setting up my own runner. There is a manually curated lists/disabled-aramo.txt with some packages that all required too much disk space or took over two hours to build. Today I finally took the time to setup a GitLab runner using podman running Trisquel aramo, and I expect to complete builds of the remaining packages soon — one of my Dell R630 server with 256GB RAM and dual 2680v4 CPUs should deliver sufficient performance.

Current limitations and ideas on further work (most are filed as project issues) include:

  • We don’t support *.buildinfo files. As far as I am aware, Trisquel does not publish them for their builds. Improving this would be a first step forward, anyone able to help? Compare buildinfo.debian.net. For example, many packages differ only in their NT_GNU_BUILD_ID symbol inside the ELF binary, see example diffoscope output for libgpg-error. By poking around in jenkins.trisquel.org I managed to discover that Trisquel built initramfs-utils in the randomized path /build/initramfs-tools-bzRLUp and hard-coding that path allowed me to reproduce that package. I expect the same to hold for many other packages. Unfortunately, this failure turned into success with that package moved the needle from 42% reproducibility to 43% however I didn’t let that stand in the way of a good headline.
  • The mechanism to download the Release/Package-files from dists/ is not fool-proof: we may not capture all ever published such files. While this is less of a concern for reproducibility, it is more of a concern for apt transparency. Still, having Trisquel provide a service similar to snapshot.debian.org would help.
  • Having at least one other CPU architecture would be nice.
  • Due to lack of time and mental focus, handling incremental updates of new versions of packages is not yet working. This means we only ever build one version of a package, and never discover any newly published versions of the same package. Now that Trisquel aramo is released, the expected rate of new versions should be low, but still happens due to security or backports.
  • Porting this to test supposedly FSDG-compliant distributions such as PureOS and Gnuinos should be relatively easy. I’m also looking at Devuan because of Gnuinos.
  • The elephant in the room is how reproducible Ubuntu is in the first place.

Happy Easter Hacking!

Update 2023-04-17: The original project “reproduce-trisquel” that was announced here has been archived and replaced with two projects, one generic “debdistreproduce” and one with results for Trisquel: “reproduce/trisquel“.

On language bindings & Relaunching Guile-GnuTLS

The Guile bindings for GnuTLS has been part of GnuTLS since spring 2007 when Ludovic Courtès contributed it after some initial discussion. I have been looking into getting back to do GnuTLS coding, and during a recent GnuTLS meeting one topic was Guile bindings. It seemed like a fairly self-contained project to pick up on. It is interesting to re-read the old thread when this work was included: some of the concerns brought up there now have track record to be evaluated on. My opinion that the cost of introducing a new project per language binding today is smaller than the cost of maintaining language bindings as part of the core project. I believe the cost/benefit ratio has changed during the past 15 years: introducing a new project used to come with a significant cost but this is no longer the case, as tooling and processes for packaging have improved. I have had similar experience with Java, C# and Emacs Lisp bindings for GNU Libidn as well, where maintaining them centralized slow down the pace of updates. Andreas Metzler pointed to a similar conclusion reached by Russ Allbery.

There are many ways to separate a project into two projects; just copying the files into a new git repository would have been the simplest and was my original plan. However Ludo’ mentioned git-filter-branch in an email, and the idea of keeping all git history for some of the relevant files seemed worth pursuing to me. I quickly found git-filter-repo which appears to be the recommend approach, and experimenting with it I found a way to filter out the GnuTLS repo into a small git repository that Guile-GnuTLS could be based on. The commands I used were the following, if you want to reproduce things.

$ git clone https://gitlab.com/gnutls/gnutls.git guile-gnutls
$ cd guile-gnutls/
$ git checkout f5dcbdb46df52458e3756193c2a23bf558a3ecfd
$ git-filter-repo --path guile/ --path m4/guile.m4 --path doc/gnutls-guile.texi --path doc/extract-guile-c-doc.scm --path doc/cha-copying.texi --path doc/fdl-1.3.texi

I debated with myself back and forth whether to include some files that would be named the same in the new repository but would share little to no similar lines, for example configure.ac, Makefile.am not to mention README and NEWS. Initially I thought it would be nice to preserve the history for all lines that went into the new project, but this is a subjective judgement call. What brought me over to a more minimal approach was that the contributor history and attribution would be quite strange for the new repository: Should Guile-GnuTLS attribute the work of the thousands of commits to configure.ac which had nothing to do with Guile? Should the people who wrote that be mentioned as contributor of Guile-GnuTLS? I think not.

The next step was to get a reasonable GitLab CI/CD pipeline up, to make sure the project builds on some free GNU/Linux distributions like Trisquel and PureOS as well as the usual non-free distributions like Debian and Fedora to have coverage of dpkg and rpm based distributions. I included builds on Alpine and ArchLinux as well, because they tend to trigger other portability issues. I wish there were GNU Guix docker images available for easy testing on that platform as well. The GitLab CI/CD rules for a project like this are fairly simple.

To get things out of the door, I tagged the result as v3.7.9 and published a GitLab release page for Guile-GnuTLS that includes OpenPGP-signed source tarballs manually uploaded built on my laptop. The URLs for these tarballs are not very pleasant to work with, and discovering new releases automatically appears unreliable, but I don’t know of a better approach.

To finish this project, I have proposed a GnuTLS merge request to remove all Guile-related parts from the GnuTLS core.

Doing some GnuTLS-related work again felt nice, it was quite some time ago so thank you for giving me this opportunity. Thoughts or comments? Happy hacking!