Apt archive mirrors in Git-LFS

My effort to improve transparency and confidence of public apt archives continues. I started to work on this in “Apt Archive Transparency” in which I mention the debdistget project in passing. Debdistget is responsible for mirroring index files for some public apt archives. I’ve realized that having a publicly auditable and preserved mirror of the apt repositories is central to being able to do apt transparency work, so the debdistget project has become more central to my project than I thought. Currently I track Trisquel, PureOS, Gnuinos and their upstreams Ubuntu, Debian and Devuan.

Debdistget download Release/Package/Sources files and store them in a git repository published on GitLab. Due to size constraints, it uses two repositories: one for the Release/InRelease files (which are small) and one that also include the Package/Sources files (which are large). See for example the repository for Trisquel release files and the Trisquel package/sources files. Repositories for all distributions can be found in debdistutils’ archives GitLab sub-group.

The reason for splitting into two repositories was that the git repository for the combined files become large, and that some of my use-cases only needed the release files. Currently the repositories with packages (which contain a couple of months worth of data now) are 9GB for Ubuntu, 2.5GB for Trisquel/Debian/PureOS, 970MB for Devuan and 450MB for Gnuinos. The repository size is correlated to the size of the archive (for the initial import) plus the frequency and size of updates. Ubuntu’s use of Apt Phased Updates (which triggers a higher churn of Packages file modifications) appears to be the primary reason for its larger size.

Working with large Git repositories is inefficient and the GitLab CI/CD jobs generate quite some network traffic downloading the git repository over and over again. The most heavy user is the debdistdiff project that download all distribution package repositories to do diff operations on the package lists between distributions. The daily job takes around 80 minutes to run, with the majority of time is spent on downloading the archives. Yes I know I could look into runner-side caching but I dislike complexity caused by caching.

Fortunately not all use-cases requires the package files. The debdistcanary project only needs the Release/InRelease files, in order to commit signatures to the Sigstore and Sigsum transparency logs. These jobs still run fairly quickly, but watching the repository size growth worries me. Currently these repositories are at Debian 440MB, PureOS 130MB, Ubuntu/Devuan 90MB, Trisquel 12MB, Gnuinos 2MB. Here I believe the main size correlation is update frequency, and Debian is large because I track the volatile unstable.

So I hit a scalability end with my first approach. A couple of months ago I “solved” this by discarding and resetting these archival repositories. The GitLab CI/CD jobs were fast again and all was well. However this meant discarding precious historic information. A couple of days ago I was reaching the limits of practicality again, and started to explore ways to fix this. I like having data stored in git (it allows easy integration with software integrity tools such as GnuPG and Sigstore, and the git log provides a kind of temporal ordering of data), so it felt like giving up on nice properties to use a traditional database with on-disk approach. So I started to learn about Git-LFS and understanding that it was able to handle multi-GB worth of data that looked promising.

Fairly quickly I scripted up a GitLab CI/CD job that incrementally update the Release/Package/Sources files in a git repository that uses Git-LFS to store all the files. The repository size is now at Ubuntu 650kb, Debian 300kb, Trisquel 50kb, Devuan 250kb, PureOS 172kb and Gnuinos 17kb. As can be expected, jobs are quick to clone the git archives: debdistdiff pipelines went from a run-time of 80 minutes down to 10 minutes which more reasonable correlate with the archive size and CPU run-time.

The LFS storage size for those repositories are at Ubuntu 15GB, Debian 8GB, Trisquel 1.7GB, Devuan 1.1GB, PureOS/Gnuinos 420MB. This is for a couple of days worth of data. It seems native Git is better at compressing/deduplicating data than Git-LFS is: the combined size for Ubuntu is already 15GB for a couple of days data compared to 8GB for a couple of months worth of data with pure Git. This may be a sub-optimal implementation of Git-LFS in GitLab but it does worry me that this new approach will be difficult to scale too. At some level the difference is understandable, Git-LFS probably store two different Packages files — around 90MB each for Trisquel — as two 90MB files, but native Git would store it as one compressed version of the 90MB file and one relatively small patch to turn the old files into the next file. So the Git-LFS approach surprisingly scale less well for overall storage-size. Still, the original repository is much smaller, and you usually don’t have to pull all LFS files anyway. So it is net win.

Throughout this work, I kept thinking about how my approach relates to Debian’s snapshot service. Ultimately what I would want is a combination of these two services. To have a good foundation to do transparency work I would want to have a collection of all Release/Packages/Sources files ever published, and ultimately also the source code and binaries. While it makes sense to start on the latest stable releases of distributions, this effort should scale backwards in time as well. For reproducing binaries from source code, I need to be able to securely find earlier versions of binary packages used for rebuilds. So I need to import all the Release/Packages/Sources packages from snapshot into my repositories. The latency to retrieve files from that server is slow so I haven’t been able to find an efficient/parallelized way to download the files. If I’m able to finish this, I would have confidence that my new Git-LFS based approach to store these files will scale over many years to come. This remains to be seen. Perhaps the repository has to be split up per release or per architecture or similar.

Another factor is storage costs. While the git repository size for a Git-LFS based repository with files from several years may be possible to sustain, the Git-LFS storage size surely won’t be. It seems GitLab charges the same for files in repositories and in Git-LFS, and it is around $500 per 100GB per year. It may be possible to setup a separate Git-LFS backend not hosted at GitLab to serve the LFS files. Does anyone know of a suitable server implementation for this? I had a quick look at the Git-LFS implementation list and it seems the closest reasonable approach would be to setup the Gitea-clone Forgejo as a self-hosted server. Perhaps a cloud storage approach a’la S3 is the way to go? The cost to host this on GitLab will be manageable for up to ~1TB ($5000/year) but scaling it to storing say 500TB of data would mean an yearly fee of $2.5M which seems like poor value for the money.

I realized that ultimately I would want a git repository locally with the entire content of all apt archives, including their binary and source packages, ever published. The storage requirements for a service like snapshot (~300TB of data?) is today not prohibitly expensive: 20TB disks are $500 a piece, so a storage enclosure with 36 disks would be around $18.000 for 720TB and using RAID1 means 360TB which is a good start. While I have heard about ~TB-sized Git-LFS repositories, would Git-LFS scale to 1PB? Perhaps the size of a git repository with multi-millions number of Git-LFS pointer files will become unmanageable? To get started on this approach, I decided to import a mirror of Debian’s bookworm for amd64 into a Git-LFS repository. That is around 175GB so reasonable cheap to host even on GitLab ($1000/year for 200GB). Having this repository publicly available will make it possible to write software that uses this approach (e.g., porting debdistreproduce), to find out if this is useful and if it could scale. Distributing the apt repository via Git-LFS would also enable other interesting ideas to protecting the data. Consider configuring apt to use a local file:// URL to this git repository, and verifying the git checkout using some method similar to Guix’s approach to trusting git content or Sigstore’s gitsign.

A naive push of the 175GB archive in a single git commit ran into pack size limitations:

remote: fatal: pack exceeds maximum allowed size (4.88 GiB)

however breaking up the commit into smaller commits for parts of the archive made it possible to push the entire archive. Here are the commands to create this repository:

git init
git lfs install
git lfs track 'dists/**' 'pool/**'
git add .gitattributes
git commit -m"Add Git-LFS track attributes." .gitattributes
time debmirror --method=rsync --host ftp.se.debian.org --root :debian --arch=amd64 --source --dist=bookworm,bookworm-updates --section=main --verbose --diff=none --keyring /usr/share/keyrings/debian-archive-keyring.gpg --ignore .git .
git add dists project
git commit -m"Add." -a
git remote add origin git@gitlab.com:debdistutils/archives/debian/mirror.git
git push --set-upstream origin --all
for d in pool//; do
echo $d;
time git add $d;
git commit -m"Add $d." -a
git push
done

The resulting repository size is around 27MB with Git LFS object storage around 174GB. I think this approach would scale to handle all architectures for one release, but working with a single git repository for all releases for all architectures may lead to a too large git repository (>1GB). So maybe one repository per release? These repositories could also be split up on a subset of pool/ files, or there could be one repository per release per architecture or sources.

Finally, I have concerns about using SHA1 for identifying objects. It seems both Git and Debian’s snapshot service is currently using SHA1. For Git there is SHA-256 transition and it seems GitLab is working on support for SHA256-based repositories. For serious long-term deployment of these concepts, it would be nice to go for SHA256 identifiers directly. Git-LFS already uses SHA256 but Git internally uses SHA1 as does the Debian snapshot service.

What do you think? Happy Hacking!

On language bindings & Relaunching Guile-GnuTLS

The Guile bindings for GnuTLS has been part of GnuTLS since spring 2007 when Ludovic Court├Ęs contributed it after some initial discussion. I have been looking into getting back to do GnuTLS coding, and during a recent GnuTLS meeting one topic was Guile bindings. It seemed like a fairly self-contained project to pick up on. It is interesting to re-read the old thread when this work was included: some of the concerns brought up there now have track record to be evaluated on. My opinion that the cost of introducing a new project per language binding today is smaller than the cost of maintaining language bindings as part of the core project. I believe the cost/benefit ratio has changed during the past 15 years: introducing a new project used to come with a significant cost but this is no longer the case, as tooling and processes for packaging have improved. I have had similar experience with Java, C# and Emacs Lisp bindings for GNU Libidn as well, where maintaining them centralized slow down the pace of updates. Andreas Metzler pointed to a similar conclusion reached by Russ Allbery.

There are many ways to separate a project into two projects; just copying the files into a new git repository would have been the simplest and was my original plan. However Ludo’ mentioned git-filter-branch in an email, and the idea of keeping all git history for some of the relevant files seemed worth pursuing to me. I quickly found git-filter-repo which appears to be the recommend approach, and experimenting with it I found a way to filter out the GnuTLS repo into a small git repository that Guile-GnuTLS could be based on. The commands I used were the following, if you want to reproduce things.

$ git clone https://gitlab.com/gnutls/gnutls.git guile-gnutls
$ cd guile-gnutls/
$ git checkout f5dcbdb46df52458e3756193c2a23bf558a3ecfd
$ git-filter-repo --path guile/ --path m4/guile.m4 --path doc/gnutls-guile.texi --path doc/extract-guile-c-doc.scm --path doc/cha-copying.texi --path doc/fdl-1.3.texi

I debated with myself back and forth whether to include some files that would be named the same in the new repository but would share little to no similar lines, for example configure.ac, Makefile.am not to mention README and NEWS. Initially I thought it would be nice to preserve the history for all lines that went into the new project, but this is a subjective judgement call. What brought me over to a more minimal approach was that the contributor history and attribution would be quite strange for the new repository: Should Guile-GnuTLS attribute the work of the thousands of commits to configure.ac which had nothing to do with Guile? Should the people who wrote that be mentioned as contributor of Guile-GnuTLS? I think not.

The next step was to get a reasonable GitLab CI/CD pipeline up, to make sure the project builds on some free GNU/Linux distributions like Trisquel and PureOS as well as the usual non-free distributions like Debian and Fedora to have coverage of dpkg and rpm based distributions. I included builds on Alpine and ArchLinux as well, because they tend to trigger other portability issues. I wish there were GNU Guix docker images available for easy testing on that platform as well. The GitLab CI/CD rules for a project like this are fairly simple.

To get things out of the door, I tagged the result as v3.7.9 and published a GitLab release page for Guile-GnuTLS that includes OpenPGP-signed source tarballs manually uploaded built on my laptop. The URLs for these tarballs are not very pleasant to work with, and discovering new releases automatically appears unreliable, but I don’t know of a better approach.

To finish this project, I have proposed a GnuTLS merge request to remove all Guile-related parts from the GnuTLS core.

Doing some GnuTLS-related work again felt nice, it was quite some time ago so thank you for giving me this opportunity. Thoughts or comments? Happy hacking!

Cosmos – A Simple Configuration Management System

Back in early 2012 I had been helping with system administration of a number of Debian/Ubuntu-based machines, and the odd Solaris machine, for a couple of years at $DAYJOB. We had a combination of hand-written scripts, documentation notes that we cut’n’paste’d from during installation, and some locally maintained Debian packages for pulling in dependencies and providing some configuration files. As the number of people and machines involved grew, I realized that I wasn’t happy with how these machines were being administrated. If one of these machines would disappear in flames, it would take time (and more importantly, non-trivial manual labor) to get its services up and running again. I wanted a system that could automate the complete configuration of any Unix-like machine. It should require minimal human interaction. I wanted the configuration files to be version controlled. I wanted good security properties. I did not want to rely on a centralized server that would be a single point of failure. It had to be portable and be easy to get to work on new (and very old) platforms. It should be easy to modify a configuration file and get it deployed. I wanted it to be easy to start to use on an existing server. I wanted it to allow for incremental adoption. Surely this must exist, I thought.

During January 2012 I evaluated the existing configuration management systems around, like CFEngine, Chef, and Puppet. I don’t recall my reasons for rejecting each individual project, but needless to say I did not find what I was looking for. The reasons for rejecting the projects I looked at ranged from centralization concerns (single-point-of-failure central servers), bad security (no OpenPGP signing integration), to the feeling that the projects were too complex and hence fragile. I’m sure there were other reasons too.

In February I started going back to my original needs and tried to see if I could abstract something from the knowledge that was in all these notes, script snippets and local dpkg packages. I realized that the essence of what I wanted was one shell script per machine, OpenPGP signed, in a Git repository. I could check out that Git repository on every new machine that I wanted to configure, verify the OpenPGP signature of the shell script, and invoke the script. The script would do everything needed to get the machine up into an operational stage again, including package installation and configuration file changes. Since I would usually want to modify configuration files on a system even after its initial installation (hey not everyone is perfect), it was natural to extend this idea to a cron job that did ‘git pull’, verified the OpenPGP signature, and ran the script. The script would then have to be a bit more clever and not redo everything every time.

Since we had many machines, it was obvious that there would be huge code duplication between scripts. It felt natural to think of splitting up the shell script into a directory with many smaller shell scripts, and invoke each shell script in turn. Think of the /etc/init.d/ hierarchy and how it worked with System V initd. This would allow re-use of useful snippets across several machines. The next realization was that large parts of the shell script would be to create configuration files, such as /etc/network/interfaces. It would be easier to modify the content of those files if they were stored as files in a separate directory, an “overlay” stored in a sub-directory overlay/, and copied into the file system’s hierarchy with rsync. The final realization was that it made some sense to run one set of scripts before rsync’ing in the configuration files (to be able to install packages or set things up for the configuration files to make sense), and one set of scripts after the rsync (to perform tasks that require some package to be installed and configured). These set of scripts were called the “pre-tasks” and “post-tasks” respectively, and stored in sub-directories called pre-tasks.d/ and post-tasks.d/.

I started putting what would become Cosmos together during February 2012. Incidentally, I had been using etckeeper on our machines, and I had been reading its source code, and it greatly inspired the internal design of Cosmos. The git history shows well how the ideas evolved — even that Cosmos was initially called Eve but in retrospect I didn’t like the religious connotations — and there were a couple of rewrites on the way, but on the 28th of February I pushed out version 1.0. It was in total 778 lines of code, with at least 200 of those lines being the license boiler plate at the top of each file. Version 1.0 had a debian/ directory and I built the dpkg file and started to deploy on it some machines. There were a couple of small fixes in the next few days, but development stopped on March 5th 2012. We started to use Cosmos, and converted more and more machines to it, and I quickly also converted all of my home servers to use it. And even my laptops. It took until September 2014 to discover the first bug (the fix is a one-liner). Since then there haven’t been any real changes to the source code. It is in daily use today.

The README that comes with Cosmos gives a more hands-on approach on using it, which I hope will serve as a starting point if the above introduction sparked some interest. I hope to cover more about how to use Cosmos in a later blog post. Since Cosmos does so little on its own, to make sense of how to use it, you want to see a Git repository with machine models. If you want to see how the Git repository for my own machines looks you can see the sjd-cosmos repository. Don’t miss its README at the bottom. In particular, its global/ sub-directory contains some of the foundation, such as OpenPGP key trust handling.

Redmine on Debian Lenny Using Lighttpd

The GnuTLS trac installation is in a poor shape. To fix that, I looked into alternatives and found Redmine. Redmine appears to do most things that I liked in Trac (wiki, roadmap and issue tracking) plus it supports more than one project (would come in handy for my other projects) and has built-in git support. I would like to see better spam handling and OpenID support, but it is good enough for our purposes now, and there are similar concerns with trac.

However, getting it up and running with lighttpd on a modern debian lenny installation was not trivial, and I needed some help from #redmine (thanks stbuehler). After finally getting it up and running, I made a copy of the machine using rsync and rsnapshot, so I could re-create a working configuration if I get stuck, and then re-installed the virtual machine.

The notes below are the steps required to set up Redmine using Lighttpd and MySQL on a Debian Lenny. I’m posting this to help others searching for the error messages I got, and to help my own memory in case I need to re-install the server sometime.
Continue reading Redmine on Debian Lenny Using Lighttpd