Apt archive mirrors in Git-LFS

My effort to improve transparency and confidence of public apt archives continues. I started to work on this in “Apt Archive Transparency” in which I mention the debdistget project in passing. Debdistget is responsible for mirroring index files for some public apt archives. I’ve realized that having a publicly auditable and preserved mirror of the apt repositories is central to being able to do apt transparency work, so the debdistget project has become more central to my project than I thought. Currently I track Trisquel, PureOS, Gnuinos and their upstreams Ubuntu, Debian and Devuan.

Debdistget download Release/Package/Sources files and store them in a git repository published on GitLab. Due to size constraints, it uses two repositories: one for the Release/InRelease files (which are small) and one that also include the Package/Sources files (which are large). See for example the repository for Trisquel release files and the Trisquel package/sources files. Repositories for all distributions can be found in debdistutils’ archives GitLab sub-group.

The reason for splitting into two repositories was that the git repository for the combined files become large, and that some of my use-cases only needed the release files. Currently the repositories with packages (which contain a couple of months worth of data now) are 9GB for Ubuntu, 2.5GB for Trisquel/Debian/PureOS, 970MB for Devuan and 450MB for Gnuinos. The repository size is correlated to the size of the archive (for the initial import) plus the frequency and size of updates. Ubuntu’s use of Apt Phased Updates (which triggers a higher churn of Packages file modifications) appears to be the primary reason for its larger size.

Working with large Git repositories is inefficient and the GitLab CI/CD jobs generate quite some network traffic downloading the git repository over and over again. The most heavy user is the debdistdiff project that download all distribution package repositories to do diff operations on the package lists between distributions. The daily job takes around 80 minutes to run, with the majority of time is spent on downloading the archives. Yes I know I could look into runner-side caching but I dislike complexity caused by caching.

Fortunately not all use-cases requires the package files. The debdistcanary project only needs the Release/InRelease files, in order to commit signatures to the Sigstore and Sigsum transparency logs. These jobs still run fairly quickly, but watching the repository size growth worries me. Currently these repositories are at Debian 440MB, PureOS 130MB, Ubuntu/Devuan 90MB, Trisquel 12MB, Gnuinos 2MB. Here I believe the main size correlation is update frequency, and Debian is large because I track the volatile unstable.

So I hit a scalability end with my first approach. A couple of months ago I “solved” this by discarding and resetting these archival repositories. The GitLab CI/CD jobs were fast again and all was well. However this meant discarding precious historic information. A couple of days ago I was reaching the limits of practicality again, and started to explore ways to fix this. I like having data stored in git (it allows easy integration with software integrity tools such as GnuPG and Sigstore, and the git log provides a kind of temporal ordering of data), so it felt like giving up on nice properties to use a traditional database with on-disk approach. So I started to learn about Git-LFS and understanding that it was able to handle multi-GB worth of data that looked promising.

Fairly quickly I scripted up a GitLab CI/CD job that incrementally update the Release/Package/Sources files in a git repository that uses Git-LFS to store all the files. The repository size is now at Ubuntu 650kb, Debian 300kb, Trisquel 50kb, Devuan 250kb, PureOS 172kb and Gnuinos 17kb. As can be expected, jobs are quick to clone the git archives: debdistdiff pipelines went from a run-time of 80 minutes down to 10 minutes which more reasonable correlate with the archive size and CPU run-time.

The LFS storage size for those repositories are at Ubuntu 15GB, Debian 8GB, Trisquel 1.7GB, Devuan 1.1GB, PureOS/Gnuinos 420MB. This is for a couple of days worth of data. It seems native Git is better at compressing/deduplicating data than Git-LFS is: the combined size for Ubuntu is already 15GB for a couple of days data compared to 8GB for a couple of months worth of data with pure Git. This may be a sub-optimal implementation of Git-LFS in GitLab but it does worry me that this new approach will be difficult to scale too. At some level the difference is understandable, Git-LFS probably store two different Packages files — around 90MB each for Trisquel — as two 90MB files, but native Git would store it as one compressed version of the 90MB file and one relatively small patch to turn the old files into the next file. So the Git-LFS approach surprisingly scale less well for overall storage-size. Still, the original repository is much smaller, and you usually don’t have to pull all LFS files anyway. So it is net win.

Throughout this work, I kept thinking about how my approach relates to Debian’s snapshot service. Ultimately what I would want is a combination of these two services. To have a good foundation to do transparency work I would want to have a collection of all Release/Packages/Sources files ever published, and ultimately also the source code and binaries. While it makes sense to start on the latest stable releases of distributions, this effort should scale backwards in time as well. For reproducing binaries from source code, I need to be able to securely find earlier versions of binary packages used for rebuilds. So I need to import all the Release/Packages/Sources packages from snapshot into my repositories. The latency to retrieve files from that server is slow so I haven’t been able to find an efficient/parallelized way to download the files. If I’m able to finish this, I would have confidence that my new Git-LFS based approach to store these files will scale over many years to come. This remains to be seen. Perhaps the repository has to be split up per release or per architecture or similar.

Another factor is storage costs. While the git repository size for a Git-LFS based repository with files from several years may be possible to sustain, the Git-LFS storage size surely won’t be. It seems GitLab charges the same for files in repositories and in Git-LFS, and it is around $500 per 100GB per year. It may be possible to setup a separate Git-LFS backend not hosted at GitLab to serve the LFS files. Does anyone know of a suitable server implementation for this? I had a quick look at the Git-LFS implementation list and it seems the closest reasonable approach would be to setup the Gitea-clone Forgejo as a self-hosted server. Perhaps a cloud storage approach a’la S3 is the way to go? The cost to host this on GitLab will be manageable for up to ~1TB ($5000/year) but scaling it to storing say 500TB of data would mean an yearly fee of $2.5M which seems like poor value for the money.

I realized that ultimately I would want a git repository locally with the entire content of all apt archives, including their binary and source packages, ever published. The storage requirements for a service like snapshot (~300TB of data?) is today not prohibitly expensive: 20TB disks are $500 a piece, so a storage enclosure with 36 disks would be around $18.000 for 720TB and using RAID1 means 360TB which is a good start. While I have heard about ~TB-sized Git-LFS repositories, would Git-LFS scale to 1PB? Perhaps the size of a git repository with multi-millions number of Git-LFS pointer files will become unmanageable? To get started on this approach, I decided to import a mirror of Debian’s bookworm for amd64 into a Git-LFS repository. That is around 175GB so reasonable cheap to host even on GitLab ($1000/year for 200GB). Having this repository publicly available will make it possible to write software that uses this approach (e.g., porting debdistreproduce), to find out if this is useful and if it could scale. Distributing the apt repository via Git-LFS would also enable other interesting ideas to protecting the data. Consider configuring apt to use a local file:// URL to this git repository, and verifying the git checkout using some method similar to Guix’s approach to trusting git content or Sigstore’s gitsign.

A naive push of the 175GB archive in a single git commit ran into pack size limitations:

remote: fatal: pack exceeds maximum allowed size (4.88 GiB)

however breaking up the commit into smaller commits for parts of the archive made it possible to push the entire archive. Here are the commands to create this repository:

git init
git lfs install
git lfs track 'dists/**' 'pool/**'
git add .gitattributes
git commit -m"Add Git-LFS track attributes." .gitattributes
time debmirror --method=rsync --host ftp.se.debian.org --root :debian --arch=amd64 --source --dist=bookworm,bookworm-updates --section=main --verbose --diff=none --keyring /usr/share/keyrings/debian-archive-keyring.gpg --ignore .git .
git add dists project
git commit -m"Add." -a
git remote add origin git@gitlab.com:debdistutils/archives/debian/mirror.git
git push --set-upstream origin --all
for d in pool//; do
echo $d;
time git add $d;
git commit -m"Add $d." -a
git push
done

The resulting repository size is around 27MB with Git LFS object storage around 174GB. I think this approach would scale to handle all architectures for one release, but working with a single git repository for all releases for all architectures may lead to a too large git repository (>1GB). So maybe one repository per release? These repositories could also be split up on a subset of pool/ files, or there could be one repository per release per architecture or sources.

Finally, I have concerns about using SHA1 for identifying objects. It seems both Git and Debian’s snapshot service is currently using SHA1. For Git there is SHA-256 transition and it seems GitLab is working on support for SHA256-based repositories. For serious long-term deployment of these concepts, it would be nice to go for SHA256 identifiers directly. Git-LFS already uses SHA256 but Git internally uses SHA1 as does the Debian snapshot service.

What do you think? Happy Hacking!

Sigstore for Apt Archives: apt-cosign

As suggested in my initial announcement of apt-sigstore my plan was to look into stronger uses of Sigstore than rekor, and I’m now happy to announce that the apt-cosign plugin has been added to apt-sigstore and the operational project debdistcanary is publishing cosign-statements about the InRelease file published by the following distributions: Trisquel GNU/Linux, PureOS, Gnuinos, Ubuntu, Debian and Devuan.

Summarizing the commands that you need to run as root to experience the great new world:

# run everything as root: su / sudo -i / doas -s
apt-get install -y apt gpg bsdutils wget
wget -nv -O/usr/local/bin/apt-verify-gpgv https://gitlab.com/debdistutils/apt-verify/-/raw/main/apt-verify-gpgv
chmod +x /usr/local/bin/apt-verify-gpgv
mkdir -p /etc/apt/verify.d
ln -s /usr/bin/gpgv /etc/apt/verify.d
echo 'APT::Key::gpgvcommand "apt-verify-gpgv";' > /etc/apt/apt.conf.d/75verify
wget -O/usr/local/bin/cosign https://github.com/sigstore/cosign/releases/download/v2.0.1/cosign-linux-amd64
echo 924754b2e62f25683e3e74f90aa5e166944a0f0cf75b4196ee76cb2f487dd980  /usr/local/bin/cosign | sha256sum -c
chmod +x /usr/local/bin/cosign
wget -nv -O/etc/apt/verify.d/apt-cosign https://gitlab.com/debdistutils/apt-sigstore/-/raw/main/apt-cosign
chmod +x /etc/apt/verify.d/apt-cosign
mkdir -p /etc/apt/trusted.cosign.d
dist=$(lsb_release --short --id | tr A-Z a-z)
wget -O/etc/apt/trusted.cosign.d/cosign-public-key-$dist.txt "https://gitlab.com/debdistutils/debdistcanary/-/raw/main/cosign/cosign-public-key-$dist.txt"
echo "Cosign::Base-URL \"https://gitlab.com/debdistutils/canary/$dist/-/raw/main/cosign\";" > /etc/apt/apt.conf.d/77cosign

Then run your usual apt-get update and look in the syslog to debug things.

This is the kind of work that gets done while waiting for the build machines to attempt to reproducibly build PureOS. Unfortunately, the results is that a meager 16% of the 765 added/modifed packages are reproducible by me. There is some infrastructure work to be done to improve things: we should use sbuild for example. The build infrastructure should produce signed statements for each package it builds: One statement saying that it attempted to reproducible build a particular binary package (thus generated some build logs and diffoscope-output for auditing), and one statements saying that it actually was able to reproduce a package. Verifying such claims during apt-get install or possibly dpkg -i is a logical next step.

There is some code cleanups and release work to be done now. Which distribution will be the first apt-based distribution that includes native support for Sigstore? Let’s see.

Sigstore is not the only relevant transparency log around, and I’ve been trying to learn a bit about Sigsum to be able to support it as well. The more improved confidence about system security, the merrier!

More on Differential Reproducible Builds: Devuan is 46% reproducible!

Building on my work to rebuild Trisquel GNU/Linux 11.0 aramo, it felt simple to generalize the tooling to any two apt-repository pairs and I’ve created debdistreproduce as a template-project for doing this through the infrastructure of GitLab CI/CD and meanwhile even set up my own gitlab-runner on spare hardware. I’ve brought over reproduce/trisquel to using debdistreproduce as well, and archived the old reproduce-trisquel project.

After fixing some quirks, building Devuan GNU+Linux 4.0 Chimaera was fairly quick since they do not modify that many packages, and I’m now able to reproduce 46% of the packages that Devuan Chimaera add/modify on amd64. I have more work in progress here (hint: reproduce/pureos), but PureOS is considerably larger than both Trisquel and Devuan together. I’m not sure how interested Devuan or PureOS are in reproducible builds though.

Reflecting on this work made me realize that while the natural thing to do here was to differentiate two different apt-based distributions, I have realized the same way I did for debdistdiff that it would also be interesting to compare, say, Debian bookworm from Debian unstable, especially now that they should be fairly close together. My tooling should support that too. However, to really provide any benefit from the more complete existing reproducible testing of Debian, some further benefit from doing that would be useful and I can’t articulate one right now.

One ultimate goal with my effort is to improve trust in apt-repositories, and combining transparency-style protection a’la apt-sigstore with third-party validated reproducible builds may indeed be one such use-case that would benefit the wider community of apt-repositories. Imagine having your system not install any package unless it can verify it against a third-party reproducible build organization that commits their results in a tamper-proof transparency ledger. But I’m now on repeat here, so will stop.

Trisquel is 42% Reproducible!

The absolute number may not be impressive, but what I hope is at least a useful contribution is that there actually is a number on how much of Trisquel is reproducible. Hopefully this will inspire others to help improve the actual metric.

tl;dr: go to reproduce-trisquel.

When I set about to understand how Trisquel worked, I identified a number of things that would improve my confidence in it. The lowest hanging fruit for me was to manually audit the package archive, and I wrote a tool called debdistdiff to automate this for me. That led me to think about apt archive transparency more in general. I have made some further work in that area (hint: apt-verify) that deserve its own blog post eventually. Most of apt archive transparency is futile if we don’t trust the intended packages that are in the archive. One way to measurable increase trust in the package are to provide reproducible builds of the packages, which should by now be an established best practice. Code review is still important, but since it will never provide positive guarantees we need other processes that can identify sub-optimal situations automatically. The way reproducible builds easily identify negative results is what I believe has driven much of its success: its results are tangible and measurable. The field of software engineering is in need of more such practices.

The design of my setup to build Trisquel reproducible are as follows.

  • The project debdistget is responsible for downloading Release/Packages files (which are the most relevant files from dists/) from apt archives, and works by commiting them into GitLab-hosted git-repositories. I maintain several such repositories for popular apt-archives, including for Trisquel and its upstream Ubuntu. GitLab invokes a schedule pipeline to do the downloading, and there is some race conditions here.
  • The project debdistdiff is used to produce the list of added and modified packages, which are the input to actually being able to know what packages to reproduce. It publishes human readable summary of difference for several distributions, including Trisquel vs Ubuntu. Early on I decided that rebuilding all of the upstream Ubuntu packages is out of scope for me: my personal trust in the official Debian/Ubuntu apt archives are greater than my trust of the added/modified packages in Trisquel.
  • The final project reproduce-trisquel puts the pieces together briefly as follows, everything being driven from its .gitlab-ci.yml file.
    • There is a (manually triggered) job generate-build-image to create a build image to speed up CI/CD runs, using a simple Dockerfile.
    • There is a (manually triggered) job generate-package-lists that uses debdistdiff to generate and store package lists and puts its output in lists/. The reason this is manually triggered right now is due to a race condition.
    • There is a (scheduled) job that does two things: from the package lists, the script generate-ci-packages.sh builds a GitLab CI/CD instruction file ci-packages.yml that describes jobs for each package to build. The second part is generate-readme.sh that re-generate the project’s README.md based on the build logs and diffoscope outputs that stored in the git repository.
    • Through the ci-packages.yml file, there is a large number of jobs that are dynamically defined, which currently are manually triggered to not overload the build servers. The script build-package.sh is invoked and attempts to rebuild a package, and stores build log and diffoscope output in the git project itself.

I did not expect to be able to use the GitLab shared runners to do the building, however they turned out to work quite well and I postponed setting up my own runner. There is a manually curated lists/disabled-aramo.txt with some packages that all required too much disk space or took over two hours to build. Today I finally took the time to setup a GitLab runner using podman running Trisquel aramo, and I expect to complete builds of the remaining packages soon — one of my Dell R630 server with 256GB RAM and dual 2680v4 CPUs should deliver sufficient performance.

Current limitations and ideas on further work (most are filed as project issues) include:

  • We don’t support *.buildinfo files. As far as I am aware, Trisquel does not publish them for their builds. Improving this would be a first step forward, anyone able to help? Compare buildinfo.debian.net. For example, many packages differ only in their NT_GNU_BUILD_ID symbol inside the ELF binary, see example diffoscope output for libgpg-error. By poking around in jenkins.trisquel.org I managed to discover that Trisquel built initramfs-utils in the randomized path /build/initramfs-tools-bzRLUp and hard-coding that path allowed me to reproduce that package. I expect the same to hold for many other packages. Unfortunately, this failure turned into success with that package moved the needle from 42% reproducibility to 43% however I didn’t let that stand in the way of a good headline.
  • The mechanism to download the Release/Package-files from dists/ is not fool-proof: we may not capture all ever published such files. While this is less of a concern for reproducibility, it is more of a concern for apt transparency. Still, having Trisquel provide a service similar to snapshot.debian.org would help.
  • Having at least one other CPU architecture would be nice.
  • Due to lack of time and mental focus, handling incremental updates of new versions of packages is not yet working. This means we only ever build one version of a package, and never discover any newly published versions of the same package. Now that Trisquel aramo is released, the expected rate of new versions should be low, but still happens due to security or backports.
  • Porting this to test supposedly FSDG-compliant distributions such as PureOS and Gnuinos should be relatively easy. I’m also looking at Devuan because of Gnuinos.
  • The elephant in the room is how reproducible Ubuntu is in the first place.

Happy Easter Hacking!

Update 2023-04-17: The original project “reproduce-trisquel” that was announced here has been archived and replaced with two projects, one generic “debdistreproduce” and one with results for Trisquel: “reproduce/trisquel“.

Apt Archive Transparency: debdistdiff & apt-canary

I’ve always found the operation of apt software package repositories to be a mystery. There appears to be a lack of transparency into which people have access to important apt package repositories out there, how the automatic non-human update mechanism is implemented, and what changes are published. I’m thinking of big distributions like Ubuntu and Debian, but also the free GNU/Linux distributions like Trisquel and PureOS that are derived from the more well-known distributions.

As far as I can tell, anyone who has the OpenPGP private key trusted by a apt-based GNU/Linux distribution can sign a modified Release/InRelease file and if my machine somehow downloads that version of the release file, my machine could be made to download and install packages that the distribution didn’t intend me to install. Further, it seems that anyone who has access to the main HTTP server, or any of its mirrors, or is anywhere on the network between them and my machine (when plaintext HTTP is used), can either stall security updates on my machine (on a per-IP basis), or use it to send my machine (again, on a per-IP basis to avoid detection) a modified Release/InRelease file if they had been able to obtain the private signing key for the archive. These are mighty powers that warrant overview.

I’ve always put off learning about the processes to protect the apt infrastructure, mentally filing it under “so many people rely on this infrastructure that enough people are likely to have invested time reviewing and improving these processes”. Simultaneous, I’ve always followed the more free-software friendly Debian-derived distributions such as gNewSense and have run it on some machines. I’ve never put them into serious production use, because the trust issues with their apt package repositories has been a big question mark for me. The “enough people” part of my rationale for deferring this is not convincing. Even the simple question of “is someone updating the apt repository” is not easy to understand on a running gNewSense system. At some point in time the gNewSense cron job to pull in security updates from Debian must have stopped working, and I wouldn’t have had any good mechanism to notice that. Most likely it happened without any public announcement. I’ve recently switched to Trisquel on production machines, and these questions has come back to haunt me.

The situation is unsatisfying and I looked into what could be done to improve it. I could try to understand who are the key people involved in each project, and may even learn what hardware component is used, or what software is involved to update and sign apt repositories. Is the server running non-free software? Proprietary BIOS or NIC firmware? Are the GnuPG private keys on disk? Smartcard? TPM? YubiKey? HSM? Where is the server co-located, and who has access to it? I tried to do a bit of this, and discovered things like Trisquel having a DSA1024 key in its default apt trust store (although for fairness, it seems that apt by default does not trust such signatures). However, I’m not certain understanding this more would scale to securing my machines against attacks on this infrastructure. Even people with the best intentions, and the state of the art hardware and software, will have problems.

To increase my trust in Trisquel I set out to understand how it worked. To make it easier to sort out what the interesting parts of the Trisquel archive to audit further were, I created debdistdiff to produce human readable text output comparing one apt archive with another apt archive. There is a GitLab CI/CD cron job that runs this every day, producing output comparing Trisquel vs Ubuntu and PureOS vs Debian. Working with these output files has made me learn more about how the process works, and I even stumbled upon something that is likely a bug where Trisquel aramo was imported from Ubuntu jammy while it contained a couple of package (e.g., gcc-8, python3.9) that were removed for the final Ubuntu jammy release.

After working on auditing the Trisquel archive manually that way, I realized that whatever I could tell from comparing Trisquel with Ubuntu, it would only be something based on a current snapshot of the archives. Tomorrow it may look completely different. What felt necessary was to audit the differences of the Trisquel archive continously. I was quite happy to have developed debdistdiff for one purpose (comparing two different archives like Trisquel and Ubuntu) and discovered that the tool could be used for another purpose (comparing the Trisquel archive at two different points in time). At this time I realized that I needed a log of all different apt archive metadata to be able to produce an audit log of the differences in time for the archive. I create manually curated git-repositories with the Release/InRelease and the Packages files for each architecture/component of the well-known distributions Trisquel, Ubuntu, Debian and PureOS. Eventually I wrote scripts to automate this, which are now published in the debdistget project.

At this point, one of the early question about per-IP substitution of Release files were lingering in my mind. However with the tooling I now had available, coming up with a way to resolve this was simple! Merely have apt compute a SHA256 checksum of the just downloaded InRelease file, and see if my git repository had the same file. At this point I started reading the Apt source code, and now I had more doubts about the security of my systems than I ever had before. Oh boy how the name Apt has never before felt more… Apt?! Oh well, we must leave some exercises for the students. Eventually I realized I wanted to touch as little of apt code basis as possible, and noticed the SigVerify::CopyAndVerify function called ExecGPGV which called apt-key verify which called GnuPG’s gpgv. By setting Apt::Key::gpgvcommand I could get apt-key verify to call another tool than gpgv. See where I’m going? I thought wrapping this up would now be trivial but for some reason the hash checksum I computed locally never matched what was on my server. I gave up and started working on other things instead.

Today I came back to this idea, and started to debug exactly how the local files looked that I got from apt and how they differed from what I had in my git repositories, that came straight from the apt archives. Eventually I traced this back to SplitClearSignedFile which takes an InRelease file and splits it into two files, probably mimicking the (old?) way of distributing both Release and Release.gpg. So the clearsigned InRelease file is split into one cleartext file (similar to the Release file) and one OpenPGP signature file (similar to the Release.gpg file). But why didn’t the cleartext variant of the InRelease file hash to the same value as the hash of the Release file? Sadly they differ by the final newline.

Having solved this technicality, wrapping the pieces up was easy, and I came up with a project apt-canary that provides a script apt-canary-gpgv that verify the local apt release files against something I call a “apt canary witness” file stored at a URL somewhere.

I’m now running apt-canary on my Trisquel aramo laptop, a Trisquel nabia server, and Talos II ppc64el Debian machine. This means I have solved the per-IP substitution worries (or at least made them less likely to occur, having to send the same malicious release files to both GitLab and my system), and allow me to have an audit log of all release files that I actually use for installing and downloading packages.

What do you think? There are clearly a lot of work and improvements to be made. This is a proof-of-concept implementation of an idea, but instead of refining it until perfection and delaying feedback, I wanted to publish this to get others to think about the problems and various ways to resolve them.

Btw, I’m going to be at FOSDEM’23 this weekend, helping to manage the Security Devroom. Catch me if you want to chat about this or other things. Happy Hacking!