Independently Reproducible Git Bundles

The gnulib project publish a git bundle as a stable archival copy of the gnulib git repository once in a while.

Why? We don’t know exactly what this may be useful for, but I’m promoting for this to see if we can establish some good use-case.

A git bundle may help to establish provinence in case of an attack on the Savannah hosting platform that compromise the gnulib git repository.

Another use is in the Debian gnulib package: that gnulib bundle is git cloned when building some Debian packages, to get to exactly the gnulib commit used by each upstream project – see my talk on gnulib at Debconf24 – and this approach reduces the amount of vendored code that is part of Debian’s source code, which is relevant to mitigate XZ-style attacks.

The first time we published the bundle, I wanted it to be possible to re-create it bit-by-bit identically by others.

At the time I discovered a well-written blog post by Paul Beacher on reproducible git bundles and thought he had solved the problem for me. Essentially it boils down to disable threading during compression when producing the bundle, and his final example show this results in a predictable bit-by-bit identical output:

$ for i in $(seq 1 100); do \
> git -c 'pack.threads=1' bundle create -q /tmp/bundle-$i --all; \
> done
$ md5sum /tmp/bundle-* | cut -f 1 -d ' ' | uniq -c
    100 4898971d4d3b8ddd59022d28c467ffca

So what remains to be said about this? It seems reproducability goes deeper than that. One desirable property is that someone else should be able to reproduce the same git bundle, and not only that a single individual is able to reproduce things on one machine.

It surprised me to see that when I ran the same set of commands on a different machine (started from a fresh git clone), I got a different checksum. The different checksums occured even when nothing had been committed on the server side between the two runs.

I thought the reason had to do with other sources of unpredictable data, and I explored several ways to work around this but eventually gave up. I settled for the following sequence of commands:

REV=ac9dd0041307b1d3a68d26bf73567aa61222df54 # master branch commit to package
git clone https://git.savannah.gnu.org/git/gnulib.git
cd gnulib
git fsck # attempt to validate input
# inspect that the new tree matches a trusted copy
git checkout -B master $REV # put $REV at master
for b in $(git branch -r | grep origin/stable- | sort --version-sort); do git checkout ${b#origin/}; done
git remote remove origin # drop some unrelated branches
git gc --prune=now # drop any commits after $REV
git -c 'pack.threads=1' bundle create gnulib.bundle --all
V=$(env TZ=UTC0 git show -s --date=format:%Y%m%d --pretty=%cd master)
mv gnulib.bundle gnulib-$V.bundle
build-aux/gnupload --to ftp.gnu.org:gnulib gnulib-$V.bundle

At the time it felt more important to publish something than to reach for perfection, so we did so using the above snippet. Afterwards I reached out to the git community on this and there were good discussion about my challenge.

At the end of that thread you see that I was finally able to reproduce a bit-by-bit identical bundles from two different clones, by using an intermediate git -c pack.threads=1 repack -adF step. I now assume that the unpredictable data I got earlier was introduced during the ‘git clone’ steps, compressing the pack differently each time due to threaded compression. The outcome could also depend on what content the server provided, so if someone ran git gc, git repack on the server side things would change for the user, even if the user forced threading to 1 during cloning — more experiments on what kind of server-side alterations results in client-side differences would be good research.

A couple of months passed and it is now time to publish another gnulib bundle – somewhat paired to the bi-yearly stable gnulib branches – so let’s walk through the commands and explain what they do. First clone the repository:

REV=225973a89f50c2b494ad947399425182dd42618c   # master branch commit to package
S1REV=475dd38289d33270d0080085084bf687ad77c74d # stable-202501 branch commit
S2REV=e8cc0791e6bb0814cf4e88395c06d5e06655d8b5 # stable-202507 branch commit
git clone https://git.savannah.gnu.org/git/gnulib.git
cd gnulib
git fsck # attempt to validate input

I believe the git fsck will validate that the chain of SHA1 commits are linked together, preventing someone from smuggling in unrelated commits earlier in the history without having to do SHA1 collision. SHA1 collisions are economically feasible today, so this isn’t much of a guarantee of anything though.

git checkout -B master $REV # put $REV at master
# Add all stable-* branches locally:
for b in $(git branch -r | grep origin/stable- | sort --version-sort); do git checkout ${b#origin/}; done
git checkout -B stable-202501 $S1REV
git checkout -B stable-202507 $S2REV
git remote remove origin # drop some unrelated branches
git gc --prune=now # drop any unrelated commits, not clear this helps

This establish a set of branches pinned to particular commits. The older stable-* branches are no longer updated, so they shouldn’t be moving targets. In case they are modified in the future, the particular commit we used will be found in the official git bundle.

time git -c pack.threads=1 repack -adF

That’s the new magic command to repack and recompress things in a hopefully more predictable way. This leads to a 72MB git pack under .git/objects/pack/ and a 62MB git bundle. The runtime on my laptop is around 5 minutes.

I experimented with -c pack.compression=1 and -c pack.compression=9 but the size was roughly the same; 76MB and 66MB for level 1 and 72MB and 62MB for level 9. Runtime still around 5 minutes.

Git uses zlib by default, which isn’t the most optimal compression around. I tried -c pack.compression=0 and got a 163MB git pack and a 153MB git bundle. The runtime is still around 5 minutes, indicating that compression is not the bottleneck for the git repack command.

That 153MB uncompressed git bundle compresses to 48MB with gzip default settings and 46MB with gzip -9; to 39MB with zst defaults and 34MB with zst -9; and to 28MB using xz defaults with a small 26MB using xz -9.

Still the inconvenience of having to uncompress a 30-40MB archive into
the much larger 153MB is probably not worth the savings compared to
shipping and using the (still relatively modest) 62MB git bundle.

Now finally prepare the bundle and ship it:

git -c 'pack.threads=1' bundle create gnulib.bundle --all
V=$(env TZ=UTC0 git show -s --date=format:%Y%m%d --pretty=%cd master)
mv gnulib.bundle gnulib-$V.bundle
build-aux/gnupload --to ftp.gnu.org:gnulib gnulib-$V.bundle

Yay! Another gnulib git bundle snapshot is available from
https://ftp.gnu.org/gnu/gnulib/.

The essential part of the git repack command is the -F parameter. In the thread -f was suggested, which translates into the git pack-objects --no-reuse-delta parameter:

--no-reuse-delta

When creating a packed archive in a repository that has existing packs, the command reuses existing deltas. This sometimes results in a slightly suboptimal pack. This flag tells the command not to reuse existing deltas but compute them from scratch.

When reading the man page, I though that using -F which translates into --no-reuse-object would be slightly stronger:

--no-reuse-object

This flag tells the command not to reuse existing object data at all, including non deltified object, forcing recompression of everything. This implies --no-reuse-delta. Useful only in the obscure case where wholesale enforcement of a different compression level on the packed data is desired.

On the surface, without --no-reuse-objects, some amount of earlier compression could taint the final result. Still, I was able to get bit-by-bit identical bundles by using -f so possibly reaching for -F is not necessary.

All the commands were done using git 2.51.0 as packaged by Guix. I fear the result may be different with other git versions and/or zlib libraries. I was able to reproduce the same bundle on a Trisquel 12 aramo (derived from Ubuntu 22.04) machine, which uses git 2.34.1. This suggests there is some chances of this being possible to reproduce in 20 years time. Time will tell.

I also fear these commands may be insufficient if something is moving on the server-side of the git repository of gnulib (even just something simple as a new commit), I tried to make some experiments with this but let’s aim for incremental progress here. At least I have now been able to reproduce the same bundle on different machines, which wasn’t the case last time.

Happy Reproducible Git Bundle Hacking!

Leave a Reply

Your email address will not be published. Required fields are marked *

*