Back in early 2012 I had been helping with system administration of a number of Debian/Ubuntu-based machines, and the odd Solaris machine, for a couple of years at $DAYJOB. We had a combination of hand-written scripts, documentation notes that we cut’n’paste’d from during installation, and some locally maintained Debian packages for pulling in dependencies and providing some configuration files. As the number of people and machines involved grew, I realized that I wasn’t happy with how these machines were being administrated. If one of these machines would disappear in flames, it would take time (and more importantly, non-trivial manual labor) to get its services up and running again. I wanted a system that could automate the complete configuration of any Unix-like machine. It should require minimal human interaction. I wanted the configuration files to be version controlled. I wanted good security properties. I did not want to rely on a centralized server that would be a single point of failure. It had to be portable and be easy to get to work on new (and very old) platforms. It should be easy to modify a configuration file and get it deployed. I wanted it to be easy to start to use on an existing server. I wanted it to allow for incremental adoption. Surely this must exist, I thought.
During January 2012 I evaluated the existing configuration management systems around, like CFEngine, Chef, and Puppet. I don’t recall my reasons for rejecting each individual project, but needless to say I did not find what I was looking for. The reasons for rejecting the projects I looked at ranged from centralization concerns (single-point-of-failure central servers), bad security (no OpenPGP signing integration), to the feeling that the projects were too complex and hence fragile. I’m sure there were other reasons too.
In February I started going back to my original needs and tried to see if I could abstract something from the knowledge that was in all these notes, script snippets and local dpkg packages. I realized that the essence of what I wanted was one shell script per machine, OpenPGP signed, in a Git repository. I could check out that Git repository on every new machine that I wanted to configure, verify the OpenPGP signature of the shell script, and invoke the script. The script would do everything needed to get the machine up into an operational stage again, including package installation and configuration file changes. Since I would usually want to modify configuration files on a system even after its initial installation (hey not everyone is perfect), it was natural to extend this idea to a cron job that did ‘git pull’, verified the OpenPGP signature, and ran the script. The script would then have to be a bit more clever and not redo everything every time.
Since we had many machines, it was obvious that there would be huge code duplication between scripts.  It felt natural to think of splitting up the shell script into a directory with many smaller shell scripts, and invoke each shell script in turn.  Think of the /etc/init.d/ hierarchy and how it worked with System V initd.  This would allow re-use of useful snippets across several machines.  The next realization was that large parts of the shell script would be to create configuration files, such as /etc/network/interfaces.  It would be easier to modify the content of those files if they were stored as files in a separate directory, an “overlay” stored in a sub-directory overlay/, and copied into the file system’s hierarchy with rsync.  The final realization was that it made some sense to run one set of scripts before rsync’ing in the configuration files (to be able to install packages or set things up for the configuration files to make sense), and one set of scripts after the rsync (to perform tasks that require some package to be installed and configured).  These set of scripts were called the “pre-tasks” and “post-tasks” respectively, and stored in sub-directories called pre-tasks.d/ and post-tasks.d/.
I started putting what would become Cosmos together during February 2012. Incidentally, I had been using etckeeper on our machines, and I had been reading its source code, and it greatly inspired the internal design of Cosmos. The git history shows well how the ideas evolved — even that Cosmos was initially called Eve but in retrospect I didn’t like the religious connotations — and there were a couple of rewrites on the way, but on the 28th of February I pushed out version 1.0. It was in total 778 lines of code, with at least 200 of those lines being the license boiler plate at the top of each file. Version 1.0 had a debian/ directory and I built the dpkg file and started to deploy on it some machines. There were a couple of small fixes in the next few days, but development stopped on March 5th 2012. We started to use Cosmos, and converted more and more machines to it, and I quickly also converted all of my home servers to use it. And even my laptops. It took until September 2014 to discover the first bug (the fix is a one-liner). Since then there haven’t been any real changes to the source code. It is in daily use today.
The README that comes with Cosmos gives a more hands-on approach on using it, which I hope will serve as a starting point if the above introduction sparked some interest.  I hope to cover more about how to use Cosmos in a later blog post.  Since Cosmos does so little on its own, to make sense of how to use it, you want to see a Git repository with machine models.  If you want to see how the Git repository for my own machines looks you can see the sjd-cosmos repository.  Don’t miss its README at the bottom.  In particular, its global/ sub-directory contains some of the foundation, such as OpenPGP key trust handling.
Quite interesting. Actually, I discovered cosmos earlier, when looking at ici which uses the same structure (and explicitely refering to you).
Cosmos interests me, but like all other system management tools, it seems concerned about machines and the presentation is geared toward a machine-centric organization. Me, I’m often thinking in a feature-centric manner (so, I’d have a structure for, say, fail2ban, with a global setup of that feature, and then some directory or script that handled the special cases, which could be just a small number of all the machines affected). Do you see a way to accomodate cosmos to that kind of thinking?
There’s another thing that has me cringe a bit… in some cases, a too automatic update of all machines could be fatal. For example, if you keep iptables with cosmos and make one little mistake just before committing (which is admitedly stupid to say the least, but we’re all human and need to admit that we make blunders at times, even of that magnitude), it could blow quite a number of machines out of the internet. I manage remote machines, so that scenario scares the hell out of me. So what I wonder is, is it possible to have cosmos do automatic updates for most stuff but make an exception for a few?
Hi. Thanks for looking at this!
The feature-oriented approach was part of the initial thinking, but I have only limited deployment experience with it. As you can see in my sjd-cosmos repository, I have a global/ sub-directory and all of my machines have a /etc/cosmos/cosmos.conf that contains (replacing latte.josefsson.org with the hostname):
COSMOS_REPO_MODELS=”$COSMOS_REPO/global/:$COSMOS_REPO/latte.josefsson.org/”
So this machine pulls in stuff from the global/ sub-directory and the latte.josefsson.org/ sub-directory. My thinking was that you would have many sub-directories for different “features” and that you would configure each machine to read files from different directories. I never deployed this idea for more than the global/ sub-directory though — the reason was that someone with experience from some other config system advices against that because it had become unwieldy for them when some features started to depend on others. I’m not sure cosmos suffers from this though. So, yes, possible, but not deployed by me.
And, yes, one pet issue with the COSMOS_REPO_MODELS variable is that as it looks now it is quite unreadable. I think it should look like this instead:
COSMOS_REPO_MODELS=global:latte.josefsson.org
In a more feature-oriented approach you would then have something like this:
COSMOS_REPO_MODELS=global:fail2ban:apache:emacs:latte.josefsson.org
Note that you can manage /etc/cosmos/cosmos.conf from cosmos itself, but you have to be careful about making mistakes…
Doing automated updates for most stuff and not for others shouldn’t be too hard to achieve. You probably have to think a bit how it would work — one idea that comes to my mind is for pre-tasks/post-tasks scripts to look at a environment variable whether the script is run with a human looking at the keyboard. Then you would have to log into the machine and run ‘HUMAN=YES /etc/cron.daily/sjd-cosmos’ (or whatever). Or use ‘isatty’ or something similar.
However, the way I deal with making these potentially damaging changes is to develop them on the machine and commit them when you are done and then manually do an update on that machine and see that it worked. If you are commiting things in a global/ sub-directory you have to be much more careful, though, and check it on several different hosts before you are done.
I agree this can be dangerous, but I don’t see any way around it if you want to allow full functionality. Any system that manages several machines in some automated fashion will have the ability to destroy the machine, or it is not flexible enough.
Btw, it isn’t necessery to run cosmos as root — you could set it up in a per-user home directory, overriding configuration paths etc.
/Simon
All right, cool, that made it worth considering it.
You might want to consider having a page somewhere, or somewhere in the README, with instructions on how to provide code, patches or other contributions. Knowing me, it’s a safe bet that I’ll find something to tinker with 😉