Passive Icinga Checks: icinga-pusher

I use Icinga to monitor the availability of my Debian/OpenWRT/etc machines. I have relied on server-side checks on the Icinga system that monitor the externally visible operations of the services that I care about. In theory, monitoring externally visible properties should be good enough. Recently I had one strange incident that was due to an out of disk space on one system. This prompted me to revisit my thinking, and to start monitor internal factors as well. This would allow me to detect problems before they happen, such as an out of disk space condition.

Another reason that I only had server-side checks was that I didn’t like the complexity of the Icinga agent nor wanted to open up for incoming SSH connections from the Icinga server on my other servers. Complexity and machine-based authorization tend to lead to security problems so I prefer to avoid them. The manual mentions agents that use the REST API which was that start of my journey into something better.

What I would prefer is for the hosts to push their self-test results to the central Icinga server. Fortunately, there is a Icinga REST API in modern versions of Icinga (including version 2.10 that I use). The process-check-result API can be used to submit passive check results. Getting this up and running required a bit more research and creativity than I would have hoped for, so I thought it was material enough for a blog post. My main requirement was to keep complexity down, hence I ended up with a simple shell script that is run from cron. None of the existing API clients mentioned in the manual appealed to me.

Prepare the Icinga server with some configuration changes to support the push clients (replace blahonga with a fresh long random password).

icinga# cat > /etc/icinga2/conf.d/api-users.conf
object ApiUser "pusher" {
  password = "blahonga"
  permissions = [ "actions/process-check-result" ]
}
^D
icinga# icinga2 feature enable api && systemctl reload icinga2

Then add some Service definitions and assign it to some hosts, to /etc/icinga2/conf.d/services.conf:

apply Service "passive-disk" {
  import "generic-service"
  check_command = "passive"
  check_interval = 2h
  assign where host.vars.os == "Debian"
}
apply Service "passive-apt" {
  import "generic-service"
  check_command = "passive"
  check_interval = 2h
  assign where host.vars.os == "Debian"
 }

I’m using a relaxed check interval of 2 hours because I will submit results from a cron job that is run every hour. The next step is to setup the machines to submit the results. Create a /etc/cron.d/icinga-pusher with the content below. Note that % characters needs to be escaped in crontab files. I’m running this as the munin user which is a non-privileged account that exists on all of my machines, but you may want to modify this. The check_disk command comes from the monitoring-plugins-basic Debian package, which includes other useful plugins like check_apt that I recommend.

30 * * * * munin /usr/local/bin/icinga-pusher `hostname -f` passive-apt /usr/lib/nagios/plugins/check_apt

40 * * * * munin /usr/local/bin/icinga-pusher `hostname -f` passive-disk "/usr/lib/nagios/plugins/check_disk -w 20\% -c 5\% -X tmpfs -X devtmpfs"

My icinga-pusher script requires a configuration file with some information about the Icinga setup. Put the following content in /etc/default/icinga-pusher (again replacing blahonga with your password):

ICINGA_PUSHER_CREDS="-u pusher:blahonga"
ICINGA_PUSHER_URL="https://icinga.yoursite.com:5665"
ICINGA_PUSHER_CA="-k"

The parameters above are used by the icinga-pusher script. The ICINGA_PUSHER_CREDS contain the api user credentials, either a simple "-u user:password" combination or it could be "--cert /etc/ssl/yourclient.crt --key /etc/ssl/yourclient.key". The ICINGA_PUSHER_URL is the base URL of your Icinga setup, for the API port which is usually 5665. The ICINGA_PUSHER_CA is "--cacert /etc/ssl/icingaca.crt" or "-k" to not use any CA verification (not recommended!).

Below is the script icinga-pusher itself. Some error handling has been removed for brevity — I have put the script in a separate “icinga-pusher” git repository which will be where I make any updates to this project in the future.

#!/bin/sh

# Copyright (C) 2019 Simon Josefsson.
# Released under the GPLv3+ license.

. /etc/default/icinga-pusher

HOST="$1"
SERVICE="$2"
CMD="$3"

OUT=$($CMD)
RC=$?

oIFS="$IFS"
IFS='|'
set -- $OUT
IFS="$oIFS"

OUTPUT="$1"
PERFORMANCE="$2"

data='{ "type": "Service", "filter": "host.name==\"'$HOST'\" && service.name==\"'$SERVICE'\"", "exit_status": '$RC', "plugin_output": "'$OUTPUT'", "performance_data": "'$PERFORMANCE'" }'

curl $ICINGA_PUSHER_CA $ICINGA_PUSHER_CREDS \
     -s -H 'Accept: application/json' -X POST \
     "$ICINGA_PUSHER_URL/v1/actions/process-check-result" \
     -d "$data"

exit 0

What do you think? Is there a simpler way of achieving what I want? Thanks for reading.

Combining Dnsmasq and Unbound

For my home office network I have been using Dnsmasq for some time. Dnsmasq provides me with DNS, DHCP, DHCPv6, and IPv6 Router Advertisement. I run dnsmasq on a Debian Jessie server, but it works similar with OpenWRT if you want to use a smaller device. My entire /etc/dnsmasq.d/local configuration used to look like this:

dhcp-authoritative
interface=eth1
read-ethers
dhcp-range=192.168.1.100,192.168.1.150,12h
dhcp-range=2001:9b0:104:42::100,2001:9b0:104:42::1500
dhcp-option=option6:dns-server,[::]
enable-ra

Here dhcp-authoritative enable DHCP. interface=eth1 says to listen on eth1 only, which is my internal (IPv4 NAT) network. I try to keep track of the MAC address of all my devices in a /etc/ethers file, so I use read-ethers to have dnsmasq give stable IP addresses for them. The dhcp-range is used to enable DHCP and DHCPv6 on my internal network. The dhcp-option=option6:dns-server,[::] statement is needed to inform the DHCP clients of the DNS resolver’s IPv6 address, otherwise they would only get the IPv4 DNS server address. The enable-ra parameter enables IPv6 router advertisement on the internal network, thereby removing the need to run radvd too — useful since I prefer to use copyleft software.

Recently I had a desire to use DNSSEC, and enabled it in Dnsmasq using the following statements:

dnssec
trust-anchor=.,19036,8,2,49AAC11D7B6F6446702E54A1607371607A1A41855200FD2CE1CDDE32F24E8FB5
dnssec-check-unsigned

The dnssec keyword enable DNSSEC validation in dnsmasq, using the indicated trust-anchor (get the root-anchors from IANA). The dnssec-check-unsigned deserves some more discussion. The dnsmasq manpage describes it as follows:

As a default, dnsmasq does not check that unsigned DNS replies are legitimate: they are assumed to be valid and passed on (without the “authentic data” bit set, of course). This does not protect against an attacker forging unsigned replies for signed DNS zones, but it is fast. If this flag is set, dnsmasq will check the zones of unsigned replies, to ensure that unsigned replies are allowed in those zones. The cost of this is more upstream queries and slower performance.

For example, this means that dnsmasq’s DNSSEC functionality is not secure against active man-in-the-middle attacks between dnsmasq and the DNS server it is using. Even if example.org used DNSSEC properly, an attacker could fake that it was unsigned to dnsmasq, and I would get potentially incorrect values in return. We all know that the Internet is not a secure place, and your threat model should include active attackers. I believe this mode should be the default in dnsmasq, and users should have to configure dnsmasq to not be in that mode if they really want to (with the obvious security warning).

Running with this enabled for a couple of days resulted in frustration about not being able to reach a couple of domains. The behaviour was that my clients would hang indefinitely or get a SERVFAIL, both resulting in lack of service. You can enable query logging in dnsmasq with log-queries and enabling this I noticed three distinct form of error patterns:

jow13gw dnsmasq 460 - -  forwarded www.fritidsresor.se to 213.80.101.3
jow13gw dnsmasq 460 - -  validation result is BOGUS
jow13gw dnsmasq 547 - -  reply cloudflare-dnssec.net is BOGUS DNSKEY
jow13gw dnsmasq 547 - -  validation result is BOGUS
jow13gw dnsmasq 547 - -  reply linux.conf.au is BOGUS DS
jow13gw dnsmasq 547 - -  validation result is BOGUS

The first only happened intermittently, the second did not cause any noticeable problem, and the final one was reproducible. To be fair, I only found the last example after starting to search for problem reports (see post confirming bug).

At this point, I had a confirmed bug in dnsmasq that affect my use-case. I want to use official packages from Debian on this machine, so installing newer versions manually is not an option. So I started to look into alternatives for DNS resolving, and quickly found Unbound. Installing it was easy:

apt-get install unbound
unbound-control-setup 

I created a local configuration file in /etc/unbound/unbound.conf.d/local.conf as follows:

server:
	interface: 127.0.0.1
	interface: ::1
	interface: 192.168.1.2
	interface: 2001:9b0:104:42::2
	access-control: 127.0.0.1 allow
	access-control: ::1 allow
	access-control: 192.168.1.2/24 allow
	access-control: 2001:9b0:104:42::2/64 allow
	outgoing-interface: 155.4.17.2
	outgoing-interface: 2001:9b0:1:1a04::2
#	log-queries: yes
#	verbosity: 2

The interface keyword determine which IP addresses to listen on, here I used the loopback interface and the local address of the physical network interface for my internal network. The access-control allows recursive DNS resolving from those networks. And outgoing-interface specify my external Internet-connected interface. log-queries and/or verbosity are useful for debugging.

To make things work, dnsmasq has to stop providing DNS services. This can be achieved with the port=0 keyword, however that will also disable informing DHCP clients about the DNS server to use. So this has to be added in manually. I ended up adding the two following lines to /etc/dnsmasq.d/local:

port=0
dhcp-option=option:dns-server,192.168.1.2

Restarting unbound and dnsmasq now leads to working (and secure) internal DNSSEC-aware name resolution over both IPv4 and IPv6. I can verify that resolution works, and that Unbound verify signatures and reject bad domains properly with dig as below, or use online DNSSEC resolver test page although I’m not sure how confident you can be in the result from that page.

$ host linux.conf.au
linux.conf.au has address 192.55.98.190
linux.conf.au mail is handled by 1 linux.org.au.
$ host sigfail.verteiltesysteme.net
;; connection timed out; no servers could be reached
$ 

I use Munin to monitor my services, and I was happy to find a nice Unbound Munin plugin. I installed the file in /usr/share/munin/plugins/ and created a Munin plugin configuration file /etc/munin/plugin-conf.d/unbound as follows:

[unbound*]
user root
env.statefile /var/lib/munin-node/plugin-state/root/unbound.state
env.unbound_conf /etc/unbound/unbound.conf
env.unbound_control /usr/sbin/unbound-control
env.spoof_warn 1000
env.spoof_crit 100000

I run munin-node-configure --shell|sh to enable it. To work unbound has to be configured as well, so I create a /etc/unbound/unbound.conf.d/munin.conf as follows.

server:
	extended-statistics: yes
	statistics-cumulative: no
	statistics-interval: 0
remote-control:
	control-enable: yes

The graphs may be viewed at my munin instance.

Cosmos – A Simple Configuration Management System

Back in early 2012 I had been helping with system administration of a number of Debian/Ubuntu-based machines, and the odd Solaris machine, for a couple of years at $DAYJOB. We had a combination of hand-written scripts, documentation notes that we cut’n’paste’d from during installation, and some locally maintained Debian packages for pulling in dependencies and providing some configuration files. As the number of people and machines involved grew, I realized that I wasn’t happy with how these machines were being administrated. If one of these machines would disappear in flames, it would take time (and more importantly, non-trivial manual labor) to get its services up and running again. I wanted a system that could automate the complete configuration of any Unix-like machine. It should require minimal human interaction. I wanted the configuration files to be version controlled. I wanted good security properties. I did not want to rely on a centralized server that would be a single point of failure. It had to be portable and be easy to get to work on new (and very old) platforms. It should be easy to modify a configuration file and get it deployed. I wanted it to be easy to start to use on an existing server. I wanted it to allow for incremental adoption. Surely this must exist, I thought.

During January 2012 I evaluated the existing configuration management systems around, like CFEngine, Chef, and Puppet. I don’t recall my reasons for rejecting each individual project, but needless to say I did not find what I was looking for. The reasons for rejecting the projects I looked at ranged from centralization concerns (single-point-of-failure central servers), bad security (no OpenPGP signing integration), to the feeling that the projects were too complex and hence fragile. I’m sure there were other reasons too.

In February I started going back to my original needs and tried to see if I could abstract something from the knowledge that was in all these notes, script snippets and local dpkg packages. I realized that the essence of what I wanted was one shell script per machine, OpenPGP signed, in a Git repository. I could check out that Git repository on every new machine that I wanted to configure, verify the OpenPGP signature of the shell script, and invoke the script. The script would do everything needed to get the machine up into an operational stage again, including package installation and configuration file changes. Since I would usually want to modify configuration files on a system even after its initial installation (hey not everyone is perfect), it was natural to extend this idea to a cron job that did ‘git pull’, verified the OpenPGP signature, and ran the script. The script would then have to be a bit more clever and not redo everything every time.

Since we had many machines, it was obvious that there would be huge code duplication between scripts. It felt natural to think of splitting up the shell script into a directory with many smaller shell scripts, and invoke each shell script in turn. Think of the /etc/init.d/ hierarchy and how it worked with System V initd. This would allow re-use of useful snippets across several machines. The next realization was that large parts of the shell script would be to create configuration files, such as /etc/network/interfaces. It would be easier to modify the content of those files if they were stored as files in a separate directory, an “overlay” stored in a sub-directory overlay/, and copied into the file system’s hierarchy with rsync. The final realization was that it made some sense to run one set of scripts before rsync’ing in the configuration files (to be able to install packages or set things up for the configuration files to make sense), and one set of scripts after the rsync (to perform tasks that require some package to be installed and configured). These set of scripts were called the “pre-tasks” and “post-tasks” respectively, and stored in sub-directories called pre-tasks.d/ and post-tasks.d/.

I started putting what would become Cosmos together during February 2012. Incidentally, I had been using etckeeper on our machines, and I had been reading its source code, and it greatly inspired the internal design of Cosmos. The git history shows well how the ideas evolved — even that Cosmos was initially called Eve but in retrospect I didn’t like the religious connotations — and there were a couple of rewrites on the way, but on the 28th of February I pushed out version 1.0. It was in total 778 lines of code, with at least 200 of those lines being the license boiler plate at the top of each file. Version 1.0 had a debian/ directory and I built the dpkg file and started to deploy on it some machines. There were a couple of small fixes in the next few days, but development stopped on March 5th 2012. We started to use Cosmos, and converted more and more machines to it, and I quickly also converted all of my home servers to use it. And even my laptops. It took until September 2014 to discover the first bug (the fix is a one-liner). Since then there haven’t been any real changes to the source code. It is in daily use today.

The README that comes with Cosmos gives a more hands-on approach on using it, which I hope will serve as a starting point if the above introduction sparked some interest. I hope to cover more about how to use Cosmos in a later blog post. Since Cosmos does so little on its own, to make sense of how to use it, you want to see a Git repository with machine models. If you want to see how the Git repository for my own machines looks you can see the sjd-cosmos repository. Don’t miss its README at the bottom. In particular, its global/ sub-directory contains some of the foundation, such as OpenPGP key trust handling.