Corresponding parent ticket: #5734

Introduction

Why?

  • Some pieces of our infrastructure are critical to e.g.:

    • the development process (if the ISO build fails, developers cannot work)
    • the release process -- which may block us from putting out emergency security fixes
    • users (if the APT repository is down, the "additional software packages" persistence feature is broken)
  • We want to avoid contributors getting used to ignore alerts sent by our CI system. The more false positives there are, the more they will "learn" to do so. Here we want to diminish the rate of false positives caused by malfunctioning infrastructure.

  • We want to shorten the dev/feedback loop for sysadmins when they deploy changes, and also when changes are automatically applied (e.g. Puppet agent passes, or automatic APT upgrades).

  • We want to be notified when a service we run doesn't come back up properly post-reboot, without having to manually test every service.

  • We want to minimize the rate of non-sysadmins discovering and reporting problems first, that is before we learn about it. This is highly subjective, but replying "we're aware of this problem and are working on it" is much more confidence inspiring than "really, it's broken?"

Nomenclature

Here, we call:

  • machine: a computer (be it bare metal or virtual) and its operating system
  • monitored machine: a machine we monitor
  • monitoring machine: the machine(s) that monitors the... monitored machines
  • monitoring system, or monitoring setup: all the software components that we run so that the monitoring machine can monitor the monitored ones, and their configuration

Note that the monitoring machine may very well be, at the same time, itself be monitored (be it by itself, or by another monitoring machine).

Requirements

Human interface

The monitoring system:

  • MUST send email notifications to the sysadmin(s) in charge, to lower the downtime.
  • MUST offer an overview of the status of our systems, via a web interface that works within Tor Browser with the security slider set to Medium-High.
  • MAY additionally offer a read-only version of this overview, that we may want to make available to selected contributors, or anonymous users. Needless to say, this must be carefully balanced with the security implications of such a system (in other words, a set of exported static HTML pages is totally fine, but a huge dynamic web application is probably a non-starter).
  • MUST support configuring, with a per-check/per-service granularity, a threshold of N failures in a row before an alert is raised. Still, it SHOULD support triggering alerts depending on the frequency of such failures, even when they never fail twice in a row (we don't want to miss the fact that $service is down for 5 minutes every day). Implementation details may vary, but you get the idea.

Threat model

Compromised monitored machine

  • We do not try to avoid the fact that it can report wrong information (this includes missing information) about itself.
  • It MUST NOT result in a compromise of the monitoring machine.
  • It MUST NOT be able to DoS the sysadmin(s) in charge, e.g. by flooding them with alerts.
  • It MUST NOT result in a compromise of the network traffic between other monitored machines and the monitoring machine (e.g. if that traffic is encrypted, the monitored machines MUST NOT use the same private key).
  • It SHOULD NOT be able to alter the information about other monitored machines.

Compromised monitoring machine

  • We do not try to avoid the fact that it can DoS the sysadmin(s) in charge, e.g. by flooding them with alerts.
  • We do not try to avoid the fact that it can report wrong information about the monitored machines.
  • It MUST NOT be able to run arbitrary code as root on any of the monitored machines.
  • It SHOULD NOT be able to run arbitrary code as a non-privileged user on any of the monitored machines.

Network attacker

Here, we consider an attacker that may be active or passive, and can sit at any point they choose on the Internet.

We accept the risk that a network attacker:

  • can enumerate the machines and services we monitor;
  • can view the reports, test results, and any such information about monitored services, that the monitoring system needs to learn; this of course implies that we should be careful about what kind of information flows this way: it MUST NOT be a big deal if it leaks into the hands of an adversary;
  • can DoS our monitoring, e.g. by blocking network connections;
  • can spoof the reports, test results and alike about monitored services that a client has no credible means to authenticate.

However, a network attacker:

  • SHOULD NOT be able to spoof the reports, test results and alike that monitored machines send about themselves;
  • MUST NOT be able to run arbitrary code on the monitored machines;
  • MUST NOT be able to run arbitrary code on the monitoring machine.

Availability, sustainability

Here, we assume that the entire monitoring system has both software components that run on the monitored machines (that we call the "agent"), and software components that run on the monitoring machine (that we call the "server"). Below, the agent implicitly includes anything needed for basic usage (plugins, checks, whatever); and similarly, the server implicitly includes its web interface, and anything needed for basic usage (plugins, checks, etc.).

  • The agent MUST be usually available in all of Debian oldstable, stable, and testing -- possibly thanks to pre-existing and well-maintained official backports. All these versions of the agent MUST be compatible with the chosen version of the server.

  • The server MUST be usually available either in current Debian stable (Jessie), or in current Debian testing (Stretch). We are considering running the version from Debian testing mainly because it might avoid having to go through a costly upgrade process in a couple years, e.g. to switch to the next major, incompatible version of the software.

  • Both the agent and the server MUST be actively maintained in all the versions of Debian we care about (see above). Hint: this excludes Nagios 4.

  • Both the agent and the server MUST be DFSG-free.

  • For all involved software, the upstream project MUST be mature and active. It MUST have a confidence inspiring future. We can't afford having to migrate to a totally different monitoring setup in three years, to the extent that this can be foreseen. Hint: given Nagios 4 is not an option (see above), this in turn excludes all older versions of Nagios.

  • It SHOULD be realistically possible for external contributors to have patches merged into the upstream codebase of the involved software.

  • All the involved softwares MUST have a not-too-scary security track record.

Configuration

Here, we have two major desires. One is the ability for humans to easily review the monitoring system's configuration, or changes proposed to it, so that contributions are made easier. The other is the ability to include monitoring aspects within the description of the services we run, in a self-contained way, so that describing them in puppet is easier. Note that a system that satisfies the second requirement has great chances to also mostly satisfy the first one as well.

The chosen monitoring system:

  • SHOULD allow encoding, in the description of a service (read: in the corresponding Puppet class), how it needs to be monitored.

    • Additionally, if this optional (but warmly welcome) requirement is satisfied, then the "shared Puppet modules" we use SHOULD already support the chosen monitoring system (hint: in practice, this means something compatible with Nagios).
    • Note: this gives us for free the ability to review the monitoring configuration for service checks, but it is unrelated to our ability to review the global configuration of the server components, that run on the monitoring machine.
  • SHOULD allow humans to easily review the service checks configuration. Really, that's a strong SHOULD. A system that doesn't make this possible will need to have very serious advantages in other areas to be attractive to us.

  • SHOULD allow humans to review the global configuration of the server components, that run on the monitoring machine. This assumes that said configuration is mostly static, and is unaffected when adding or modifying service checks.

Adequacy to our resources

Being able to operate the monitoring system for 20-50 monitored systems MUST NOT require Tails sysadmins to invest lots of time and become experts at hand-holding a complex software stack: the main focus of our system and automation engineers shall not become monitoring. For example, we won't like a monitoring system that is trivial to set up for monitoring 5-10 hosts, but requires adding more and more moving parts and complex optional components to be able to scale up to 50 hosts.

Miscellaneous

  • We run Tor hidden services, that we want to monitor, so the monitoring system MUST allow using a configured SOCKS proxy for specific checks (worst case, for all checks, but it prevents us from). Wrapping checks with torsocks might be an acceptable option, depending on how involved and hackish this would be. Ability to retry and not notify on first error is interesting here.

Hosting of the monitoring machine

  • The monitoring machine MUST be a virtual machine.
  • We MUST be enabled to admin the OS of the monitoring machine ourselves: we need to be root, we need to have a Puppet agent that talks to our own puppetmaster, we want to do the initial OS installation.
  • The monitoring machine MUST be hosted on infrastructure managed by people the Tails sysadmins trust quite a bit.
  • The people who manage the underlying hardware and infrastructure MUST be reactive and easy to get in touch with.
  • We MUST be given out-of-band access to the monitoring machine.
  • The monitoring machine MUST have unfiltered access to the Internet, and SHOULD be assigned at least one public IPv4 address.
  • Hosting MUST be affordable (say, max. 20€/month).
  • The monitoring machine SHOULD allow at least some flexibility regarding future "hardware" upgrades (e.g. allocating more disk space, memory, CPU cores).
  • TODO: exact hardware specifications, depending on the chosen monitoring system. Let's keep in mind that collecting exported Puppet resources is expensive.

Service and system checks

Below, HIGH, MEDIUM and LOW are priority level wrt. the implementation of such checks.

For description of individual services, see sysadmins

All systems

  • HIGH: up and running!
  • HIGH: disk space usage (bytes and inodes)
  • HIGH: memory usage
  • MEDIUM: Puppet agent last run
  • MEDIUM: APT indices (aka. apt-get update was successfully run recently)
  • MEDIUM: systemctl is-system-running (see #8262)

APT repository

  • CRITICAL: stable APT suite over HTTP
  • CRITICAL: freezable APT repository, once it exists

Bitcoind

  • MEDIUM: compare getblockcount with what the Internet says it should be (probably requires exporting the output of bitcoin-cli getblockcount to a place that's readable by the monitoring agent)

BitTorrent

  • LOW: last Tails release is seeded

Gitolite

  • MEDIUM: git pull or git clone a test repository over all supported protocols (currently: git:// and SSH)

git-annex

  • HIGH: our Tor Browser archive must be reachable over HTTP, and contain directories with tarballs

Jenkins

  • CRITICAL: the HTTP server must be up, and unauthenticated connection must be forbidden (may require to install its TLS certificate, or to skip certificate validation, or something)

Nightly builds

rsync

  • CRITICAL: check, over rsync://, that expected directories are there

Test suite infrastructure

  • HIGH: the (fake or limited) SSH and SFTP access used by core contributors and robots when running the test suite must be up

Website

WhisperBack relay

  • HIGH: SMTP server is up
  • MEDIUM: email is actually relayed (would be truly good to have, but hard to implement, so the cost/benefit ratio is likely to be pretty bad)

XMPP server

  • MEDIUM: responds on the TCP/IP port it is listening on