Highly-available monitoring with Prometheus and Alertmanager on NixOS

This post is the end result of a year-long nerd-snipe caused by a real problem:

How can I be sure that I am told when my site goes down, when the service that tells me can also go down?

Strange issues

I host multiple services on multiple servers:

These have been running for anywhere from two years to eleven years. Over that time, all sorts of weird things have gone wrong:

AWS has lost some of my Route53 DNS records without telling me.
A space leak in Social Dance Today took down the entire server
EC2 instances have run out of CPU credits and have gotten stuck.
An EC2 server has gotten so old that amazonka's onEC2 :: IO Bool started returning False.
Hetzner's DHCP servers randomly stopped letting my server lease an IP address.

The point is: These services and servers have gone down for a number of unpredictable reasons. These reasons were not only unknown but they were unknown unknowns. I had not even imagined that these things could happen before they did.

Getting services back online

My services go down for whatever reason. What do I do next? I'd like to get them back up again.

Some solutions let me do this automatically, but some don't. Assuming the service cannot be brought back automatically, I need to be made aware that they are down first.

Auto-restart

You can configure systemd to automatically restart a service that has failed.

[Service]
Restart=always # Also restart on success, we need the service to stay online.
RestartSec=1 # Make sure that restarts don't happen too quickly.

[Unit]
StartLimitInterval=0 # Make sure that Restart=always is honoured.

This can resolve downtime from services that die transiently. Examples of such downtime could include sporadic segfaults.

This solution does not help with downtime that comes from a service getting stuck, or from extra-service problems like someone pulling the plug.

Watchdog

The systemd watchdog lets you tell it that you want it to kill a service if the service hasn't notified the watchdog in a given amount of time.

You would tell the watchdog: Kill the service if it hasn't notified the watchdog in 30 seconds, and then program the service to notify the watchdog every 15 seconds.

[Service]
WatchdogSec=30 # Ask the watchdog to kill the service if the service has not notified the watchdog in 30 seconds
NotifyAccess=all # Allow subprocesses and threads to notify the watchdog

Any downtime of only the service, that can be fixed with a restart, will then be fixed automatically. Examples of such downtime could include sporadic deadlocks,

This solution still does not help with downtime that comes from extra-service problems like someone pulling the plug.

Monitoring

From here on this post will describe solutions that alarm the administrator (me) about downtime instead of trying to fix anything automatically.

Health checks

One such solution is to point a health checker at the service. A health checker would periodically try to access my site to see if it is still online and send me an alarm otherwise.

The health checker needs to be a separate process in order to send any meaningful alarms, otherwise it would be down whenever the service is down. Ideally the health checker would also be on another machine so that I can still get alarms if the entire machine hosting the service is down.

This begs the question: what if the machine hosting the health checker goes down at the same time as the machine hosting the service?

I know this sounds unlikely, but this has happened to me around once per year in the past eleven years of hosting this blog. I would like my solution to solve this problem as well.

Nerd-snipe

This is where the nerd-snipe happened. Clearly I would need multiple machines to solve this problem, but how would they have to communicate to be sure that I would definitely receive an alarm? There are some really interesting sub-problems here:

Can we use a deadman's switch instead of a healthcheck to avoid the problems with firewalls? (I.e. use a watchdog instead of a health check.)
Can we avoid assuming that machines' clocks are synchronised?
Can we avoid assuming that machines' clocks run at the same speed?
Can we avoid assuming that all nodes can reach all other nodes? I.e. non-clique topologies.
Can we avoid the difficult problem of leader election?
When we get alarms, can we avoid getting way too many of them?

I had started on a Proof of Concept thinking "this should be doable". Instead I found myself at 03:00 on a Saturday reading the academic literature on distributed failure detectors going back to an old(-ish) 300-page book explaining the topic in detail. I even started on proofs and asked my mathematician-friend for help with them.

Eventually (yes, very late), I realised: The problem is probably a lot simpler in practice than in papers, and someone else must have done this before.

Prometheus & Alertmanager

It turns out that the problem is much easier in practice:

We assume no firewalls will get in the way. (This is fine in practice; You can put your failure detector inside the network.)
We don't need to use any properties of timestamps accross machines
We can have all nodes talk to each other, no need for complicated topologies.
Leader election is solved, just not easy to implement.
A two-tier system can de-duplicate alarms.

Someone has indeed done this before, but I haven't found great explanations of how to do it in practice. A bird's-eye view of the system looks like this:

The system is two tiered: We use multiple redundant health checkers (Prometheus) combined with highly-available alarmers (Alertmanager).
The health checkers all use the same configuration and watch the same services. They are duplicated so that all but one of them can go down.
These health checkers send all alerts to all the alarmers.
The alarmers gossip with each other about which alerts to turn into alarms. They are also duplicated so that all but one of them can go down, and so that they can agree on which alerts to de-duplicate before sending alarms.

Prometheus and Alertmanager can already do this. I was able to throw away all of my nerd-snipe-'Proof of concept' work and "just" set up three prometheus instances and three alartmanagers in a cluster.

The extra nice side-benefit of this setup is that Prometheus doesn't just check that your services are online. It's an entire timeseries scraper that can save all sorts of metrics of your services for you. It can then also send you alerts about problems other than just services being down. For example, I now get alerts when my backups haven't run (successfully) for a while, or when the storage for those backups is getting full.

Prometheus and Alertmanager on NixOS

There are really nice NixOS modules for Prometheus servers.

Here is a snippet based on my configuration. This configuration is then duplicated (slightly differently) on three servers.

{
  # Keep these in sync with the other prometheus servers
  services.prometheus = {
    enable = true;
    webExternalUrl = "https://prometheus.thisdomain.com";
    globalConfig = {
      scrape_interval = "10s";
      # Use identical external_labels to have 
      # alertmanagers deduplicate alerts.
      external_labels = {
        monitor = "global";
      };
    };

    alertmanagers = [{
      scheme = "https";
      static_configs = [{
        targets = [
          "alertmanager.thisdomain.com"
          "alertmanager.otherdomain.com"
          "alertmanager.thirddomain.com"
        ];
      }];
    }];

    scrapeConfigs = [
      {
        job_name = "alerts";
        static_configs = [{
          targets = [
            "alertmanager.thisdomain.com"
            "alertmanager.otherdomain.com"
            "alertmanager.thirddomain.com"
            "prometheus.thisdomain.com"
            "prometheus.otherdomain.com"
            "prometheus.thirddomain.com"
          ];
        }];
      }
    ];

    ruleFiles = [
      "${pkgs.writeText "general-rule" (builtins.toJSON {
        groups = [
          {
            name = "instance_down";
            rules = [{
              alert = "InstanceDown";
              expr = "up < 1";
              for = "1m";
              labels = { severity = "alarm"; };
              annotations = {
                summary = "Instance {{ $labels.instance }} is down";
              };
            }];
          }
        ];
      })}"
    ];
  };

  services.nginx.virtualHosts."prometheus.thisdomain.com" = {
    enableACME = true;
    forceSSL = true;
    locations."/" = {
      proxyPass = "http://localhost:${builtins.toString config.services.prometheus.port}";
      recommendedProxySettings = true;
    };
  };
}

The same goes for alertmanager:

let 
  clusterPort = 9094;
in
{
  networking.firewall.allowedTCPPorts = [ clusterPort ];

  services.prometheus.alertmanager = {
    enable = true;
    webExternalUrl = "https://alertmanager.thisdomain.com";
    clusterPeers = [
      "alertmanager.otherdomain.com"
      "alertmanager.thirddomain.com"
    ];
    extraFlags = [
      "--cluster.advertise-address=${myIP}:${builtins.toString clusterPort}"
      "--cluster.listen-address=0.0.0.0:${builtins.toString clusterPort}"
    ];
    configuration = {
      global = { };
      route = {
        group_wait = "10s";
        group_interval = "1m";
        repeat_interval = "1h";

        receiver = "telegram"; # Default receiver
      };
      receivers = [
        {
          name = "telegram";
          telegram_configs = [{
            send_resolved = true;
            bot_token_file = config.age.secrets.telegram-bot-token.path;
            chat_id = "YOURCHATIDHERE";
          }];
        }
      ];
    };
  };

  services.nginx.virtualHosts."alertmanager.thisdomain.com" = {
    enableACME = true;
    forceSSL = true;
    locations."/" = {
      proxyPass = "http://localhost:${builtins.toString config.services.prometheus.alertmanager.port}";
      recommendedProxySettings = true;
    };
  };
}

With a setup like this, any two of three prometheus servers or alertmanagers can go down and alerts will still be sent as usual.