pez-infra/docs/monitoring.md
Rasmus Wejlgaard 76dbf43076 Replace ASCII diagrams with mermaid in docs
Convert remaining ASCII art diagrams to mermaid syntax:
- monitoring.md: stack overview diagram
- networking.md: Tailscale mesh diagram + DNS request flow

architecture.md already used mermaid, no changes needed.

PESO-123
2026-04-03 09:47:33 +00:00

3.5 KiB

Monitoring

Stack Overview

graph TD
    subgraph "london-a (FreeBSD)"
        Prometheus[":9090 Prometheus"] -->|query| Grafana[":3000 Grafana"]
    end

    Prometheus -->|scrape over Tailscale| NE["node_exporter<br/>(all hosts) :9100"]
    Prometheus -->|scrape over Tailscale| SE["smartctl_exporter<br/>(london-b) :9633"]
    Prometheus -->|scrape over Tailscale| PE["plex_exporter<br/>(london-b)"]

Both Prometheus and Grafana are accessible via:

  • grafana.pez.sh (behind Authelia)
  • prometheus.pez.sh (behind Authelia)

Prometheus

Prometheus runs on london-a and scrapes metrics from exporters across the fleet. All scrape targets are reached over Tailscale — no ports need to be exposed on the public internet.

Scrape Targets

Target Host Port What
node_exporter All hosts 9100 System metrics (CPU, memory, disk, network)
smartctl_exporter london-b 9633 Disk SMART health data
prom-plex-exporter london-b (varies) Plex streaming activity

node_exporter is deployed to every host via Ansible. It's one of the first things that gets installed on a new server.

Adding a scrape target

  1. Deploy the exporter to the host (via Ansible or Docker)
  2. Add the target to the Prometheus config in services/prometheus/
  3. Deploy the updated config (Ansible or manual restart)
  4. Verify it shows up in Prometheus targets page

Grafana

Grafana reads from Prometheus and provides dashboards for everything worth watching.

Dashboards

Dashboard What it shows
Server Health CPU, memory, disk usage, network I/O across all hosts
ZFS Pool status, usage, scrub results for london-b
SMART Disk health metrics, temperature, error counts
Plex Active streams, transcoding status, library stats

Adding a dashboard

Dashboards are defined in services/grafana/. Export as JSON from Grafana and commit to the repo to keep them in version control.

Exporters

node_exporter

Standard Prometheus node exporter. Deployed on every host. Provides system-level metrics:

  • CPU usage and load averages
  • Memory usage
  • Disk space and I/O
  • Network traffic
  • System uptime

Installed via Ansible as part of the base server setup.

smartctl_exporter

Runs on london-b (the ZFS storage server with 8 spinning disks). Exposes SMART data from all drives:

  • Temperature
  • Reallocated sectors
  • Read/write error rates
  • Power-on hours
  • Overall health assessment

Critical for catching dying drives before they take out a RAIDZ1 vdev.

prom-plex-exporter

Runs on london-b. Scrapes the Plex API and exposes metrics about:

  • Active streams
  • Transcode sessions
  • Library size
  • User activity

Mostly for fun — it's satisfying to see the Plex dashboard light up when people are streaming.

Status Page

status.pez.sh is a lightweight public status page that shows service availability.

  • Pulls availability data from Prometheus
  • Shows 90-day uptime history
  • Hosted on helsinki-a at /srv/status
  • Source: RWejlgaard/pez-status
  • Not behind Authelia — it's public by design

Alerting

Prometheus alerting rules can be configured in the Prometheus config. Alert conditions worth monitoring:

  • Host down (node_exporter unreachable)
  • Disk space critical (>90% usage)
  • ZFS scrub errors
  • SMART drive failures
  • High memory usage

Grafana can also be configured with alert channels (email, webhooks, etc.) for dashboard-based alerts.