mirror of https://github.com/RWejlgaard/pez-infra.git synced 2026-05-06 04:14:43 +00:00

Replace ASCII diagrams with mermaid in docs (#47 )

Convert remaining ASCII art diagrams to mermaid syntax:
- monitoring.md: stack overview diagram
- networking.md: Tailscale mesh diagram + DNS request flow

architecture.md already used mermaid, no changes needed.

PESO-123

2026-04-03 10:48:41 +01:00

3.5 KiB

Raw Blame History

Monitoring

Stack Overview

graph TD
    subgraph "london-a (FreeBSD)"
        Prometheus[":9090 Prometheus"] -->|query| Grafana[":3000 Grafana"]
    end

    Prometheus -->|scrape over Tailscale| NE["node_exporter<br/>(all hosts) :9100"]
    Prometheus -->|scrape over Tailscale| SE["smartctl_exporter<br/>(london-b) :9633"]
    Prometheus -->|scrape over Tailscale| PE["plex_exporter<br/>(london-b)"]

Both Prometheus and Grafana are accessible via:

grafana.pez.sh (behind Authelia)
prometheus.pez.sh (behind Authelia)

Prometheus

Prometheus runs on london-a and scrapes metrics from exporters across the fleet. All scrape targets are reached over Tailscale — no ports need to be exposed on the public internet.

Scrape Targets

Target	Host	Port	What
node_exporter	All hosts	9100	System metrics (CPU, memory, disk, network)
smartctl_exporter	london-b	9633	Disk SMART health data
prom-plex-exporter	london-b	(varies)	Plex streaming activity

node_exporter is deployed to every host via Ansible. It's one of the first things that gets installed on a new server.

Adding a scrape target

Deploy the exporter to the host (via Ansible or Docker)
Add the target to the Prometheus config in services/prometheus/
Deploy the updated config (Ansible or manual restart)
Verify it shows up in Prometheus targets page

Grafana

Grafana reads from Prometheus and provides dashboards for everything worth watching.

Dashboards

Dashboard	What it shows
Server Health	CPU, memory, disk usage, network I/O across all hosts
ZFS	Pool status, usage, scrub results for london-b
SMART	Disk health metrics, temperature, error counts
Plex	Active streams, transcoding status, library stats

Adding a dashboard

Dashboards are defined in services/grafana/. Export as JSON from Grafana and commit to the repo to keep them in version control.

Exporters

node_exporter

Standard Prometheus node exporter. Deployed on every host. Provides system-level metrics:

CPU usage and load averages
Memory usage
Disk space and I/O
Network traffic
System uptime

Installed via Ansible as part of the base server setup.

smartctl_exporter

Runs on london-b (the ZFS storage server with 8 spinning disks). Exposes SMART data from all drives:

Temperature
Reallocated sectors
Read/write error rates
Power-on hours
Overall health assessment

Critical for catching dying drives before they take out a RAIDZ1 vdev.

prom-plex-exporter

Runs on london-b. Scrapes the Plex API and exposes metrics about:

Active streams
Transcode sessions
Library size
User activity

Mostly for fun — it's satisfying to see the Plex dashboard light up when people are streaming.

Status Page

status.pez.sh is a lightweight public status page that shows service availability.

Pulls availability data from Prometheus
Shows 90-day uptime history
Hosted on helsinki-a at /srv/status
Source: RWejlgaard/pez-status
Not behind Authelia — it's public by design

Alerting

Prometheus alerting rules can be configured in the Prometheus config. Alert conditions worth monitoring:

Host down (node_exporter unreachable)
Disk space critical (>90% usage)
ZFS scrub errors
SMART drive failures
High memory usage

Grafana can also be configured with alert channels (email, webhooks, etc.) for dashboard-based alerts.

3.5 KiB Raw Blame History