# Monitoring
## Stack Overview
Observability is a fully managed pipeline today: every host runs **Grafana Alloy** as the local collector, and everything ships to **Grafana Cloud**. Synthetic checks are also driven from Grafana Cloud, and alerts are routed to **PagerDuty**.
```mermaid
graph LR
subgraph "Fleet (each host)"
NE["node_exporter :9100"]
SE["systemd_exporter :9558"]
XE["host-specific
exporters
(smartctl, plex,
octopus...)"]
Alloy["alloy.service
(Grafana Alloy)"]
NE --> Alloy
SE --> Alloy
XE --> Alloy
end
Alloy -->|metrics, logs, traces| GC["Grafana Cloud
pez.grafana.net"]
SM["Synthetic Monitoring
probes (London)"] -->|HTTPS GETs| Internet["https://*.pez.sh"]
SM --> GC
GC -->|alerts| PD["PagerDuty"]
```
Everything in `terraform/grafana/` is the source of truth for the Grafana Cloud side: stack, Fleet Management collectors, fleet pipelines, dashboards, and synthetic checks. Everything in `terraform/pagerduty/` configures the on-call destination.
## Grafana Alloy (per-host collector)
Alloy runs as `alloy.service` on every host in the inventory. Each host is registered as a Grafana Fleet Management collector in `terraform/grafana/fleet_collectors.tf`, tagged with a `location` attribute (`london`, `copenhagen`, `cloud`) so pipelines can target subsets of the fleet.
Pipelines (what to scrape, how to relabel, where to ship) live in `terraform/grafana/fleet_pipelines/` and are pushed to Grafana Cloud as a `grafana_fleet_management_pipeline` resource. The Alloy daemons on each host pull their config from Fleet Management.
### Local exporters scraped by Alloy
| Exporter | Hosts | What |
|---|---|---|
| node_exporter | All hosts | CPU, memory, disk, network, system uptime |
| systemd_exporter | All hosts | Per-unit systemd state |
| smartctl_exporter (Docker) | london-b, copenhagen-a | Disk SMART data |
| prom-plex-exporter (Docker) | london-b | Plex streaming activity |
| octopus_exporter (Docker) | london-c | Octopus Energy electricity usage |
| Caddy `/metrics` | helsinki-a | HTTP request metrics, upstream health (per host) |
### Logs
Alloy ships systemd journal entries from every host to Grafana Cloud Logs. Log-derived alerts (e.g. SSH brute-force, mail server errors) can be configured directly in Grafana Cloud.
## Synthetic Monitoring
Grafana Cloud's Synthetic Monitoring service runs HTTPS probes from the London region against the public services, every 10 minutes. Configured in `terraform/grafana/synthetic_checks.tf`:
| Check | URL |
|---|---|
| pez_sh | https://pez.sh |
| pez_solutions | https://pez.solutions |
| jellyfin | https://jellyfin.pez.sh |
| plex | https://plex.pez.sh (auth header) |
| request | https://request.pez.sh |
| jellyfin_requests | https://jellyfin-requests.pez.sh |
| git | https://git.pez.sh |
Each check has a `ProbeFailedExecutionsTooHigh` alert wired up (3 failed executions in a 30-minute window).
## Alerting → PagerDuty
PagerDuty is configured in `terraform/pagerduty/`:
- A single service (`pez-infra`) receives alerts
- Escalation policy fires to me directly
- The Grafana Cloud → PagerDuty integration sends every fired alert (synthetic check failures today; can be extended to log/metric alerts)
## Status Page
**status.pez.sh** is a public status page hosted on helsinki-a at `/srv/status`.
- Cron-driven static JSON (see `ansible/roles/status_page/`) — does not require Grafana Cloud to render
- Hosted directly by Caddy as a `file_server`
- Public by design (no Authelia)
- Source repo for the front-end: [RWejlgaard/pez-status](https://github.com/RWejlgaard/pez-status)
## History
Monitoring used to run locally on **london-a** (FreeBSD) with a self-hosted Prometheus + Grafana. When london-a was reinstalled as Proxmox VE, the local stack was retired and everything moved to Grafana Cloud + Alloy. Older docs (and a few legacy hard-coded IPs in helper scripts) may still reference `100.122.219.41:9090` — that endpoint no longer exists.