diff --git a/README.md b/README.md index 99689e8..1bb80a3 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ Infrastructure-as-code monorepo for managing my homelab and cloud server fleet. ## What's in this repo - **Ansible** — Playbooks, roles, and inventory for configuring servers, deploying Docker-based services, and managing dotfiles -- **Terraform** — OpenTofu/Terraform configs for cloud resources (Cloudflare DNS, Hetzner servers) +- **Terraform** — OpenTofu/Terraform configs for cloud resources (Hetzner Cloud, Cloudflare DNS, Grafana Cloud, PagerDuty) - **Services** — Docker Compose definitions and config files for each self-hosted service - **Documentation** — Architecture decisions, networking topology, and operational guides @@ -13,54 +13,59 @@ Infrastructure-as-code monorepo for managing my homelab and cloud server fleet. ```mermaid graph TD - CF[Cloudflare
DNS + CDN] --> HEL[helsinki-a
Caddy proxy
Hetzner Cloud] + CF[Cloudflare
DNS + CDN] --> HEL[helsinki-a
Caddy proxy + SSO
Hetzner Cloud] HEL --> TS{Tailscale mesh} - TS --> LB[london-b
Storage, Docker services] - TS --> LA[london-a
Monitoring
Prometheus, Grafana] + TS --> LB[london-b
Storage, media
Docker + systemd] + TS --> LA[london-a
Proxmox VE hypervisor] + TS --> LC[london-c
Raspberry Pi
Octopus Energy exporter] TS --> CA[copenhagen-a
Gaming
Minecraft, WoW MaNGOS] TS --> NUR[nuremberg-a
Mail, poste.io] - TS --> CC[copenhagen-c
idle] + TS --> CC[copenhagen-c
Raspberry Pi
cloudflared, idle] + TS -.-> GC[Grafana Cloud
metrics, logs, traces] ``` -Traffic enters via Cloudflare DNS, hits a Caddy reverse proxy on a Hetzner cloud instance, and is forwarded to backend services running on various hosts connected over a Tailscale mesh network. Authentication is handled by Authelia with an LLDAP backend. +Traffic enters via Cloudflare DNS, hits a Caddy reverse proxy on a Hetzner cloud instance, and is forwarded to backend services running on various hosts connected over a Tailscale mesh network. Authentication for protected services is handled by Authelia with an LLDAP backend. Observability is shipped from every host to Grafana Cloud via Grafana Alloy. ### Hosts | Host | Location | OS | Role | |------|----------|-----|------| -| helsinki-a | Hetzner Cloud | Linux | Reverse proxy (Caddy), main traffic gateway | -| london-b | London | Linux | Primary storage (ZFS), Docker services | -| london-a | London | FreeBSD | Monitoring (Prometheus, Grafana) | -| nuremberg-a | Hetzner Cloud | Alpine Linux | Mail server (poste.io) | -| copenhagen-a | Copenhagen | Linux | Gaming servers (Minecraft, WoW/MaNGOS) | -| copenhagen-c | Copenhagen | Linux | Idle/available | +| helsinki-a | Hetzner Cloud (Helsinki) | Debian 13 | Reverse proxy (Caddy), SSO (Authelia + LLDAP), Bitwarden, Forgejo | +| london-b | London | Ubuntu 24.04 | Primary storage (ZFS), media servers, *arr stack | +| london-a | London | Debian 13 / Proxmox VE | Hypervisor (currently runs a Mac VM; platform for future VMs) | +| london-c | London | Debian 13 (Raspberry Pi) | Octopus Energy exporter, edge utility box | +| nuremberg-a | Hetzner Cloud (Nuremberg) | Debian 13 | Mail server (poste.io) | +| copenhagen-a | Copenhagen | Ubuntu 22.04 | Gaming servers (Minecraft, WoW/MaNGOS) | +| copenhagen-c | Copenhagen | Debian 12 (Raspberry Pi) | cloudflared tunnel, idle/available | ## Directory Structure ``` ├── ansible/ # Ansible playbooks, roles, inventory, and all managed files -│ ├── roles/ # Ansible roles (caddy, docker, dotfiles, etc.) +│ ├── roles/ # Ansible roles (caddy, docker, media_stack, proxmox_ve, etc.) │ ├── services/ # Docker Compose definitions and service configs │ ├── dotfiles/ # Shell config (fish, nvim, tmux, git, etc.) +│ ├── playbooks/ # One-off playbooks (updates, reboots, status) │ └── scripts/ # Utility and maintenance scripts -├── terraform/ # Terraform/OpenTofu for Cloudflare DNS, Hetzner servers -└── docs/ # Architecture, networking, services, and monitoring docs +├── terraform/ # Terraform/OpenTofu for Hetzner, Cloudflare, Grafana Cloud, PagerDuty +└── docs/ # Architecture, networking, services, monitoring, and per-host docs ``` ## Getting Started ### Prerequisites -- SSH access to hosts via Tailscale +- SSH access to hosts via Tailscale (all hosts SSH as `root`) - `ansible` for configuration management - `tofu` (OpenTofu) or `terraform` for infrastructure provisioning +- `sops` + `age` for editing encrypted secrets ### Usage 1. **Clone:** `git clone git@github.com:RWejlgaard/pez-infra.git` 2. **Services:** Each service has its own directory under `ansible/services/` with a `docker-compose.yml` and config files -3. **Deploy:** Ansible playbooks in `ansible/` handle deployment (see individual playbook docs) -4. **Infrastructure:** Terraform configs in `terraform/` manage DNS and cloud resources +3. **Deploy:** `cd ansible && make deploy` runs the unified `deploy.yml` against the whole fleet (or `make deploy-host HOST=`) +4. **Infrastructure:** Terraform configs in `terraform/` manage Hetzner servers, Cloudflare DNS, Grafana Cloud, and PagerDuty ### Secrets @@ -73,5 +78,6 @@ Detailed documentation lives in [`docs/`](docs/): - **[Architecture](docs/architecture.md)** — Network topology, traffic flow, design principles - **[Networking](docs/networking.md)** — Tailscale mesh, DNS flow, physical networking - **[Services](docs/services.md)** — Complete service map with ports, auth, and deployment info -- **[Monitoring](docs/monitoring.md)** — Prometheus, Grafana, exporters, status page +- **[Monitoring](docs/monitoring.md)** — Grafana Cloud, Alloy, synthetic checks, PagerDuty +- **[Hosts](docs/hosts/)** — Per-host detail (hardware, services, quirks) - **[Getting Started](docs/getting-started.md)** — How to work with this repo diff --git a/ansible/README.md b/ansible/README.md index e95f7a9..42bfccb 100644 --- a/ansible/README.md +++ b/ansible/README.md @@ -25,26 +25,28 @@ make deploy-host HOST=helsinki-a | Playbook | Purpose | Usage | |----------|---------|-------| | `deploy.yml` | Full host rebuild from repo | `make deploy` or `--limit ` | -| `playbooks/update-all.yml` | OS package updates (all hosts) | `make update-all` | -| `playbooks/update-linux.yml` | Linux-only updates (apt + apk) | `make update-linux` | -| `playbooks/update-freebsd.yml` | FreeBSD-only updates (pkg) | `make update-freebsd` | +| `playbooks/update-all.yml` | OS package updates (all hosts, apt) | `make update-all` | +| `playbooks/update-linux.yml` | Alias for update-all (apt) | `make update-linux` | | `playbooks/docker-status.yml` | Show running containers | `make docker-status` | | `playbooks/reboot.yml` | Safe reboot with pre-flight | `make reboot HOST=` | +| `playbooks/zfs.yml` | ZFS scrub scheduling (london-b) | `ansible-playbook playbooks/zfs.yml` | ## Deploy Stages -The deploy playbook runs in stages, each independently taggable: +The deploy playbook runs in stages, each independently taggable (see `deploy.yml`): -1. **common** — Baseline packages, SSH hardening, fish shell -2. **docker** — Docker engine on container hosts -3. **node-exporter** — Prometheus monitoring agent on all hosts -4. **services** — Per-host service deployment: - - `helsinki-a`: Caddy reverse proxy - - `london-b`: Docker Compose services (Jellyseer, etc.) - - `nuremberg-a`: poste.io mail - - `copenhagen-a`: Minecraft + MaNGOS systemd services - - `london-a`: Prometheus + Grafana (FreeBSD) -5. **verify** — Post-deploy health check +1. **common / baseline** — Baseline packages, SSH hardening, fish shell, dotfiles +2. **docker** — Docker engine on container hosts (`docker_hosts` group) +3. **services** — Per-host service deployment: + - `helsinki-a`: Caddy + status-page + custom systemd units + - `docker_hosts`: Docker Compose stacks from `services/` + - `nuremberg-a`: poste.io mail (Docker) + - `london-b`: `media_stack` + `backup` (rclone to B2) + - `copenhagen-a`: MaNGOS systemd units + MariaDB + - `london-a`: `proxmox_ve` (apt repo, nag patch, CIFS storage) + - `zfs_hosts`: ZFS scrub scheduling + +Observability (node_exporter, systemd_exporter, Grafana Alloy) is part of the `common` baseline — every host gets it. Run a single stage: `ansible-playbook deploy.yml --tags docker` @@ -52,13 +54,18 @@ Run a single stage: `ansible-playbook deploy.yml --tags docker` | Role | Description | |------|-------------| -| `common` | Base packages, SSH hardening, fish shell | -| `docker` | Docker engine install and setup | -| `docker-services` | Deploy compose files from `services/` | +| `common` | Base packages, SSH hardening, fish shell, exporters, Alloy | | `dotfiles` | Shell config from `dotfiles/` | +| `docker` | Docker engine install and setup | +| `docker_services` | Deploy compose files from `services/` | | `caddy` | Caddy reverse proxy (helsinki-a) | -| `node-exporter` | Prometheus node_exporter | -| `systemd-services` | Custom systemd units from `services/` | +| `status_page` | status.pez.sh generator script + cron | +| `systemd_services` | Custom systemd units from `services/` | +| `media_stack` | *Arr stack, Plex/Jellyfin, Samba, Syncthing on london-b | +| `backup` | rclone-to-B2 cron job on london-b | +| `mariadb` | Native MariaDB (used by MaNGOS on copenhagen-a) | +| `proxmox_ve` | PVE no-subscription repo, UI lockdown, CIFS storage | +| `zfs` | Weekly scrub cron on ZFS hosts | ## Inventory diff --git a/ansible/services/README.md b/ansible/services/README.md index dfd12b4..8bd6a8e 100644 --- a/ansible/services/README.md +++ b/ansible/services/README.md @@ -1,45 +1,53 @@ # Services -Version-controlled service definitions across the fleet. +Version-controlled service definitions across the fleet. Each subdirectory is a single deployable unit — either a Docker Compose stack, a systemd unit, or a static config file set — that the Ansible roles in `ansible/roles/` pick up and deploy. -## Directory Structure +## Layout ``` services/ -├── systemd/ # systemd unit files (Linux hosts) -│ ├── copenhagen-a/ -│ │ ├── mangos-realmd.service # MaNGOS Zero realm server -│ │ └── mangos-world.service # MaNGOS Zero world server -│ └── helsinki-a/ -│ ├── caddy.service # Caddy reverse proxy (stock unit) -│ └── thiswebsitedoesnotexist.service # Node.js app on port 3721 -└── rc.d/ # FreeBSD rc.conf and rc.d scripts - └── london-a/ - └── rc.conf # /etc/rc.conf — all enabled services +├── / +│ ├── docker-compose.yml # Docker services +│ ├── .service # Native systemd unit (when applicable) +│ ├── config/ # Mounted/copied config files +│ ├── *.enc.{yml,yaml,env} # SOPS-encrypted secrets +│ └── README.md # Service-specific notes (where relevant) ``` -## Notes +There is **no** per-host subdirectory — services are named by what they are, and the host they land on is decided by `docker_services` / `systemd_services` lists in `ansible/inventory/host_vars/.yml`. -### copenhagen-a (Linux) +## Service inventory -| Service | Unit | Status | Notes | -|---------|------|--------|-------| -| MaNGOS realmd | `mangos-realmd.service` | enabled, custom | Realm server for WoW private server. Depends on MariaDB. | -| MaNGOS world | `mangos-world.service` | enabled, custom | World server. Depends on MariaDB and realmd. | +| Service | Type | Host(s) | Notes | +|---|---|---|---| +| caddy | Native (apt) | helsinki-a | Reverse proxy. Caddyfile lives here. | +| authelia | Docker | helsinki-a | SSO, plus MariaDB and LLDAP sidecars | +| bitwarden | Docker | helsinki-a | Vaultwarden + MariaDB | +| forgejo | Docker | helsinki-a | Git forge | +| poste-io | Docker | nuremberg-a | Mail | +| jellyseerr | Docker | london-b | Plex request manager | +| navidrome | Docker | london-b | Music streaming | +| slskd | Docker | london-b | Soulseek client | +| miniflux | Docker | london-b | RSS reader (with postgres) | +| smartctl-exporter | Docker | london-b, copenhagen-a | SMART metrics | +| plex-exporter | Docker | london-b | Plex metrics | +| octopus-exporter | Docker | london-c | Octopus Energy metrics | +| minecraft | Docker | copenhagen-a | PaperMC server | +| radarr / sonarr / lidarr / readarr / prowlarr / whisparr | systemd | london-b | *Arr stack (systemd unit files here) | +| transmission | systemd | london-b | Config files (the daemon itself is apt) | +| samba / vsftpd | systemd | london-b | File-sharing config | +| ollama | systemd | london-b | Custom unit + binary install | +| mangos-realmd / mangos-world / mangos-zero | systemd | copenhagen-a | MaNGOS WoW server | +| promtail | systemd | (currently unused; historical) | Log shipper, replaced by Alloy | +| status-page | Cron script | helsinki-a | `update-status.sh` writes `/srv/status` | +| rc.d | FreeBSD rc.conf | (historical) | Snapshot of london-a's old FreeBSD setup | -### helsinki-a (Linux) +## Conventions -| Service | Unit | Status | Notes | -|---------|------|--------|-------| -| Caddy | `caddy.service` | enabled, stock | Installed via package manager. Config at `/etc/caddy/Caddyfile`. | -| thiswebsitedoesnotexist | `thiswebsitedoesnotexist.service` | enabled, custom | Node.js app. Env vars in `/opt/thiswebsitedoesnotexist/.env`. | - -### london-a (Linux) - -No custom rc.d scripts — all services installed via `pkg`. The `rc.conf` captures all enabled services: - -| Service | Unit | Notes | -|---------|-----------------|-------| -| libvirtd | `libvirtd.service` | Virtualisation daemon | +- **Compose stacks** live at `/docker-compose.yml` and are deployed to `/opt/docker//` on the target host. +- **Systemd units** are copied to `/etc/systemd/system/.service` by the `media_stack` or `systemd_services` role. +- **Secrets** are SOPS-encrypted (`*.enc.yml`) and decrypted into place at deploy time. +## Adding a new service +See [docs/getting-started.md](../../docs/getting-started.md#adding-a-new-service) for the end-to-end flow (compose → host_vars → Caddy → DNS → docs). diff --git a/docs/README.md b/docs/README.md index fbe54c0..0823b2a 100644 --- a/docs/README.md +++ b/docs/README.md @@ -7,17 +7,19 @@ Everything you need to understand how this infrastructure works. - **[Architecture](architecture.md)** — High-level overview, network topology, traffic flow diagrams - **[Networking](networking.md)** — Tailscale mesh, physical networking, DNS and proxy flow - **[Services](services.md)** — Complete service map: what runs where, ports, auth -- **[Monitoring](monitoring.md)** — Prometheus, Grafana, exporters, alerting, status page +- **[Monitoring](monitoring.md)** — Grafana Cloud, Alloy, synthetic checks, alerting via PagerDuty - **[Secrets](secrets.md)** — SOPS + age encryption: setup, usage, CI integration - **[Getting Started](getting-started.md)** — How to work with this repo, deploy changes, add services +- **[Hosts](hosts/)** — Per-host detail (hardware, services, quirks) ## Quick Reference | Host | Tailscale IP | Location | Role | |------|-------------|----------|------| -| helsinki-a | 100.67.6.27 | Hetzner Cloud | Reverse proxy, SSO, Bitwarden | +| helsinki-a | 100.67.6.27 | Hetzner Cloud (Helsinki) | Reverse proxy, SSO, Bitwarden, Forgejo | +| london-a | 100.122.180.98 | London | Proxmox VE hypervisor | | london-b | 100.84.65.101 | London | Storage, media, Docker services | -| london-a | 100.122.219.41 | London | Prometheus + Grafana | -| nuremberg-a | 100.117.235.28 | Hetzner Cloud | Mail (poste.io) | -| copenhagen-a | 100.89.206.60 | Copenhagen | Minecraft, WoW | -| copenhagen-c | 100.115.45.53 | Copenhagen | Idle | +| london-c | 100.123.72.87 | London | Raspberry Pi, Octopus Energy exporter | +| nuremberg-a | 100.70.180.24 | Hetzner Cloud (Nuremberg) | Mail (poste.io) | +| copenhagen-a | 100.89.206.60 | Copenhagen | Minecraft, WoW/MaNGOS | +| copenhagen-c | 100.115.45.53 | Copenhagen | Raspberry Pi, cloudflared, idle | diff --git a/docs/architecture.md b/docs/architecture.md index 4810044..f8ef102 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -2,26 +2,29 @@ ## Overview -The infrastructure spans four physical locations connected by a Tailscale mesh network. All public traffic enters through a single Hetzner Cloud VPS (helsinki-a) running Caddy as a reverse proxy, which forwards requests over Tailscale to backend services running on physical servers in London and Copenhagen. +The infrastructure spans three physical locations (London, Copenhagen, Hetzner Cloud) connected by a Tailscale mesh network. All public traffic enters through a single Hetzner Cloud VPS (helsinki-a) running Caddy as a reverse proxy, which forwards requests over Tailscale to backend services running on physical servers in London and Copenhagen. -The setup is entirely self-hosted (with the exception of Hetzner Cloud VPSs and Cloudflare for DNS/CDN). Servers are old personal computers repurposed into server duty — cheaper than cloud, and I get a rack cabinet that doubles as a bedroom white noise machine. +The setup is entirely self-hosted (with the exception of Hetzner Cloud VPSs, Cloudflare for DNS/CDN, and Grafana Cloud for observability). Most physical servers are old personal computers repurposed into server duty — cheaper than cloud, and I get a rack cabinet that doubles as a bedroom white noise machine. ## Network Topology ```mermaid graph TD - CF["Cloudflare
DNS + CDN
*.pez.sh"] + CF["Cloudflare
DNS + CDN
*.pez.sh, *.pez.solutions"] CF -->|HTTPS| HEL - HEL["helsinki-a
Hetzner Cloud VPS

Caddy (reverse proxy)
Authelia (SSO)
Bitwarden
LLDAP"] + HEL["helsinki-a
Hetzner Cloud VPS

Caddy (reverse proxy)
Authelia (SSO)
LLDAP (Authelia backend)
Bitwarden (Vaultwarden)
Forgejo"] HEL --> TS["Tailscale Mesh
WireGuard-based VPN"] - TS --> LB["london-b
Storage / Media
Docker services
(46T ZFS)"] - TS --> LA["london-a
Monitoring
Prometheus / Grafana
(FreeBSD)"] - TS --> NA["nuremberg-a
Mail
poste.io
(Alpine)"] - TS --> CA["copenhagen-a
Gaming
Minecraft / WoW/MaNGOS
(Ubuntu)"] - TS --> CC["copenhagen-c
(idle)"] + TS --> LB["london-b
Storage / Media
*arr stack, Plex, Jellyfin
(Threadripper, 87T ZFS)"] + TS --> LA["london-a
Proxmox VE hypervisor
(Debian 13)"] + TS --> LC["london-c
Raspberry Pi
Octopus Energy exporter"] + TS --> NA["nuremberg-a
Mail
poste.io"] + TS --> CA["copenhagen-a
Gaming
Minecraft / WoW (MaNGOS)"] + TS --> CC["copenhagen-c
Raspberry Pi
cloudflared, idle"] + + TS -.->|Alloy| GC["Grafana Cloud
metrics, logs, traces
synthetic checks"] style CC stroke-dasharray: 5 5 ``` @@ -34,7 +37,7 @@ All public-facing services follow the same pattern: User → Cloudflare (DNS + TLS) → helsinki-a (Caddy) → Backend (over Tailscale) ``` -1. DNS for `*.pez.sh` is managed by Cloudflare (provisioned via Terraform) +1. DNS for `pez.sh` and `pez.solutions` is managed by Cloudflare (provisioned via Terraform) 2. Cloudflare proxies traffic to helsinki-a 3. Caddy on helsinki-a terminates TLS and routes to the correct backend 4. For protected services, Caddy calls Authelia first (`forward_auth`) @@ -51,8 +54,8 @@ graph LR R["radarr.pez.sh"] --> A1 --> LB1["london-b:7878"] J["jellyfin.pez.sh"] --> A2 --> LB2["london-b:8096"] - G["grafana.pez.sh"] --> A3 --> LA["london-a:3000"] - AU["auth.pez.sh"] --> A4 --> LO["localhost:9091"] + G["git.pez.sh"] --> A3 --> LO3["localhost:3000 (Forgejo)"] + AU["auth.pez.sh"] --> A4 --> LO["localhost:9091 (Authelia)"] ``` ## Auth Architecture @@ -60,17 +63,22 @@ graph LR ```mermaid graph TD Caddy["Caddy
forward_auth"] --> Authelia["Authelia
SSO
auth.pez.sh"] - Authelia --> LLDAP["LLDAP
User directory"] + Authelia --> LLDAP["LLDAP
User directory
(Authelia backend only)"] + Authelia --> MariaDB["MariaDB
Authelia session/state"] ``` -Authelia authenticates against LLDAP (both on helsinki-a). One place to manage users — add or remove someone in LDAP and it propagates to all protected services. +Authelia authenticates against LLDAP and uses a MariaDB for session/state. All three run as Docker containers on helsinki-a. LLDAP is **not** wired into other apps — it's purely Authelia's user backend. Services that sit behind Authelia inherit users from LLDAP via the Caddy `forward_auth` flow; services with their own auth (Bitwarden, Plex, Jellyfin, Navidrome, Jellyseerr, Forgejo, poste.io) maintain their own user databases. -Services with their own auth (Bitwarden, Jellyfin, Plex, Nextcloud, Navidrome, Jellyseerr) are not behind Authelia. +## Observability + +Metrics, logs, and traces ship to **Grafana Cloud** from every host via **Grafana Alloy**. The Alloy collectors are registered in Grafana Fleet Management (configured in `terraform/grafana/`). Synthetic uptime checks for the public sites run from Grafana Cloud probes, and PagerDuty handles alert delivery. + +> **History:** Monitoring used to run locally on london-a (FreeBSD, with Prometheus + Grafana). london-a has since been wiped and reinstalled as Proxmox VE; the local stack was retired in favour of Grafana Cloud. See [monitoring.md](monitoring.md) for the current setup. ## Design Principles - **Self-hosted first.** Cloud VPSs only where it makes sense (public gateway, mail with clean IP reputation). Everything else runs on physical hardware I own. - **Tailscale as the backbone.** No ports exposed on residential IPs. All inter-server communication goes over the mesh. -- **Ansible for everything.** If a server dies, reinstall the OS, install Tailscale, run Ansible. 30 minutes to full recovery. -- **Terraform for DNS.** All Cloudflare records are in code. No clicking around in dashboards. +- **Ansible for everything.** If a server dies, reinstall the OS, install Tailscale, run `make deploy`. Roughly 30 minutes to full recovery. +- **Terraform for cloud + DNS.** Hetzner servers, Cloudflare records, Grafana Cloud configuration, and PagerDuty are all in code. No clicking around in dashboards. - **Cattle, not pets (as much as possible).** The servers are technically pets — old hardware in specific locations — but the configs are cattle. Everything is reproducible from this repo. diff --git a/docs/getting-started.md b/docs/getting-started.md index a4357df..7a74b4c 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -8,10 +8,10 @@ You'll need: - **Tailscale** — installed and connected to the tailnet. All SSH access goes through Tailscale. No servers have SSH exposed on the public internet. - **SSH keys** — set up for each host you need to access -- **Ansible** — for configuration management and deployments -- **OpenTofu** (or Terraform) — for managing Cloudflare DNS and infrastructure +- **Ansible** — for configuration management and deployments (`make deps` from `ansible/` installs collections) +- **OpenTofu** (or Terraform) — for Hetzner, Cloudflare, Grafana Cloud, and PagerDuty - **Docker** — helpful to understand, since most services are containerised -- **SOPS + age** — for secrets encryption/decryption (run `./scripts/sops-setup.sh`) +- **SOPS + age** — for secrets encryption/decryption (run `./ansible/scripts/sops-setup.sh`) - **Git** — obviously - **gh CLI** — for GitHub operations (PRs, issues, etc.) @@ -28,76 +28,98 @@ cd pez-infra pez-infra/ ├── docs/ # You are here ├── ansible/ # Ansible playbooks, roles, inventory, and all managed files -│ ├── roles/ # Ansible roles (caddy, docker, dotfiles, etc.) +│ ├── roles/ # Ansible roles (common, caddy, docker, media_stack, proxmox_ve, etc.) │ ├── services/ # Docker Compose definitions and service configs │ ├── dotfiles/ # Shell config (fish, nvim, tmux, git, etc.) +│ ├── playbooks/ # One-off playbooks (updates, reboots, status) │ └── scripts/ # Utility and maintenance scripts -└── terraform/ # Terraform/OpenTofu for Cloudflare, DNS, etc. +└── terraform/ # Terraform/OpenTofu for Hetzner, Cloudflare, Grafana Cloud, PagerDuty ``` ## Connecting to hosts -All access is via Tailscale. Once you're on the tailnet, SSH using the Tailscale IP or hostname: +All access is via Tailscale, as `root`. Once you're on the tailnet, SSH using the Tailscale IP or hostname: ```bash ssh root@helsinki-a # or ssh root@100.67.6.27 -ssh root@london-b # or ssh root@100.84.65.101 -ssh root@london-a # FreeBSD — might need a different user -ssh root@copenhagen-a # or ssh root@100.89.206.60 +ssh root@london-a # Proxmox VE host +ssh root@london-b # storage / media +ssh root@london-c # Raspberry Pi +ssh root@copenhagen-a +ssh root@copenhagen-c # Raspberry Pi +ssh root@nuremberg-a ``` ## Common Tasks ### Deploying configuration changes -Ansible handles deployments. Playbooks are in `ansible/` and are structured by host/role. +Ansible handles deployments. The unified `deploy.yml` rebuilds a host from bare-metal-with-Tailscale to fully configured. ```bash -# Run the full site playbook -cd ansible -ansible-playbook site.yml +cd ansible/ -# Target a specific host -ansible-playbook site.yml --limit london-b +# Install collections +make deps -# Dry run first -ansible-playbook site.yml --check --diff +# Dry run — see what would change +make deploy-check + +# Deploy everything +make deploy + +# Deploy a single host +make deploy-host HOST=london-b + +# Or run a single stage +ansible-playbook deploy.yml --tags docker ``` Ansible also runs automatically via GitHub Actions on commits to the main branch — so a quick commit from your phone can fix a misconfiguration when you're out. -### Managing DNS +Other playbooks live under `ansible/playbooks/`: -DNS records are managed via Terraform in the `terraform/` directory: +| Playbook | Purpose | +|---|---| +| `update-all.yml` | OS package updates (all hosts) | +| `update-linux.yml` | Linux-only updates (apt) | +| `docker-status.yml` | Show running containers per host | +| `reboot.yml` | Safe reboot with pre-flight (interactive confirm for london-b) | +| `zfs.yml` | ZFS scrub scheduling | + +### Managing cloud + DNS + observability + +Terraform manages Hetzner servers, Cloudflare DNS, Grafana Cloud (stack, fleet, dashboards, synthetic checks), and PagerDuty: ```bash cd terraform -tofu plan # see what would change -tofu apply # apply the changes +make init # initialize providers and B2 backend +make plan # preview changes +make apply # apply the changes ``` -All Cloudflare DNS records, pages, and access policies are defined here. Don't click around in the Cloudflare dashboard — if it's not in Terraform, it doesn't exist. +State lives in a Backblaze B2 bucket (`pez-infra-tfstate`) via the S3-compatible backend. Don't click around in the Cloudflare or Grafana Cloud dashboards — if it's not in Terraform, it doesn't exist. ### Adding a new service -1. **Create a Docker Compose file** in `ansible/services//docker-compose.yml` -2. **Add the Caddy route** — if it needs a public subdomain, add a block to the Caddyfile in `ansible/services/caddy/` -3. **Add a DNS record** — add the subdomain to `terraform/` and run `tofu apply` -4. **Add Ansible deployment** — create or update the relevant role in `ansible/` so the service gets deployed automatically -5. **Add monitoring** — if the service has a metrics endpoint, add it as a Prometheus scrape target -6. **Update docs** — add the service to `docs/services.md` +1. **Create a Docker Compose file** in `ansible/services//docker-compose.yml` (or a systemd unit if it's native) +2. **Add the host_var** — list the service under `docker_services` (or `systemd_services`) in `ansible/inventory/host_vars/.yml` +3. **Add the Caddy route** — if it needs a public subdomain, add a block to `ansible/services/caddy/Caddyfile` +4. **Add a DNS record** — add the subdomain to `terraform/hetzner/dns.tf` and run `tofu apply` +5. **Add monitoring** — if the service has a metrics endpoint, scrape it via Alloy (`terraform/grafana/fleet_pipelines/`) +6. **Update docs** — add the service to `docs/services.md` (and the relevant `docs/hosts/.md` page) ### Adding a new server -1. Install the OS (Ubuntu preferred — see below) -2. Set up SSH keys +1. Install the OS (Debian 13 or Ubuntu LTS preferred — see below) +2. Set up SSH keys for `root` 3. Install Tailscale and join the tailnet -4. Add the host to the Ansible inventory in `ansible/` -5. Assign roles (at minimum: node_exporter for monitoring) -6. Run `ansible-playbook site.yml --limit ` -7. Update `docs/services.md` and `docs/architecture.md` +4. Add the host to `ansible/inventory/hosts.ini` and create `ansible/inventory/host_vars/.yml` +5. Run `make deploy-host HOST=` from `ansible/` +6. Register the host as a Grafana Fleet collector in `terraform/grafana/fleet_collectors.tf` and `tofu apply` +7. Add a doc at `docs/hosts/.md` and update `docs/services.md` + `docs/architecture.md` -That's it. Ansible takes care of installing node_exporter, configuring the system, and deploying any assigned services. +That's it. The common role installs node_exporter, systemd_exporter, and Alloy as part of the baseline, so observability is automatic. ### Working with ZFS (london-b) @@ -108,17 +130,20 @@ zpool status hdd # Check usage zfs list -# Scrub status (runs weekly on Sundays) +# Scrub status (runs weekly on Sundays at 12:00) zpool status hdd | grep scan ``` -ZFS is set up with 3× RAIDZ1 vdevs across 8 drives. Tolerates one drive failure per vdev. +ZFS is set up with 3× RAIDZ1 vdevs of 4 drives each (12 drives total) on the `hdd` pool. Tolerates one drive failure per vdev. The long-term plan is to replace the 8 TB drives with 24 TB drives and grow the pool toward 24 drives / ~0.5 PB raw. ## OS Choice -Ubuntu is the preferred OS for new servers. Not because I love it — Alpine is faster and leaner — but because Ansible support is vastly better. The lack of GNU binaries and systemd on Alpine caused enough headaches that the switch to Ubuntu was worth it. +- **Debian (12 or 13)** is the default for new hosts — including the Raspberry Pis. Stable, well-supported by Ansible, predictable. +- **Ubuntu LTS** is on london-b and copenhagen-a (historical — both came up before the Debian standard). +- **Proxmox VE** (Debian Bookworm under the hood) on london-a. +- **No more FreeBSD.** london-a used to run FreeBSD for Prometheus/Grafana; that's all on Grafana Cloud now and london-a is Linux/Proxmox. -FreeBSD is used on london-a (monitoring) and works well for that single-purpose role. +Alpine has been tried and rejected — the missing GNU binaries / systemd caused enough Ansible headaches to not be worth the size savings. ## Secrets @@ -151,7 +176,7 @@ This monorepo replaces several standalone repos: |----------|-------------| | pez-ansible | `ansible/` | | pez-terraform | `terraform/` | -| pez-grafana | `services/grafana/` | -| pez-proxy | `services/caddy/` | +| pez-grafana | `terraform/grafana/` | +| pez-proxy | `ansible/services/caddy/` | | pez-docs | `docs/` | -| server-scripts | `scripts/` and `ansible/` | +| server-scripts | `ansible/scripts/` and `ansible/roles/` | diff --git a/docs/hosts/copenhagen-a.md b/docs/hosts/copenhagen-a.md index 775aea7..2526035 100644 --- a/docs/hosts/copenhagen-a.md +++ b/docs/hosts/copenhagen-a.md @@ -7,7 +7,7 @@ Game servers. Located at my dad's place in Copenhagen as an off-site location. | | | |---|---| | **Location** | Copenhagen | -| **OS** | Ubuntu 22.04 | +| **OS** | Ubuntu 22.04 LTS | | **Tailscale IP** | 100.89.206.60 | | **Role** | Gaming servers (Minecraft, WoW) | | **Form factor** | Lenovo "tiny" desktop (lunchbox-sized) | @@ -18,7 +18,7 @@ Game servers. Located at my dad's place in Copenhagen as an off-site location. |---|---| | CPU | Intel i5-4570T (4 threads) | | Memory | 16 GB | -| Boot disk | 500 GB (26% used) | +| Boot disk | 500 GB | Compact Lenovo desktop — powered by a standard ThinkPad charging brick. Small, quiet, and draws minimal power. @@ -28,11 +28,11 @@ Compact Lenovo desktop — powered by a standard ThinkPad charging brick. Small, | | | |---|---| -| Image | `marctv/minecraft-papermc-server` | +| Image | `itzg/minecraft-server` | | Port | 25565 | | Deployment | Docker | -PaperMC for better performance than vanilla. Not proxied through Caddy — accessed directly via Tailscale or the host's IP. +Not proxied through Caddy — accessed directly via Tailscale or the host's public IP. ### World of Warcraft (MaNGOS Zero) @@ -47,29 +47,18 @@ WoW 1.12 (Vanilla) private server using the MaNGOS Zero emulator. Runs natively - Runs as the `mangos` user - Install path: `/home/mangos/mangos/zero/` - MariaDB hosts the character, world, and auth databases locally +- Both `mangos-realmd` and `mangos-world` start automatically on boot via systemd +- The `mariadb` Ansible role manages package + secrets; the `systemd_services` role drops the unit files (`ansible/services/mangos-realmd/`, `ansible/services/mangos-world/`) -Both `mangos-realmd` and `mangos-world` start automatically on boot via systemd. +### Other -### Monitoring - -| Service | Port | Managed by | -|---------|------|-----------| -| node_exporter | 9100 | systemd (Ansible-managed) | - -Prometheus Node Exporter for host metrics. Installed and managed via the Ansible `node_exporter` role. Scraped by Prometheus on london-a via Tailscale. - -> **Note:** Stale Docker images for `prom/node-exporter` and `quay.io/prometheus/node-exporter` exist on the host from a previous Docker-based deployment. These should be cleaned up — the systemd service is the active one. - -### Potentially Unused Services - -The following services are running but have no known active consumers: - -| Service | Notes | -|---------|-------| -| PostgreSQL 14 | Only default databases (template0, template1, postgres). Likely leftover. | -| Redis 6.0 | Running but no known application depends on it. | - -These are candidates for removal or investigation. +| Service | Port | Deployment | Notes | +|---------|------|-----------|-------| +| smartctl_exporter | 9633 | Docker | Disk SMART metrics scraped by Alloy | +| node_exporter | 9100 | Native | Host metrics | +| systemd_exporter | — | Native | systemd unit metrics | +| Alloy | — | Native | Ships everything to Grafana Cloud | +| Tailscale | — | Native | Mesh networking | ## Networking @@ -77,4 +66,4 @@ Connected directly to the ISP router's built-in switch. Symmetrical 500 Mbit con ## Notes -Copenhagen-a has a static IP, which is needed for game servers that require direct client connections (WoW realm list, Minecraft server list). +copenhagen-a has a static public IP, which is needed for game servers that require direct client connections (WoW realm list, Minecraft server list). The reboot playbook (`ansible/playbooks/reboot.yml`) does a netplan pre-flight check before rebooting to make sure the static IP config will come back up cleanly. diff --git a/docs/hosts/copenhagen-c.md b/docs/hosts/copenhagen-c.md index 47fe460..dbd2d03 100644 --- a/docs/hosts/copenhagen-c.md +++ b/docs/hosts/copenhagen-c.md @@ -1,21 +1,29 @@ # copenhagen-c -General purpose box. Currently idle. +Raspberry Pi at the Copenhagen site. General-purpose / off-site utility box. ## Overview | | | |---|---| | **Location** | Copenhagen | -| **OS** | Debian 12 | +| **OS** | Debian 12 (Bookworm), aarch64 | | **Tailscale IP** | 100.115.45.53 | -| **Role** | Idle / available | -| **Disk** | 117 GB (15% used) | +| **Role** | Idle / cloudflared tunnel | +| **Form factor** | Raspberry Pi (ARM64) | -## Status +## Services -No active workloads. Connected to Tailscale and available for future use. Has node_exporter running for monitoring. +| Service | Deployment | Notes | +|---------|-----------|-------| +| cloudflared | Native (systemd) | Cloudflare-managed tunnel for ad-hoc exposure of services from this site | +| Tailscale | Native | Mesh networking | +| Alloy | Native | Ships metrics/logs to Grafana Cloud | +| node_exporter | Native | Host metrics | +| Docker / containerd | Native | Available, but no compose services currently scheduled here | + +The cloudflared token is stored directly in the systemd unit (`/etc/systemd/system/cloudflared.service`); the tunnel itself is configured in the Cloudflare dashboard. ## Notes -Part of the Copenhagen off-site setup at my dad's place. Available if I need to spin up something that benefits from a Copenhagen location or just need another box. +Part of the Copenhagen off-site setup at my dad's place. Otherwise idle — available if I need to spin up something that benefits from a Copenhagen location or just need another always-on box. diff --git a/docs/hosts/helsinki-a.md b/docs/hosts/helsinki-a.md index f1b3364..5d85e3a 100644 --- a/docs/hosts/helsinki-a.md +++ b/docs/hosts/helsinki-a.md @@ -7,31 +7,44 @@ Public-facing traffic gateway. Everything exposed to the internet goes through t | | | |---|---| | **Location** | Hetzner Cloud (Helsinki) | -| **OS** | Linux (Ubuntu/Debian) | +| **OS** | Debian 13 (Trixie) | | **Tailscale IP** | 100.67.6.27 | -| **Role** | Reverse proxy, SSO, Bitwarden, LDAP | +| **Role** | Reverse proxy, SSO, Bitwarden, Forgejo | | **Provider** | Hetzner Cloud VPS | ## What it does -This is the front door. All public subdomains (*.pez.sh) terminate here via Caddy, which proxies traffic to the appropriate backend over Tailscale. +This is the front door. All public subdomains under `pez.sh` and `pez.solutions` terminate here via Caddy, which proxies traffic to the appropriate backend over Tailscale. -It also runs the auth stack — Authelia for SSO and LLDAP for user management. Having auth on the same box as the proxy keeps latency low for the `forward_auth` check. +It also runs the auth stack — Authelia for SSO and LLDAP as Authelia's user backend. Having auth on the same box as the proxy keeps latency low for the `forward_auth` check. -Bitwarden (Vaultwarden) lives here too, because password management needs to be available even if the London servers are having a moment. +Bitwarden (Vaultwarden) and Forgejo also live here. Both expose their own login and don't go through Authelia. Bitwarden is on helsinki-a for availability — password management needs to be reachable even if the London servers are having a moment. Forgejo is colocated for the same reason and to keep Git access independent of home internet. ## Services | Service | Port | Deployment | Notes | |---------|------|-----------|-------| -| Caddy | 80, 443 | Docker | Reverse proxy + TLS termination | +| Caddy | 80, 443 | Native (apt + systemd) | Reverse proxy + TLS termination. Config at `/etc/caddy/Caddyfile` | | Authelia | 9091 | Docker | SSO, accessible at auth.pez.sh | -| Bitwarden (Vaultwarden) | 8443 | Docker | bitwarden.pez.sh, own auth | -| LLDAP | 3890/17170 | Docker | User directory for Authelia | +| Authelia MariaDB | (internal) | Docker | Authelia session/state | +| LLDAP | 3890, 17170 | Docker | User directory for Authelia (UI at ldap.pez.sh) | +| Bitwarden (Vaultwarden) | 8443, 8080 | Docker | bitwarden.pez.sh, own auth | +| Bitwarden MariaDB | (internal) | Docker | Backing DB | +| Forgejo | 3000 (HTTP), 2222 (SSH) | Docker | git.pez.sh, own auth; SSH on `git.pez.sh:2222` | -Also serves static content: -- **status.pez.sh** → `/srv/status` (public status page) -- **apps.pez.sh** → `/srv/apps` (behind Authelia) +Caddy is the only service installed natively — it needs to bind 80/443 directly and there's no benefit to wrapping it in Docker on a single-purpose proxy host. Everything else runs as Docker Compose stacks under `/opt/docker//` (managed by the `docker_services` Ansible role from `ansible/services//docker-compose.yml`). + +### Static sites + +Caddy also serves static content from `/srv/`: + +| Path | URL | Auth | +|---|---|---| +| `/srv/status` | status.pez.sh | — | +| `/srv/apps` | apps.pez.sh, apps.pez.solutions | Authelia | +| `/srv/pez.sh` | pez.sh | — | +| `/srv/pez.solutions` | pez.solutions | — | +| `/srv/pez-signup` | signup.pez.solutions | — | ## Why Hetzner Cloud diff --git a/docs/hosts/london-a.md b/docs/hosts/london-a.md index f17ba92..20f71a5 100644 --- a/docs/hosts/london-a.md +++ b/docs/hosts/london-a.md @@ -1,13 +1,13 @@ # london-a -Proxmox VE hypervisor. +Proxmox VE hypervisor. The platform for any VM workloads I want to run on owned hardware. ## Overview | | | |---|---| | **Location** | London (NW9) | -| **OS** | Proxmox VE (Debian Bookworm) | +| **OS** | Debian 13 (Trixie) with Proxmox VE 9.x | | **Tailscale IP** | 100.122.180.98 | | **Role** | Hypervisor (Proxmox VE) | @@ -25,9 +25,40 @@ Old gaming PC. Runs Proxmox VE on bare metal. | Service | Port | Status | Notes | |---------|------|--------|-------| -| Proxmox VE | 8006 | Active | Web UI — Tailscale only | +| Proxmox VE | 8006 | Active | Web UI — reachable via `london-a.pez.sh` (Caddy) or Tailscale IP | | Tailscale | — | Active | Mesh networking | +| node_exporter, systemd_exporter, Alloy | — | Active | Observability baseline (Ansible-managed) | + +### Storage + +Proxmox is connected to a CIFS share on **london-b** (`100.84.65.101 /pve`) for ISO/template/backup storage. The mount is configured by the `proxmox_ve` Ansible role: + +| Storage ID | Type | Backing | +|---|---|---| +| `local-lvm` | LVM-Thin | Local boot disk | +| `hdd` | CIFS | london-b `/pve` share | + +### VMs + +| VMID | Name | Status | Notes | +|---|---|---|---| +| 100 | Mac-Server | Stopped | macOS Sequoia VM (OpenCore bootloader). Intended for occasional macOS workloads. | + +The VM list will grow over time — this is a general-purpose hypervisor, not a single-VM appliance. + +## Ansible + +The `proxmox_ve` role: + +- Swaps the enterprise apt repo for `pve-no-subscription` so updates work without a paid subscription +- Patches `proxmoxlib.js` to suppress the subscription nag dialog +- Restricts the web UI to the `tailscale0` interface via UFW +- Mounts the london-b CIFS storage ## Networking -Connected via Cat 5 to the Ubiquiti switch alongside london-b. +Connected via Cat 5 to the Ubiquiti switch alongside london-b and london-c. + +## History + +london-a used to run **FreeBSD** as a single-purpose monitoring host (Prometheus + Grafana). Monitoring moved to Grafana Cloud, the box was repaved as Proxmox VE, and the FreeBSD-specific Ansible has been removed. diff --git a/docs/hosts/london-b.md b/docs/hosts/london-b.md index 13202ec..e82914d 100644 --- a/docs/hosts/london-b.md +++ b/docs/hosts/london-b.md @@ -19,24 +19,26 @@ Primary storage and media server. The workhorse of the fleet. | Memory | 64 GB | | GPU | Nvidia GTX 980 | | Boot disk | 500 GB | -| Storage pool | ~64 TB (ZFS) | +| Storage pool | ~87 TB raw / ~64 TB usable (ZFS) | This machine is ridiculously overpowered as a media server. It's my old gaming/workstation PC repurposed into server duty. The GPU helps with Plex transcoding but the CPU can handle it fine on its own. ## Storage -ZFS pool `hdd`: 3× RAIDZ1 vdevs, 8 drives total. +ZFS pool `hdd`: 3× RAIDZ1 vdevs, 4 drives each (12 drives total). | Metric | Value | |---|---| -| Used | 46 TB | -| Free | 18 TB | -| Total | ~64 TB | -| Usage | 72% | -| Scrub | Weekly (Sundays) | +| Used | ~61 TB | +| Free | ~26 TB | +| Total | ~87 TB raw | +| Usage | ~70% | +| Scrub | Weekly (Sundays at 12:00, cron `/sbin/zpool scrub hdd`) | RAIDZ1 tolerates one drive failure per vdev. With this many drives and this much data, ZFS checksumming is essential — silent data corruption on spinning disks is real and you don't want to find out about it years later. +**Roadmap:** the long-term plan is to gradually replace the 8 TB drives with 24 TB drives and grow the pool toward 24 drives / ~0.5 PB raw. + ## Services ### Media Servers @@ -58,15 +60,19 @@ RAIDZ1 tolerates one drive failure per vdev. With this many drives and this much | Prowlarr | 9696 | prowlarr.pez.sh | | Transmission | 9091 | download.pez.sh | | Jellyseerr | 5055 | request.pez.sh | +| Overseerr (snap) | 5056 | jellyfin-requests.pez.sh | ### Other | Service | Port | URL | |---------|------|-----| -| Nextcloud AIO | 11000 | cloud.pez.sh | +| Nextcloud AIO | 11000 | cloud.pez.sh (internal) | +| Miniflux | 8181 | rss.pez.sh | | slskd (Soulseek) | 5030 | soulseek.pez.sh | -| smartctl_exporter | 9633 | (Prometheus scrape) | -| prom-plex-exporter | — | (Prometheus scrape) | +| Syncthing (`syncthing@pez`) | 8384 | (LAN / Tailscale) | +| Ollama | 11434 | (Tailscale) | +| smartctl_exporter | 9633 | (Alloy scrape) | +| prom-plex-exporter | 9594 | (Alloy scrape) | ### Systemd Services (non-Docker) @@ -85,12 +91,15 @@ The media automation suite and several supporting services run as native systemd | Transmission | transmission-daemon | Package-managed | | Samba | smbd | Package-managed | | Ollama | ollama | /usr/local/bin, custom unit | -| Promtail | promtail | Custom unit, ships logs to Loki | +| Syncthing | syncthing@pez | Package-managed, user instance | | vsftpd | vsftpd | FTP server for /hdd/ftp | | systemd_exporter | systemd_exporter | Ansible-managed | -| node_exporter | node_exporter | Ansible-managed | +| node_exporter | prometheus-node-exporter | apt-managed | +| Alloy | alloy | Grafana Alloy, fleet-managed config | -Docker services: Nextcloud AIO, Jellyseerr, Navidrome, slskd, Miniflux, smartctl-exporter, plex-exporter. +Docker services: Nextcloud AIO, Jellyseerr, Navidrome, slskd, Miniflux (with postgres sidecar), smartctl-exporter, plex-exporter. + +Snap: Overseerr (`latest/beta` channel). ### Cron Jobs @@ -99,7 +108,8 @@ Docker services: Nextcloud AIO, Jellyseerr, Navidrome, slskd, Miniflux, smartctl | Every hour | `/root/scripts/movie-rename-fix.fish` | | Midnight daily | `systemctl restart radarr` | | Midnight daily | `systemctl restart sonarr` | -| 22:00 daily | `/root/scripts/backup.sh` (rclone to B2) | +| 22:00 daily | `/root/scripts/backup.sh` (rclone to Backblaze B2) | +| Sundays 12:00 | `/sbin/zpool scrub hdd` | ### Samba Shares @@ -108,8 +118,9 @@ Docker services: Nextcloud AIO, Jellyseerr, Navidrome, slskd, Miniflux, smartctl | HDD | /hdd | pez, root (rw) | | Movies | /hdd/movies | public (ro) | | TV Shows | /hdd/tv | public (ro) | +| pve | /hdd/pve | london-a Proxmox (rw) — ISO/template/backup storage | -Media is served directly from the ZFS pool. +Media is served directly from the ZFS pool. Docker root (`/hdd/docker`) and PVE storage (`/hdd/pve`) live on the pool too. ## Networking diff --git a/docs/hosts/london-c.md b/docs/hosts/london-c.md new file mode 100644 index 0000000..31a22db --- /dev/null +++ b/docs/hosts/london-c.md @@ -0,0 +1,36 @@ +# london-c + +Raspberry Pi at the London site. Edge utility box for lightweight workloads that don't justify spinning up the Threadripper. + +## Overview + +| | | +|---|---| +| **Location** | London (NW9) | +| **OS** | Debian 13 (Trixie), aarch64 | +| **Tailscale IP** | 100.123.72.87 | +| **Role** | Octopus Energy exporter, general-purpose Pi | +| **Form factor** | Raspberry Pi (ARM64) | + +## Services + +| Service | Port | Deployment | Notes | +|---------|------|-----------|-------| +| octopus_exporter | 9359 | Docker (`rwejlgaard/octopus_exporter`) | Pulls electricity-usage data from the Octopus Energy API; scraped by Alloy | +| Tailscale | — | Native | Mesh networking | +| Docker / containerd | — | Native | For octopus-exporter | +| Alloy | — | Native (Ansible-managed) | Ships metrics/logs to Grafana Cloud | +| node_exporter | 9100 | Native | Host metrics | +| systemd_exporter | — | Native | systemd unit metrics | +| fail2ban | — | Native | SSH brute-force protection | + +Compose file lives at `ansible/services/octopus-exporter/docker-compose.yml`. The `OCTOPUS_API_KEY` is templated in from a SOPS-encrypted variable. + +## Networking + +Connected via Ethernet to the Ubiquiti switch alongside london-a and london-b. + +## Notes + +- Single-board-computer form factor — runs cool, draws ~5 W, lives on the rack shelf without active cooling. +- A natural place to park future "small but always-on" workloads (sensors, cron jobs, smart-home glue) that don't need to share fate with london-b. diff --git a/docs/hosts/nuremberg-a.md b/docs/hosts/nuremberg-a.md index 9c3b493..62e71fb 100644 --- a/docs/hosts/nuremberg-a.md +++ b/docs/hosts/nuremberg-a.md @@ -7,7 +7,7 @@ Dedicated mail server. One job, does it well. | | | |---|---| | **Location** | Hetzner Cloud (Nuremberg) | -| **OS** | Debian | +| **OS** | Debian 13 (Trixie) | | **Tailscale IP** | 100.70.180.24 | | **Role** | Mail server (poste.io) | | **Provider** | Hetzner Cloud VPS | @@ -16,10 +16,12 @@ Dedicated mail server. One job, does it well. | Service | Ports | Deployment | |---------|-------|-----------| -| poste.io | 25, 587, 993, 443 | Docker | +| poste.io | 25, 80, 110, 143, 443, 465, 587, 993, 995 | Docker | poste.io is a batteries-included mail server that bundles postfix, dovecot, rspamd, and webmail into a single Docker container. No juggling separate containers for each mail component. +The compose definition lives at `ansible/services/poste-io/docker-compose.yml` and is deployed via the `docker_services` Ansible role (see `ansible/inventory/host_vars/nuremberg-a.yml`). + ## Why a separate server Mail lives on its own VPS to isolate its IP reputation. If the IP gets flagged for any reason, it doesn't affect the rest of the infrastructure. And if something else gets flagged, it doesn't affect mail deliverability. @@ -35,4 +37,4 @@ Mail-related DNS records are managed via Cloudflare (Terraform): ## Firewall -Managed by Hetzner Cloud firewall rules (Terraform). Mail ports are exposed via Docker port mappings in `ansible/services/poste-io/docker-compose.yml`. +Managed by Hetzner Cloud firewall rules (Terraform, `terraform/hetzner/firewall.tf`). Mail ports are exposed via Docker port mappings in `ansible/services/poste-io/docker-compose.yml`. diff --git a/docs/monitoring.md b/docs/monitoring.md index e296a50..cfce62a 100644 --- a/docs/monitoring.md +++ b/docs/monitoring.md @@ -2,111 +2,82 @@ ## Stack Overview +Observability is a fully managed pipeline today: every host runs **Grafana Alloy** as the local collector, and everything ships to **Grafana Cloud**. Synthetic checks are also driven from Grafana Cloud, and alerts are routed to **PagerDuty**. + ```mermaid -graph TD - subgraph "london-a (FreeBSD)" - Prometheus[":9090 Prometheus"] -->|query| Grafana[":3000 Grafana"] +graph LR + subgraph "Fleet (each host)" + NE["node_exporter :9100"] + SE["systemd_exporter :9558"] + XE["host-specific
exporters
(smartctl, plex,
octopus...)"] + Alloy["alloy.service
(Grafana Alloy)"] + NE --> Alloy + SE --> Alloy + XE --> Alloy end - Prometheus -->|scrape over Tailscale| NE["node_exporter
(all hosts) :9100"] - Prometheus -->|scrape over Tailscale| SE["smartctl_exporter
(london-b) :9633"] - Prometheus -->|scrape over Tailscale| PE["plex_exporter
(london-b)"] + Alloy -->|metrics, logs, traces| GC["Grafana Cloud
pez.grafana.net"] + SM["Synthetic Monitoring
probes (London)"] -->|HTTPS GETs| Internet["https://*.pez.sh"] + SM --> GC + GC -->|alerts| PD["PagerDuty"] ``` -Both Prometheus and Grafana are accessible via: -- **grafana.pez.sh** (behind Authelia) -- **prometheus.pez.sh** (behind Authelia) +Everything in `terraform/grafana/` is the source of truth for the Grafana Cloud side: stack, Fleet Management collectors, fleet pipelines, dashboards, and synthetic checks. Everything in `terraform/pagerduty/` configures the on-call destination. -## Prometheus +## Grafana Alloy (per-host collector) -Prometheus runs on london-a and scrapes metrics from exporters across the fleet. All scrape targets are reached over Tailscale — no ports need to be exposed on the public internet. +Alloy runs as `alloy.service` on every host in the inventory. Each host is registered as a Grafana Fleet Management collector in `terraform/grafana/fleet_collectors.tf`, tagged with a `location` attribute (`london`, `copenhagen`, `cloud`) so pipelines can target subsets of the fleet. -### Scrape Targets +Pipelines (what to scrape, how to relabel, where to ship) live in `terraform/grafana/fleet_pipelines/` and are pushed to Grafana Cloud as a `grafana_fleet_management_pipeline` resource. The Alloy daemons on each host pull their config from Fleet Management. -| Target | Host | Port | What | -|--------|------|------|------| -| node_exporter | All hosts | 9100 | System metrics (CPU, memory, disk, network) | -| smartctl_exporter | london-b | 9633 | Disk SMART health data | -| prom-plex-exporter | london-b | (varies) | Plex streaming activity | +### Local exporters scraped by Alloy -node_exporter is deployed to every host via Ansible. It's one of the first things that gets installed on a new server. +| Exporter | Hosts | What | +|---|---|---| +| node_exporter | All hosts | CPU, memory, disk, network, system uptime | +| systemd_exporter | All hosts | Per-unit systemd state | +| smartctl_exporter (Docker) | london-b, copenhagen-a | Disk SMART data | +| prom-plex-exporter (Docker) | london-b | Plex streaming activity | +| octopus_exporter (Docker) | london-c | Octopus Energy electricity usage | +| Caddy `/metrics` | helsinki-a | HTTP request metrics, upstream health (per host) | -### Adding a scrape target +### Logs -1. Deploy the exporter to the host (via Ansible or Docker) -2. Add the target to the Prometheus config in `services/prometheus/` -3. Deploy the updated config (Ansible or manual restart) -4. Verify it shows up in Prometheus targets page +Alloy ships systemd journal entries from every host to Grafana Cloud Logs. Log-derived alerts (e.g. SSH brute-force, mail server errors) can be configured directly in Grafana Cloud. -## Grafana +## Synthetic Monitoring -Grafana reads from Prometheus and provides dashboards for everything worth watching. +Grafana Cloud's Synthetic Monitoring service runs HTTPS probes from the London region against the public services, every 10 minutes. Configured in `terraform/grafana/synthetic_checks.tf`: -### Dashboards +| Check | URL | +|---|---| +| pez_sh | https://pez.sh | +| pez_solutions | https://pez.solutions | +| jellyfin | https://jellyfin.pez.sh | +| plex | https://plex.pez.sh (auth header) | +| request | https://request.pez.sh | +| jellyfin_requests | https://jellyfin-requests.pez.sh | +| git | https://git.pez.sh | -| Dashboard | What it shows | -|-----------|--------------| -| Server Health | CPU, memory, disk usage, network I/O across all hosts | -| ZFS | Pool status, usage, scrub results for london-b | -| SMART | Disk health metrics, temperature, error counts | -| Plex | Active streams, transcoding status, library stats | +Each check has a `ProbeFailedExecutionsTooHigh` alert wired up (3 failed executions in a 30-minute window). -### Adding a dashboard +## Alerting → PagerDuty -Dashboards are defined in `services/grafana/`. Export as JSON from Grafana and commit to the repo to keep them in version control. +PagerDuty is configured in `terraform/pagerduty/`: -## Exporters - -### node_exporter - -Standard Prometheus node exporter. Deployed on every host. Provides system-level metrics: -- CPU usage and load averages -- Memory usage -- Disk space and I/O -- Network traffic -- System uptime - -Installed via Ansible as part of the base server setup. - -### smartctl_exporter - -Runs on london-b (the ZFS storage server with 8 spinning disks). Exposes SMART data from all drives: -- Temperature -- Reallocated sectors -- Read/write error rates -- Power-on hours -- Overall health assessment - -Critical for catching dying drives before they take out a RAIDZ1 vdev. - -### prom-plex-exporter - -Runs on london-b. Scrapes the Plex API and exposes metrics about: -- Active streams -- Transcode sessions -- Library size -- User activity - -Mostly for fun — it's satisfying to see the Plex dashboard light up when people are streaming. +- A single service (`pez-infra`) receives alerts +- Escalation policy fires to me directly +- The Grafana Cloud → PagerDuty integration sends every fired alert (synthetic check failures today; can be extended to log/metric alerts) ## Status Page -**status.pez.sh** is a lightweight public status page that shows service availability. +**status.pez.sh** is a public status page hosted on helsinki-a at `/srv/status`. -- Pulls availability data from Prometheus -- Shows 90-day uptime history -- Hosted on helsinki-a at `/srv/status` -- Source: [RWejlgaard/pez-status](https://github.com/RWejlgaard/pez-status) -- Not behind Authelia — it's public by design +- Cron-driven static JSON (see `ansible/roles/status_page/`) — does not require Grafana Cloud to render +- Hosted directly by Caddy as a `file_server` +- Public by design (no Authelia) +- Source repo for the front-end: [RWejlgaard/pez-status](https://github.com/RWejlgaard/pez-status) -## Alerting +## History -Prometheus alerting rules can be configured in the Prometheus config. Alert conditions worth monitoring: - -- Host down (node_exporter unreachable) -- Disk space critical (>90% usage) -- ZFS scrub errors -- SMART drive failures -- High memory usage - -Grafana can also be configured with alert channels (email, webhooks, etc.) for dashboard-based alerts. +Monitoring used to run locally on **london-a** (FreeBSD) with a self-hosted Prometheus + Grafana. When london-a was reinstalled as Proxmox VE, the local stack was retired and everything moved to Grafana Cloud + Alloy. Older docs (and a few legacy hard-coded IPs in helper scripts) may still reference `100.122.219.41:9090` — that endpoint no longer exists. diff --git a/docs/networking.md b/docs/networking.md index ad8139d..58752b4 100644 --- a/docs/networking.md +++ b/docs/networking.md @@ -9,16 +9,17 @@ All inter-server communication uses Tailscale IPs: | Host | Tailscale IP | |------|-------------| | helsinki-a | 100.67.6.27 | +| london-a | 100.122.180.98 | | london-b | 100.84.65.101 | -| london-a | 100.122.219.41 | -| nuremberg-a | 100.117.235.28 | +| london-c | 100.123.72.87 | +| nuremberg-a | 100.70.180.24 | | copenhagen-a | 100.89.206.60 | | copenhagen-c | 100.115.45.53 | ### What Tailscale is used for - **Reverse proxying:** Caddy on helsinki-a forwards traffic to backends via Tailscale IPs -- **Monitoring:** Prometheus on london-a scrapes exporters on all hosts via Tailscale +- **Observability:** Grafana Alloy on each host pushes metrics/logs/traces to Grafana Cloud; intra-fleet probes (e.g. Proxmox UI) hop over Tailscale - **SSH access:** All SSH is done over Tailscale — no SSH ports exposed to the internet - **Ansible deployments:** GitHub Actions runs Ansible over Tailscale SSH connections - **Exit nodes:** Servers can act as VPN endpoints — useful for accessing UK content from Copenhagen or vice versa @@ -29,21 +30,22 @@ All inter-server communication uses Tailscale IPs: graph TD HEL["helsinki-a"] <--> LB["london-b"] HEL <--> LA["london-a"] + HEL <--> LC["london-c"] HEL <--> NA["nuremberg-a"] - LB <--> LA - LB <--> CA["copenhagen-a"] - LA <--> CA - CA <--> CC["copenhagen-c"] - NA <--> CA - HEL <--> CA - HEL <--> CC + HEL <--> CA["copenhagen-a"] + HEL <--> CC["copenhagen-c"] + LA <--> LB + LA <--> LC + LB <--> LC + LB <--> CA LB <--> CC + NA <--> CA NA <--> LB - NA <--> CC NA <--> LA - LA <--> CC + CA <--> CC style CC stroke-dasharray: 5 5 + style LC stroke-dasharray: 5 5 ``` > Every node can reach every other node directly. The mesh is fully connected. @@ -57,7 +59,7 @@ The London setup is in a rack cabinet in the bedroom (great white noise machine, - **Router:** Ubiquiti Dream Machine Special Edition — overkill for a home setup but gives excellent routing performance vs an ISP router - **ISP:** BT, 1 Gbit down / 300 Mbit up, ~£90/month - **Cabling:** Cat 5 in the walls, patch panel in the utility closet, connected to a Ubiquiti switch -- **Servers:** london-a and london-b connected via Ethernet to the switch +- **Servers:** london-a, london-b, and london-c all wired into the Ubiquiti switch (london-c is a Raspberry Pi running over Ethernet) ### Copenhagen @@ -65,22 +67,23 @@ A stack of servers at my dad's place — acts as an off-site location. - **Router:** ISP-provided (not my house, can't exactly install a Ubiquiti rack) - **ISP:** Symmetrical 500 Mbit — plenty for what's running there -- **Servers:** copenhagen-a and copenhagen-c connected directly to the ISP router's built-in switch +- **Servers:** copenhagen-a (Lenovo tiny desktop) and copenhagen-c (Raspberry Pi) connected directly to the ISP router's built-in switch ### Helsinki / Nuremberg (Hetzner Cloud) - Standard Hetzner Cloud VPS networking -- Public IPv4 addresses -- helsinki-a is the only server that receives traffic from the public internet -- nuremberg-a receives mail (ports 25, 587, 993) +- Public IPv4 addresses, managed via the `terraform/hetzner/` module +- helsinki-a is the only server that receives general HTTP/HTTPS traffic from the public internet +- nuremberg-a receives mail (ports 25, 465, 587, 993, 995) ## DNS Flow All DNS is managed by Cloudflare, provisioned via Terraform. -### Domain: pez.sh +### Domains -The domain is registered on Hover.com with nameservers pointed to Cloudflare. +- **pez.sh** — primary domain. Registered on Hover.com with nameservers pointed to Cloudflare. +- **pez.solutions** — alternate domain. Most services that have a `*.pez.sh` host also accept the matching `*.pez.solutions` host, so apps remain reachable if one TLD has trouble. ### How a request reaches a service @@ -102,28 +105,33 @@ graph TD ### Public Subdomains -All subdomains are Cloudflare-proxied and terminate at helsinki-a: +All subdomains are Cloudflare-proxied and terminate at helsinki-a. Hosts marked with both `pez.sh` and `pez.solutions` are reachable on either TLD. | Subdomain | Backend | Auth | |---|---|---| -| auth.pez.sh | helsinki-a:9091 | — | -| bitwarden.pez.sh | helsinki-a:8443 | — | -| status.pez.sh | helsinki-a:/srv/status | — | -| apps.pez.sh | helsinki-a:/srv/apps | Authelia | -| grafana.pez.sh | london-a:3000 | Authelia | -| prometheus.pez.sh | london-a:9090 | Authelia | -| jellyfin.pez.sh | london-b:8096 | — | -| plex.pez.sh | london-b:32400 | — | -| request.pez.sh | london-b:5055 | — | -| cloud.pez.sh | london-b:11000 | — | -| music.pez.sh | london-b:4533 | — | -| radarr.pez.sh | london-b:7878 | Authelia | -| sonarr.pez.sh | london-b:8989 | Authelia | -| lidarr.pez.sh | london-b:8686 | Authelia | -| readarr.pez.sh | london-b:8787 | Authelia | -| prowlarr.pez.sh | london-b:9696 | Authelia | -| soulseek.pez.sh | london-b:5030 | Authelia | -| download.pez.sh | london-b:9091 | Authelia | +| auth.pez.sh / auth.pez.solutions | helsinki-a:9091 (Authelia) | — | +| bitwarden.pez.sh | helsinki-a:8443 (Vaultwarden) | Own auth | +| git.pez.sh | helsinki-a:3000 (Forgejo) | Own auth | +| ldap.pez.sh | helsinki-a:17170 (LLDAP web UI) | LLDAP login | +| status.pez.sh | helsinki-a:/srv/status (static) | — | +| apps.pez.sh / apps.pez.solutions | helsinki-a:/srv/apps (static dashboard) | Authelia | +| pez.sh | helsinki-a:/srv/pez.sh (static) | — | +| pez.solutions | helsinki-a:/srv/pez.solutions (static) | — | +| signup.pez.solutions | helsinki-a:/srv/pez-signup (static) | — | +| london-a.pez.sh | london-a:8006 (Proxmox UI) | Proxmox login | +| jellyfin.pez.sh / .solutions | london-b:8096 | Own auth | +| plex.pez.sh / .solutions | london-b:32400 | Own auth | +| music.pez.sh | london-b:4533 (Navidrome) | Own auth | +| rss.pez.sh | london-b:8181 (Miniflux) | Authelia | +| request.pez.sh / .solutions | london-b:5055 (Jellyseerr) | Own auth | +| jellyfin-requests.pez.sh / .solutions | london-b:5056 (Overseerr) | Own auth | +| radarr.pez.sh / .solutions | london-b:7878 | Authelia | +| sonarr.pez.sh / .solutions | london-b:8989 | Authelia | +| lidarr.pez.sh / .solutions | london-b:8686 | Authelia | +| readarr.pez.sh / .solutions | london-b:8787 | Authelia | +| prowlarr.pez.sh / .solutions | london-b:9696 | Authelia | +| soulseek.pez.sh / .solutions | london-b:5030 (slskd) | Authelia | +| download.pez.sh / .solutions | london-b:9091 (Transmission) | Authelia | ### Mail DNS @@ -140,13 +148,13 @@ Caddy handles TLS termination for the Cloudflare-to-origin connection. Certifica Example Caddyfile block for a protected service: -``` +```caddyfile radarr.pez.sh { - forward_auth helsinki-a:9091 { - uri /api/verify?rd=https://auth.pez.sh + forward_auth localhost:9091 { + uri /api/authz/forward-auth copy_headers Remote-User Remote-Groups Remote-Name Remote-Email } - reverse_proxy london-b:7878 + reverse_proxy 100.84.65.101:7878 } ``` diff --git a/docs/services.md b/docs/services.md index f029953..8d3510b 100644 --- a/docs/services.md +++ b/docs/services.md @@ -2,20 +2,25 @@ Complete map of every service in the fleet — what it does, where it runs, how it's deployed, and whether it's behind auth. -## helsinki-a — Gateway & Auth +## helsinki-a — Gateway, Auth, Git | Service | Port | Deployment | Auth | URL | |---------|------|-----------|------|-----| -| Caddy | 80, 443 | Native (apt) | — | (reverse proxy, no direct URL) | +| Caddy | 80, 443 | Native (apt + systemd) | — | (reverse proxy, no direct URL) | | Authelia | 9091 | Docker | — | auth.pez.sh | -| Bitwarden (Vaultwarden) | 8443 | Docker | Own auth | bitwarden.pez.sh | -| LLDAP | 3890/17170 | Docker | — | (internal, used by Authelia) | +| Authelia MariaDB | 3306 (internal) | Docker | — | (Authelia session/state) | +| LLDAP | 3890, 17170 | Docker | — | ldap.pez.sh (UI) — used by Authelia | +| Bitwarden (Vaultwarden) | 8443, 8080 | Docker | Own auth | bitwarden.pez.sh | +| Bitwarden MariaDB | 3306 (internal) | Docker | — | (Vaultwarden backing DB) | +| Forgejo | 3000 (HTTP), 2222 (SSH) | Docker | Own auth | git.pez.sh | -Caddy is the single entry point for all public traffic. Authelia and LLDAP provide SSO. Bitwarden is on helsinki-a for availability — it needs to be reachable even if the London servers are down. +Caddy is the single entry point for all public traffic and runs as a native apt-managed systemd service so it can bind 80/443 directly. Everything else on this host runs in Docker. + +Authelia provides SSO via Caddy `forward_auth`. LLDAP is Authelia's user backend — it is **not** wired into Forgejo or Bitwarden, both of which keep their own user databases. Bitwarden lives on helsinki-a so password management stays reachable even if the London servers are down. Forgejo hosts internal Git repositories and exposes SSH on port 2222 (the SSH service itself uses `git.pez.sh:2222`). ## london-b — Storage & Media -The workhorse. Threadripper 3970X, 64GB RAM, 64TB ZFS storage. Everything media-related lives here. +The workhorse. Threadripper 3970X, 64GB RAM. Everything media-related lives here. ### Media Servers @@ -31,35 +36,51 @@ I run both Plex and Jellyfin — some clients work better with one than the othe | Service | Port | Deployment | Auth | URL | |---------|------|-----------|------|-----| -| Radarr | 7878 | Native (systemd) | Authelia | radarr.pez.sh | -| Sonarr | 8989 | Native (apt/systemd) | Authelia | sonarr.pez.sh | -| Lidarr | 8686 | Native (systemd) | Authelia | lidarr.pez.sh | -| Readarr | 8787 | Native (systemd) | Authelia | readarr.pez.sh | -| Prowlarr | 9696 | Native (systemd) | Authelia | prowlarr.pez.sh | +| Radarr | 7878 | Custom systemd unit (`/opt/Radarr`) | Authelia | radarr.pez.sh | +| Sonarr | 8989 | Native (apt/systemd, mono) | Authelia | sonarr.pez.sh | +| Lidarr | 8686 | Custom systemd unit (`/opt/Lidarr`) | Authelia | lidarr.pez.sh | +| Readarr | 8787 | Custom systemd unit (`/opt/Readarr`) | Authelia | readarr.pez.sh | +| Prowlarr | 9696 | Custom systemd unit (`/opt/Prowlarr`) | Authelia | prowlarr.pez.sh | +| Whisparr | — | Custom systemd unit (disabled) | — | — | | Transmission | 9091 | Native (apt/systemd) | Authelia | download.pez.sh | | Jellyseerr | 5055 | Docker | Own auth | request.pez.sh | +| Overseerr | 5056 | Snap (`overseerr` from `latest/beta`) | Own auth | jellyfin-requests.pez.sh | -The arr stack pipeline: Jellyseerr accepts requests → Radarr/Sonarr/Lidarr/Readarr search via Prowlarr → sends to Transmission → downloaded content is moved to the library → Plex and Jellyfin pick it up automatically. +The arr stack pipeline: Jellyseerr/Overseerr accept requests → Radarr/Sonarr/Lidarr/Readarr search via Prowlarr → send to Transmission → downloaded content is moved to the library → Plex and Jellyfin pick it up automatically. Two requesters because Overseerr is hooked into Jellyfin and Jellyseerr into Plex. ### Other | Service | Port | Deployment | Auth | URL | |---------|------|-----------|------|-----| -| Nextcloud AIO | 11000 | Docker | Own auth | cloud.pez.sh | +| Nextcloud AIO | 11000 | Docker | Own auth | cloud.pez.sh (internal/Tailscale) | +| Miniflux | 8181 | Docker (with postgres sidecar) | Authelia | rss.pez.sh | | slskd (Soulseek) | 5030 | Docker | Authelia | soulseek.pez.sh | -| smartctl exporter | 9633 | Docker | — | (scraped by Prometheus) | -| prom-plex-exporter | — | Docker | — | (scraped by Prometheus) | +| Syncthing (`syncthing@pez`) | 8384 | Native (apt) | Own auth | (LAN/Tailscale only) | +| Samba (`smbd`) | 445 | Native (apt) | Local users | (LAN/Tailscale only) | +| vsftpd | 21 | Native (apt) | Local users | (LAN/Tailscale only) | +| Ollama | 11434 | Native (`/usr/local/bin`) | — | (Tailscale only) | +| smartctl_exporter | 9633 | Docker | — | (scraped by Alloy → Grafana Cloud) | +| prom-plex-exporter | 9594 | Docker | — | (scraped by Alloy → Grafana Cloud) | -## london-a — Monitoring +## london-a — Proxmox VE Hypervisor -Dedicated monitoring host running FreeBSD. Very lightly loaded. +Repurposed gaming PC (i7-4790K, 32 GB) running Proxmox VE on bare metal. Currently hosts a single Mac VM and is the landing zone for future virtual machines. | Service | Port | Deployment | Auth | URL | |---------|------|-----------|------|-----| -| Prometheus | 9090 | Native | Authelia | prometheus.pez.sh | -| Grafana | 3000 | Native | Authelia | grafana.pez.sh | +| Proxmox VE | 8006 | Native (Debian Bookworm-based PVE) | Proxmox login | london-a.pez.sh | -See [monitoring.md](monitoring.md) for details on scrape targets, dashboards, and exporters. +The web UI is exposed via Caddy at `london-a.pez.sh` but is also reachable directly over Tailscale at `https://100.122.180.98:8006`. Proxmox storage is augmented with a CIFS share mounted from london-b's `/hdd/pve` for ISO/template/backup storage (configured by the `proxmox_ve` Ansible role). + +## london-c — Edge Utility (Raspberry Pi) + +Raspberry Pi running Debian 13. Houses helper services that don't need a beefy box. + +| Service | Port | Deployment | Auth | URL | +|---------|------|-----------|------|-----| +| octopus_exporter | 9359 | Docker | — | (scraped by Alloy → Grafana Cloud) | + +The `octopus_exporter` pulls electricity consumption data from the Octopus Energy API and exposes it as Prometheus-formatted metrics, which Alloy then ships to Grafana Cloud. ## nuremberg-a — Mail @@ -67,43 +88,48 @@ Dedicated mail server on Hetzner Cloud. Isolated to protect IP reputation. | Service | Port | Deployment | Auth | URL | |---------|------|-----------|------|-----| -| poste.io | 25, 587, 993, 443 | Docker | Own auth | (webmail via direct access) | +| poste.io | 25, 80, 110, 143, 443, 465, 587, 993, 995 | Docker | Own auth | (webmail via direct host access) | poste.io bundles everything — postfix, dovecot, rspamd, webmail — into a single container. Makes updates straightforward. ## copenhagen-a — Gaming -Game servers. Not publicly exposed via Caddy — accessed directly or over Tailscale. +Game servers. Not publicly exposed via Caddy — accessed directly over the public IP/Tailscale. | Service | Port | Deployment | Auth | URL | |---------|------|-----------|------|-----| -| Minecraft (PaperMC) | 25565 | Docker | — | (direct connection) | +| Minecraft (`itzg/minecraft-server`) | 25565 | Docker | — | (direct connection) | | MaNGOS realmd | 3724 | Native (systemd) | — | (direct connection) | | MaNGOS world | 8085 | Native (systemd) | — | (direct connection) | -| MariaDB | 3306 | Native | — | (local, used by MaNGOS) | +| MariaDB | 3306 | Native (apt) | — | (local, used by MaNGOS) | +| smartctl_exporter | 9633 | Docker | — | (scraped by Alloy → Grafana Cloud) | MaNGOS Zero is a WoW 1.12 (Vanilla) private server. Runs natively under systemd as the `mangos` user from `/home/mangos/mangos/zero/`. Not containerised — it predates the Docker setup on this host. -## copenhagen-c — Idle +## copenhagen-c — Idle (Raspberry Pi) -No active services. Available for future use. +Raspberry Pi running Debian 12 at the Copenhagen site. Mostly idle, but runs a cloudflared tunnel for one-off use. -## Exporters (Monitoring) +| Service | Port | Deployment | Auth | URL | +|---------|------|-----------|------|-----| +| cloudflared | — | Native (systemd) | — | (Cloudflare-managed tunnel) | -These run on various hosts and are scraped by Prometheus: +## Observability Agents -| Exporter | Host | What it monitors | -|----------|------|-----------------| -| node_exporter | All hosts | CPU, memory, disk, network | -| smartctl_exporter | london-b | Disk SMART health data | -| prom-plex-exporter | london-b | Plex activity metrics | +Every host runs: + +- **Grafana Alloy** (`alloy.service`) — collects metrics/logs/traces and ships them to Grafana Cloud +- **node_exporter** (`prometheus-node-exporter.service`) — host metrics (CPU/memory/disk/network) +- **systemd_exporter** (`systemd_exporter.service`) — per-unit systemd metrics + +Plus host-specific exporters (smartctl, plex, octopus) called out above. See [monitoring.md](monitoring.md) for details on what gets shipped and where. ## Auth Summary Services fall into two categories: -**Behind Authelia** (SSO via Caddy forward_auth): -- Grafana, Prometheus, Radarr, Sonarr, Lidarr, Readarr, Prowlarr, Transmission, Soulseek, apps.pez.sh +**Behind Authelia** (SSO via Caddy `forward_auth`): +- Radarr, Sonarr, Lidarr, Readarr, Prowlarr, Transmission, Soulseek, Miniflux, apps.pez.sh **Own auth** (handle login themselves): -- Bitwarden, Plex, Jellyfin, Nextcloud, Navidrome, Jellyseerr, poste.io +- Bitwarden, Forgejo, Plex, Jellyfin, Navidrome, Jellyseerr, Overseerr, Proxmox, poste.io