pez-infra/docs/getting-started.md
Rasmus Wejlgaard 361133ec7e docs: catch up with the Cloudflare to Hetzner DNS move, fix secrets/terraform drift
The docs still described Cloudflare as DNS + CDN in front of helsinki-a,
but that was dropped in #90 - pez.sh lives on Hetzner DNS via Terraform
now and records point straight at the origin. Updated README,
architecture, networking, getting-started and the nuremberg-a host doc
to match, and noted that pez.solutions still resolves via Cloudflare
outside Terraform.

Also fixed while I was in there:
- terraform/README: PagerDuty provider is ~> 3.32 (table said ~> 2.2),
  and the B2 secret keys are backblaze_keyID/backblaze_applicationKey
- secrets docs: group_vars secrets file is .enc.yaml, dropped the
  FreeBSD install steps, the long-gone .sops.yaml placeholder note and
  the ANSIBLE_VAULT_PASS migration note, swapped the cloudflare_record
  example for hcloud
- getting-started referenced ansible/scripts/sops-setup.sh which
  doesn't exist
- added naveen.pez.sh to the subdomain tables and a note about the
  DNS-only records (mail, minecraft, wow, public)
2026-06-10 19:35:53 +01:00

6.6 KiB
Raw Blame History

Getting Started

How to work with this repo, deploy changes, and not break things.

Prerequisites

You'll need:

  • Tailscale — installed and connected to the tailnet. All SSH access goes through Tailscale. No servers have SSH exposed on the public internet.
  • SSH keys — set up for each host you need to access
  • Ansible — for configuration management and deployments (make deps from ansible/ installs collections)
  • OpenTofu (or Terraform) — for Hetzner (servers + DNS), Grafana Cloud, and PagerDuty
  • Docker — helpful to understand, since most services are containerised
  • SOPS + age — for secrets encryption/decryption (see Secrets for setup)
  • Git — obviously
  • gh CLI — for GitHub operations (PRs, issues, etc.)

Clone the repo

git clone git@github.com:RWejlgaard/pez-infra.git
cd pez-infra

Repo Structure

pez-infra/
├── docs/           # You are here
├── ansible/        # Ansible playbooks, roles, inventory, and all managed files
│   ├── roles/      # Ansible roles (common, caddy, docker, media_stack, proxmox_ve, etc.)
│   ├── services/   # Docker Compose definitions and service configs
│   ├── dotfiles/   # Shell config (fish, nvim, tmux, git, etc.)
│   ├── playbooks/  # One-off playbooks (updates, reboots, status)
│   └── scripts/    # Utility and maintenance scripts
└── terraform/      # Terraform/OpenTofu for Hetzner (servers + DNS), Grafana Cloud, PagerDuty

Connecting to hosts

All access is via Tailscale, as root. Once you're on the tailnet, SSH using the Tailscale IP or hostname:

ssh root@helsinki-a        # or ssh root@100.67.6.27
ssh root@london-a          # Proxmox VE host
ssh root@london-b          # storage / media
ssh root@london-c          # Raspberry Pi
ssh root@copenhagen-a
ssh root@copenhagen-c      # Raspberry Pi
ssh root@nuremberg-a

Common Tasks

Deploying configuration changes

Ansible handles deployments. The unified deploy.yml rebuilds a host from bare-metal-with-Tailscale to fully configured.

cd ansible/

# Install collections
make deps

# Dry run — see what would change
make deploy-check

# Deploy everything
make deploy

# Deploy a single host
make deploy-host HOST=london-b

# Or run a single stage
ansible-playbook deploy.yml --tags docker

Ansible also runs automatically via GitHub Actions on commits to the main branch — so a quick commit from your phone can fix a misconfiguration when you're out.

Other playbooks live under ansible/playbooks/:

Playbook Purpose
update-all.yml OS package updates (all hosts)
update-linux.yml Linux-only updates (apt)
docker-status.yml Show running containers per host
reboot.yml Safe reboot with pre-flight (interactive confirm for london-b)
zfs.yml ZFS scrub scheduling

Managing cloud + DNS + observability

Terraform manages Hetzner servers + DNS, Grafana Cloud (stack, fleet, dashboards, synthetic checks), and PagerDuty:

cd terraform
make init   # initialize providers and B2 backend
make plan   # preview changes
make apply  # apply the changes

State lives in a Backblaze B2 bucket (pez-infra-tfstate) via the S3-compatible backend. Don't click around in the Hetzner or Grafana Cloud dashboards — if it's not in Terraform, it doesn't exist.

Adding a new service

  1. Create a Docker Compose file in ansible/services/<service-name>/docker-compose.yml (or a systemd unit if it's native)
  2. Add the host_var — list the service under docker_services (or systemd_services) in ansible/inventory/host_vars/<host>.yml
  3. Add the Caddy route — if it needs a public subdomain, add a block to ansible/services/caddy/Caddyfile
  4. Add a DNS record — add the subdomain to terraform/hetzner/dns.tf and run tofu apply
  5. Add monitoring — if the service has a metrics endpoint, scrape it via Alloy (terraform/grafana/fleet_pipelines/)
  6. Update docs — add the service to docs/services.md (and the relevant docs/hosts/<host>.md page)

Adding a new server

  1. Install the OS (Debian 13 or Ubuntu LTS preferred — see below)
  2. Set up SSH keys for root
  3. Install Tailscale and join the tailnet
  4. Add the host to ansible/inventory/hosts.ini and create ansible/inventory/host_vars/<host>.yml
  5. Run make deploy-host HOST=<new-host> from ansible/
  6. Register the host as a Grafana Fleet collector in terraform/grafana/fleet_collectors.tf and tofu apply
  7. Add a doc at docs/hosts/<host>.md and update docs/services.md + docs/architecture.md

That's it. The common role installs node_exporter, systemd_exporter, and Alloy as part of the baseline, so observability is automatic.

Working with ZFS (london-b)

# Check pool status
zpool status hdd

# Check usage
zfs list

# Scrub status (runs weekly on Sundays at 12:00)
zpool status hdd | grep scan

ZFS is set up with 3× RAIDZ1 vdevs of 4 drives each (12 drives total) on the hdd pool. Tolerates one drive failure per vdev. The long-term plan is to replace the 8 TB drives with 24 TB drives and grow the pool toward 24 drives / ~0.5 PB raw.

OS Choice

  • Debian (12 or 13) is the default for new hosts — including the Raspberry Pis. Stable, well-supported by Ansible, predictable.
  • Ubuntu LTS is on london-b and copenhagen-a (historical — both came up before the Debian standard).
  • Proxmox VE (Debian Bookworm under the hood) on london-a.
  • No more FreeBSD. london-a used to run FreeBSD for Prometheus/Grafana; that's all on Grafana Cloud now and london-a is Linux/Proxmox.

Alpine has been tried and rejected — the missing GNU binaries / systemd caused enough Ansible headaches to not be worth the size savings.

Secrets

Secrets are encrypted in-repo using SOPS + age. Encrypted files have .enc. in their extension (e.g. secrets.enc.yaml).

# Edit an encrypted file
sops ansible/services/authelia/config.enc.yml

# Decrypt to stdout
sops -d ansible/services/authelia/config.enc.yml

Full documentation: docs/secrets.md

Branching

  • main is the production branch. Ansible runs from main via GitHub Actions.
  • Feature branches for changes, PRs for review.
  • Branch naming: <author>/PESO-<number>-<description> for Jira-tracked work.

Consolidated Repos

This monorepo replaces several standalone repos:

Old repo Now lives in
pez-ansible ansible/
pez-terraform terraform/
pez-grafana terraform/grafana/
pez-proxy ansible/services/caddy/
pez-docs docs/
server-scripts ansible/scripts/ and ansible/roles/