pez-infra/docs/getting-started.md
Rasmus "Pez" Wejlgaard 0a357fc69a
Some checks failed
Terraform / Plan (push) Has been cancelled
Terraform / Apply (push) Has been cancelled
docs: catch up with the Cloudflare to Hetzner DNS move, fix secrets/terraform drift (#130)
The docs still described Cloudflare as DNS + CDN in front of helsinki-a,
but that was dropped in #90 - pez.sh lives on Hetzner DNS via Terraform
now and records point straight at the origin. Updated README,
architecture, networking, getting-started and the nuremberg-a host doc
to match, and noted that pez.solutions still resolves via Cloudflare
outside Terraform.

Also fixed while I was in there:
- terraform/README: PagerDuty provider is ~> 3.32 (table said ~> 2.2),
  and the B2 secret keys are backblaze_keyID/backblaze_applicationKey
- secrets docs: group_vars secrets file is .enc.yaml, dropped the
  FreeBSD install steps, the long-gone .sops.yaml placeholder note and
  the ANSIBLE_VAULT_PASS migration note, swapped the cloudflare_record
  example for hcloud
- getting-started referenced ansible/scripts/sops-setup.sh which
  doesn't exist
- added naveen.pez.sh to the subdomain tables and a note about the
  DNS-only records (mail, minecraft, wow, public)
2026-06-10 20:59:23 +01:00

179 lines
6.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Getting Started
How to work with this repo, deploy changes, and not break things.
## Prerequisites
You'll need:
- **Tailscale** — installed and connected to the tailnet. All SSH access goes through Tailscale. No servers have SSH exposed on the public internet.
- **SSH keys** — set up for each host you need to access
- **Ansible** — for configuration management and deployments (`make deps` from `ansible/` installs collections)
- **OpenTofu** (or Terraform) — for Hetzner (servers + DNS), Grafana Cloud, and PagerDuty
- **Docker** — helpful to understand, since most services are containerised
- **SOPS + age** — for secrets encryption/decryption (see [Secrets](secrets.md) for setup)
- **Git** — obviously
- **gh CLI** — for GitHub operations (PRs, issues, etc.)
## Clone the repo
```bash
git clone git@github.com:RWejlgaard/pez-infra.git
cd pez-infra
```
## Repo Structure
```
pez-infra/
├── docs/ # You are here
├── ansible/ # Ansible playbooks, roles, inventory, and all managed files
│ ├── roles/ # Ansible roles (common, caddy, docker, media_stack, proxmox_ve, etc.)
│ ├── services/ # Docker Compose definitions and service configs
│ ├── dotfiles/ # Shell config (fish, nvim, tmux, git, etc.)
│ ├── playbooks/ # One-off playbooks (updates, reboots, status)
│ └── scripts/ # Utility and maintenance scripts
└── terraform/ # Terraform/OpenTofu for Hetzner (servers + DNS), Grafana Cloud, PagerDuty
```
## Connecting to hosts
All access is via Tailscale, as `root`. Once you're on the tailnet, SSH using the Tailscale IP or hostname:
```bash
ssh root@helsinki-a # or ssh root@100.67.6.27
ssh root@london-a # Proxmox VE host
ssh root@london-b # storage / media
ssh root@london-c # Raspberry Pi
ssh root@copenhagen-a
ssh root@copenhagen-c # Raspberry Pi
ssh root@nuremberg-a
```
## Common Tasks
### Deploying configuration changes
Ansible handles deployments. The unified `deploy.yml` rebuilds a host from bare-metal-with-Tailscale to fully configured.
```bash
cd ansible/
# Install collections
make deps
# Dry run — see what would change
make deploy-check
# Deploy everything
make deploy
# Deploy a single host
make deploy-host HOST=london-b
# Or run a single stage
ansible-playbook deploy.yml --tags docker
```
Ansible also runs automatically via GitHub Actions on commits to the main branch — so a quick commit from your phone can fix a misconfiguration when you're out.
Other playbooks live under `ansible/playbooks/`:
| Playbook | Purpose |
|---|---|
| `update-all.yml` | OS package updates (all hosts) |
| `update-linux.yml` | Linux-only updates (apt) |
| `docker-status.yml` | Show running containers per host |
| `reboot.yml` | Safe reboot with pre-flight (interactive confirm for london-b) |
| `zfs.yml` | ZFS scrub scheduling |
### Managing cloud + DNS + observability
Terraform manages Hetzner servers + DNS, Grafana Cloud (stack, fleet, dashboards, synthetic checks), and PagerDuty:
```bash
cd terraform
make init # initialize providers and B2 backend
make plan # preview changes
make apply # apply the changes
```
State lives in a Backblaze B2 bucket (`pez-infra-tfstate`) via the S3-compatible backend. Don't click around in the Hetzner or Grafana Cloud dashboards — if it's not in Terraform, it doesn't exist.
### Adding a new service
1. **Create a Docker Compose file** in `ansible/services/<service-name>/docker-compose.yml` (or a systemd unit if it's native)
2. **Add the host_var** — list the service under `docker_services` (or `systemd_services`) in `ansible/inventory/host_vars/<host>.yml`
3. **Add the Caddy route** — if it needs a public subdomain, add a block to `ansible/services/caddy/Caddyfile`
4. **Add a DNS record** — add the subdomain to `terraform/hetzner/dns.tf` and run `tofu apply`
5. **Add monitoring** — if the service has a metrics endpoint, scrape it via Alloy (`terraform/grafana/fleet_pipelines/`)
6. **Update docs** — add the service to `docs/services.md` (and the relevant `docs/hosts/<host>.md` page)
### Adding a new server
1. Install the OS (Debian 13 or Ubuntu LTS preferred — see below)
2. Set up SSH keys for `root`
3. Install Tailscale and join the tailnet
4. Add the host to `ansible/inventory/hosts.ini` and create `ansible/inventory/host_vars/<host>.yml`
5. Run `make deploy-host HOST=<new-host>` from `ansible/`
6. Register the host as a Grafana Fleet collector in `terraform/grafana/fleet_collectors.tf` and `tofu apply`
7. Add a doc at `docs/hosts/<host>.md` and update `docs/services.md` + `docs/architecture.md`
That's it. The common role installs node_exporter, systemd_exporter, and Alloy as part of the baseline, so observability is automatic.
### Working with ZFS (london-b)
```bash
# Check pool status
zpool status hdd
# Check usage
zfs list
# Scrub status (runs weekly on Sundays at 12:00)
zpool status hdd | grep scan
```
ZFS is set up with 3× RAIDZ1 vdevs of 4 drives each (12 drives total) on the `hdd` pool. Tolerates one drive failure per vdev. The long-term plan is to replace the 8 TB drives with 24 TB drives and grow the pool toward 24 drives / ~0.5 PB raw.
## OS Choice
- **Debian (12 or 13)** is the default for new hosts — including the Raspberry Pis. Stable, well-supported by Ansible, predictable.
- **Ubuntu LTS** is on london-b and copenhagen-a (historical — both came up before the Debian standard).
- **Proxmox VE** (Debian Bookworm under the hood) on london-a.
- **No more FreeBSD.** london-a used to run FreeBSD for Prometheus/Grafana; that's all on Grafana Cloud now and london-a is Linux/Proxmox.
Alpine has been tried and rejected — the missing GNU binaries / systemd caused enough Ansible headaches to not be worth the size savings.
## Secrets
Secrets are encrypted in-repo using **SOPS + age**. Encrypted files have `.enc.` in their extension (e.g. `secrets.enc.yaml`).
```bash
# Edit an encrypted file
sops ansible/services/authelia/config.enc.yml
# Decrypt to stdout
sops -d ansible/services/authelia/config.enc.yml
```
Full documentation: [docs/secrets.md](secrets.md)
## Branching
- `main` is the production branch. Ansible runs from main via GitHub Actions.
- Feature branches for changes, PRs for review.
- Branch naming: `<author>/PESO-<number>-<description>` for Jira-tracked work.
## Consolidated Repos
This monorepo replaces several standalone repos:
| Old repo | Now lives in |
|----------|-------------|
| pez-ansible | `ansible/` |
| pez-terraform | `terraform/` |
| pez-grafana | `terraform/grafana/` |
| pez-proxy | `ansible/services/caddy/` |
| pez-docs | `docs/` |
| server-scripts | `ansible/scripts/` and `ansible/roles/` |