The docs still described Cloudflare as DNS + CDN in front of helsinki-a,
but that was dropped in #90 - pez.sh lives on Hetzner DNS via Terraform
now and records point straight at the origin. Updated README,
architecture, networking, getting-started and the nuremberg-a host doc
to match, and noted that pez.solutions still resolves via Cloudflare
outside Terraform.
Also fixed while I was in there:
- terraform/README: PagerDuty provider is ~> 3.32 (table said ~> 2.2),
and the B2 secret keys are backblaze_keyID/backblaze_applicationKey
- secrets docs: group_vars secrets file is .enc.yaml, dropped the
FreeBSD install steps, the long-gone .sops.yaml placeholder note and
the ANSIBLE_VAULT_PASS migration note, swapped the cloudflare_record
example for hcloud
- getting-started referenced ansible/scripts/sops-setup.sh which
doesn't exist
- added naveen.pez.sh to the subdomain tables and a note about the
DNS-only records (mail, minecraft, wow, public)
docker-log-cleanup.sh lived in the repo but nothing deployed it — the
script and monthly cron on nuremberg-a were set up by hand and got wiped
when the host was reinstalled. Fold both into the docker role so every
docker_hosts member gets the script in /usr/local/bin and a monthly cron,
and it survives a rebuild.
copenhagen-c stopped reporting to Grafana Cloud on 2026-05-20: a transient
TLS failure to fleet-management tripped systemd's default start rate-limit,
systemd gave up, and the host sat silently unmonitored for ~2.5 weeks.
Add a 10-resilience.conf systemd drop-in for alloy.service on every host
(StartLimitIntervalSec=0, Restart=always, RestartSec=30) so a momentary
upstream/TLS blip can no longer permanently kill the collector.
Also drop the old self-hosted Grafana package that was left enabled and
failing on copenhagen-c after the move to Grafana Cloud.
cloudflared was retired in #56 when Caddy + Authelia replaced Cloudflare
Tunnels, but copenhagen-a was unreachable at the time so its
cloudflared.service was never stopped and is still running.
Add a cleanup task to the common role that stops, disables and purges
cloudflared wherever the unit lingers. Gated on the unit file existing so
it self-targets copenhagen-a and is a no-op everywhere else, and explicitly
excludes copenhagen-c, which legitimately runs a hand-configured tunnel.
Samba on london-b was allowed on 445/tcp from anywhere via UFW, exposing
SMB/CIFS to the public internet. Tailscale already reaches it through the
tailscale0 allow-all rule, so scope the explicit rule to the local London
LAN (192.168.1.0/24) instead of the world.
The common UFW task only ever adds allow rules, so it gained support for an
optional per-port from_ip, plus a follow-up task that deletes the superseded
world-open variant of any source-restricted port — otherwise the old
'445 ALLOW Anywhere' rule would linger on the host and defeat the change.
PESO-145
Bookshelf (PR #122) is a Readarr revival and now owns port 8787 on
london-b, so the old custom Readarr systemd unit is removed:
- drop readarr from the media_stack role's unit-deploy and enable loops,
and add an idempotent decommission task (stop, disable, remove unit)
so the host tears it down via Ansible rather than ad-hoc SSH
- delete services/readarr/readarr.service
- update docs (services, london-b host, service inventory) to describe
bookshelf as a Docker service instead of a custom systemd unit
The public readarr.pez.sh hostname is kept and now reverse-proxies to
bookshelf on :8787 — DNS, Caddy and Authelia (pez_readarr_users group)
are unchanged.
Bookshelf (a Readarr revival) for managing the ebook/audiobook library.
Runs on london-b with config at /root/bookshelf and the library at
/hdd/books mounted into the container at the same path.
The .terraform.lock.hcl was gitignored while providers use floating
~> constraints, so every CI 'tofu init' resolved provider versions
fresh and could drift from what was tested locally, with no checksum
verification on the providers.
Track the lock file instead, with hashes for linux_amd64 (CI) plus
darwin_arm64/amd64 (local). Dependabot's terraform updates now surface
exact provider version bumps as reviewable, hash-pinned changes.
The nightly job runs 'rclone sync', which permanently deletes or overwrites
objects at the B2 destination. That means an accidental deletion or a
ransomware encryption on /hdd propagates straight to the backup on the next
run, leaving no clean copy.
Add --backup-dir so every superseded version is moved into a dated folder
under _versions/ rather than thrown away, and prune that folder after 30
days so it doesn't grow unbounded.
The docker_services and systemd_services roles ran their "start the
service" tasks with `failed_when: false`, so a container or unit that
failed to come up still reported the deploy as green. Drop it from both
start tasks so a broken deploy actually fails CI. The compose/unit *copy*
tasks keep `failed_when: false` — that's load-bearing for the
`item is not failed` filter that skips services without a compose/unit file.
Also:
- Remove a duplicate "Template service .env files" task in docker_services
(second copy used a hardcoded path and didn't register; first one is the
one the start task reads).
- Don't trigger a full fleet deploy on docs/markdown/workflow-only pushes
to main — add docs/**, **/*.md and .github/** to paths-ignore.
- Drop the dangling `update-freebsd` Make target (playbook doesn't exist;
fleet has no FreeBSD hosts).
* chore: add dependabot config
Add Dependabot for the three supported ecosystems in this repo:
GitHub Actions, Terraform (root + grafana/hetzner/pagerduty modules),
and Docker (service compose files + dotfile Dockerfiles). Weekly
schedule with per-ecosystem grouping to keep PR noise down.
* ci: make terraform validation work on dependabot PRs
Dependabot PRs run with no access to repository secrets and a read-only
token, so the SOPS decrypt step (and the PR-comment step) fail. Give
Dependabot a secret-free path: stub the secrets.yaml keys it reads and
run init -backend=false + validate, skipping decrypt/plan/comment. Human
PRs are unchanged and still get a full plan.
Add Dependabot for the three supported ecosystems in this repo:
GitHub Actions, Terraform (root + grafana/hetzner/pagerduty modules),
and Docker (service compose files + dotfile Dockerfiles). Weekly
schedule with per-ecosystem grouping to keep PR noise down.
* ci: serialize infra runs and enable terraform state locking
Add concurrency guards to the terraform and deploy-on-merge workflows so
two merges in quick succession can't run against the same state or the
same hosts at once (queue, never cancel an in-flight run).
Enable native S3 state locking (use_lockfile) on the Backblaze B2 backend,
which needs OpenTofu 1.10+, so bump the CI tofu version 1.9.0 -> 1.10.10
and the required_version constraint to >= 1.10.0.
* ci: bump tofu to 1.10.10 in the validate workflow too
Missed this one in the last commit — the PR-time validate still pinned
1.9.0, which trips the new required_version >= 1.10.0 constraint.
* ci: drop use_lockfile — Backblaze B2 can't do native state locking
B2's S3 API returns 501 NotImplemented for the conditional PutObject that
use_lockfile relies on, so tofu plan/apply fails to acquire the lock.
Revert the lockfile and the 1.10 version bump it required; rely on the
concurrency guard to serialize applies instead. Left a note in the
backend block so this isn't re-attempted.
* Grafana Cloud migration, adding dashboards, fleet, alloy and synthetics
* modulize stuff now that we have multiple substantial things in here
* provider updates and new secrets
* remove grafana and prometheus from ansible