Commit graph

146 commits

Author SHA1 Message Date
a218acac34 ci: extract shared SOPS/tofu steps into composite actions
The SOPS install + version, the decrypt loop, the OpenTofu version, and
the Backblaze backend-credential extraction were copy-pasted across
terraform.yml (twice), validate-terraform.yml, and _deploy-core.yml.
A version bump meant editing the same string in up to four places and
was easy to do partially.

Pull them into three local composite actions so each is defined once:
  - setup-tofu          (pins OpenTofu version)
  - sops-decrypt        (installs SOPS, decrypts *.enc.* in place)
  - tofu-backend-creds  (exports Backblaze S3 creds to GITHUB_ENV)

Behaviour is unchanged; sops-decrypt also matches *.enc.env everywhere
(previously only _deploy-core did), which is a no-op in terraform/.
2026-06-18 20:23:35 +01:00
e9d5f9bc76
ci: make Caddyfile validation download robust (#134)
The validate-caddyfile workflow fetched the Caddy binary by first hitting
api.github.com/releases/latest to resolve the version tag, then building a
release-asset URL from it. That API call is unauthenticated, so it shares
the 60-requests/hour-per-IP limit across all GitHub-hosted runners and
returns 403 under load. On failure jq emits "null", the URL becomes
caddy_null_linux_amd64.tar.gz, and `curl -sL` silently pipes a 404 page
into tar — a confusing, flaky failure on every PR that touches the Caddyfile.

Switch to Caddy's official download API, which serves the latest linux/amd64
binary directly: one request, no GitHub API, no jq/tar parsing. Add `-f` so
curl fails loudly on an HTTP error instead of writing an error page to disk.
2026-06-15 20:38:21 +01:00
ac8dabe9a4
media_stack: capture london-b sonarr.service unit in repo (PESO-140) (#133)
Some checks failed
Deploy (on merge) / Discover hosts (push) Has been cancelled
Deploy (on merge) / deploy (push) Has been cancelled
sonarr was the only *arr service without its systemd unit in the repo —
it was treated as package-managed and never captured, so a london-b
rebuild would lose the unit. Capture the running unit (APT/mono Sonarr
v3) into ansible/services/sonarr/sonarr.service and have the media_stack
role deploy it to /etc/systemd/system like radarr/lidarr/prowlarr,
overriding the package-owned copy. Move sonarr out of the
package-managed enable loop into the custom-unit deploy + enable loops.
2026-06-14 21:10:43 +01:00
26f8224941
make Dependabot tofu validate stubs satisfy provider validators (#132)
Some checks failed
Deploy (on merge) / Discover hosts (push) Has been cancelled
Terraform / Plan (push) Has been cancelled
Deploy (on merge) / deploy (push) Has been cancelled
Terraform / Apply (push) Has been cancelled
2026-06-12 19:25:24 +01:00
8665a5fe99
remove stale promtail/rc.d leftovers, rss DNS record, fix london-c host description (#131) 2026-06-12 19:24:39 +01:00
0a357fc69a
docs: catch up with the Cloudflare to Hetzner DNS move, fix secrets/terraform drift (#130)
Some checks failed
Terraform / Plan (push) Has been cancelled
Terraform / Apply (push) Has been cancelled
The docs still described Cloudflare as DNS + CDN in front of helsinki-a,
but that was dropped in #90 - pez.sh lives on Hetzner DNS via Terraform
now and records point straight at the origin. Updated README,
architecture, networking, getting-started and the nuremberg-a host doc
to match, and noted that pez.solutions still resolves via Cloudflare
outside Terraform.

Also fixed while I was in there:
- terraform/README: PagerDuty provider is ~> 3.32 (table said ~> 2.2),
  and the B2 secret keys are backblaze_keyID/backblaze_applicationKey
- secrets docs: group_vars secrets file is .enc.yaml, dropped the
  FreeBSD install steps, the long-gone .sops.yaml placeholder note and
  the ANSIBLE_VAULT_PASS migration note, swapped the cloudflare_record
  example for hcloud
- getting-started referenced ansible/scripts/sops-setup.sh which
  doesn't exist
- added naveen.pez.sh to the subdomain tables and a note about the
  DNS-only records (mail, minecraft, wow, public)
2026-06-10 20:59:23 +01:00
0c00a3cb4d
docs: remove decommissioned Miniflux refs; fix status-page + minor drift (#129)
Some checks failed
Deploy (on merge) / Discover hosts (push) Has been cancelled
Deploy (on merge) / deploy (push) Has been cancelled
2026-06-09 19:49:16 +01:00
9d56a22c30
Ansible-manage docker-log-cleanup script and cron (PESO-142) (#128)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / deploy (push) Blocked by required conditions
docker-log-cleanup.sh lived in the repo but nothing deployed it — the
script and monthly cron on nuremberg-a were set up by hand and got wiped
when the host was reinstalled. Fold both into the docker role so every
docker_hosts member gets the script in /usr/local/bin and a monthly cron,
and it survives a rebuild.
2026-06-08 18:38:19 +01:00
3945b8cafc
remove miniflux — decommissioned (#127)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / deploy (push) Blocked by required conditions
Stopped and removed containers on london-b. Removed compose definition,
Caddy reverse proxy route for rss.pez.sh, and london-b host_vars entry.
2026-06-07 18:07:11 +01:00
9ac179dbec
Make Alloy resilient to transient failures; remove leftover Grafana (PESO-149) (#126)
copenhagen-c stopped reporting to Grafana Cloud on 2026-05-20: a transient
TLS failure to fleet-management tripped systemd's default start rate-limit,
systemd gave up, and the host sat silently unmonitored for ~2.5 weeks.

Add a 10-resilience.conf systemd drop-in for alloy.service on every host
(StartLimitIntervalSec=0, Restart=always, RestartSec=30) so a momentary
upstream/TLS blip can no longer permanently kill the collector.

Also drop the old self-hosted Grafana package that was left enabled and
failing on copenhagen-c after the move to Grafana Cloud.
2026-06-07 14:30:08 +01:00
81efa1b717
Remove stale cloudflared service from copenhagen-a (PESO-138) (#125)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / deploy (push) Blocked by required conditions
cloudflared was retired in #56 when Caddy + Authelia replaced Cloudflare
Tunnels, but copenhagen-a was unreachable at the time so its
cloudflared.service was never stopped and is still running.

Add a cleanup task to the common role that stops, disables and purges
cloudflared wherever the unit lingers. Gated on the unit file existing so
it self-targets copenhagen-a and is a no-op everywhere else, and explicitly
excludes copenhagen-c, which legitimately runs a hand-configured tunnel.
2026-06-07 11:45:35 +01:00
3871dc8f90
Restrict london-b Samba (445) to LAN + Tailscale, off public internet (#124)
Samba on london-b was allowed on 445/tcp from anywhere via UFW, exposing
SMB/CIFS to the public internet. Tailscale already reaches it through the
tailscale0 allow-all rule, so scope the explicit rule to the local London
LAN (192.168.1.0/24) instead of the world.

The common UFW task only ever adds allow rules, so it gained support for an
optional per-port from_ip, plus a follow-up task that deletes the superseded
world-open variant of any source-restricted port — otherwise the old
'445 ALLOW Anywhere' rule would linger on the host and defeat the change.

PESO-145
2026-06-07 11:37:45 +01:00
644b608831
chore: retire readarr service, replaced by bookshelf (#123)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / deploy (push) Blocked by required conditions
Bookshelf (PR #122) is a Readarr revival and now owns port 8787 on
london-b, so the old custom Readarr systemd unit is removed:

- drop readarr from the media_stack role's unit-deploy and enable loops,
  and add an idempotent decommission task (stop, disable, remove unit)
  so the host tears it down via Ansible rather than ad-hoc SSH
- delete services/readarr/readarr.service
- update docs (services, london-b host, service inventory) to describe
  bookshelf as a Docker service instead of a custom systemd unit

The public readarr.pez.sh hostname is kept and now reverse-proxies to
bookshelf on :8787 — DNS, Caddy and Authelia (pez_readarr_users group)
are unchanged.
2026-06-06 15:50:37 +01:00
98ac065056
feat: add bookshelf service on london-b (#122)
Bookshelf (a Readarr revival) for managing the ebook/audiobook library.
Runs on london-b with config at /root/bookshelf and the library at
/hdd/books mounted into the container at the same path.
2026-06-06 15:34:57 +01:00
85d1cb945e
chore: commit terraform lock file for reproducible provider versions (#121)
Some checks failed
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / deploy (push) Blocked by required conditions
Terraform / Plan (push) Has been cancelled
Terraform / Apply (push) Has been cancelled
The .terraform.lock.hcl was gitignored while providers use floating
~> constraints, so every CI 'tofu init' resolved provider versions
fresh and could drift from what was tested locally, with no checksum
verification on the providers.

Track the lock file instead, with hashes for linux_amd64 (CI) plus
darwin_arm64/amd64 (local). Dependabot's terraform updates now surface
exact provider version bumps as reviewable, hash-pinned changes.
2026-06-06 13:19:08 +01:00
a40cd60d60
backup: keep deleted/overwritten versions instead of mirroring them away (#120)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / deploy (push) Blocked by required conditions
Terraform / Plan (push) Waiting to run
Terraform / Apply (push) Blocked by required conditions
The nightly job runs 'rclone sync', which permanently deletes or overwrites
objects at the B2 destination. That means an accidental deletion or a
ransomware encryption on /hdd propagates straight to the backup on the next
run, leaving no clean copy.

Add --backup-dir so every superseded version is moved into a dated folder
under _versions/ rather than thrown away, and prune that folder after 30
days so it doesn't grow unbounded.
2026-06-05 21:23:04 +01:00
dependabot[bot]
7f2cbd4af1
chore(deps): bump the github-actions group across 1 directory with 2 updates (#117)
Bumps the github-actions group with 2 updates in the / directory: [ansible/ansible-lint](https://github.com/ansible/ansible-lint) and [actions/github-script](https://github.com/actions/github-script).


Updates `ansible/ansible-lint` from 25 to 26
- [Release notes](https://github.com/ansible/ansible-lint/releases)
- [Commits](https://github.com/ansible/ansible-lint/compare/v25...v26)

Updates `actions/github-script` from 7 to 9
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
- dependency-name: ansible/ansible-lint
  dependency-version: '26'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-05 21:13:03 +01:00
dependabot[bot]
24431466c5
chore(deps): bump the terraform group across 2 directories with 1 update (#116)
Updates the requirements on  and [pagerduty/pagerduty](https://github.com/PagerDuty/terraform-provider-pagerduty) to permit the latest version.

Updates `pagerduty/pagerduty` to 3.32.4
- [Release notes](https://github.com/PagerDuty/terraform-provider-pagerduty/releases)
- [Changelog](https://github.com/PagerDuty/terraform-provider-pagerduty/blob/master/CHANGELOG.md)
- [Commits](https://github.com/PagerDuty/terraform-provider-pagerduty/compare/v2.2.0...v3.32.4)

Updates `pagerduty/pagerduty` to 3.32.4
- [Release notes](https://github.com/PagerDuty/terraform-provider-pagerduty/releases)
- [Changelog](https://github.com/PagerDuty/terraform-provider-pagerduty/blob/master/CHANGELOG.md)
- [Commits](https://github.com/PagerDuty/terraform-provider-pagerduty/compare/v2.2.0...v3.32.4)

---
updated-dependencies:
- dependency-name: pagerduty/pagerduty
  dependency-version: 3.32.4
  dependency-type: direct:production
  dependency-group: terraform
- dependency-name: pagerduty/pagerduty
  dependency-version: 3.32.4
  dependency-type: direct:production
  dependency-group: terraform
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-05 21:12:59 +01:00
9815f44b84
fix: stop masking failed service deploys; trim dead config (#119)
Some checks failed
Deploy (on merge) / Discover hosts (push) Has been cancelled
Deploy (on merge) / deploy (push) Has been cancelled
The docker_services and systemd_services roles ran their "start the
service" tasks with `failed_when: false`, so a container or unit that
failed to come up still reported the deploy as green. Drop it from both
start tasks so a broken deploy actually fails CI. The compose/unit *copy*
tasks keep `failed_when: false` — that's load-bearing for the
`item is not failed` filter that skips services without a compose/unit file.

Also:
- Remove a duplicate "Template service .env files" task in docker_services
  (second copy used a hardcoded path and didn't register; first one is the
  one the start task reads).
- Don't trigger a full fleet deploy on docs/markdown/workflow-only pushes
  to main — add docs/**, **/*.md and .github/** to paths-ignore.
- Drop the dangling `update-freebsd` Make target (playbook doesn't exist;
  fleet has no FreeBSD hosts).
2026-06-04 18:41:24 +01:00
7b2552fea5
chore: fix dependabot PRs (#118)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / deploy (push) Blocked by required conditions
* chore: add dependabot config

Add Dependabot for the three supported ecosystems in this repo:
GitHub Actions, Terraform (root + grafana/hetzner/pagerduty modules),
and Docker (service compose files + dotfile Dockerfiles). Weekly
schedule with per-ecosystem grouping to keep PR noise down.

* ci: make terraform validation work on dependabot PRs

Dependabot PRs run with no access to repository secrets and a read-only
token, so the SOPS decrypt step (and the PR-comment step) fail. Give
Dependabot a secret-free path: stub the secrets.yaml keys it reads and
run init -backend=false + validate, skipping decrypt/plan/comment. Human
PRs are unchanged and still get a full plan.
2026-06-03 19:29:23 +01:00
7e74232d64
chore: add dependabot config (#115)
Add Dependabot for the three supported ecosystems in this repo:
GitHub Actions, Terraform (root + grafana/hetzner/pagerduty modules),
and Docker (service compose files + dotfile Dockerfiles). Weekly
schedule with per-ecosystem grouping to keep PR noise down.
2026-06-03 19:15:12 +01:00
65090ca9d6
ci: serialize terraform and deploy runs with concurrency guards (#114)
Some checks failed
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / deploy (push) Blocked by required conditions
Terraform / Plan (push) Has been cancelled
Terraform / Apply (push) Has been cancelled
* ci: serialize infra runs and enable terraform state locking

Add concurrency guards to the terraform and deploy-on-merge workflows so
two merges in quick succession can't run against the same state or the
same hosts at once (queue, never cancel an in-flight run).

Enable native S3 state locking (use_lockfile) on the Backblaze B2 backend,
which needs OpenTofu 1.10+, so bump the CI tofu version 1.9.0 -> 1.10.10
and the required_version constraint to >= 1.10.0.

* ci: bump tofu to 1.10.10 in the validate workflow too

Missed this one in the last commit — the PR-time validate still pinned
1.9.0, which trips the new required_version >= 1.10.0 constraint.

* ci: drop use_lockfile — Backblaze B2 can't do native state locking

B2's S3 API returns 501 NotImplemented for the conditional PutObject that
use_lockfile relies on, so tofu plan/apply fails to acquire the lock.
Revert the lockfile and the 1.10 version bump it required; rely on the
concurrency guard to serialize applies instead. Left a note in the
backend block so this isn't re-attempted.
2026-06-02 19:39:13 +01:00
45dff99e7c
fix: update octopus exporter (#113)
Some checks failed
Deploy (on merge) / Discover hosts (push) Has been cancelled
Deploy (on merge) / deploy (push) Has been cancelled
2026-05-26 20:56:07 +01:00
a031d4218b
fix: Documentation overhaul (#112)
Some checks failed
Deploy (on merge) / Discover hosts (push) Has been cancelled
Deploy (on merge) / deploy (push) Has been cancelled
* fix: Documentation overhaul

* removing joke graph
2026-05-19 18:49:21 +01:00
1ec4e10eb1
Update cache action (#111)
Some checks failed
Deploy (on merge) / Discover hosts (push) Has been cancelled
Deploy (on merge) / deploy (push) Has been cancelled
* fix: update cache version

* fix: update cache
2026-05-16 11:13:38 +01:00
a6aa561147
fix: update cache version (#110) 2026-05-16 11:03:12 +01:00
7ad2766f94
hotfix: broken pipeline (#109)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / deploy (push) Blocked by required conditions
* fix: cleanup deploy.yml and share workflow

* lint issue

* hotfix: broken pipeline
2026-05-15 20:19:56 +01:00
9f84652102
fix: cleanup deploy.yml and share workflow (#108)
* fix: cleanup deploy.yml and share workflow

* lint issue
2026-05-15 20:17:28 +01:00
69145b3089
fix: add smb mount (#107)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
* fix: add smb mount

* update secrets

* address linting issues
2026-05-14 20:49:25 +01:00
5481292b7f
fix: remove subscription nag and lock down proxmox (#106)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
2026-05-13 21:09:54 +01:00
d3b516c594
fix: cleanup freebsd and alpine stuff (#105)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
2026-05-12 22:43:12 +01:00
e502a92451
fix: tracing on caddy services (#104)
Some checks failed
Deploy (on merge) / Discover hosts (push) Has been cancelled
Terraform / Plan (push) Has been cancelled
Deploy (on merge) / Deploy → (push) Has been cancelled
Terraform / Apply (push) Has been cancelled
2026-05-10 10:18:53 +01:00
06552c5b75
fix: slight tweaks (#103)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
Terraform / Plan (push) Waiting to run
Terraform / Apply (push) Blocked by required conditions
* fix: slight tweaks

* remove vendor
2026-05-09 20:49:46 +01:00
b5d5537c1f
Proxmox ve on london a (#102)
* fix: update config for london-a for new proxmox install

* fix: update proxmox endpoint
2026-05-09 19:29:44 +01:00
928d1d0b99
fix: update config for london-a for new proxmox install (#101) 2026-05-09 19:22:34 +01:00
51efda6053
Update fleet_pipelines.tf
Some checks are pending
Terraform / Plan (push) Waiting to run
Terraform / Apply (push) Blocked by required conditions
2026-05-08 19:32:58 +01:00
d88d2e5d12
Add git synthetic check (#99)
Some checks failed
Terraform / Plan (push) Has been cancelled
Terraform / Apply (push) Has been cancelled
2026-05-06 06:01:59 +01:00
7d22ad1ce1
bug: add retry to restarting caddy (#97)
Some checks failed
Terraform / Plan (push) Waiting to run
Terraform / Apply (push) Blocked by required conditions
Deploy (on merge) / Discover hosts (push) Has been cancelled
Deploy (on merge) / Deploy → (push) Has been cancelled
* bug: add retry to restarting caddy

* skip terraform pipeline when no terraform changes has been done
2026-05-05 20:42:52 +01:00
abb283c1d7
terraform plan on pr and caddy metrics on localhost since we have all… (#96)
* terraform plan on pr and caddy metrics on localhost since we have alloy now

* remove refreshing state
2026-05-05 13:35:37 +01:00
9bde71fbf9
adding pagerduty stack (#95)
Some checks are pending
Terraform / Plan (push) Waiting to run
Terraform / Apply (push) Blocked by required conditions
* adding pagerduty stack

* rename files to not be overly descriptive
2026-05-04 20:50:31 +01:00
043c783361
Grafana Cloud Migration (#94)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
Terraform / Plan (push) Waiting to run
Terraform / Apply (push) Blocked by required conditions
* Grafana Cloud migration, adding dashboards, fleet, alloy and synthetics

* modulize stuff now that we have multiple substantial things in here

* provider updates and new secrets

* remove grafana and prometheus from ansible
2026-05-04 13:40:30 +01:00
83f023aedd
Migration to Grafana Cloud, nuremberg-a reinstalled, london-a reinsta… (#93)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
Terraform / Plan (push) Waiting to run
Terraform / Apply (push) Blocked by required conditions
* Migration to Grafana Cloud, nuremberg-a reinstalled, london-a reinstalled

* dns config for cockpit
2026-05-03 14:00:22 +01:00
d22f7a52a0
fix: clean up of terraform (#92)
Some checks failed
Terraform / Plan (push) Has been cancelled
Terraform / Apply (push) Has been cancelled
2026-05-02 14:46:03 +01:00
03ad9b476d
make dns more neat (#91)
Some checks are pending
Terraform / Plan (push) Waiting to run
Terraform / Apply (push) Blocked by required conditions
2026-05-01 21:05:53 +01:00
b5cef4b985
fix: remove cloudflare resources (#90)
Some checks failed
Terraform / Plan (push) Has been cancelled
Terraform / Apply (push) Has been cancelled
* phase 1 - add all the records to both providers to A/B test

* dkim fix

* remove cloudflare resources
2026-04-30 15:55:14 +01:00
ba04d49c4e
Clou dflaring out mayday mayday mayday (#89)
Some checks failed
Terraform / Plan (push) Waiting to run
Terraform / Apply (push) Blocked by required conditions
Deploy (on merge) / Discover hosts (push) Has been cancelled
Deploy (on merge) / Deploy → (push) Has been cancelled
* phase 1 - add all the records to both providers to A/B test

* dkim fix
2026-04-29 21:23:15 +01:00
dd112fd505
phase 1 - add all the records to both providers to A/B test (#88) 2026-04-29 20:47:34 +01:00
e5306a5409
Fixing loki alloy (#87)
* add alloy to docker group

* fix: use docker driver instead of hacky alloy setup

* fixing linting issue
2026-04-29 20:07:40 +01:00
a51a0879d3
add alloy to docker group (#86)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
2026-04-28 20:53:19 +01:00
6a3618aa4a
fix: Fixing loki alloy (#85)
* fix: alloy

* fix: alpine doesn't need a hacky install
2026-04-28 20:30:30 +01:00