Commit graph

39 commits

Author SHA1 Message Date
9ac179dbec
Make Alloy resilient to transient failures; remove leftover Grafana (PESO-149) (#126)
copenhagen-c stopped reporting to Grafana Cloud on 2026-05-20: a transient
TLS failure to fleet-management tripped systemd's default start rate-limit,
systemd gave up, and the host sat silently unmonitored for ~2.5 weeks.

Add a 10-resilience.conf systemd drop-in for alloy.service on every host
(StartLimitIntervalSec=0, Restart=always, RestartSec=30) so a momentary
upstream/TLS blip can no longer permanently kill the collector.

Also drop the old self-hosted Grafana package that was left enabled and
failing on copenhagen-c after the move to Grafana Cloud.
2026-06-07 14:30:08 +01:00
81efa1b717
Remove stale cloudflared service from copenhagen-a (PESO-138) (#125)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / deploy (push) Blocked by required conditions
cloudflared was retired in #56 when Caddy + Authelia replaced Cloudflare
Tunnels, but copenhagen-a was unreachable at the time so its
cloudflared.service was never stopped and is still running.

Add a cleanup task to the common role that stops, disables and purges
cloudflared wherever the unit lingers. Gated on the unit file existing so
it self-targets copenhagen-a and is a no-op everywhere else, and explicitly
excludes copenhagen-c, which legitimately runs a hand-configured tunnel.
2026-06-07 11:45:35 +01:00
3871dc8f90
Restrict london-b Samba (445) to LAN + Tailscale, off public internet (#124)
Samba on london-b was allowed on 445/tcp from anywhere via UFW, exposing
SMB/CIFS to the public internet. Tailscale already reaches it through the
tailscale0 allow-all rule, so scope the explicit rule to the local London
LAN (192.168.1.0/24) instead of the world.

The common UFW task only ever adds allow rules, so it gained support for an
optional per-port from_ip, plus a follow-up task that deletes the superseded
world-open variant of any source-restricted port — otherwise the old
'445 ALLOW Anywhere' rule would linger on the host and defeat the change.

PESO-145
2026-06-07 11:37:45 +01:00
644b608831
chore: retire readarr service, replaced by bookshelf (#123)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / deploy (push) Blocked by required conditions
Bookshelf (PR #122) is a Readarr revival and now owns port 8787 on
london-b, so the old custom Readarr systemd unit is removed:

- drop readarr from the media_stack role's unit-deploy and enable loops,
  and add an idempotent decommission task (stop, disable, remove unit)
  so the host tears it down via Ansible rather than ad-hoc SSH
- delete services/readarr/readarr.service
- update docs (services, london-b host, service inventory) to describe
  bookshelf as a Docker service instead of a custom systemd unit

The public readarr.pez.sh hostname is kept and now reverse-proxies to
bookshelf on :8787 — DNS, Caddy and Authelia (pez_readarr_users group)
are unchanged.
2026-06-06 15:50:37 +01:00
9815f44b84
fix: stop masking failed service deploys; trim dead config (#119)
Some checks failed
Deploy (on merge) / Discover hosts (push) Has been cancelled
Deploy (on merge) / deploy (push) Has been cancelled
The docker_services and systemd_services roles ran their "start the
service" tasks with `failed_when: false`, so a container or unit that
failed to come up still reported the deploy as green. Drop it from both
start tasks so a broken deploy actually fails CI. The compose/unit *copy*
tasks keep `failed_when: false` — that's load-bearing for the
`item is not failed` filter that skips services without a compose/unit file.

Also:
- Remove a duplicate "Template service .env files" task in docker_services
  (second copy used a hardcoded path and didn't register; first one is the
  one the start task reads).
- Don't trigger a full fleet deploy on docs/markdown/workflow-only pushes
  to main — add docs/**, **/*.md and .github/** to paths-ignore.
- Drop the dangling `update-freebsd` Make target (playbook doesn't exist;
  fleet has no FreeBSD hosts).
2026-06-04 18:41:24 +01:00
69145b3089
fix: add smb mount (#107)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
* fix: add smb mount

* update secrets

* address linting issues
2026-05-14 20:49:25 +01:00
5481292b7f
fix: remove subscription nag and lock down proxmox (#106)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
2026-05-13 21:09:54 +01:00
d3b516c594
fix: cleanup freebsd and alpine stuff (#105)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
2026-05-12 22:43:12 +01:00
928d1d0b99
fix: update config for london-a for new proxmox install (#101) 2026-05-09 19:22:34 +01:00
7d22ad1ce1
bug: add retry to restarting caddy (#97)
Some checks failed
Terraform / Plan (push) Waiting to run
Terraform / Apply (push) Blocked by required conditions
Deploy (on merge) / Discover hosts (push) Has been cancelled
Deploy (on merge) / Deploy → (push) Has been cancelled
* bug: add retry to restarting caddy

* skip terraform pipeline when no terraform changes has been done
2026-05-05 20:42:52 +01:00
83f023aedd
Migration to Grafana Cloud, nuremberg-a reinstalled, london-a reinsta… (#93)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
Terraform / Plan (push) Waiting to run
Terraform / Apply (push) Blocked by required conditions
* Migration to Grafana Cloud, nuremberg-a reinstalled, london-a reinstalled

* dns config for cockpit
2026-05-03 14:00:22 +01:00
e5306a5409
Fixing loki alloy (#87)
* add alloy to docker group

* fix: use docker driver instead of hacky alloy setup

* fixing linting issue
2026-04-29 20:07:40 +01:00
a51a0879d3
add alloy to docker group (#86)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
2026-04-28 20:53:19 +01:00
6a3618aa4a
fix: Fixing loki alloy (#85)
* fix: alloy

* fix: alpine doesn't need a hacky install
2026-04-28 20:30:30 +01:00
b474e28528
fix: alloy (#84) 2026-04-28 20:10:20 +01:00
5391c500e1
fix: loki & alloy (#83)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
* fix: loki & alloy

* fix linting
2026-04-28 16:40:45 +01:00
af2f462c1c
fix: prometheus retention and authelia fix (#73)
Some checks are pending
Deploy (on merge) / Deploy (push) Waiting to run
Terraform / Plan (push) Waiting to run
Terraform / Apply (push) Blocked by required conditions
* fix: prometheus retention time

* also fix bug with authelia

* linting issues

* more linting
2026-04-25 21:35:39 +01:00
b3cc47f3d6
fix: optimize deploy playbook and get rid of deprecated stuff (#70) 2026-04-25 15:04:16 +01:00
177fbb4014
Change provider for plex metrics (#65)
* change provider for plex metrics

* update plex token

* update plex token loading
2026-04-13 19:04:54 +01:00
2a98a89eb4
Change provider for plex metrics (#64)
* change provider for plex metrics

* update plex token
2026-04-12 21:21:24 +01:00
49cee191b5
fix: bind mariadb to local ip (#62) 2026-04-11 21:24:11 +01:00
ed6eb22f60
Remove cloudflared — replaced by Caddy reverse proxy (#56)
Cloudflared tunnels are no longer used. All traffic now routes through
Cloudflare DNS to Caddy on helsinki-a over Tailscale.

- Remove cloudflared systemd unit files (copenhagen-a, london-b)
- Remove cloudflared from media_stack role and copenhagen-a host_vars
- Remove cloudflared references from services README and host docs
- Remove cloudflared deploy trigger from CI workflow

Live service on london-b stopped and disabled. copenhagen-a was
unreachable but the tunnel is unused regardless.
2026-04-03 22:51:12 +01:00
a31f8b5651
Add systemd_exporter Ansible role and Prometheus scrape config (#49)
* Add systemd_exporter Ansible role and Prometheus scrape config

- Create systemd_exporter role (download binary, create user, deploy service)
- Add scrape job for london-b:9558 and copenhagen-a:9558
- Add systemd_exporter_hosts inventory group
- Add stage 3b to deploy.yml
- Map role to deploy-on-merge scope

Closes PESO-120

* Fix line length lint violations in systemd_exporter tasks

* Fix var-naming lint: use systemd_exporter_ prefix for role variables
2026-04-03 12:23:38 +01:00
49cea826e0
capture overseerr, syncthing, and fix slskd on london-b (#43) 2026-04-03 09:52:10 +01:00
853386ce2f
fix: remove custom node_exporter, standardise on package version (#40)
london-b had both a custom node_exporter.service and the
package-managed prometheus-node-exporter.service installed.
Both tried to bind port 9100, causing the package version to fail.

- Add cleanup tasks to remove custom /etc/systemd/system/node_exporter.service
  and /usr/local/bin/node_exporter if present
- Add node_exporter_extra_collectors variable for configurable collectors
- Configure london-b with systemd/processes/sysctl/ethtool/zfs collectors
  matching its previous custom setup

Resolves PESO-109
2026-04-03 01:50:13 +01:00
3c751af3ce
fix(firewall_alpine): replace empty iptables ruleset with proper INPUT filtering (#32)
* Bind node_exporter to Tailscale IP on public-facing hosts

node_exporter was listening on 0.0.0.0:9100 on helsinki-a and london-a,
exposing metrics to the public internet.

Changes:
- Add node_exporter_bind_tailscale flag (default false) to opt in
- Set flag on helsinki-a and london-a host_vars
- Debian: configure ARGS in /etc/default/prometheus-node-exporter
- FreeBSD: use native node_exporter_listen_address rc.conf variable
- Add handlers to restart on config change

Prometheus already scrapes via Tailscale IPs, no scrape config changes needed.

Fixes PESO-98

* fix(firewall_alpine): replace empty iptables ruleset with proper INPUT filtering

The rules.v4.j2 template deployed a ruleset with INPUT ACCEPT and zero
custom rules — effectively a no-op. nuremberg-a is a public-facing mail
server and needs actual filtering.

Changes:
- INPUT default policy set to DROP
- Allow loopback, established/related, Tailscale interface, SSH, ICMP
- FORWARD stays ACCEPT for Docker port-forwarding
- Added firewall_alpine_extra_input_rules variable for host-specific rules

Mail ports remain handled by Docker's FORWARD chain, not INPUT.

Closes PESO-119
2026-04-02 21:18:11 +01:00
f2cebcdf38
Bind node_exporter to Tailscale IP on public-facing hosts (#31)
node_exporter was listening on 0.0.0.0:9100 on helsinki-a and london-a,
exposing metrics to the public internet.

Changes:
- Add node_exporter_bind_tailscale flag (default false) to opt in
- Set flag on helsinki-a and london-a host_vars
- Debian: configure ARGS in /etc/default/prometheus-node-exporter
- FreeBSD: use native node_exporter_listen_address rc.conf variable
- Add handlers to restart on config change

Prometheus already scrapes via Tailscale IPs, no scrape config changes needed.

Fixes PESO-98
2026-03-30 22:56:59 +01:00
cfb2e83070 fix: remove docker-compose-v2 before installing docker-compose-plugin
copenhagen-a had Ubuntu's docker-compose-v2 package installed, which
conflicts with Docker's official docker-compose-plugin over
/usr/libexec/docker/cli-plugins/docker-compose.

Moved the removal task before the install task and added docker-compose-v2
to the removal list.
2026-03-30 18:08:50 +00:00
431c65065a
Add Docker official apt repo to docker role (#24)
* Add Docker official apt repo to docker role

The docker role was installing docker-compose-plugin which is only
available from Docker's official apt repository. helsinki-a had it
configured manually, but london-b and copenhagen-a did not, causing
deploy failures.

Now the role:
- Adds Docker's GPG key and apt repo (handles both Debian and Ubuntu)
- Installs docker-ce, docker-ce-cli, containerd.io, docker-compose-plugin
- Removes conflicting stock packages (docker.io, docker-compose)

* fix: resolve yamllint violations in docker role

- Remove standalone comment blocks that caused indentation errors
- Collapse multiline repo string to single line
- Ensure document start marker is present

* fix: keep all lines under 160 chars for yamllint

Use set_fact to build the Docker repo line in parts instead of
one long inline string.

* fix: resolve yamllint errors in london-b host_vars and promtail config

- Remove trailing blank line in inventory/host_vars/london-b.yml
- Add missing document start marker to promtail config
- Fix indentation in promtail scrape_configs (indent list items under key)

* Remove ansible-lint on push, keep PR-only

Lint already runs on pull_request — no need to double up on push to main.
2026-03-29 21:11:33 +01:00
353c2ad790
Capture london-b media stack and systemd services (#19)
Add the full media automation stack (sonarr, radarr, prowlarr, lidarr,
readarr, whisparr), media servers (jellyfin, plex), and supporting
services (transmission, samba, ollama, promtail, cloudflared, vsftpd)
to the repo as a media_stack Ansible role.

Includes:
- Custom systemd unit files for non-package-managed services
- Config files for promtail, samba, transmission, vsftpd
- Cron jobs for movie-rename-fix, sonarr/radarr midnight restarts
- Updated deploy.yml to wire the role into london-b's stage
- Updated london-b docs with full service inventory

Backup script (backup.sh) already covered by the existing backup role.
Node/systemd exporters already covered by existing monitoring roles.

Closes PESO-92
2026-03-29 19:13:48 +01:00
69918c8619
Add ZFS management role: scrub scheduling and pool monitoring (#18)
- New zfs role with cron-based scrub scheduling for Linux and FreeBSD
- Weekly Sunday scrubs at noon (matching existing manual crons)
- Add zfs_hosts inventory group with london-a and london-b
- Configure zfs_pools per host: zroot (london-a), hdd (london-b)
- Add Prometheus alert rules for degraded/faulted/offline pools
- Add zfs.yml playbook for targeted deploys

Captures the previously untracked scrub cron on london-a and
re-enables the commented-out scrub on london-b.

Refs: PESO-93
2026-03-29 19:12:42 +01:00
0247f6aa6b
Fix docker-compose package conflict and alpine firewall handler (#22)
- Docker role: replace docker-compose with docker-compose-plugin (v2).
  The old docker-compose package conflicts with docker-compose-plugin
  already installed on helsinki-a. Also removes the conflicting package
  if present.

- firewall_alpine handler: use ansible.builtin.shell instead of
  ansible.builtin.command for iptables-restore, since the redirect
  operator (<) requires a shell.
2026-03-29 19:11:52 +01:00
b0acdb72e3
capture helsinki-a status page cron in repo (#17)
add status_page role that deploys update-status.sh and its cron job.
script queries prometheus for caddy upstream health and writes
status.json + history to /srv/status/ every minute.

refs: PESO-94
2026-03-29 15:39:35 +01:00
42eba42522
Add backup role to deploy hdd-backup.sh and cron to london-b (#16)
Captures the existing /root/scripts/backup.sh and its 22:00 daily cron
job as an Ansible role so it's managed via pez-infra deploys.

Refs: PESO-95
2026-03-29 15:09:01 +01:00
a7a71e4f87
capture nuremberg-a firewall rules in pez-infra (#15)
Add firewall_alpine role for Alpine hosts with iptables persistence
and fail2ban SSH jails. Wire it into nuremberg-a's deploy stage.

Mail ports are already exposed via Docker port mappings in the
poste-io docker-compose — this captures the surrounding iptables
and fail2ban config that was previously undocumented.

Closes PESO-96
2026-03-29 14:40:10 +01:00
f9d0a7ebf4
fix: resolve UFW ansible-lint failures and deploy error (#11)
- Fix 'interface_or_direction' → 'direction' (required param for ufw module)
- Rename ufw_enabled/ufw_allowed_ports → common_ufw_enabled/common_ufw_allowed_ports (role prefix convention)
- Fix yaml[braces] violations in helsinki-a host_vars
2026-03-29 10:53:54 +01:00
4554dec7d2
Remove unused Prometheus alerting config (#10)
* Configure UFW firewall rules in common Ansible role

Add UFW configuration to the common role for Debian hosts:
- Default deny incoming, allow outgoing
- Allow all traffic on tailscale0 interface (mesh comms)
- Allow SSH port 22 as safety net
- Per-host allowed ports via ufw_allowed_ports variable
- Enable UFW after rules are applied

helsinki-a gets ports 80/443 for reverse proxy traffic.
Other Debian hosts only need Tailscale + SSH.

Closes PESO-79

* Remove unused alerting and rule_files from prometheus.yml

Alerting is handled by Grafana, not Prometheus Alertmanager.
The empty alertmanagers and rule_files sections were just noise.

Resolves PESO-74
2026-03-29 10:37:25 +01:00
dc10ceacf5 fix remaining yaml lint nitpicks
- add missing document start (---) to contact-points.yml and docker-compose files
- fix extra spaces inside braces in dotfiles and common role tasks
2026-03-28 13:13:37 +00:00
Rasmus Wejlgaard
737d6e0bc1 initial commit 2026-03-28 12:39:41 +00:00