RWejlgaard/pez-infra

mirror of https://github.com/RWejlgaard/pez-infra.git synced 2026-07-04 15:46:16 +00:00

Author	SHA1	Message	Date
Rasmus "Pez" Wejlgaard	9ac179dbec	Make Alloy resilient to transient failures; remove leftover Grafana (PESO-149) (#126 ) copenhagen-c stopped reporting to Grafana Cloud on 2026-05-20: a transient TLS failure to fleet-management tripped systemd's default start rate-limit, systemd gave up, and the host sat silently unmonitored for ~2.5 weeks. Add a 10-resilience.conf systemd drop-in for alloy.service on every host (StartLimitIntervalSec=0, Restart=always, RestartSec=30) so a momentary upstream/TLS blip can no longer permanently kill the collector. Also drop the old self-hosted Grafana package that was left enabled and failing on copenhagen-c after the move to Grafana Cloud.	2026-06-07 14:30:08 +01:00
Rasmus "Pez" Wejlgaard	81efa1b717	Remove stale cloudflared service from copenhagen-a (PESO-138) (#125 ) Some checks are pending Deploy (on merge) / Discover hosts (push) Waiting to run Details Deploy (on merge) / deploy (push) Blocked by required conditions Details cloudflared was retired in #56 when Caddy + Authelia replaced Cloudflare Tunnels, but copenhagen-a was unreachable at the time so its cloudflared.service was never stopped and is still running. Add a cleanup task to the common role that stops, disables and purges cloudflared wherever the unit lingers. Gated on the unit file existing so it self-targets copenhagen-a and is a no-op everywhere else, and explicitly excludes copenhagen-c, which legitimately runs a hand-configured tunnel.	2026-06-07 11:45:35 +01:00
Rasmus "Pez" Wejlgaard	3871dc8f90	Restrict london-b Samba (445) to LAN + Tailscale, off public internet (#124 ) Samba on london-b was allowed on 445/tcp from anywhere via UFW, exposing SMB/CIFS to the public internet. Tailscale already reaches it through the tailscale0 allow-all rule, so scope the explicit rule to the local London LAN (192.168.1.0/24) instead of the world. The common UFW task only ever adds allow rules, so it gained support for an optional per-port from_ip, plus a follow-up task that deletes the superseded world-open variant of any source-restricted port — otherwise the old '445 ALLOW Anywhere' rule would linger on the host and defeat the change. PESO-145	2026-06-07 11:37:45 +01:00
Rasmus "Pez" Wejlgaard	644b608831	chore: retire readarr service, replaced by bookshelf (#123 ) Some checks are pending Deploy (on merge) / Discover hosts (push) Waiting to run Details Deploy (on merge) / deploy (push) Blocked by required conditions Details Bookshelf (PR #122) is a Readarr revival and now owns port 8787 on london-b, so the old custom Readarr systemd unit is removed: - drop readarr from the media_stack role's unit-deploy and enable loops, and add an idempotent decommission task (stop, disable, remove unit) so the host tears it down via Ansible rather than ad-hoc SSH - delete services/readarr/readarr.service - update docs (services, london-b host, service inventory) to describe bookshelf as a Docker service instead of a custom systemd unit The public readarr.pez.sh hostname is kept and now reverse-proxies to bookshelf on :8787 — DNS, Caddy and Authelia (pez_readarr_users group) are unchanged.	2026-06-06 15:50:37 +01:00
Rasmus "Pez" Wejlgaard	9815f44b84	fix: stop masking failed service deploys; trim dead config (#119 ) Some checks failed Deploy (on merge) / Discover hosts (push) Has been cancelled Details Deploy (on merge) / deploy (push) Has been cancelled Details The docker_services and systemd_services roles ran their "start the service" tasks with `failed_when: false`, so a container or unit that failed to come up still reported the deploy as green. Drop it from both start tasks so a broken deploy actually fails CI. The compose/unit copy tasks keep `failed_when: false` — that's load-bearing for the `item is not failed` filter that skips services without a compose/unit file. Also: - Remove a duplicate "Template service .env files" task in docker_services (second copy used a hardcoded path and didn't register; first one is the one the start task reads). - Don't trigger a full fleet deploy on docs/markdown/workflow-only pushes to main — add docs/, /.md and .github/* to paths-ignore. - Drop the dangling `update-freebsd` Make target (playbook doesn't exist; fleet has no FreeBSD hosts).	2026-06-04 18:41:24 +01:00
Rasmus "Pez" Wejlgaard	69145b3089	fix: add smb mount (#107 ) Some checks are pending Deploy (on merge) / Discover hosts (push) Waiting to run Details Deploy (on merge) / Deploy → (push) Blocked by required conditions Details * fix: add smb mount * update secrets * address linting issues	2026-05-14 20:49:25 +01:00
Rasmus "Pez" Wejlgaard	5481292b7f	fix: remove subscription nag and lock down proxmox (#106 ) Some checks are pending Deploy (on merge) / Discover hosts (push) Waiting to run Details Deploy (on merge) / Deploy → (push) Blocked by required conditions Details	2026-05-13 21:09:54 +01:00
Rasmus "Pez" Wejlgaard	d3b516c594	fix: cleanup freebsd and alpine stuff (#105 ) Some checks are pending Deploy (on merge) / Discover hosts (push) Waiting to run Details Deploy (on merge) / Deploy → (push) Blocked by required conditions Details	2026-05-12 22:43:12 +01:00
Rasmus "Pez" Wejlgaard	928d1d0b99	fix: update config for london-a for new proxmox install (#101 )	2026-05-09 19:22:34 +01:00
Rasmus "Pez" Wejlgaard	7d22ad1ce1	bug: add retry to restarting caddy (#97 ) Some checks failed Terraform / Plan (push) Waiting to run Details Terraform / Apply (push) Blocked by required conditions Details Deploy (on merge) / Discover hosts (push) Has been cancelled Details Deploy (on merge) / Deploy → (push) Has been cancelled Details * bug: add retry to restarting caddy * skip terraform pipeline when no terraform changes has been done	2026-05-05 20:42:52 +01:00
Rasmus "Pez" Wejlgaard	83f023aedd	Migration to Grafana Cloud, nuremberg-a reinstalled, london-a reinsta… (#93 ) Some checks are pending Deploy (on merge) / Discover hosts (push) Waiting to run Details Deploy (on merge) / Deploy → (push) Blocked by required conditions Details Terraform / Plan (push) Waiting to run Details Terraform / Apply (push) Blocked by required conditions Details * Migration to Grafana Cloud, nuremberg-a reinstalled, london-a reinstalled * dns config for cockpit	2026-05-03 14:00:22 +01:00
Rasmus "Pez" Wejlgaard	e5306a5409	Fixing loki alloy (#87 ) * add alloy to docker group * fix: use docker driver instead of hacky alloy setup * fixing linting issue	2026-04-29 20:07:40 +01:00
Rasmus "Pez" Wejlgaard	a51a0879d3	add alloy to docker group (#86 ) Some checks are pending Deploy (on merge) / Discover hosts (push) Waiting to run Details Deploy (on merge) / Deploy → (push) Blocked by required conditions Details	2026-04-28 20:53:19 +01:00
Rasmus "Pez" Wejlgaard	6a3618aa4a	fix: Fixing loki alloy (#85 ) * fix: alloy * fix: alpine doesn't need a hacky install	2026-04-28 20:30:30 +01:00
Rasmus "Pez" Wejlgaard	b474e28528	fix: alloy (#84 )	2026-04-28 20:10:20 +01:00
Rasmus "Pez" Wejlgaard	5391c500e1	fix: loki & alloy (#83 ) Some checks are pending Deploy (on merge) / Discover hosts (push) Waiting to run Details Deploy (on merge) / Deploy → (push) Blocked by required conditions Details * fix: loki & alloy * fix linting	2026-04-28 16:40:45 +01:00
Rasmus "Pez" Wejlgaard	af2f462c1c	fix: prometheus retention and authelia fix (#73 ) Some checks are pending Deploy (on merge) / Deploy (push) Waiting to run Details Terraform / Plan (push) Waiting to run Details Terraform / Apply (push) Blocked by required conditions Details * fix: prometheus retention time * also fix bug with authelia * linting issues * more linting	2026-04-25 21:35:39 +01:00
Rasmus "Pez" Wejlgaard	b3cc47f3d6	fix: optimize deploy playbook and get rid of deprecated stuff (#70 )	2026-04-25 15:04:16 +01:00
Rasmus "Pez" Wejlgaard	177fbb4014	Change provider for plex metrics (#65 ) * change provider for plex metrics * update plex token * update plex token loading	2026-04-13 19:04:54 +01:00
Rasmus "Pez" Wejlgaard	2a98a89eb4	Change provider for plex metrics (#64 ) * change provider for plex metrics * update plex token	2026-04-12 21:21:24 +01:00
Rasmus "Pez" Wejlgaard	49cee191b5	fix: bind mariadb to local ip (#62 )	2026-04-11 21:24:11 +01:00
Rasmus "Pez" Wejlgaard	ed6eb22f60	Remove cloudflared — replaced by Caddy reverse proxy (#56 ) Cloudflared tunnels are no longer used. All traffic now routes through Cloudflare DNS to Caddy on helsinki-a over Tailscale. - Remove cloudflared systemd unit files (copenhagen-a, london-b) - Remove cloudflared from media_stack role and copenhagen-a host_vars - Remove cloudflared references from services README and host docs - Remove cloudflared deploy trigger from CI workflow Live service on london-b stopped and disabled. copenhagen-a was unreachable but the tunnel is unused regardless.	2026-04-03 22:51:12 +01:00
Rasmus "Pez" Wejlgaard	a31f8b5651	Add systemd_exporter Ansible role and Prometheus scrape config (#49 ) * Add systemd_exporter Ansible role and Prometheus scrape config - Create systemd_exporter role (download binary, create user, deploy service) - Add scrape job for london-b:9558 and copenhagen-a:9558 - Add systemd_exporter_hosts inventory group - Add stage 3b to deploy.yml - Map role to deploy-on-merge scope Closes PESO-120 * Fix line length lint violations in systemd_exporter tasks * Fix var-naming lint: use systemd_exporter_ prefix for role variables	2026-04-03 12:23:38 +01:00
Rasmus "Pez" Wejlgaard	49cea826e0	capture overseerr, syncthing, and fix slskd on london-b (#43 )	2026-04-03 09:52:10 +01:00
Rasmus "Pez" Wejlgaard	853386ce2f	fix: remove custom node_exporter, standardise on package version (#40 ) london-b had both a custom node_exporter.service and the package-managed prometheus-node-exporter.service installed. Both tried to bind port 9100, causing the package version to fail. - Add cleanup tasks to remove custom /etc/systemd/system/node_exporter.service and /usr/local/bin/node_exporter if present - Add node_exporter_extra_collectors variable for configurable collectors - Configure london-b with systemd/processes/sysctl/ethtool/zfs collectors matching its previous custom setup Resolves PESO-109	2026-04-03 01:50:13 +01:00
Rasmus "Pez" Wejlgaard	3c751af3ce	fix(firewall_alpine): replace empty iptables ruleset with proper INPUT filtering (#32 ) * Bind node_exporter to Tailscale IP on public-facing hosts node_exporter was listening on 0.0.0.0:9100 on helsinki-a and london-a, exposing metrics to the public internet. Changes: - Add node_exporter_bind_tailscale flag (default false) to opt in - Set flag on helsinki-a and london-a host_vars - Debian: configure ARGS in /etc/default/prometheus-node-exporter - FreeBSD: use native node_exporter_listen_address rc.conf variable - Add handlers to restart on config change Prometheus already scrapes via Tailscale IPs, no scrape config changes needed. Fixes PESO-98 * fix(firewall_alpine): replace empty iptables ruleset with proper INPUT filtering The rules.v4.j2 template deployed a ruleset with INPUT ACCEPT and zero custom rules — effectively a no-op. nuremberg-a is a public-facing mail server and needs actual filtering. Changes: - INPUT default policy set to DROP - Allow loopback, established/related, Tailscale interface, SSH, ICMP - FORWARD stays ACCEPT for Docker port-forwarding - Added firewall_alpine_extra_input_rules variable for host-specific rules Mail ports remain handled by Docker's FORWARD chain, not INPUT. Closes PESO-119	2026-04-02 21:18:11 +01:00
Rasmus "Pez" Wejlgaard	f2cebcdf38	Bind node_exporter to Tailscale IP on public-facing hosts (#31 ) node_exporter was listening on 0.0.0.0:9100 on helsinki-a and london-a, exposing metrics to the public internet. Changes: - Add node_exporter_bind_tailscale flag (default false) to opt in - Set flag on helsinki-a and london-a host_vars - Debian: configure ARGS in /etc/default/prometheus-node-exporter - FreeBSD: use native node_exporter_listen_address rc.conf variable - Add handlers to restart on config change Prometheus already scrapes via Tailscale IPs, no scrape config changes needed. Fixes PESO-98	2026-03-30 22:56:59 +01:00
Rasmus Wejlgaard	cfb2e83070	fix: remove docker-compose-v2 before installing docker-compose-plugin copenhagen-a had Ubuntu's docker-compose-v2 package installed, which conflicts with Docker's official docker-compose-plugin over /usr/libexec/docker/cli-plugins/docker-compose. Moved the removal task before the install task and added docker-compose-v2 to the removal list.	2026-03-30 18:08:50 +00:00
Rasmus "Pez" Wejlgaard	431c65065a	Add Docker official apt repo to docker role (#24 ) * Add Docker official apt repo to docker role The docker role was installing docker-compose-plugin which is only available from Docker's official apt repository. helsinki-a had it configured manually, but london-b and copenhagen-a did not, causing deploy failures. Now the role: - Adds Docker's GPG key and apt repo (handles both Debian and Ubuntu) - Installs docker-ce, docker-ce-cli, containerd.io, docker-compose-plugin - Removes conflicting stock packages (docker.io, docker-compose) * fix: resolve yamllint violations in docker role - Remove standalone comment blocks that caused indentation errors - Collapse multiline repo string to single line - Ensure document start marker is present * fix: keep all lines under 160 chars for yamllint Use set_fact to build the Docker repo line in parts instead of one long inline string. * fix: resolve yamllint errors in london-b host_vars and promtail config - Remove trailing blank line in inventory/host_vars/london-b.yml - Add missing document start marker to promtail config - Fix indentation in promtail scrape_configs (indent list items under key) * Remove ansible-lint on push, keep PR-only Lint already runs on pull_request — no need to double up on push to main.	2026-03-29 21:11:33 +01:00
Rasmus "Pez" Wejlgaard	353c2ad790	Capture london-b media stack and systemd services (#19 ) Add the full media automation stack (sonarr, radarr, prowlarr, lidarr, readarr, whisparr), media servers (jellyfin, plex), and supporting services (transmission, samba, ollama, promtail, cloudflared, vsftpd) to the repo as a media_stack Ansible role. Includes: - Custom systemd unit files for non-package-managed services - Config files for promtail, samba, transmission, vsftpd - Cron jobs for movie-rename-fix, sonarr/radarr midnight restarts - Updated deploy.yml to wire the role into london-b's stage - Updated london-b docs with full service inventory Backup script (backup.sh) already covered by the existing backup role. Node/systemd exporters already covered by existing monitoring roles. Closes PESO-92	2026-03-29 19:13:48 +01:00
Rasmus "Pez" Wejlgaard	69918c8619	Add ZFS management role: scrub scheduling and pool monitoring (#18 ) - New zfs role with cron-based scrub scheduling for Linux and FreeBSD - Weekly Sunday scrubs at noon (matching existing manual crons) - Add zfs_hosts inventory group with london-a and london-b - Configure zfs_pools per host: zroot (london-a), hdd (london-b) - Add Prometheus alert rules for degraded/faulted/offline pools - Add zfs.yml playbook for targeted deploys Captures the previously untracked scrub cron on london-a and re-enables the commented-out scrub on london-b. Refs: PESO-93	2026-03-29 19:12:42 +01:00
Rasmus "Pez" Wejlgaard	0247f6aa6b	Fix docker-compose package conflict and alpine firewall handler (#22 ) - Docker role: replace docker-compose with docker-compose-plugin (v2). The old docker-compose package conflicts with docker-compose-plugin already installed on helsinki-a. Also removes the conflicting package if present. - firewall_alpine handler: use ansible.builtin.shell instead of ansible.builtin.command for iptables-restore, since the redirect operator (<) requires a shell.	2026-03-29 19:11:52 +01:00
Rasmus "Pez" Wejlgaard	b0acdb72e3	capture helsinki-a status page cron in repo (#17 ) add status_page role that deploys update-status.sh and its cron job. script queries prometheus for caddy upstream health and writes status.json + history to /srv/status/ every minute. refs: PESO-94	2026-03-29 15:39:35 +01:00
Rasmus "Pez" Wejlgaard	42eba42522	Add backup role to deploy hdd-backup.sh and cron to london-b (#16 ) Captures the existing /root/scripts/backup.sh and its 22:00 daily cron job as an Ansible role so it's managed via pez-infra deploys. Refs: PESO-95	2026-03-29 15:09:01 +01:00
Rasmus "Pez" Wejlgaard	a7a71e4f87	capture nuremberg-a firewall rules in pez-infra (#15 ) Add firewall_alpine role for Alpine hosts with iptables persistence and fail2ban SSH jails. Wire it into nuremberg-a's deploy stage. Mail ports are already exposed via Docker port mappings in the poste-io docker-compose — this captures the surrounding iptables and fail2ban config that was previously undocumented. Closes PESO-96	2026-03-29 14:40:10 +01:00
Rasmus "Pez" Wejlgaard	f9d0a7ebf4	fix: resolve UFW ansible-lint failures and deploy error (#11 ) - Fix 'interface_or_direction' → 'direction' (required param for ufw module) - Rename ufw_enabled/ufw_allowed_ports → common_ufw_enabled/common_ufw_allowed_ports (role prefix convention) - Fix yaml[braces] violations in helsinki-a host_vars	2026-03-29 10:53:54 +01:00
Rasmus "Pez" Wejlgaard	4554dec7d2	Remove unused Prometheus alerting config (#10 ) * Configure UFW firewall rules in common Ansible role Add UFW configuration to the common role for Debian hosts: - Default deny incoming, allow outgoing - Allow all traffic on tailscale0 interface (mesh comms) - Allow SSH port 22 as safety net - Per-host allowed ports via ufw_allowed_ports variable - Enable UFW after rules are applied helsinki-a gets ports 80/443 for reverse proxy traffic. Other Debian hosts only need Tailscale + SSH. Closes PESO-79 * Remove unused alerting and rule_files from prometheus.yml Alerting is handled by Grafana, not Prometheus Alertmanager. The empty alertmanagers and rule_files sections were just noise. Resolves PESO-74	2026-03-29 10:37:25 +01:00
Rasmus Wejlgaard	dc10ceacf5	fix remaining yaml lint nitpicks - add missing document start (---) to contact-points.yml and docker-compose files - fix extra spaces inside braces in dotfiles and common role tasks	2026-03-28 13:13:37 +00:00
Rasmus Wejlgaard	737d6e0bc1	initial commit	2026-03-28 12:39:41 +00:00

39 commits