RWejlgaard/pez-infra

mirror of https://github.com/RWejlgaard/pez-infra.git synced 2026-07-04 23:56:16 +00:00

Author	SHA1	Message	Date
Rasmus Wejlgaard	c1580a06e2	fix grafana alert rules missing relativeTimeRange Grafana 12 requires explicit relativeTimeRange on all alert rule data queries. Without it, queries default to {from: 0, to: 0} which is rejected as invalid, causing Grafana to crash on startup during alerting provisioning. Added relativeTimeRange to all data entries: - Prometheus queries: {from: 600, to: 0} (10 min lookback) - Expression refs: {from: 0, to: 0} This was preventing Grafana from starting on london-a, which meant alerts (including Host Down for copenhagen-a) could never auto-resolve.	2026-04-04 08:54:58 +00:00
Rasmus "Pez" Wejlgaard	267b392996	Add sonarr service directory with README (#51 ) Sonarr is running on london-b as an apt-managed systemd service but was the only arr service without a services/ directory in the repo. Add services/sonarr/README.md documenting the install method, data paths, and how it differs from the other arr services. Closes PESO-133	2026-04-04 09:31:39 +01:00
Rasmus "Pez" Wejlgaard	ed6eb22f60	Remove cloudflared — replaced by Caddy reverse proxy (#56 ) Cloudflared tunnels are no longer used. All traffic now routes through Cloudflare DNS to Caddy on helsinki-a over Tailscale. - Remove cloudflared systemd unit files (copenhagen-a, london-b) - Remove cloudflared from media_stack role and copenhagen-a host_vars - Remove cloudflared references from services README and host docs - Remove cloudflared deploy trigger from CI workflow Live service on london-b stopped and disabled. copenhagen-a was unreachable but the tunnel is unused regardless.	2026-04-03 22:51:12 +01:00
Rasmus "Pez" Wejlgaard	99c2091b96	Add smartctl-exporter to copenhagen-a and Prometheus scrape (#55 ) - Add smartctl-exporter to copenhagen-a docker_services - Add copenhagen-a as a Prometheus smartmontools scrape target - Update compose file comment to reflect multi-host usage Closes PESO-128	2026-04-03 21:20:20 +01:00
Rasmus "Pez" Wejlgaard	dca6a08ba1	Remove cloudflared from london-a (PESO-134) (#50 ) cloudflared has been replaced by Caddy + Authelia. Removed: - cloudflared service config (services/cloudflared/london-a/) - tunnel ID from london-a host_vars - cloudflared_enable from rc.conf Also synced rc.conf with live server state (disabled services from PESO-113, added node_exporter_listen_address). Live server: stopped service, removed from rc.conf, uninstalled pkg.	2026-04-03 18:51:51 +01:00
Rasmus "Pez" Wejlgaard	a31f8b5651	Add systemd_exporter Ansible role and Prometheus scrape config (#49 ) * Add systemd_exporter Ansible role and Prometheus scrape config - Create systemd_exporter role (download binary, create user, deploy service) - Add scrape job for london-b:9558 and copenhagen-a:9558 - Add systemd_exporter_hosts inventory group - Add stage 3b to deploy.yml - Map role to deploy-on-merge scope Closes PESO-120 * Fix line length lint violations in systemd_exporter tasks * Fix var-naming lint: use systemd_exporter_ prefix for role variables	2026-04-03 12:23:38 +01:00
Rasmus "Pez" Wejlgaard	49cea826e0	capture overseerr, syncthing, and fix slskd on london-b (#43 )	2026-04-03 09:52:10 +01:00
Rasmus "Pez" Wejlgaard	2d7723d145	Add rule_files to prometheus.yml, remove empty node-exporter.rules (#46 ) prometheus.yml was missing the rule_files section, so alerting rules deployed to /usr/local/etc/prometheus/rules/ were never loaded. - Add rule_files glob so Prometheus evaluates the ZFS pool rules - Document that alerting notifications go through Grafana, not Alertmanager — no alerting: section needed - Remove node-exporter.rules (all rules were commented out) Resolves PESO-103	2026-04-03 04:49:16 +01:00
Rasmus "Pez" Wejlgaard	f75e2a8d5f	remove alertmanager caddyfile entry and clean up references (#42 ) alerting is handled by grafana, not alertmanager. removed the stale reverse proxy block from caddyfile template and updated caddy + prometheus docs to reflect grafana-only alerting.	2026-04-03 02:49:37 +01:00
Rasmus Wejlgaard	00b967d930	fix trailing blank line in copenhagen-a host_vars and missing document start in cloudflared config	2026-04-02 23:13:18 +00:00
Rasmus "Pez" Wejlgaard	3ce559d7b9	Wire thiswebsitedoesnotexist.service into deployment pipeline - Move unit file from services/systemd/helsinki-a/ to services/thiswebsitedoesnotexist/ (matches systemd_services role convention) - Add systemd_services: [thiswebsitedoesnotexist] to helsinki-a host_vars - Add systemd_services role to helsinki-a stage in deploy.yml - Remove redundant caddy.service (apt manages this via the caddy role) Closes PESO-117	2026-04-02 22:19:26 +01:00
Rasmus "Pez" Wejlgaard	0bcc53b01d	Document undocumented services on london-a (#29 ) Audit of london-a rc.conf found several services running but not captured in host_vars or docs: cloudflared, InfluxDB, Redis, PostgreSQL, and libvirtd. - InfluxDB: only _internal db, completely unused - Redis: empty keyspace, unused - PostgreSQL: has pez_vps db from a dead project, needs data review - libvirtd: zero VMs, related to same dead project - cloudflared: running tunnel 168eccae, config now captured Also documented the weekly ZFS scrub cron (Sundays at noon) which is in root's crontab but not ansible-managed. Ref: PESO-101	2026-03-30 21:39:57 +01:00
Rasmus "Pez" Wejlgaard	eb9f026abd	Clean up stale DNS records and Caddyfile entries (#28 ) Remove webdav.pez.sh DNS record (WebDAV replaced by Nextcloud AIO on cloud.pez.sh) Remove alertmanager.pez.sh DNS record and Caddyfile block (Alertmanager not running on london-a) Remove status-https HTTPS record pointing to old statuspage.io (status.pez.sh is self-hosted on helsinki-a) Remove commented-out WebDAV block from Caddyfile Remove empty section headers for decommissioned hosts (london-c, copenhagen-b, copenhagen-c) Closes PESO-102	2026-03-30 21:12:52 +01:00
Rasmus "Pez" Wejlgaard	431c65065a	Add Docker official apt repo to docker role (#24 ) * Add Docker official apt repo to docker role The docker role was installing docker-compose-plugin which is only available from Docker's official apt repository. helsinki-a had it configured manually, but london-b and copenhagen-a did not, causing deploy failures. Now the role: - Adds Docker's GPG key and apt repo (handles both Debian and Ubuntu) - Installs docker-ce, docker-ce-cli, containerd.io, docker-compose-plugin - Removes conflicting stock packages (docker.io, docker-compose) * fix: resolve yamllint violations in docker role - Remove standalone comment blocks that caused indentation errors - Collapse multiline repo string to single line - Ensure document start marker is present * fix: keep all lines under 160 chars for yamllint Use set_fact to build the Docker repo line in parts instead of one long inline string. * fix: resolve yamllint errors in london-b host_vars and promtail config - Remove trailing blank line in inventory/host_vars/london-b.yml - Add missing document start marker to promtail config - Fix indentation in promtail scrape_configs (indent list items under key) * Remove ansible-lint on push, keep PR-only Lint already runs on pull_request — no need to double up on push to main.	2026-03-29 21:11:33 +01:00
Rasmus "Pez" Wejlgaard	353c2ad790	Capture london-b media stack and systemd services (#19 ) Add the full media automation stack (sonarr, radarr, prowlarr, lidarr, readarr, whisparr), media servers (jellyfin, plex), and supporting services (transmission, samba, ollama, promtail, cloudflared, vsftpd) to the repo as a media_stack Ansible role. Includes: - Custom systemd unit files for non-package-managed services - Config files for promtail, samba, transmission, vsftpd - Cron jobs for movie-rename-fix, sonarr/radarr midnight restarts - Updated deploy.yml to wire the role into london-b's stage - Updated london-b docs with full service inventory Backup script (backup.sh) already covered by the existing backup role. Node/systemd exporters already covered by existing monitoring roles. Closes PESO-92	2026-03-29 19:13:48 +01:00
Rasmus "Pez" Wejlgaard	69918c8619	Add ZFS management role: scrub scheduling and pool monitoring (#18 ) - New zfs role with cron-based scrub scheduling for Linux and FreeBSD - Weekly Sunday scrubs at noon (matching existing manual crons) - Add zfs_hosts inventory group with london-a and london-b - Configure zfs_pools per host: zroot (london-a), hdd (london-b) - Add Prometheus alert rules for degraded/faulted/offline pools - Add zfs.yml playbook for targeted deploys Captures the previously untracked scrub cron on london-a and re-enables the commented-out scrub on london-b. Refs: PESO-93	2026-03-29 19:12:42 +01:00
Rasmus "Pez" Wejlgaard	b0acdb72e3	capture helsinki-a status page cron in repo (#17 ) add status_page role that deploys update-status.sh and its cron job. script queries prometheus for caddy upstream health and writes status.json + history to /srv/status/ every minute. refs: PESO-94	2026-03-29 15:39:35 +01:00
Rasmus "Pez" Wejlgaard	99cc0d6967	Fix Alertmanager Caddyfile route pointing to Grafana port (#13 ) Alertmanager reverse_proxy was pointing to :3000 (Grafana) instead of :9093 (Alertmanager). Copy-paste artifact. Fixed in both the Caddyfile and the template.	2026-03-29 11:07:41 +01:00
Rasmus "Pez" Wejlgaard	4554dec7d2	Remove unused Prometheus alerting config (#10 ) * Configure UFW firewall rules in common Ansible role Add UFW configuration to the common role for Debian hosts: - Default deny incoming, allow outgoing - Allow all traffic on tailscale0 interface (mesh comms) - Allow SSH port 22 as safety net - Per-host allowed ports via ufw_allowed_ports variable - Enable UFW after rules are applied helsinki-a gets ports 80/443 for reverse proxy traffic. Other Debian hosts only need Tailscale + SSH. Closes PESO-79 * Remove unused alerting and rule_files from prometheus.yml Alerting is handled by Grafana, not Prometheus Alertmanager. The empty alertmanagers and rule_files sections were just noise. Resolves PESO-74	2026-03-29 10:37:25 +01:00
Rasmus "Pez" Wejlgaard	03ce524730	Standardise Prometheus targets to Tailscale IPs (#4 ) Replace local network IPs (192.168.1.x) with Tailscale IPs for london-a and london-b in all scrape configs. This ensures consistent connectivity via Tailscale mesh regardless of network topology changes. Refs: PESO-80	2026-03-28 20:08:09 +00:00
Rasmus Wejlgaard	8bb91032f3	Add Authelia config and SOPS-encrypted secrets - Add configuration.yml from running helsinki-a deployment - Replace example secrets with real SOPS-encrypted config.enc.yml - Add LDAP and SMTP password file env vars to docker-compose (all secrets now via file mounts, zero inline passwords) - Update README with secret mapping and deployment steps Closes PESO-89	2026-03-28 17:42:07 +00:00
Rasmus "Pez" Wejlgaard	8163b226b3	Merge pull request #2 from RWejlgaard/fix-lint-nitpicks Fix ansible-lint yaml nitpicks	2026-03-28 13:19:37 +00:00
Rasmus Wejlgaard	46063246a2	fix last 3 yaml lint failures - add missing --- to notification-policy.yml - prometheus.yml: replace commented-out template defaults with empty lists	2026-03-28 13:17:42 +00:00
Rasmus Wejlgaard	dc198eea81	fix more yaml document-start and comment indentation - add missing --- to 13 more yml files - fix comment indentation in prometheus.yml	2026-03-28 13:15:46 +00:00
Rasmus Wejlgaard	dc10ceacf5	fix remaining yaml lint nitpicks - add missing document start (---) to contact-points.yml and docker-compose files - fix extra spaces inside braces in dotfiles and common role tasks	2026-03-28 13:13:37 +00:00
Rasmus Wejlgaard	269f1b2274	fix ansible-lint yaml nitpicks - rules-warning.yml: remove trailing blank line - pr-test.yml: quote 'on' key for yaml truthy, add newline at EOF - add .yamllint config to ignore SOPS-encrypted secrets (line-length unfixable without re-encrypting)	2026-03-28 13:10:16 +00:00
Rasmus Wejlgaard	cfd745b2b7	add mangos zero config and fix world service - add mangosd.conf, realmd.conf, ahbot.conf, aiplayerbot.conf from copenhagen-a - db password replaced with {{ mangos_db_password }} placeholder - fix mangos-world.service: was identical copy of realmd service, now points to mangosd - add README for mangos-zero service	2026-03-28 13:03:09 +00:00
Rasmus Wejlgaard	737d6e0bc1	initial commit	2026-03-28 12:39:41 +00:00

28 commits