Commit graph

50 commits

Author SHA1 Message Date
c1580a06e2 fix grafana alert rules missing relativeTimeRange
Grafana 12 requires explicit relativeTimeRange on all alert rule data
queries. Without it, queries default to {from: 0, to: 0} which is
rejected as invalid, causing Grafana to crash on startup during
alerting provisioning.

Added relativeTimeRange to all data entries:
- Prometheus queries: {from: 600, to: 0} (10 min lookback)
- Expression refs: {from: 0, to: 0}

This was preventing Grafana from starting on london-a, which meant
alerts (including Host Down for copenhagen-a) could never auto-resolve.
2026-04-04 08:54:58 +00:00
267b392996
Add sonarr service directory with README (#51)
Sonarr is running on london-b as an apt-managed systemd service
but was the only *arr service without a services/ directory in the
repo. Add services/sonarr/README.md documenting the install method,
data paths, and how it differs from the other *arr services.

Closes PESO-133
2026-04-04 09:31:39 +01:00
ed6eb22f60
Remove cloudflared — replaced by Caddy reverse proxy (#56)
Cloudflared tunnels are no longer used. All traffic now routes through
Cloudflare DNS to Caddy on helsinki-a over Tailscale.

- Remove cloudflared systemd unit files (copenhagen-a, london-b)
- Remove cloudflared from media_stack role and copenhagen-a host_vars
- Remove cloudflared references from services README and host docs
- Remove cloudflared deploy trigger from CI workflow

Live service on london-b stopped and disabled. copenhagen-a was
unreachable but the tunnel is unused regardless.
2026-04-03 22:51:12 +01:00
99c2091b96
Add smartctl-exporter to copenhagen-a and Prometheus scrape (#55)
- Add smartctl-exporter to copenhagen-a docker_services
- Add copenhagen-a as a Prometheus smartmontools scrape target
- Update compose file comment to reflect multi-host usage

Closes PESO-128
2026-04-03 21:20:20 +01:00
d8757d37e1
fix(london-a): correct grafana provisioning dir path (#53)
grafana.ini on london-a sets provisioning = /usr/local/etc/grafana/provisioning
but grafana_provisioning_dir pointed at /usr/local/share/grafana/conf/provisioning.

This meant deploy.yml synced alerting rules, dashboards provisioning, and
datasources to a path Grafana never reads — a from-scratch deploy would have
broken alerting entirely.

Fixes PESO-131
2026-04-03 20:20:15 +01:00
25d201f930
Add copenhagen-a to docker_hosts and wire up minecraft docker service (#52)
- Add copenhagen-a to [docker_hosts] inventory group so the docker role
  runs on it in Stage 2
- Add docker_services: [minecraft] to copenhagen-a host_vars
- Add docker_services role to Stage 4d (copenhagen-a) in deploy.yml
- Update deploy-on-merge scope mapping to include copenhagen-a for
  docker role changes

Closes PESO-132
2026-04-03 19:50:51 +01:00
dca6a08ba1
Remove cloudflared from london-a (PESO-134) (#50)
cloudflared has been replaced by Caddy + Authelia. Removed:
- cloudflared service config (services/cloudflared/london-a/)
- tunnel ID from london-a host_vars
- cloudflared_enable from rc.conf

Also synced rc.conf with live server state (disabled services
from PESO-113, added node_exporter_listen_address).

Live server: stopped service, removed from rc.conf, uninstalled pkg.
2026-04-03 18:51:51 +01:00
a31f8b5651
Add systemd_exporter Ansible role and Prometheus scrape config (#49)
* Add systemd_exporter Ansible role and Prometheus scrape config

- Create systemd_exporter role (download binary, create user, deploy service)
- Add scrape job for london-b:9558 and copenhagen-a:9558
- Add systemd_exporter_hosts inventory group
- Add stage 3b to deploy.yml
- Map role to deploy-on-merge scope

Closes PESO-120

* Fix line length lint violations in systemd_exporter tasks

* Fix var-naming lint: use systemd_exporter_ prefix for role variables
2026-04-03 12:23:38 +01:00
8a4a95b596
Add ZFS role to deploy.yml for scrub scheduling (#44) 2026-04-03 09:53:10 +01:00
49cea826e0
capture overseerr, syncthing, and fix slskd on london-b (#43) 2026-04-03 09:52:10 +01:00
2d7723d145
Add rule_files to prometheus.yml, remove empty node-exporter.rules (#46)
prometheus.yml was missing the rule_files section, so alerting rules
deployed to /usr/local/etc/prometheus/rules/ were never loaded.

- Add rule_files glob so Prometheus evaluates the ZFS pool rules
- Document that alerting notifications go through Grafana, not
  Alertmanager — no alerting: section needed
- Remove node-exporter.rules (all rules were commented out)

Resolves PESO-103
2026-04-03 04:49:16 +01:00
ff8d7a53e7
Remove copenhagen-a from docker_hosts and docker_services (#45)
Docker is masked on copenhagen-a and Minecraft is no longer managed
via Docker Compose. Removes:
- copenhagen-a from [docker_hosts] inventory group
- docker_services var from copenhagen-a host_vars
- docker_services role from Stage 4d deploy play

MaNGOS systemd services remain unchanged.

Fixes PESO-104
2026-04-03 04:18:46 +01:00
f75e2a8d5f
remove alertmanager caddyfile entry and clean up references (#42)
alerting is handled by grafana, not alertmanager. removed the
stale reverse proxy block from caddyfile template and updated
caddy + prometheus docs to reflect grafana-only alerting.
2026-04-03 02:49:37 +01:00
853386ce2f
fix: remove custom node_exporter, standardise on package version (#40)
london-b had both a custom node_exporter.service and the
package-managed prometheus-node-exporter.service installed.
Both tried to bind port 9100, causing the package version to fail.

- Add cleanup tasks to remove custom /etc/systemd/system/node_exporter.service
  and /usr/local/bin/node_exporter if present
- Add node_exporter_extra_collectors variable for configurable collectors
- Configure london-b with systemd/processes/sysctl/ethtool/zfs collectors
  matching its previous custom setup

Resolves PESO-109
2026-04-03 01:50:13 +01:00
d3bce0d5c2
nuremberg-a: add poste-io to docker_services (#38)
Adds docker_services list to nuremberg-a host_vars so the docker_services
role deploys and manages the poste-io mail container via docker compose,
replacing the current manual container setup.
2026-04-03 00:49:50 +01:00
5a5c60b6b2
london-a: disable unused services (InfluxDB, Redis, PostgreSQL, libvirtd) (#37)
Services stopped and disabled in rc.conf on london-a.
Removed audit variables from host_vars, replaced with cleanup note.

All four were leftovers from a defunct pez_vps project:
- InfluxDB: no user databases, only _internal
- Redis: empty keyspace, no clients
- PostgreSQL: defunct pez_vps database (Pez approved removal)
- libvirtd: zero VMs defined

Resolves PESO-113
2026-04-03 00:17:58 +01:00
00b967d930 fix trailing blank line in copenhagen-a host_vars and missing document start in cloudflared config 2026-04-02 23:13:18 +00:00
ca3d9c4261
Remove undocumented_services from copenhagen-a host_vars (#35)
PostgreSQL 14 and Redis have been stopped, disabled, purged, and
data directories removed from copenhagen-a. These were leftovers
from an old WordPress project with no user data.

Resolves: PESO-114
2026-04-02 23:53:15 +01:00
3ce559d7b9
Wire thiswebsitedoesnotexist.service into deployment pipeline
- Move unit file from services/systemd/helsinki-a/ to
  services/thiswebsitedoesnotexist/ (matches systemd_services role convention)
- Add systemd_services: [thiswebsitedoesnotexist] to helsinki-a host_vars
- Add systemd_services role to helsinki-a stage in deploy.yml
- Remove redundant caddy.service (apt manages this via the caddy role)

Closes PESO-117
2026-04-02 22:19:26 +01:00
3c751af3ce
fix(firewall_alpine): replace empty iptables ruleset with proper INPUT filtering (#32)
* Bind node_exporter to Tailscale IP on public-facing hosts

node_exporter was listening on 0.0.0.0:9100 on helsinki-a and london-a,
exposing metrics to the public internet.

Changes:
- Add node_exporter_bind_tailscale flag (default false) to opt in
- Set flag on helsinki-a and london-a host_vars
- Debian: configure ARGS in /etc/default/prometheus-node-exporter
- FreeBSD: use native node_exporter_listen_address rc.conf variable
- Add handlers to restart on config change

Prometheus already scrapes via Tailscale IPs, no scrape config changes needed.

Fixes PESO-98

* fix(firewall_alpine): replace empty iptables ruleset with proper INPUT filtering

The rules.v4.j2 template deployed a ruleset with INPUT ACCEPT and zero
custom rules — effectively a no-op. nuremberg-a is a public-facing mail
server and needs actual filtering.

Changes:
- INPUT default policy set to DROP
- Allow loopback, established/related, Tailscale interface, SSH, ICMP
- FORWARD stays ACCEPT for Docker port-forwarding
- Added firewall_alpine_extra_input_rules variable for host-specific rules

Mail ports remain handled by Docker's FORWARD chain, not INPUT.

Closes PESO-119
2026-04-02 21:18:11 +01:00
f2cebcdf38
Bind node_exporter to Tailscale IP on public-facing hosts (#31)
node_exporter was listening on 0.0.0.0:9100 on helsinki-a and london-a,
exposing metrics to the public internet.

Changes:
- Add node_exporter_bind_tailscale flag (default false) to opt in
- Set flag on helsinki-a and london-a host_vars
- Debian: configure ARGS in /etc/default/prometheus-node-exporter
- FreeBSD: use native node_exporter_listen_address rc.conf variable
- Add handlers to restart on config change

Prometheus already scrapes via Tailscale IPs, no scrape config changes needed.

Fixes PESO-98
2026-03-30 22:56:59 +01:00
a74213b4cb
copenhagen-a: document all live services in host_vars and docs (#30)
Audit of copenhagen-a found several running services not captured in
host_vars: cloudflared, node_exporter (systemd), and MariaDB. Also
found postgresql and redis running with no active consumers.

Updated host_vars to list all services and added undocumented_services
for the potentially unused ones. Updated docs with cloudflare tunnel,
monitoring, and notes about stale Docker images to clean up.

Closes PESO-100
2026-03-30 22:10:27 +01:00
0bcc53b01d
Document undocumented services on london-a (#29)
Audit of london-a rc.conf found several services running but not
captured in host_vars or docs: cloudflared, InfluxDB, Redis,
PostgreSQL, and libvirtd.

- InfluxDB: only _internal db, completely unused
- Redis: empty keyspace, unused
- PostgreSQL: has pez_vps db from a dead project, needs data review
- libvirtd: zero VMs, related to same dead project
- cloudflared: running tunnel 168eccae, config now captured

Also documented the weekly ZFS scrub cron (Sundays at noon) which
is in root's crontab but not ansible-managed.

Ref: PESO-101
2026-03-30 21:39:57 +01:00
eb9f026abd
Clean up stale DNS records and Caddyfile entries (#28)
Remove webdav.pez.sh DNS record (WebDAV replaced by Nextcloud AIO on cloud.pez.sh)
Remove alertmanager.pez.sh DNS record and Caddyfile block (Alertmanager not running on london-a)
Remove status-https HTTPS record pointing to old statuspage.io (status.pez.sh is self-hosted on helsinki-a)
Remove commented-out WebDAV block from Caddyfile
Remove empty section headers for decommissioned hosts (london-c, copenhagen-b, copenhagen-c)

Closes PESO-102
2026-03-30 21:12:52 +01:00
cfb2e83070 fix: remove docker-compose-v2 before installing docker-compose-plugin
copenhagen-a had Ubuntu's docker-compose-v2 package installed, which
conflicts with Docker's official docker-compose-plugin over
/usr/libexec/docker/cli-plugins/docker-compose.

Moved the removal task before the install task and added docker-compose-v2
to the removal list.
2026-03-30 18:08:50 +00:00
431c65065a
Add Docker official apt repo to docker role (#24)
* Add Docker official apt repo to docker role

The docker role was installing docker-compose-plugin which is only
available from Docker's official apt repository. helsinki-a had it
configured manually, but london-b and copenhagen-a did not, causing
deploy failures.

Now the role:
- Adds Docker's GPG key and apt repo (handles both Debian and Ubuntu)
- Installs docker-ce, docker-ce-cli, containerd.io, docker-compose-plugin
- Removes conflicting stock packages (docker.io, docker-compose)

* fix: resolve yamllint violations in docker role

- Remove standalone comment blocks that caused indentation errors
- Collapse multiline repo string to single line
- Ensure document start marker is present

* fix: keep all lines under 160 chars for yamllint

Use set_fact to build the Docker repo line in parts instead of
one long inline string.

* fix: resolve yamllint errors in london-b host_vars and promtail config

- Remove trailing blank line in inventory/host_vars/london-b.yml
- Add missing document start marker to promtail config
- Fix indentation in promtail scrape_configs (indent list items under key)

* Remove ansible-lint on push, keep PR-only

Lint already runs on pull_request — no need to double up on push to main.
2026-03-29 21:11:33 +01:00
353c2ad790
Capture london-b media stack and systemd services (#19)
Add the full media automation stack (sonarr, radarr, prowlarr, lidarr,
readarr, whisparr), media servers (jellyfin, plex), and supporting
services (transmission, samba, ollama, promtail, cloudflared, vsftpd)
to the repo as a media_stack Ansible role.

Includes:
- Custom systemd unit files for non-package-managed services
- Config files for promtail, samba, transmission, vsftpd
- Cron jobs for movie-rename-fix, sonarr/radarr midnight restarts
- Updated deploy.yml to wire the role into london-b's stage
- Updated london-b docs with full service inventory

Backup script (backup.sh) already covered by the existing backup role.
Node/systemd exporters already covered by existing monitoring roles.

Closes PESO-92
2026-03-29 19:13:48 +01:00
69918c8619
Add ZFS management role: scrub scheduling and pool monitoring (#18)
- New zfs role with cron-based scrub scheduling for Linux and FreeBSD
- Weekly Sunday scrubs at noon (matching existing manual crons)
- Add zfs_hosts inventory group with london-a and london-b
- Configure zfs_pools per host: zroot (london-a), hdd (london-b)
- Add Prometheus alert rules for degraded/faulted/offline pools
- Add zfs.yml playbook for targeted deploys

Captures the previously untracked scrub cron on london-a and
re-enables the commented-out scrub on london-b.

Refs: PESO-93
2026-03-29 19:12:42 +01:00
3d8fb84d1f
Feat/london b plex ufw (#21)
* Allow Plex port (32400/tcp) through UFW on london-b

Plex needs direct access on port 32400 for remote streaming.
Adds common_ufw_allowed_ports to london-b host_vars.

* Add BitTorrent port (6881) to london-b UFW allowed ports

Port was already manually configured in UFW, bringing it under Ansible management.

* Add Samba port (445/tcp) to london-b UFW allowed ports
2026-03-29 19:12:10 +01:00
0247f6aa6b
Fix docker-compose package conflict and alpine firewall handler (#22)
- Docker role: replace docker-compose with docker-compose-plugin (v2).
  The old docker-compose package conflicts with docker-compose-plugin
  already installed on helsinki-a. Also removes the conflicting package
  if present.

- firewall_alpine handler: use ansible.builtin.shell instead of
  ansible.builtin.command for iptables-restore, since the redirect
  operator (<) requires a shell.
2026-03-29 19:11:52 +01:00
106c45fc81
Add helsinki-a to docker_hosts inventory group (#20)
helsinki-a runs Docker containers (authelia, forgejo, bitwarden) but was
missing from docker_hosts. This means the docker role and docker-status
playbook weren't targeting it during deploys.

Closes PESO-91
2026-03-29 17:08:34 +01:00
b0acdb72e3
capture helsinki-a status page cron in repo (#17)
add status_page role that deploys update-status.sh and its cron job.
script queries prometheus for caddy upstream health and writes
status.json + history to /srv/status/ every minute.

refs: PESO-94
2026-03-29 15:39:35 +01:00
42eba42522
Add backup role to deploy hdd-backup.sh and cron to london-b (#16)
Captures the existing /root/scripts/backup.sh and its 22:00 daily cron
job as an Ansible role so it's managed via pez-infra deploys.

Refs: PESO-95
2026-03-29 15:09:01 +01:00
a7a71e4f87
capture nuremberg-a firewall rules in pez-infra (#15)
Add firewall_alpine role for Alpine hosts with iptables persistence
and fail2ban SSH jails. Wire it into nuremberg-a's deploy stage.

Mail ports are already exposed via Docker port mappings in the
poste-io docker-compose — this captures the surrounding iptables
and fail2ban config that was previously undocumented.

Closes PESO-96
2026-03-29 14:40:10 +01:00
8dffd3732b
Allow Plex port (32400/tcp) through UFW on london-b (#12)
* Allow Plex port (32400/tcp) through UFW on london-b

Plex needs direct access on port 32400 for remote streaming.
Adds common_ufw_allowed_ports to london-b host_vars.

* Add BitTorrent port (6881) to london-b UFW allowed ports

Port was already manually configured in UFW, bringing it under Ansible management.
2026-03-29 11:29:06 +01:00
99cc0d6967
Fix Alertmanager Caddyfile route pointing to Grafana port (#13)
Alertmanager reverse_proxy was pointing to :3000 (Grafana) instead of
:9093 (Alertmanager). Copy-paste artifact. Fixed in both the Caddyfile
and the template.
2026-03-29 11:07:41 +01:00
f9d0a7ebf4
fix: resolve UFW ansible-lint failures and deploy error (#11)
- Fix 'interface_or_direction' → 'direction' (required param for ufw module)
- Rename ufw_enabled/ufw_allowed_ports → common_ufw_enabled/common_ufw_allowed_ports (role prefix convention)
- Fix yaml[braces] violations in helsinki-a host_vars
2026-03-29 10:53:54 +01:00
4554dec7d2
Remove unused Prometheus alerting config (#10)
* Configure UFW firewall rules in common Ansible role

Add UFW configuration to the common role for Debian hosts:
- Default deny incoming, allow outgoing
- Allow all traffic on tailscale0 interface (mesh comms)
- Allow SSH port 22 as safety net
- Per-host allowed ports via ufw_allowed_ports variable
- Enable UFW after rules are applied

helsinki-a gets ports 80/443 for reverse proxy traffic.
Other Debian hosts only need Tailscale + SSH.

Closes PESO-79

* Remove unused alerting and rule_files from prometheus.yml

Alerting is handled by Grafana, not Prometheus Alertmanager.
The empty alertmanagers and rule_files sections were just noise.

Resolves PESO-74
2026-03-29 10:37:25 +01:00
da80c58ca4
fix: move authelia, forgejo, bitwarden to helsinki-a host_vars (#8)
These services run on helsinki-a, not london-b. Verified via docker ps
on both hosts. deploy.yml would have managed them on the wrong host.

Fixes PESO-73
2026-03-28 22:08:16 +00:00
03ce524730
Standardise Prometheus targets to Tailscale IPs (#4)
Replace local network IPs (192.168.1.x) with Tailscale IPs for
london-a and london-b in all scrape configs. This ensures consistent
connectivity via Tailscale mesh regardless of network topology changes.

Refs: PESO-80
2026-03-28 20:08:09 +00:00
92fb6f9d11 ignore all SOPS-encrypted files in yamllint 2026-03-28 18:50:08 +00:00
8bb91032f3 Add Authelia config and SOPS-encrypted secrets
- Add configuration.yml from running helsinki-a deployment
- Replace example secrets with real SOPS-encrypted config.enc.yml
- Add LDAP and SMTP password file env vars to docker-compose
  (all secrets now via file mounts, zero inline passwords)
- Update README with secret mapping and deployment steps

Closes PESO-89
2026-03-28 17:42:07 +00:00
8163b226b3
Merge pull request #2 from RWejlgaard/fix-lint-nitpicks
Fix ansible-lint yaml nitpicks
2026-03-28 13:19:37 +00:00
46063246a2 fix last 3 yaml lint failures
- add missing --- to notification-policy.yml
- prometheus.yml: replace commented-out template defaults with empty lists
2026-03-28 13:17:42 +00:00
dc198eea81 fix more yaml document-start and comment indentation
- add missing --- to 13 more yml files
- fix comment indentation in prometheus.yml
2026-03-28 13:15:46 +00:00
dc10ceacf5 fix remaining yaml lint nitpicks
- add missing document start (---) to contact-points.yml and docker-compose files
- fix extra spaces inside braces in dotfiles and common role tasks
2026-03-28 13:13:37 +00:00
6f5cb82ab9 remove pr-test.yml 2026-03-28 13:11:34 +00:00
269f1b2274 fix ansible-lint yaml nitpicks
- rules-warning.yml: remove trailing blank line
- pr-test.yml: quote 'on' key for yaml truthy, add newline at EOF
- add .yamllint config to ignore SOPS-encrypted secrets (line-length unfixable without re-encrypting)
2026-03-28 13:10:16 +00:00
cfd745b2b7 add mangos zero config and fix world service
- add mangosd.conf, realmd.conf, ahbot.conf, aiplayerbot.conf from copenhagen-a
- db password replaced with {{ mangos_db_password }} placeholder
- fix mangos-world.service: was identical copy of realmd service, now points to mangosd
- add README for mangos-zero service
2026-03-28 13:03:09 +00:00
Rasmus Wejlgaard
737d6e0bc1 initial commit 2026-03-28 12:39:41 +00:00