The nightly job runs 'rclone sync', which permanently deletes or overwrites
objects at the B2 destination. That means an accidental deletion or a
ransomware encryption on /hdd propagates straight to the backup on the next
run, leaving no clean copy.
Add --backup-dir so every superseded version is moved into a dated folder
under _versions/ rather than thrown away, and prune that folder after 30
days so it doesn't grow unbounded.
The docker_services and systemd_services roles ran their "start the
service" tasks with `failed_when: false`, so a container or unit that
failed to come up still reported the deploy as green. Drop it from both
start tasks so a broken deploy actually fails CI. The compose/unit *copy*
tasks keep `failed_when: false` — that's load-bearing for the
`item is not failed` filter that skips services without a compose/unit file.
Also:
- Remove a duplicate "Template service .env files" task in docker_services
(second copy used a hardcoded path and didn't register; first one is the
one the start task reads).
- Don't trigger a full fleet deploy on docs/markdown/workflow-only pushes
to main — add docs/**, **/*.md and .github/** to paths-ignore.
- Drop the dangling `update-freebsd` Make target (playbook doesn't exist;
fleet has no FreeBSD hosts).
* Grafana Cloud migration, adding dashboards, fleet, alloy and synthetics
* modulize stuff now that we have multiple substantial things in here
* provider updates and new secrets
* remove grafana and prometheus from ansible
* fix: actually decomission nextcloud and TWDNE
* ignore spaces in lint and remove dns for the services
* linting on the linting config wasn't linting the lints
node_exporter is deployed by the dedicated node_exporter Ansible role
using distro packages (prometheus-node-exporter). Having it in
systemd_services causes the systemd_services role to look for a
non-existent services/node_exporter/node_exporter.service file,
producing errors during deploy.
Resolves PESO-135
Sonarr is running on london-b as an apt-managed systemd service
but was the only *arr service without a services/ directory in the
repo. Add services/sonarr/README.md documenting the install method,
data paths, and how it differs from the other *arr services.
Closes PESO-133
Cloudflared tunnels are no longer used. All traffic now routes through
Cloudflare DNS to Caddy on helsinki-a over Tailscale.
- Remove cloudflared systemd unit files (copenhagen-a, london-b)
- Remove cloudflared from media_stack role and copenhagen-a host_vars
- Remove cloudflared references from services README and host docs
- Remove cloudflared deploy trigger from CI workflow
Live service on london-b stopped and disabled. copenhagen-a was
unreachable but the tunnel is unused regardless.
grafana.ini on london-a sets provisioning = /usr/local/etc/grafana/provisioning
but grafana_provisioning_dir pointed at /usr/local/share/grafana/conf/provisioning.
This meant deploy.yml synced alerting rules, dashboards provisioning, and
datasources to a path Grafana never reads — a from-scratch deploy would have
broken alerting entirely.
Fixes PESO-131
- Add copenhagen-a to [docker_hosts] inventory group so the docker role
runs on it in Stage 2
- Add docker_services: [minecraft] to copenhagen-a host_vars
- Add docker_services role to Stage 4d (copenhagen-a) in deploy.yml
- Update deploy-on-merge scope mapping to include copenhagen-a for
docker role changes
Closes PESO-132
cloudflared has been replaced by Caddy + Authelia. Removed:
- cloudflared service config (services/cloudflared/london-a/)
- tunnel ID from london-a host_vars
- cloudflared_enable from rc.conf
Also synced rc.conf with live server state (disabled services
from PESO-113, added node_exporter_listen_address).
Live server: stopped service, removed from rc.conf, uninstalled pkg.
* Add systemd_exporter Ansible role and Prometheus scrape config
- Create systemd_exporter role (download binary, create user, deploy service)
- Add scrape job for london-b:9558 and copenhagen-a:9558
- Add systemd_exporter_hosts inventory group
- Add stage 3b to deploy.yml
- Map role to deploy-on-merge scope
Closes PESO-120
* Fix line length lint violations in systemd_exporter tasks
* Fix var-naming lint: use systemd_exporter_ prefix for role variables
prometheus.yml was missing the rule_files section, so alerting rules
deployed to /usr/local/etc/prometheus/rules/ were never loaded.
- Add rule_files glob so Prometheus evaluates the ZFS pool rules
- Document that alerting notifications go through Grafana, not
Alertmanager — no alerting: section needed
- Remove node-exporter.rules (all rules were commented out)
Resolves PESO-103