Commit graph

44 commits

Author SHA1 Message Date
043c783361
Grafana Cloud Migration (#94)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
Terraform / Plan (push) Waiting to run
Terraform / Apply (push) Blocked by required conditions
* Grafana Cloud migration, adding dashboards, fleet, alloy and synthetics

* modulize stuff now that we have multiple substantial things in here

* provider updates and new secrets

* remove grafana and prometheus from ansible
2026-05-04 13:40:30 +01:00
83f023aedd
Migration to Grafana Cloud, nuremberg-a reinstalled, london-a reinsta… (#93)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
Terraform / Plan (push) Waiting to run
Terraform / Apply (push) Blocked by required conditions
* Migration to Grafana Cloud, nuremberg-a reinstalled, london-a reinstalled

* dns config for cockpit
2026-05-03 14:00:22 +01:00
5391c500e1
fix: loki & alloy (#83)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
* fix: loki & alloy

* fix linting
2026-04-28 16:40:45 +01:00
a7f51ec10c
fix: update octo exporter (#82)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
2026-04-27 20:10:11 +01:00
5c404dca87
fix: update octopus_exporter to v1.1.1 (#81)
Some checks failed
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
Terraform / Plan (push) Has been cancelled
Terraform / Apply (push) Has been cancelled
2026-04-26 21:01:24 +01:00
10bb940f87
fix: update living room dashboard (#74) 2026-04-26 14:09:09 +01:00
af2f462c1c
fix: prometheus retention and authelia fix (#73)
Some checks are pending
Deploy (on merge) / Deploy (push) Waiting to run
Terraform / Plan (push) Waiting to run
Terraform / Apply (push) Blocked by required conditions
* fix: prometheus retention time

* also fix bug with authelia

* linting issues

* more linting
2026-04-25 21:35:39 +01:00
b82013c2f0
fix: actually decomission nextcloud and TWDNE (#72)
* fix: actually decomission nextcloud and TWDNE

* ignore spaces in lint and remove dns for the services

* linting on the linting config wasn't linting the lints
2026-04-25 18:19:16 +01:00
35c5079d8f
fix: remove cloud and TWDNE and add energy dashboard for grafana (#71) 2026-04-25 17:46:17 +01:00
7df62e8848
fix: adding octopus_exporter compose (#69)
* fix: adding octopus_exporter compose

* add the secret for octopus
2026-04-25 12:38:12 +01:00
56bec98afc
fix: Add octopus_exporter job configuration (#68) 2026-04-22 21:28:14 +01:00
c495b73720
template prometheus config (#67) 2026-04-21 20:44:37 +01:00
177fbb4014
Change provider for plex metrics (#65)
* change provider for plex metrics

* update plex token

* update plex token loading
2026-04-13 19:04:54 +01:00
2a98a89eb4
Change provider for plex metrics (#64)
* change provider for plex metrics

* update plex token
2026-04-12 21:21:24 +01:00
a0ec92dfdd
change provider for plex metrics (#63) 2026-04-12 18:45:30 +01:00
41d7876260
change provider for mc server for more configurability (#58) 2026-04-04 12:01:28 +01:00
849ea208f0
fix grafana alert rules missing relativeTimeRange (#57) 2026-04-04 09:58:13 +01:00
267b392996
Add sonarr service directory with README (#51)
Sonarr is running on london-b as an apt-managed systemd service
but was the only *arr service without a services/ directory in the
repo. Add services/sonarr/README.md documenting the install method,
data paths, and how it differs from the other *arr services.

Closes PESO-133
2026-04-04 09:31:39 +01:00
ed6eb22f60
Remove cloudflared — replaced by Caddy reverse proxy (#56)
Cloudflared tunnels are no longer used. All traffic now routes through
Cloudflare DNS to Caddy on helsinki-a over Tailscale.

- Remove cloudflared systemd unit files (copenhagen-a, london-b)
- Remove cloudflared from media_stack role and copenhagen-a host_vars
- Remove cloudflared references from services README and host docs
- Remove cloudflared deploy trigger from CI workflow

Live service on london-b stopped and disabled. copenhagen-a was
unreachable but the tunnel is unused regardless.
2026-04-03 22:51:12 +01:00
99c2091b96
Add smartctl-exporter to copenhagen-a and Prometheus scrape (#55)
- Add smartctl-exporter to copenhagen-a docker_services
- Add copenhagen-a as a Prometheus smartmontools scrape target
- Update compose file comment to reflect multi-host usage

Closes PESO-128
2026-04-03 21:20:20 +01:00
dca6a08ba1
Remove cloudflared from london-a (PESO-134) (#50)
cloudflared has been replaced by Caddy + Authelia. Removed:
- cloudflared service config (services/cloudflared/london-a/)
- tunnel ID from london-a host_vars
- cloudflared_enable from rc.conf

Also synced rc.conf with live server state (disabled services
from PESO-113, added node_exporter_listen_address).

Live server: stopped service, removed from rc.conf, uninstalled pkg.
2026-04-03 18:51:51 +01:00
a31f8b5651
Add systemd_exporter Ansible role and Prometheus scrape config (#49)
* Add systemd_exporter Ansible role and Prometheus scrape config

- Create systemd_exporter role (download binary, create user, deploy service)
- Add scrape job for london-b:9558 and copenhagen-a:9558
- Add systemd_exporter_hosts inventory group
- Add stage 3b to deploy.yml
- Map role to deploy-on-merge scope

Closes PESO-120

* Fix line length lint violations in systemd_exporter tasks

* Fix var-naming lint: use systemd_exporter_ prefix for role variables
2026-04-03 12:23:38 +01:00
49cea826e0
capture overseerr, syncthing, and fix slskd on london-b (#43) 2026-04-03 09:52:10 +01:00
2d7723d145
Add rule_files to prometheus.yml, remove empty node-exporter.rules (#46)
prometheus.yml was missing the rule_files section, so alerting rules
deployed to /usr/local/etc/prometheus/rules/ were never loaded.

- Add rule_files glob so Prometheus evaluates the ZFS pool rules
- Document that alerting notifications go through Grafana, not
  Alertmanager — no alerting: section needed
- Remove node-exporter.rules (all rules were commented out)

Resolves PESO-103
2026-04-03 04:49:16 +01:00
f75e2a8d5f
remove alertmanager caddyfile entry and clean up references (#42)
alerting is handled by grafana, not alertmanager. removed the
stale reverse proxy block from caddyfile template and updated
caddy + prometheus docs to reflect grafana-only alerting.
2026-04-03 02:49:37 +01:00
00b967d930 fix trailing blank line in copenhagen-a host_vars and missing document start in cloudflared config 2026-04-02 23:13:18 +00:00
3ce559d7b9
Wire thiswebsitedoesnotexist.service into deployment pipeline
- Move unit file from services/systemd/helsinki-a/ to
  services/thiswebsitedoesnotexist/ (matches systemd_services role convention)
- Add systemd_services: [thiswebsitedoesnotexist] to helsinki-a host_vars
- Add systemd_services role to helsinki-a stage in deploy.yml
- Remove redundant caddy.service (apt manages this via the caddy role)

Closes PESO-117
2026-04-02 22:19:26 +01:00
0bcc53b01d
Document undocumented services on london-a (#29)
Audit of london-a rc.conf found several services running but not
captured in host_vars or docs: cloudflared, InfluxDB, Redis,
PostgreSQL, and libvirtd.

- InfluxDB: only _internal db, completely unused
- Redis: empty keyspace, unused
- PostgreSQL: has pez_vps db from a dead project, needs data review
- libvirtd: zero VMs, related to same dead project
- cloudflared: running tunnel 168eccae, config now captured

Also documented the weekly ZFS scrub cron (Sundays at noon) which
is in root's crontab but not ansible-managed.

Ref: PESO-101
2026-03-30 21:39:57 +01:00
eb9f026abd
Clean up stale DNS records and Caddyfile entries (#28)
Remove webdav.pez.sh DNS record (WebDAV replaced by Nextcloud AIO on cloud.pez.sh)
Remove alertmanager.pez.sh DNS record and Caddyfile block (Alertmanager not running on london-a)
Remove status-https HTTPS record pointing to old statuspage.io (status.pez.sh is self-hosted on helsinki-a)
Remove commented-out WebDAV block from Caddyfile
Remove empty section headers for decommissioned hosts (london-c, copenhagen-b, copenhagen-c)

Closes PESO-102
2026-03-30 21:12:52 +01:00
431c65065a
Add Docker official apt repo to docker role (#24)
* Add Docker official apt repo to docker role

The docker role was installing docker-compose-plugin which is only
available from Docker's official apt repository. helsinki-a had it
configured manually, but london-b and copenhagen-a did not, causing
deploy failures.

Now the role:
- Adds Docker's GPG key and apt repo (handles both Debian and Ubuntu)
- Installs docker-ce, docker-ce-cli, containerd.io, docker-compose-plugin
- Removes conflicting stock packages (docker.io, docker-compose)

* fix: resolve yamllint violations in docker role

- Remove standalone comment blocks that caused indentation errors
- Collapse multiline repo string to single line
- Ensure document start marker is present

* fix: keep all lines under 160 chars for yamllint

Use set_fact to build the Docker repo line in parts instead of
one long inline string.

* fix: resolve yamllint errors in london-b host_vars and promtail config

- Remove trailing blank line in inventory/host_vars/london-b.yml
- Add missing document start marker to promtail config
- Fix indentation in promtail scrape_configs (indent list items under key)

* Remove ansible-lint on push, keep PR-only

Lint already runs on pull_request — no need to double up on push to main.
2026-03-29 21:11:33 +01:00
353c2ad790
Capture london-b media stack and systemd services (#19)
Add the full media automation stack (sonarr, radarr, prowlarr, lidarr,
readarr, whisparr), media servers (jellyfin, plex), and supporting
services (transmission, samba, ollama, promtail, cloudflared, vsftpd)
to the repo as a media_stack Ansible role.

Includes:
- Custom systemd unit files for non-package-managed services
- Config files for promtail, samba, transmission, vsftpd
- Cron jobs for movie-rename-fix, sonarr/radarr midnight restarts
- Updated deploy.yml to wire the role into london-b's stage
- Updated london-b docs with full service inventory

Backup script (backup.sh) already covered by the existing backup role.
Node/systemd exporters already covered by existing monitoring roles.

Closes PESO-92
2026-03-29 19:13:48 +01:00
69918c8619
Add ZFS management role: scrub scheduling and pool monitoring (#18)
- New zfs role with cron-based scrub scheduling for Linux and FreeBSD
- Weekly Sunday scrubs at noon (matching existing manual crons)
- Add zfs_hosts inventory group with london-a and london-b
- Configure zfs_pools per host: zroot (london-a), hdd (london-b)
- Add Prometheus alert rules for degraded/faulted/offline pools
- Add zfs.yml playbook for targeted deploys

Captures the previously untracked scrub cron on london-a and
re-enables the commented-out scrub on london-b.

Refs: PESO-93
2026-03-29 19:12:42 +01:00
b0acdb72e3
capture helsinki-a status page cron in repo (#17)
add status_page role that deploys update-status.sh and its cron job.
script queries prometheus for caddy upstream health and writes
status.json + history to /srv/status/ every minute.

refs: PESO-94
2026-03-29 15:39:35 +01:00
99cc0d6967
Fix Alertmanager Caddyfile route pointing to Grafana port (#13)
Alertmanager reverse_proxy was pointing to :3000 (Grafana) instead of
:9093 (Alertmanager). Copy-paste artifact. Fixed in both the Caddyfile
and the template.
2026-03-29 11:07:41 +01:00
4554dec7d2
Remove unused Prometheus alerting config (#10)
* Configure UFW firewall rules in common Ansible role

Add UFW configuration to the common role for Debian hosts:
- Default deny incoming, allow outgoing
- Allow all traffic on tailscale0 interface (mesh comms)
- Allow SSH port 22 as safety net
- Per-host allowed ports via ufw_allowed_ports variable
- Enable UFW after rules are applied

helsinki-a gets ports 80/443 for reverse proxy traffic.
Other Debian hosts only need Tailscale + SSH.

Closes PESO-79

* Remove unused alerting and rule_files from prometheus.yml

Alerting is handled by Grafana, not Prometheus Alertmanager.
The empty alertmanagers and rule_files sections were just noise.

Resolves PESO-74
2026-03-29 10:37:25 +01:00
03ce524730
Standardise Prometheus targets to Tailscale IPs (#4)
Replace local network IPs (192.168.1.x) with Tailscale IPs for
london-a and london-b in all scrape configs. This ensures consistent
connectivity via Tailscale mesh regardless of network topology changes.

Refs: PESO-80
2026-03-28 20:08:09 +00:00
8bb91032f3 Add Authelia config and SOPS-encrypted secrets
- Add configuration.yml from running helsinki-a deployment
- Replace example secrets with real SOPS-encrypted config.enc.yml
- Add LDAP and SMTP password file env vars to docker-compose
  (all secrets now via file mounts, zero inline passwords)
- Update README with secret mapping and deployment steps

Closes PESO-89
2026-03-28 17:42:07 +00:00
8163b226b3
Merge pull request #2 from RWejlgaard/fix-lint-nitpicks
Fix ansible-lint yaml nitpicks
2026-03-28 13:19:37 +00:00
46063246a2 fix last 3 yaml lint failures
- add missing --- to notification-policy.yml
- prometheus.yml: replace commented-out template defaults with empty lists
2026-03-28 13:17:42 +00:00
dc198eea81 fix more yaml document-start and comment indentation
- add missing --- to 13 more yml files
- fix comment indentation in prometheus.yml
2026-03-28 13:15:46 +00:00
dc10ceacf5 fix remaining yaml lint nitpicks
- add missing document start (---) to contact-points.yml and docker-compose files
- fix extra spaces inside braces in dotfiles and common role tasks
2026-03-28 13:13:37 +00:00
269f1b2274 fix ansible-lint yaml nitpicks
- rules-warning.yml: remove trailing blank line
- pr-test.yml: quote 'on' key for yaml truthy, add newline at EOF
- add .yamllint config to ignore SOPS-encrypted secrets (line-length unfixable without re-encrypting)
2026-03-28 13:10:16 +00:00
cfd745b2b7 add mangos zero config and fix world service
- add mangosd.conf, realmd.conf, ahbot.conf, aiplayerbot.conf from copenhagen-a
- db password replaced with {{ mangos_db_password }} placeholder
- fix mangos-world.service: was identical copy of realmd service, now points to mangosd
- add README for mangos-zero service
2026-03-28 13:03:09 +00:00
Rasmus Wejlgaard
737d6e0bc1 initial commit 2026-03-28 12:39:41 +00:00