Commit graph

93 commits

Author SHA1 Message Date
9eb9b613c6 fix: update octopus_exporter to v1.1.1 2026-04-26 20:57:34 +01:00
d76be4828c
fix: add ssh key resource (#80) 2026-04-26 20:08:45 +01:00
19928358c5
fix: Update node version for gha (#79)
* fix: update checkout version to dodge deprecation

* fix: more deprecations

* forgot one
2026-04-26 18:35:15 +01:00
7c3fec983b
fix: Update node version for gha (#78)
* fix: update checkout version to dodge deprecation

* fix: more deprecations
2026-04-26 18:23:22 +01:00
98be03c273
fix: update checkout version to dodge deprecation (#77) 2026-04-26 18:13:38 +01:00
1c6784eade
fix: replace tailscale authkey use with oauth (#76)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
2026-04-26 17:30:15 +01:00
e9fbd41cb4
fix: deploy using a matrix (#75) 2026-04-26 14:35:12 +01:00
10bb940f87
fix: update living room dashboard (#74) 2026-04-26 14:09:09 +01:00
af2f462c1c
fix: prometheus retention and authelia fix (#73)
Some checks are pending
Deploy (on merge) / Deploy (push) Waiting to run
Terraform / Plan (push) Waiting to run
Terraform / Apply (push) Blocked by required conditions
* fix: prometheus retention time

* also fix bug with authelia

* linting issues

* more linting
2026-04-25 21:35:39 +01:00
b82013c2f0
fix: actually decomission nextcloud and TWDNE (#72)
* fix: actually decomission nextcloud and TWDNE

* ignore spaces in lint and remove dns for the services

* linting on the linting config wasn't linting the lints
2026-04-25 18:19:16 +01:00
35c5079d8f
fix: remove cloud and TWDNE and add energy dashboard for grafana (#71) 2026-04-25 17:46:17 +01:00
b3cc47f3d6
fix: optimize deploy playbook and get rid of deprecated stuff (#70) 2026-04-25 15:04:16 +01:00
7df62e8848
fix: adding octopus_exporter compose (#69)
* fix: adding octopus_exporter compose

* add the secret for octopus
2026-04-25 12:38:12 +01:00
56bec98afc
fix: Add octopus_exporter job configuration (#68) 2026-04-22 21:28:14 +01:00
c495b73720
template prometheus config (#67) 2026-04-21 20:44:37 +01:00
34820ee663
adding london-c (#66) 2026-04-20 20:52:19 +01:00
177fbb4014
Change provider for plex metrics (#65)
* change provider for plex metrics

* update plex token

* update plex token loading
2026-04-13 19:04:54 +01:00
2a98a89eb4
Change provider for plex metrics (#64)
* change provider for plex metrics

* update plex token
2026-04-12 21:21:24 +01:00
a0ec92dfdd
change provider for plex metrics (#63) 2026-04-12 18:45:30 +01:00
49cee191b5
fix: bind mariadb to local ip (#62) 2026-04-11 21:24:11 +01:00
1ef59ccc4a
fix: add mangos ports to firewall (#61) 2026-04-11 20:42:17 +01:00
1ab278e47a
only send email if something went wrong with backups (#60) 2026-04-06 18:33:07 +01:00
4c7ea76d81
fix: remove node_exporter from copenhagen-a systemd_services (#59)
node_exporter is deployed by the dedicated node_exporter Ansible role
using distro packages (prometheus-node-exporter). Having it in
systemd_services causes the systemd_services role to look for a
non-existent services/node_exporter/node_exporter.service file,
producing errors during deploy.

Resolves PESO-135
2026-04-04 12:51:52 +01:00
41d7876260
change provider for mc server for more configurability (#58) 2026-04-04 12:01:28 +01:00
849ea208f0
fix grafana alert rules missing relativeTimeRange (#57) 2026-04-04 09:58:13 +01:00
267b392996
Add sonarr service directory with README (#51)
Sonarr is running on london-b as an apt-managed systemd service
but was the only *arr service without a services/ directory in the
repo. Add services/sonarr/README.md documenting the install method,
data paths, and how it differs from the other *arr services.

Closes PESO-133
2026-04-04 09:31:39 +01:00
ed6eb22f60
Remove cloudflared — replaced by Caddy reverse proxy (#56)
Cloudflared tunnels are no longer used. All traffic now routes through
Cloudflare DNS to Caddy on helsinki-a over Tailscale.

- Remove cloudflared systemd unit files (copenhagen-a, london-b)
- Remove cloudflared from media_stack role and copenhagen-a host_vars
- Remove cloudflared references from services README and host docs
- Remove cloudflared deploy trigger from CI workflow

Live service on london-b stopped and disabled. copenhagen-a was
unreachable but the tunnel is unused regardless.
2026-04-03 22:51:12 +01:00
99c2091b96
Add smartctl-exporter to copenhagen-a and Prometheus scrape (#55)
- Add smartctl-exporter to copenhagen-a docker_services
- Add copenhagen-a as a Prometheus smartmontools scrape target
- Update compose file comment to reflect multi-host usage

Closes PESO-128
2026-04-03 21:20:20 +01:00
88377f3e93
fix: remove || true from compose lint so validation errors fail CI (#54)
The lint-docker-compose workflow was swallowing all validation errors with
|| true, meaning broken compose files would never fail the check.

- Remove || true and let validation failures propagate
- Add a pre-step that creates empty stubs for referenced env_file entries
  (e.g. bitwarden/settings.env) so docker compose config can validate
  structure without needing real secrets
- Track per-file pass/fail and exit non-zero if any file fails

Closes PESO-130
2026-04-03 20:50:47 +01:00
d8757d37e1
fix(london-a): correct grafana provisioning dir path (#53)
grafana.ini on london-a sets provisioning = /usr/local/etc/grafana/provisioning
but grafana_provisioning_dir pointed at /usr/local/share/grafana/conf/provisioning.

This meant deploy.yml synced alerting rules, dashboards provisioning, and
datasources to a path Grafana never reads — a from-scratch deploy would have
broken alerting entirely.

Fixes PESO-131
2026-04-03 20:20:15 +01:00
25d201f930
Add copenhagen-a to docker_hosts and wire up minecraft docker service (#52)
- Add copenhagen-a to [docker_hosts] inventory group so the docker role
  runs on it in Stage 2
- Add docker_services: [minecraft] to copenhagen-a host_vars
- Add docker_services role to Stage 4d (copenhagen-a) in deploy.yml
- Update deploy-on-merge scope mapping to include copenhagen-a for
  docker role changes

Closes PESO-132
2026-04-03 19:50:51 +01:00
dca6a08ba1
Remove cloudflared from london-a (PESO-134) (#50)
cloudflared has been replaced by Caddy + Authelia. Removed:
- cloudflared service config (services/cloudflared/london-a/)
- tunnel ID from london-a host_vars
- cloudflared_enable from rc.conf

Also synced rc.conf with live server state (disabled services
from PESO-113, added node_exporter_listen_address).

Live server: stopped service, removed from rc.conf, uninstalled pkg.
2026-04-03 18:51:51 +01:00
a31f8b5651
Add systemd_exporter Ansible role and Prometheus scrape config (#49)
* Add systemd_exporter Ansible role and Prometheus scrape config

- Create systemd_exporter role (download binary, create user, deploy service)
- Add scrape job for london-b:9558 and copenhagen-a:9558
- Add systemd_exporter_hosts inventory group
- Add stage 3b to deploy.yml
- Map role to deploy-on-merge scope

Closes PESO-120

* Fix line length lint violations in systemd_exporter tasks

* Fix var-naming lint: use systemd_exporter_ prefix for role variables
2026-04-03 12:23:38 +01:00
8f5eb385cc
Remove copenhagen-a from docker role mapping in deploy-on-merge (#48)
copenhagen-a is not in [docker_hosts] inventory group. Running the
docker role play against it just gets skipped, wasting CI time.

Fixes PESO-121
2026-04-03 11:49:41 +01:00
029c35fba6
Replace ASCII diagrams with mermaid in docs (#47)
Convert remaining ASCII art diagrams to mermaid syntax:
- monitoring.md: stack overview diagram
- networking.md: Tailscale mesh diagram + DNS request flow

architecture.md already used mermaid, no changes needed.

PESO-123
2026-04-03 10:48:41 +01:00
8a4a95b596
Add ZFS role to deploy.yml for scrub scheduling (#44) 2026-04-03 09:53:10 +01:00
49cea826e0
capture overseerr, syncthing, and fix slskd on london-b (#43) 2026-04-03 09:52:10 +01:00
2d7723d145
Add rule_files to prometheus.yml, remove empty node-exporter.rules (#46)
prometheus.yml was missing the rule_files section, so alerting rules
deployed to /usr/local/etc/prometheus/rules/ were never loaded.

- Add rule_files glob so Prometheus evaluates the ZFS pool rules
- Document that alerting notifications go through Grafana, not
  Alertmanager — no alerting: section needed
- Remove node-exporter.rules (all rules were commented out)

Resolves PESO-103
2026-04-03 04:49:16 +01:00
ff8d7a53e7
Remove copenhagen-a from docker_hosts and docker_services (#45)
Docker is masked on copenhagen-a and Minecraft is no longer managed
via Docker Compose. Removes:
- copenhagen-a from [docker_hosts] inventory group
- docker_services var from copenhagen-a host_vars
- docker_services role from Stage 4d deploy play

MaNGOS systemd services remain unchanged.

Fixes PESO-104
2026-04-03 04:18:46 +01:00
f75e2a8d5f
remove alertmanager caddyfile entry and clean up references (#42)
alerting is handled by grafana, not alertmanager. removed the
stale reverse proxy block from caddyfile template and updated
caddy + prometheus docs to reflect grafana-only alerting.
2026-04-03 02:49:37 +01:00
b6c8c18106
deploy-on-merge: add path-based host limiting (#41)
Instead of deploying to the entire fleet on every merge, detect which
files changed and limit ansible-playbook to only affected hosts.

Maps ansible roles, services, and host_vars to their target hosts.
Falls back to full fleet deploy for unmapped paths or changes to
shared infrastructure (common role, deploy.yml, inventory).

Closes PESO-108
2026-04-03 02:19:55 +01:00
853386ce2f
fix: remove custom node_exporter, standardise on package version (#40)
london-b had both a custom node_exporter.service and the
package-managed prometheus-node-exporter.service installed.
Both tried to bind port 9100, causing the package version to fail.

- Add cleanup tasks to remove custom /etc/systemd/system/node_exporter.service
  and /usr/local/bin/node_exporter if present
- Add node_exporter_extra_collectors variable for configurable collectors
- Configure london-b with systemd/processes/sysctl/ethtool/zfs collectors
  matching its previous custom setup

Resolves PESO-109
2026-04-03 01:50:13 +01:00
20274d49d4
ci: add ansible-galaxy collection install to deploy workflows (#39)
Both deploy-on-merge.yml and deploy.yml install ansible via pip but
never install the required Galaxy collections (community.docker,
community.general, ansible.posix) from ansible/requirements.yml.

This works by accident because the pip ansible package bundles some
collections, but it's fragile — a pip upgrade or runner image change
could break deploys silently.

Fixes PESO-110
2026-04-03 01:18:30 +01:00
d3bce0d5c2
nuremberg-a: add poste-io to docker_services (#38)
Adds docker_services list to nuremberg-a host_vars so the docker_services
role deploys and manages the poste-io mail container via docker compose,
replacing the current manual container setup.
2026-04-03 00:49:50 +01:00
5a5c60b6b2
london-a: disable unused services (InfluxDB, Redis, PostgreSQL, libvirtd) (#37)
Services stopped and disabled in rc.conf on london-a.
Removed audit variables from host_vars, replaced with cleanup note.

All four were leftovers from a defunct pez_vps project:
- InfluxDB: no user databases, only _internal
- Redis: empty keyspace, no clients
- PostgreSQL: defunct pez_vps database (Pez approved removal)
- libvirtd: zero VMs defined

Resolves PESO-113
2026-04-03 00:17:58 +01:00
6503bef2c6
Merge pull request #36 from RWejlgaard/fix/ansible-lint-yaml-violations
Fix ansible-lint yaml violations
2026-04-03 00:15:10 +01:00
00b967d930 fix trailing blank line in copenhagen-a host_vars and missing document start in cloudflared config 2026-04-02 23:13:18 +00:00
ca3d9c4261
Remove undocumented_services from copenhagen-a host_vars (#35)
PostgreSQL 14 and Redis have been stopped, disabled, purged, and
data directories removed from copenhagen-a. These were leftovers
from an old WordPress project with no user data.

Resolves: PESO-114
2026-04-02 23:53:15 +01:00
9317a712ec
Fix deployment methods in docs/services.md (#34)
Several services were incorrectly listed as Docker when they actually
run as native systemd services:

- helsinki-a: Caddy is apt-installed, not Docker
- london-b: Radarr, Sonarr, Lidarr, Readarr, Prowlarr are systemd
  services managed by media_stack role
- london-b: Jellyfin, Plex, Transmission are apt packages with systemd
  units

Updated Deployment column to reflect actual deployment method.

Fixes PESO-116
2026-04-02 22:48:14 +01:00
3ce559d7b9
Wire thiswebsitedoesnotexist.service into deployment pipeline
- Move unit file from services/systemd/helsinki-a/ to
  services/thiswebsitedoesnotexist/ (matches systemd_services role convention)
- Add systemd_services: [thiswebsitedoesnotexist] to helsinki-a host_vars
- Add systemd_services role to helsinki-a stage in deploy.yml
- Remove redundant caddy.service (apt manages this via the caddy role)

Closes PESO-117
2026-04-02 22:19:26 +01:00