Commit graph

28 commits

Author SHA1 Message Date
87439d47b8
ci: extract shared SOPS/tofu steps into composite actions (#135)
Some checks failed
Terraform / Plan (push) Has been cancelled
Terraform / Apply (push) Has been cancelled
The SOPS install + version, the decrypt loop, the OpenTofu version, and
the Backblaze backend-credential extraction were copy-pasted across
terraform.yml (twice), validate-terraform.yml, and _deploy-core.yml.
A version bump meant editing the same string in up to four places and
was easy to do partially.

Pull them into three local composite actions so each is defined once:
  - setup-tofu          (pins OpenTofu version)
  - sops-decrypt        (installs SOPS, decrypts *.enc.* in place)
  - tofu-backend-creds  (exports Backblaze S3 creds to GITHUB_ENV)

Behaviour is unchanged; sops-decrypt also matches *.enc.env everywhere
(previously only _deploy-core did), which is a no-op in terraform/.
2026-06-18 20:27:54 +01:00
e9d5f9bc76
ci: make Caddyfile validation download robust (#134)
The validate-caddyfile workflow fetched the Caddy binary by first hitting
api.github.com/releases/latest to resolve the version tag, then building a
release-asset URL from it. That API call is unauthenticated, so it shares
the 60-requests/hour-per-IP limit across all GitHub-hosted runners and
returns 403 under load. On failure jq emits "null", the URL becomes
caddy_null_linux_amd64.tar.gz, and `curl -sL` silently pipes a 404 page
into tar — a confusing, flaky failure on every PR that touches the Caddyfile.

Switch to Caddy's official download API, which serves the latest linux/amd64
binary directly: one request, no GitHub API, no jq/tar parsing. Add `-f` so
curl fails loudly on an HTTP error instead of writing an error page to disk.
2026-06-15 20:38:21 +01:00
26f8224941
make Dependabot tofu validate stubs satisfy provider validators (#132)
Some checks failed
Deploy (on merge) / Discover hosts (push) Has been cancelled
Terraform / Plan (push) Has been cancelled
Deploy (on merge) / deploy (push) Has been cancelled
Terraform / Apply (push) Has been cancelled
2026-06-12 19:25:24 +01:00
dependabot[bot]
7f2cbd4af1
chore(deps): bump the github-actions group across 1 directory with 2 updates (#117)
Bumps the github-actions group with 2 updates in the / directory: [ansible/ansible-lint](https://github.com/ansible/ansible-lint) and [actions/github-script](https://github.com/actions/github-script).


Updates `ansible/ansible-lint` from 25 to 26
- [Release notes](https://github.com/ansible/ansible-lint/releases)
- [Commits](https://github.com/ansible/ansible-lint/compare/v25...v26)

Updates `actions/github-script` from 7 to 9
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
- dependency-name: ansible/ansible-lint
  dependency-version: '26'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-05 21:13:03 +01:00
9815f44b84
fix: stop masking failed service deploys; trim dead config (#119)
Some checks failed
Deploy (on merge) / Discover hosts (push) Has been cancelled
Deploy (on merge) / deploy (push) Has been cancelled
The docker_services and systemd_services roles ran their "start the
service" tasks with `failed_when: false`, so a container or unit that
failed to come up still reported the deploy as green. Drop it from both
start tasks so a broken deploy actually fails CI. The compose/unit *copy*
tasks keep `failed_when: false` — that's load-bearing for the
`item is not failed` filter that skips services without a compose/unit file.

Also:
- Remove a duplicate "Template service .env files" task in docker_services
  (second copy used a hardcoded path and didn't register; first one is the
  one the start task reads).
- Don't trigger a full fleet deploy on docs/markdown/workflow-only pushes
  to main — add docs/**, **/*.md and .github/** to paths-ignore.
- Drop the dangling `update-freebsd` Make target (playbook doesn't exist;
  fleet has no FreeBSD hosts).
2026-06-04 18:41:24 +01:00
7b2552fea5
chore: fix dependabot PRs (#118)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / deploy (push) Blocked by required conditions
* chore: add dependabot config

Add Dependabot for the three supported ecosystems in this repo:
GitHub Actions, Terraform (root + grafana/hetzner/pagerduty modules),
and Docker (service compose files + dotfile Dockerfiles). Weekly
schedule with per-ecosystem grouping to keep PR noise down.

* ci: make terraform validation work on dependabot PRs

Dependabot PRs run with no access to repository secrets and a read-only
token, so the SOPS decrypt step (and the PR-comment step) fail. Give
Dependabot a secret-free path: stub the secrets.yaml keys it reads and
run init -backend=false + validate, skipping decrypt/plan/comment. Human
PRs are unchanged and still get a full plan.
2026-06-03 19:29:23 +01:00
65090ca9d6
ci: serialize terraform and deploy runs with concurrency guards (#114)
Some checks failed
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / deploy (push) Blocked by required conditions
Terraform / Plan (push) Has been cancelled
Terraform / Apply (push) Has been cancelled
* ci: serialize infra runs and enable terraform state locking

Add concurrency guards to the terraform and deploy-on-merge workflows so
two merges in quick succession can't run against the same state or the
same hosts at once (queue, never cancel an in-flight run).

Enable native S3 state locking (use_lockfile) on the Backblaze B2 backend,
which needs OpenTofu 1.10+, so bump the CI tofu version 1.9.0 -> 1.10.10
and the required_version constraint to >= 1.10.0.

* ci: bump tofu to 1.10.10 in the validate workflow too

Missed this one in the last commit — the PR-time validate still pinned
1.9.0, which trips the new required_version >= 1.10.0 constraint.

* ci: drop use_lockfile — Backblaze B2 can't do native state locking

B2's S3 API returns 501 NotImplemented for the conditional PutObject that
use_lockfile relies on, so tofu plan/apply fails to acquire the lock.
Revert the lockfile and the 1.10 version bump it required; rely on the
concurrency guard to serialize applies instead. Left a note in the
backend block so this isn't re-attempted.
2026-06-02 19:39:13 +01:00
1ec4e10eb1
Update cache action (#111)
Some checks failed
Deploy (on merge) / Discover hosts (push) Has been cancelled
Deploy (on merge) / deploy (push) Has been cancelled
* fix: update cache version

* fix: update cache
2026-05-16 11:13:38 +01:00
a6aa561147
fix: update cache version (#110) 2026-05-16 11:03:12 +01:00
7ad2766f94
hotfix: broken pipeline (#109)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / deploy (push) Blocked by required conditions
* fix: cleanup deploy.yml and share workflow

* lint issue

* hotfix: broken pipeline
2026-05-15 20:19:56 +01:00
9f84652102
fix: cleanup deploy.yml and share workflow (#108)
* fix: cleanup deploy.yml and share workflow

* lint issue
2026-05-15 20:17:28 +01:00
7d22ad1ce1
bug: add retry to restarting caddy (#97)
Some checks failed
Terraform / Plan (push) Waiting to run
Terraform / Apply (push) Blocked by required conditions
Deploy (on merge) / Discover hosts (push) Has been cancelled
Deploy (on merge) / Deploy → (push) Has been cancelled
* bug: add retry to restarting caddy

* skip terraform pipeline when no terraform changes has been done
2026-05-05 20:42:52 +01:00
abb283c1d7
terraform plan on pr and caddy metrics on localhost since we have all… (#96)
* terraform plan on pr and caddy metrics on localhost since we have alloy now

* remove refreshing state
2026-05-05 13:35:37 +01:00
19928358c5
fix: Update node version for gha (#79)
* fix: update checkout version to dodge deprecation

* fix: more deprecations

* forgot one
2026-04-26 18:35:15 +01:00
7c3fec983b
fix: Update node version for gha (#78)
* fix: update checkout version to dodge deprecation

* fix: more deprecations
2026-04-26 18:23:22 +01:00
98be03c273
fix: update checkout version to dodge deprecation (#77) 2026-04-26 18:13:38 +01:00
1c6784eade
fix: replace tailscale authkey use with oauth (#76)
Some checks are pending
Deploy (on merge) / Discover hosts (push) Waiting to run
Deploy (on merge) / Deploy → (push) Blocked by required conditions
2026-04-26 17:30:15 +01:00
e9fbd41cb4
fix: deploy using a matrix (#75) 2026-04-26 14:35:12 +01:00
ed6eb22f60
Remove cloudflared — replaced by Caddy reverse proxy (#56)
Cloudflared tunnels are no longer used. All traffic now routes through
Cloudflare DNS to Caddy on helsinki-a over Tailscale.

- Remove cloudflared systemd unit files (copenhagen-a, london-b)
- Remove cloudflared from media_stack role and copenhagen-a host_vars
- Remove cloudflared references from services README and host docs
- Remove cloudflared deploy trigger from CI workflow

Live service on london-b stopped and disabled. copenhagen-a was
unreachable but the tunnel is unused regardless.
2026-04-03 22:51:12 +01:00
88377f3e93
fix: remove || true from compose lint so validation errors fail CI (#54)
The lint-docker-compose workflow was swallowing all validation errors with
|| true, meaning broken compose files would never fail the check.

- Remove || true and let validation failures propagate
- Add a pre-step that creates empty stubs for referenced env_file entries
  (e.g. bitwarden/settings.env) so docker compose config can validate
  structure without needing real secrets
- Track per-file pass/fail and exit non-zero if any file fails

Closes PESO-130
2026-04-03 20:50:47 +01:00
25d201f930
Add copenhagen-a to docker_hosts and wire up minecraft docker service (#52)
- Add copenhagen-a to [docker_hosts] inventory group so the docker role
  runs on it in Stage 2
- Add docker_services: [minecraft] to copenhagen-a host_vars
- Add docker_services role to Stage 4d (copenhagen-a) in deploy.yml
- Update deploy-on-merge scope mapping to include copenhagen-a for
  docker role changes

Closes PESO-132
2026-04-03 19:50:51 +01:00
a31f8b5651
Add systemd_exporter Ansible role and Prometheus scrape config (#49)
* Add systemd_exporter Ansible role and Prometheus scrape config

- Create systemd_exporter role (download binary, create user, deploy service)
- Add scrape job for london-b:9558 and copenhagen-a:9558
- Add systemd_exporter_hosts inventory group
- Add stage 3b to deploy.yml
- Map role to deploy-on-merge scope

Closes PESO-120

* Fix line length lint violations in systemd_exporter tasks

* Fix var-naming lint: use systemd_exporter_ prefix for role variables
2026-04-03 12:23:38 +01:00
8f5eb385cc
Remove copenhagen-a from docker role mapping in deploy-on-merge (#48)
copenhagen-a is not in [docker_hosts] inventory group. Running the
docker role play against it just gets skipped, wasting CI time.

Fixes PESO-121
2026-04-03 11:49:41 +01:00
b6c8c18106
deploy-on-merge: add path-based host limiting (#41)
Instead of deploying to the entire fleet on every merge, detect which
files changed and limit ansible-playbook to only affected hosts.

Maps ansible roles, services, and host_vars to their target hosts.
Falls back to full fleet deploy for unmapped paths or changes to
shared infrastructure (common role, deploy.yml, inventory).

Closes PESO-108
2026-04-03 02:19:55 +01:00
20274d49d4
ci: add ansible-galaxy collection install to deploy workflows (#39)
Both deploy-on-merge.yml and deploy.yml install ansible via pip but
never install the required Galaxy collections (community.docker,
community.general, ansible.posix) from ansible/requirements.yml.

This works by accident because the pip ansible package bundles some
collections, but it's fragile — a pip upgrade or runner image change
could break deploys silently.

Fixes PESO-110
2026-04-03 01:18:30 +01:00
b16f89357b
replace hard set ip with vars (#25)
* replace hard set ip with vars

* run all PR checks every time
2026-03-29 21:33:50 +01:00
431c65065a
Add Docker official apt repo to docker role (#24)
* Add Docker official apt repo to docker role

The docker role was installing docker-compose-plugin which is only
available from Docker's official apt repository. helsinki-a had it
configured manually, but london-b and copenhagen-a did not, causing
deploy failures.

Now the role:
- Adds Docker's GPG key and apt repo (handles both Debian and Ubuntu)
- Installs docker-ce, docker-ce-cli, containerd.io, docker-compose-plugin
- Removes conflicting stock packages (docker.io, docker-compose)

* fix: resolve yamllint violations in docker role

- Remove standalone comment blocks that caused indentation errors
- Collapse multiline repo string to single line
- Ensure document start marker is present

* fix: keep all lines under 160 chars for yamllint

Use set_fact to build the Docker repo line in parts instead of
one long inline string.

* fix: resolve yamllint errors in london-b host_vars and promtail config

- Remove trailing blank line in inventory/host_vars/london-b.yml
- Add missing document start marker to promtail config
- Fix indentation in promtail scrape_configs (indent list items under key)

* Remove ansible-lint on push, keep PR-only

Lint already runs on pull_request — no need to double up on push to main.
2026-03-29 21:11:33 +01:00
Rasmus Wejlgaard
737d6e0bc1 initial commit 2026-03-28 12:39:41 +00:00