ADR 043: Unified Testing Architecture¶
Status: Accepted
Type: Process
Created: 2026-06-05
Supersedes: ADR 026 (Dashboard UI Test Classification)
Related-ADRs: 004 (development-tooling), 027 (config-system-refactoring-for-testability), 030 (two-level-build-architecture), 041 (privileged-operations-agent)
Related-notes: notes/v0.6-rootd-hardening.md, docs/src/dev/testing-strategy.md
Context¶
What's wrong¶
The project accumulated several parallel testing approaches, each added for a good local reason, with no unifying model. A contributor must hold a large set of entry points across many layers in their head — and know which still work, because a sizeable fraction are dead. An audit of the full surface (pytest layers, the hop3-test runner, the standalone demos/ harness, the tutorial scripts, validoc, nox, the Makefile/CI) found:
- No fast daily path, and the defaults are slow.
c_systemsits in the defaulttestpaths, so bareuv run pytestandmake testpull the Docker layer and can trigger a multi-minute image build. The fast unit suite is reachable only by hand-typing its directory. The loudest day-to-day complaint. - The primary GitHub CI workflow is dead.
ci.yml(the broadest trigger — every push and PR) runsnoxsessions that do not exist; it fails at session selection on every run. The CI of record is SourceHut (.builds/), not GitHub Actions. hop3-testadvertises commands it doesn't register. Onlysystem,list,cloudexist;dev / apps / ci / nightly / hetzner / multi-distro / showerror out — yet the README,CLAUDE.md, and the Makefile call them, so several Makefile targets are broken.- Deploy-and-verify is implemented four times over the same path.
hop3-test system, the pytest Docker fixtures,demos/demo.py+demos/lib, andscripts/run-all-tutorials.pyeach stand up a server and verify an app independently — over the one realhop3-deployprimitive thathop3-testing'sDeploymentTargetABC already wraps. - Diagnostics are forked, with inverted coverage. The richest collector (journalctl, nginx error/access, docker daemon) is wired only to the Hetzner path; the everyday Docker run gets the leanest bundle; the pytest Docker layers collect nothing on failure. Crucially, no surface has a proxy-reachability probe — so the most common production failure, a healthy app behind a front-end 502 because nginx points at the wrong port/host (the "silent-502" class), is captured by none. This is the failure that motivated the review.
- The pyramid has collapsed in practice.
c_systemno longer pulls its weight: its only live tests are one fully-skipped module and one in-processTestClient+ in-memory-DB test (which belongs inb_integration). Every test embodying the original "CLI ↔ server in a container, no deploys" intent is disabled. The real boundary today isb_integration(in-process, fast) vsd_e2e(Docker, real deploy). - Three incompatible speed taxonomies, none load-bearing. Directory layers, decorative pytest markers, and
hop3-test'sP0/P1/P2+ tier labels.pytest -m "not slow"cannot select a fast lane because the markers aren't applied. - Whole packages run in no CI.
hop3-installer,hop3-rootd(security-sensitive — ADR 041 and the v0.6 hardening note),hop3-tui, andhop3-testingappear in no Makefile target and no workflow.
What is actually good (and must be kept)¶
- The
DeploymentTargetABC (hop3-testing/targets/) is a clean abstraction over docker/ssh/hetzner — the right single deploy-and-verify primitive. validocis a genuine literate-test substrate: tutorials are Markdown with executable, asserted code blocks. It can drive any CLI scenario, not just tutorials.- An HTML report generator already exists (
diagnostics.py::generate_html_report) — the nightly artifact we want. noxis alive as SourceHut's multi-Python driver; only the GitHub workflow that referenced non-existent sessions is dead.
Why now¶
Pre-1.0, with license to make breaking changes. The cost of carrying a broken, fast-loop-less, diagnosis-less testing surface into 1.0 is far higher than consolidating now. Critically, the consolidation is mostly clarification and deletion, not a rewrite — the one genuinely new component is a shared diagnostic bundle, which we owe the operator regardless.
Decision¶
1. Three runners, three domains¶
We keep three runners, each with a crisp, non-overlapping domain; we do not collapse them into one binary, and pytest stays a first-class, directly-invokable runner.
| Runner | Tests… | Notes |
|---|---|---|
| pytest | the platform code (unit → integration → e2e) | the inner-loop tool; the only runner that produces coverage |
| hop3-test | applications (real apps + demos): deploy + verify | exercises the platform indirectly; owns the app catalog + the DeploymentTarget ABC |
| validoc | narratives: tutorials-as-tests, long CLI-driven scenarios | literate, doc-shaped; verifies docs match reality |
The "unification" is not one command. It is: (a) a memorable map of which runner for what, (b) one set of speed tiers across all three, and © one shared diagnostic bundle every runner calls whenever a deployed server is involved.
2. The pytest pyramid: three layers + a placement rule¶
c_system dissolves, and d_e2e is renamed c_e2e:
| Layer | Contains | Docker / root / host-mutation? | Counts toward coverage? |
|---|---|---|---|
a_unit |
pure functions/classes, no real I/O | no | ✅ yes |
b_integration |
multi-component, real DB/subprocess, but hermetic | no | ✅ yes |
c_e2e |
deploys / real-server tests | yes | ❌ no (out-of-process) |
The placement rule (the crux). The axis is not "is this test complex?" — it is "does this test need Docker, root, a real server, or host mutation?" Pure logic → a_unit. Heavyweight but hermetic (runs in tmp_path, temp/in-memory DB, monkeypatched HOME, no privileged ops, no /etc, cleans up after itself) → b_integration. Needs Docker / root / a real server / system mutation → c_e2e. This deliberately pushes work down: prefer b_integration whenever a test can run hermetically.
Duplication across layers is allowed and expected. A hermetic backup-logic test in b_integration and a real-deploy backup test in c_e2e test different things and should coexist; the only question is which layer a test's dependencies place it in.
This supersedes ADR 026, which classified dashboard tests by the real-vs-mocked-dependency axis. Under the Docker/root/host-mutation axis, a test running real App.create() against tmp_path with an in-memory DB is hermetic and belongs in b_integration — reversing ADR 026. The one in-process c_system test moves there; the skipped Postgres-backup test is made hermetic or moved to c_e2e; the disabled proxy tests are revived into c_e2e (they exercise the silent-502 class).
3. Coverage policy¶
Coverage is measured on a_unit + b_integration only — the in-process layers coverage.py can see. c_e2e runs in a separate process/container and is pass/fail. This produces a happy alignment: the two fast layers are exactly the two that move the coverage number, so the placement rule both tightens the feedback loop and raises coverage.
4. The speed-tier ladder¶
Tiers are named by feedback latency and selected by pytest markers stamped from the directory layer by an autouse conftest hook — so the markers become real with zero per-test edits and stop depending on the broken directory pyramid.
| Tier | Scope | Docker? | Dev moment | CI |
|---|---|---|---|---|
| fast | a_unit (+ fastest b_integration) |
no | every save / pre-commit | non-Docker distros (Rocky, NixOS) |
| check | full pytest pyramid incl. c_e2e (backups, git-push, proxy) + lint + types |
yes | pre-push / PR | Docker distros (Ubuntu) |
| apps | hop3-test — a P0 subset, or one named app |
yes | touching deployer/proxy/builders | on-demand |
| nightly | full hop3-test app/demo matrix + validoc tutorials + multi-distro, HTML report |
yes | cron / release | SourceHut nightly |
The slow Docker platform tests (backups, git-push, proxy) live in check, not nightly-only — they are core guarantees and should gate a push. check requires Docker, so non-Docker distros run fast instead, matching the .builds/ split. Bare pytest (and make test) run the in-process layers only (a_unit + b_integration) — the testpaths default excludes the Docker c_e2e layer, so a reflexive pytest never spins up Docker; CI invokes c_e2e explicitly.
5. Unified entry points¶
The Makefile targets collapse to a small, all-working set:
| Target | Runs |
|---|---|
make test-fast |
pytest -m fast (a_unit + fast integration), < 1 min |
make test |
the check tier |
make lint / make check |
ruff + reuse + deptry + pyrefly + mypy |
make test-with-coverage |
a_unit + b_integration with --cov |
make test-apps / make test-app APP=… |
hop3-test system over a subset / one app |
make test-nightly |
hop3-test full matrix + demos + validoc, HTML report |
All dead targets are removed, and hop3-test's advertised-but-unregistered commands are dropped from the docs.
6. One deploy-and-verify path¶
Everything that stands up a server routes through the DeploymentTarget ABC + the real hop3-deploy primitive: hop3-test (already does), the pytest c_e2e fixtures (today own a parallel Docker lifecycle), the demos, and validoc.
| Target | When | Gate |
|---|---|---|
--docker |
default; dev + CI | the only one wired into routine CI |
--ssh --host $HOP3_DEV_HOST |
systemd-specific paths (rootd, nginx reload, www-data perms) Docker can't fully exercise | nightly / manual |
--hetzner |
multi-distro, real-server release validation | release gate only, behind HETZNER_API_TOKEN |
A developer can run one real e2e fast: hop3-test system --docker <app> against the cached image, with --reuse to iterate on verification alone. Bug to fix in passing: c_e2e rebuilds the Docker image every session while c_system checks-then-reuses it; the reuse behaviour becomes the shared default.
7. Shared diagnostics — the silent-502 fix¶
Replace the forked collectors with one shared contract, always invoked on failure, before cleanup:
Called from a single try/finally in the deploy-verify loop and from a pytest c_e2e fixture finalizer, so every failure (docker, ssh, hetzner, in-test) collects the same bundle. It uses only target.exec_run, retiring the current remote-exec mechanisms. Three layers ensure logs are never missing and never overwhelming:
- Headline (always, ≤ ~12 lines): a classifier verdict (
proxy-502 / build-failure / addon-unreachable / app-crash / timeout) plus the decisive signals. - Drill-down (on demand):
hop3-test why <run-id> --section nginx|journal|build|http|deploy|appreplays one section from the saved bundle. Nothing is dumped inline by default. - Artifact (always written):
~/.hop3/test-runs/<ISO>-<app>-<shortid>/, recorded in one result store that keeps the bundle path so failure history is queryable. CI uploads it.
Bundle manifest (the union of today's richest collector, made target-agnostic):
proxy_probe.txt— the silent-502 fix, absent today:curl 127.0.0.1:<port>,ss -ltnpfor the actual LISTEN port, the nginxproxy_passtarget, and the diff between them.nginx.txt—error.log+access.logtail + the rendered conf (today only the static conf is dumped).app.txt— app/uwsgi/granian/docker logs + listen check.journal.txt—journalctl(ssh/hetzner; container logs on docker).build.txt,deploy.txt,http.txt,dns.txt,manifest.json.
Each section is tail-bounded — rich on disk, terse on screen. generate_html_report renders these bundles as the nightly HTML report.
8. CI: GitHub vestigial, SourceHut real¶
- GitHub Actions (
ci.yml,test.yml,e2e.yml) are vestigial and all three are removed; SourceHut is the sole CI. - SourceHut
.builds/is the CI of record (Ubuntu, Rocky, NixOS) and is itself partly broken. Fix: Docker distros runcheckonce; thenoxmulti-Python fan-out runs the fast tier only; non-Docker distros runfast. - Keep
noxas the multi-Python driver. Itspytestsession inherits thetestpathsfix, so oncec_e2eleaves the default pathnoxstops dragging Docker across Pythons. The redundant per-package fan-out can be dropped.
9. What gets deleted / absorbed¶
After a parity window (see migration):
.github/workflows/ci.yml(dead).scripts/run-all-tutorials.py(and the never-existed.sh) — tutorials becomevalidocruns driven byhop3-test.- Orphaned
hop3-testingcode:runners/validations.py::_validate_validoc(a no-op that masks failures) and the redundant result DB. - The forked diagnostics collectors, folded into
collect_diagnostic_bundle. - The dead Makefile targets (§5).
Kept: demos/demo.py, demos/lib, and runners/demo.py. A demo is simultaneously an educational walkthrough, a live demonstration, and a test, so a broken demo is a first-class regression. The meta runner exercises each demo in place (DemoTestRunner → demos/demo.py) and surfaces a non-zero exit as a failed test; runners/demo.py is the integration layer, not dead code. The HTML generator, the catalog/scanner, the targets/ ABC, the Hetzner orchestrator, and runners/tutorial.py (now reached via tutorial discovery) are also kept. Reducing demo/runner duplication is welcome; removing the demo engine or its educational/demonstration behaviour is not.
Migration plan¶
Strictly incremental; each phase ships value alone and is reversible. Ordered by leverage-per-hour.
Known constraint:
hop3-test build-ready-imageis referenced (intargets/docker.py's_start_prebuilthint and the docs) as the way to buildhop3-ready:latest, but the command does not exist. The fast app-test path therefore has no working build instruction; the intended ready-image lifecycle must be documented or restored.
- Phase 0 — cheap wins. Add
make test-fast. Makec_e2ereuse the cached image. Delete the deadci.yml. Renamed_e2e → c_e2e; dissolvec_system. (Hold thetestpathsedit until the bare-pytestdefault is decided.) - Phase 1 — the silent-502 fix. Build
collect_diagnostic_bundle(proxy probe first); wire it intohop3-test systemand the pytestc_e2efixtures; addhop3-test why+ the classifier + a bundle-aware result store. Revive the disabled proxy tests intoc_e2e. - Phase 2 — tiers + entry points. Add the autouse marker-stamping hook; define
fast/check; consolidate the Makefile (§5); bringinstaller/rootd/tui/testingintocheckand CI (rootd first, per ADR 041); fix.builds/; update docs to the true surface. - Phase 3 — absorb demos + tutorials. Promote demos and tutorials to catalog entries through the ABC; preserve the
--narratetiming reporter; wire the HTML report into nightly. Run side-by-side with the old harnesses for one nightly parity cycle. - Phase 4 — cleanup (after parity). Delete
scripts/run-all-tutorials*,runners/validations.py::_validate_validoc, the redundant result DB, and the dead Makefile targets. Trimnoxtolint+ fastpytest+audit. Keep the demo engine.
Consequences¶
- A contributor learns three runners and four tiers, all of which work, replacing a large mostly-broken surface.
- The daily inner loop drops to < 1 min and can never accidentally trigger a Docker build.
- The most common deployment failure (silent-502) gains a one-line diagnosis on every surface; failure history becomes queryable; nightly produces an HTML report.
- Previously-untested-in-CI packages (incl. security-sensitive
rootd) enter the gate. - The duplicated harness code is deleted; all deploy paths route through one primitive (consistent with "Hop3 is responsible for making things work" and functional-core/imperative-shell).
- Cost: real implementation work in
hop3-testingfor Phases 1 and 3; CI coordination for the GitHub→SourceHut clarification and any required branch-protection checks.
Resolved decisions¶
Points that could have been left open are resolved as follows:
- Bare-
pytestdefault. Thetestpathsdefault is the in-process layers only (a_unit+b_integration); the Dockerc_e2elayer is excluded, so a reflexivepytestnever triggers Docker and CI invokesc_e2eexplicitly. - GitHub Actions. All three vestigial workflows (
ci.yml,test.yml,e2e.yml) are removed; SourceHut is the sole CI. - Supported-Python contract. 3.11–3.14 (
requires-python = ">=3.11,<3.15");noxfans out across those. 3.10 and 3.15 are dropped. - Postgres-backup test placement. The real-deploy
c_e2ebackup test (c_e2e/test_backup.py) owns the coverage; the legacyc_systemvariant is retired. checknaming.make checkstays the lint alias; the check tier ismake test.- Bundle retention. Build logs are auto-pruned to a recent-run count (
[retention].keep_runs, viahop3-test prune), not kept forever.
Supersedes: ADR 026: Dashboard UI Test Classification
Related ADRs: ADR 004: Development Tooling, ADR 027: Configuration System Refactoring for Testability, ADR 030: Two-Level Build Architecture, ADR 041: Privileged Operations Agent (hop3-rootd)