ADR 043: Unified Testing Architecture¶

Status: Accepted Type: Process Created: 2026-06-05 Supersedes: ADR 026 (Dashboard UI Test Classification) Related-ADRs: 004 (development-tooling), 027 (config-system-refactoring-for-testability), 030 (two-level-build-architecture), 041 (privileged-operations-agent) Related-notes: notes/v0.6-rootd-hardening.md, docs/src/dev/testing-strategy.md

Context¶

What's wrong¶

The project accumulated several parallel testing approaches, each added for a good local reason, with no unifying model. A contributor must hold a large set of entry points across many layers in their head — and know which still work, because a sizeable fraction are dead. An audit of the full surface (pytest layers, the hop3-test runner, the standalone demos/ harness, the tutorial scripts, validoc, nox, the Makefile/CI) found:

No fast daily path, and the defaults are slow. c_system sits in the default testpaths, so bare uv run pytest and make test pull the Docker layer and can trigger a multi-minute image build. The fast unit suite is reachable only by hand-typing its directory. The loudest day-to-day complaint.
The primary GitHub CI workflow is dead. ci.yml (the broadest trigger — every push and PR) runs nox sessions that do not exist; it fails at session selection on every run. The CI of record is SourceHut (.builds/), not GitHub Actions.
hop3-test advertises commands it doesn't register. Only system, list, cloud exist; dev / apps / ci / nightly / hetzner / multi-distro / show error out — yet the README, CLAUDE.md, and the Makefile call them, so several Makefile targets are broken.
Deploy-and-verify is implemented four times over the same path. hop3-test system, the pytest Docker fixtures, demos/demo.py + demos/lib, and scripts/run-all-tutorials.py each stand up a server and verify an app independently — over the one real hop3-deploy primitive that hop3-testing's DeploymentTarget ABC already wraps.
Diagnostics are forked, with inverted coverage. The richest collector (journalctl, nginx error/access, docker daemon) is wired only to the Hetzner path; the everyday Docker run gets the leanest bundle; the pytest Docker layers collect nothing on failure. Crucially, no surface has a proxy-reachability probe — so the most common production failure, a healthy app behind a front-end 502 because nginx points at the wrong port/host (the "silent-502" class), is captured by none. This is the failure that motivated the review.
The pyramid has collapsed in practice. c_system no longer pulls its weight: its only live tests are one fully-skipped module and one in-process TestClient + in-memory-DB test (which belongs in b_integration). Every test embodying the original "CLI ↔ server in a container, no deploys" intent is disabled. The real boundary today is b_integration (in-process, fast) vs d_e2e (Docker, real deploy).
Three incompatible speed taxonomies, none load-bearing. Directory layers, decorative pytest markers, and hop3-test's P0/P1/P2 + tier labels. pytest -m "not slow" cannot select a fast lane because the markers aren't applied.
Whole packages run in no CI. hop3-installer, hop3-rootd (security-sensitive — ADR 041 and the v0.6 hardening note), hop3-tui, and hop3-testing appear in no Makefile target and no workflow.

What is actually good (and must be kept)¶

The DeploymentTarget ABC (hop3-testing/targets/) is a clean abstraction over docker/ssh/hetzner — the right single deploy-and-verify primitive.
validoc is a genuine literate-test substrate: tutorials are Markdown with executable, asserted code blocks. It can drive any CLI scenario, not just tutorials.
An HTML report generator already exists (diagnostics.py::generate_html_report) — the nightly artifact we want.
nox is alive as SourceHut's multi-Python driver; only the GitHub workflow that referenced non-existent sessions is dead.

Why now¶

Pre-1.0, with license to make breaking changes. The cost of carrying a broken, fast-loop-less, diagnosis-less testing surface into 1.0 is far higher than consolidating now. Critically, the consolidation is mostly clarification and deletion, not a rewrite — the one genuinely new component is a shared diagnostic bundle, which we owe the operator regardless.

Decision¶

1. Three runners, three domains¶

We keep three runners, each with a crisp, non-overlapping domain; we do not collapse them into one binary, and pytest stays a first-class, directly-invokable runner.

Runner	Tests…	Notes
pytest	the platform code (unit → integration → e2e)	the inner-loop tool; the only runner that produces coverage
hop3-test	applications (real apps + demos): deploy + verify	exercises the platform indirectly; owns the app catalog + the `DeploymentTarget` ABC
validoc	narratives: tutorials-as-tests, long CLI-driven scenarios	literate, doc-shaped; verifies docs match reality

The "unification" is not one command. It is: (a) a memorable map of which runner for what, (b) one set of speed tiers across all three, and © one shared diagnostic bundle every runner calls whenever a deployed server is involved.

2. The pytest pyramid: three layers + a placement rule¶

c_system dissolves, and d_e2e is renamed c_e2e:

Layer	Contains	Docker / root / host-mutation?	Counts toward coverage?
`a_unit`	pure functions/classes, no real I/O	no	✅ yes
`b_integration`	multi-component, real DB/subprocess, but hermetic	no	✅ yes
`c_e2e`	deploys / real-server tests	yes	❌ no (out-of-process)

The placement rule (the crux). The axis is not "is this test complex?" — it is "does this test need Docker, root, a real server, or host mutation?" Pure logic → a_unit. Heavyweight but hermetic (runs in tmp_path, temp/in-memory DB, monkeypatched HOME, no privileged ops, no /etc, cleans up after itself) → b_integration. Needs Docker / root / a real server / system mutation → c_e2e. This deliberately pushes work down: prefer b_integration whenever a test can run hermetically.

Duplication across layers is allowed and expected. A hermetic backup-logic test in b_integration and a real-deploy backup test in c_e2e test different things and should coexist; the only question is which layer a test's dependencies place it in.

This supersedes ADR 026, which classified dashboard tests by the real-vs-mocked-dependency axis. Under the Docker/root/host-mutation axis, a test running real App.create() against tmp_path with an in-memory DB is hermetic and belongs in b_integration — reversing ADR 026. The one in-process c_system test moves there; the skipped Postgres-backup test is made hermetic or moved to c_e2e; the disabled proxy tests are revived into c_e2e (they exercise the silent-502 class).

3. Coverage policy¶

Coverage is measured on a_unit + b_integration only — the in-process layers coverage.py can see. c_e2e runs in a separate process/container and is pass/fail. This produces a happy alignment: the two fast layers are exactly the two that move the coverage number, so the placement rule both tightens the feedback loop and raises coverage.

4. The speed-tier ladder¶

Tiers are named by feedback latency and selected by pytest markers stamped from the directory layer by an autouse conftest hook — so the markers become real with zero per-test edits and stop depending on the broken directory pyramid.

Tier	Scope	Docker?	Dev moment	CI
fast	`a_unit` (+ fastest `b_integration`)	no	every save / pre-commit	non-Docker distros (Rocky, NixOS)
check	full pytest pyramid incl. `c_e2e` (backups, git-push, proxy) + lint + types	yes	pre-push / PR	Docker distros (Ubuntu)
apps	`hop3-test` — a P0 subset, or one named app	yes	touching deployer/proxy/builders	on-demand
nightly	full `hop3-test` app/demo matrix + `validoc` tutorials + multi-distro, HTML report	yes	cron / release	SourceHut nightly

The slow Docker platform tests (backups, git-push, proxy) live in check, not nightly-only — they are core guarantees and should gate a push. check requires Docker, so non-Docker distros run fast instead, matching the .builds/ split. Bare pytest (and make test) run the in-process layers only (a_unit + b_integration) — the testpaths default excludes the Docker c_e2e layer, so a reflexive pytest never spins up Docker; CI invokes c_e2e explicitly.

5. Unified entry points¶

The Makefile targets collapse to a small, all-working set:

Target	Runs
`make test-fast`	`pytest -m fast` (a_unit + fast integration), < 1 min
`make test`	the `check` tier
`make lint` / `make check`	ruff + reuse + deptry + pyrefly + mypy
`make test-with-coverage`	`a_unit` + `b_integration` with `--cov`
`make test-apps` / `make test-app APP=…`	`hop3-test system` over a subset / one app
`make test-nightly`	`hop3-test` full matrix + demos + `validoc`, HTML report

All dead targets are removed, and hop3-test's advertised-but-unregistered commands are dropped from the docs.

6. One deploy-and-verify path¶

Everything that stands up a server routes through the DeploymentTarget ABC + the real hop3-deploy primitive: hop3-test (already does), the pytest c_e2e fixtures (today own a parallel Docker lifecycle), the demos, and validoc.

Target	When	Gate
`--docker`	default; dev + CI	the only one wired into routine CI
`--ssh --host $HOP3_DEV_HOST`	systemd-specific paths (rootd, nginx reload, www-data perms) Docker can't fully exercise	nightly / manual
`--hetzner`	multi-distro, real-server release validation	release gate only, behind `HETZNER_API_TOKEN`

A developer can run one real e2e fast: hop3-test system --docker <app> against the cached image, with --reuse to iterate on verification alone. Bug to fix in passing: c_e2e rebuilds the Docker image every session while c_system checks-then-reuses it; the reuse behaviour becomes the shared default.

7. Shared diagnostics — the silent-502 fix¶

Replace the forked collectors with one shared contract, always invoked on failure, before cleanup:

collect_diagnostic_bundle(target: DeploymentTarget, app: str) -> Bundle

Called from a single try/finally in the deploy-verify loop and from a pytest c_e2e fixture finalizer, so every failure (docker, ssh, hetzner, in-test) collects the same bundle. It uses only target.exec_run, retiring the current remote-exec mechanisms. Three layers ensure logs are never missing and never overwhelming:

Headline (always, ≤ ~12 lines): a classifier verdict (proxy-502 / build-failure / addon-unreachable / app-crash / timeout) plus the decisive signals.
Drill-down (on demand): hop3-test why <run-id> --section nginx|journal|build|http|deploy|app replays one section from the saved bundle. Nothing is dumped inline by default.
Artifact (always written): ~/.hop3/test-runs/<ISO>-<app>-<shortid>/, recorded in one result store that keeps the bundle path so failure history is queryable. CI uploads it.

Bundle manifest (the union of today's richest collector, made target-agnostic):

proxy_probe.txt — the silent-502 fix, absent today: curl 127.0.0.1:<port>, ss -ltnp for the actual LISTEN port, the nginx proxy_pass target, and the diff between them.
nginx.txt — error.log + access.log tail + the rendered conf (today only the static conf is dumped).
app.txt — app/uwsgi/granian/docker logs + listen check.
journal.txt — journalctl (ssh/hetzner; container logs on docker).
build.txt, deploy.txt, http.txt, dns.txt, manifest.json.

Each section is tail-bounded — rich on disk, terse on screen. generate_html_report renders these bundles as the nightly HTML report.

8. CI: GitHub vestigial, SourceHut real¶

GitHub Actions (ci.yml, test.yml, e2e.yml) are vestigial and all three are removed; SourceHut is the sole CI.
SourceHut .builds/ is the CI of record (Ubuntu, Rocky, NixOS) and is itself partly broken. Fix: Docker distros run check once; the nox multi-Python fan-out runs the fast tier only; non-Docker distros run fast.
Keep nox as the multi-Python driver. Its pytest session inherits the testpaths fix, so once c_e2e leaves the default path nox stops dragging Docker across Pythons. The redundant per-package fan-out can be dropped.

9. What gets deleted / absorbed¶

After a parity window (see migration):

.github/workflows/ci.yml (dead).
scripts/run-all-tutorials.py (and the never-existed .sh) — tutorials become validoc runs driven by hop3-test.
Orphaned hop3-testing code: runners/validations.py::_validate_validoc (a no-op that masks failures) and the redundant result DB.
The forked diagnostics collectors, folded into collect_diagnostic_bundle.
The dead Makefile targets (§5).

Kept: demos/demo.py, demos/lib, and runners/demo.py. A demo is simultaneously an educational walkthrough, a live demonstration, and a test, so a broken demo is a first-class regression. The meta runner exercises each demo in place (DemoTestRunner → demos/demo.py) and surfaces a non-zero exit as a failed test; runners/demo.py is the integration layer, not dead code. The HTML generator, the catalog/scanner, the targets/ ABC, the Hetzner orchestrator, and runners/tutorial.py (now reached via tutorial discovery) are also kept. Reducing demo/runner duplication is welcome; removing the demo engine or its educational/demonstration behaviour is not.

Migration plan¶

Strictly incremental; each phase ships value alone and is reversible. Ordered by leverage-per-hour.

Known constraint: hop3-test build-ready-image is referenced (in targets/docker.py's _start_prebuilt hint and the docs) as the way to build hop3-ready:latest, but the command does not exist. The fast app-test path therefore has no working build instruction; the intended ready-image lifecycle must be documented or restored.

Phase 0 — cheap wins. Add make test-fast. Make c_e2e reuse the cached image. Delete the dead ci.yml. Rename d_e2e → c_e2e; dissolve c_system. (Hold the testpaths edit until the bare-pytest default is decided.)
Phase 1 — the silent-502 fix. Build collect_diagnostic_bundle (proxy probe first); wire it into hop3-test system and the pytest c_e2e fixtures; add hop3-test why + the classifier + a bundle-aware result store. Revive the disabled proxy tests into c_e2e.
Phase 2 — tiers + entry points. Add the autouse marker-stamping hook; define fast/check; consolidate the Makefile (§5); bring installer/rootd/tui/testing into check and CI (rootd first, per ADR 041); fix .builds/; update docs to the true surface.
Phase 3 — absorb demos + tutorials. Promote demos and tutorials to catalog entries through the ABC; preserve the --narrate timing reporter; wire the HTML report into nightly. Run side-by-side with the old harnesses for one nightly parity cycle.
Phase 4 — cleanup (after parity). Delete scripts/run-all-tutorials*, runners/validations.py::_validate_validoc, the redundant result DB, and the dead Makefile targets. Trim nox to lint + fast pytest + audit. Keep the demo engine.

Consequences¶

A contributor learns three runners and four tiers, all of which work, replacing a large mostly-broken surface.
The daily inner loop drops to < 1 min and can never accidentally trigger a Docker build.
The most common deployment failure (silent-502) gains a one-line diagnosis on every surface; failure history becomes queryable; nightly produces an HTML report.
Previously-untested-in-CI packages (incl. security-sensitive rootd) enter the gate.
The duplicated harness code is deleted; all deploy paths route through one primitive (consistent with "Hop3 is responsible for making things work" and functional-core/imperative-shell).
Cost: real implementation work in hop3-testing for Phases 1 and 3; CI coordination for the GitHub→SourceHut clarification and any required branch-protection checks.

Resolved decisions¶

Points that could have been left open are resolved as follows:

Bare-pytest default. The testpaths default is the in-process layers only (a_unit + b_integration); the Docker c_e2e layer is excluded, so a reflexive pytest never triggers Docker and CI invokes c_e2e explicitly.
GitHub Actions. All three vestigial workflows (ci.yml, test.yml, e2e.yml) are removed; SourceHut is the sole CI.
Supported-Python contract. 3.11–3.14 (requires-python = ">=3.11,<3.15"); nox fans out across those. 3.10 and 3.15 are dropped.
Postgres-backup test placement. The real-deploy c_e2e backup test (c_e2e/test_backup.py) owns the coverage; the legacy c_system variant is retired.
check naming. make check stays the lint alias; the check tier is make test.
Bundle retention. Build logs are auto-pruned to a recent-run count ([retention].keep_runs, via hop3-test prune), not kept forever.

Supersedes: ADR 026: Dashboard UI Test Classification
Related ADRs: ADR 004: Development Tooling, ADR 027: Configuration System Refactoring for Testability, ADR 030: Two-Level Build Architecture, ADR 041: Privileged Operations Agent (hop3-rootd)