ADR 017: Distributed, Agent-Based Architecture¶
Status: Draft Type: Feature Created: 2024-07-17 Related-ADRs: 029 Related: Hop3 paper (Section 7.4: Agent Model)
Introduction¶
This ADR describes the long-term vision for Hop3's evolution from a single-server PaaS to a distributed, agent-based platform. It establishes the architectural principles and evolution path while recognizing that foundational work (ADR 029) must be completed first.
Summary¶
Hop3 will evolve through four phases:
- Phase 1 (ADR 029): Single-server reconciliation, health checks, and restart policies
- Phase 2: Extract agent responsibilities into a separate module
- Phase 3: Multi-server support with central coordinator
- Phase 4: Full distributed scheduling and orchestration
This phased approach allows Hop3 to gain production reliability immediately (Phase 1) while building toward a scalable distributed architecture over time.
Context and Goals¶
Context¶
Hop3 is currently a single-server PaaS optimized for simplicity. Comparing Hop3's architecture with production-grade orchestrators (like Kubernetes, Nomad, or the "Cube" pattern from container orchestration literature) reveals several gaps:
| Aspect | Production Orchestrators | Hop3 Current |
|---|---|---|
| Architecture | Distributed Manager-Worker | Single-server monolithic |
| Reconciliation | Periodic background loop | On-demand only |
| Health Checks | Active probing + remediation | Defined but not implemented |
| Restart Policy | Always/OnFailure/Never | Not implemented |
| Metrics | Worker collects CPU/Memory | Not implemented |
| Event Log | Immutable state changes | Not implemented |
While multi-server distribution isn't needed for Hop3's primary use case (single-server simplicity), the patterns from distributed systems (reconciliation loops, health monitoring, self-healing) provide significant reliability benefits even on a single server.
Goals¶
- Incremental Evolution: Build distributed capabilities incrementally, not as a big-bang rewrite
- Single-Server First: Ensure single-server deployments gain full reliability benefits
- Production Patterns: Adopt proven patterns from production orchestrators
- Minimal Complexity: Add multi-server support only when justified by concrete requirements
- Promise-Based Design: Use the theory of promises for coordination semantics
Non-Goals¶
- Competing with Kubernetes for large-scale container orchestration
- Supporting arbitrary distributed consensus algorithms
- Real-time sub-second state synchronization
Decision¶
Hop3 will adopt an agent-based architecture where each server runs an autonomous agent that: 1. Manages local application lifecycle 2. Maintains promises about application state 3. Reports status to a coordinator (in multi-server mode) 4. Self-heals based on configured policies
The architecture will be implemented in phases, with each phase delivering standalone value.
Detailed Design¶
Phase 1: Single-Server Foundations (ADR 029)¶
Phase 1 implements the core patterns on a single server (the detailed design lives in ADR 029):
┌─────────────────────────────────────────────────────────┐
│ Hop3 Server │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ Watchdog │───▶│ Database │◀───│ CLI │ │
│ │ Service │ │ (SQLite) │ │ │ │
│ └──────┬───────┘ └──────────────┘ └──────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Health │ │ uWSGI │ │
│ │ Checker │───▶│ Emperor │ │
│ └──────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ App Processes│ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
Key deliverables:
- WatchdogService: Background reconciliation loop (30-60 second cycle)
- HealthChecker: Command and HTTP-based health probing
- RestartPolicy: NEVER, ON_FAILURE, ALWAYS with exponential backoff
- AppEvent: Immutable audit log of state changes
Phase 2: Agent Module Extraction¶
Phase 2 extracts agent responsibilities into a distinct module, preparing for distribution:
class LocalAgent:
"""Agent running on a single Hop3 server."""
def __init__(self, node_id: str, coordinator: Coordinator | None = None):
self.node_id = node_id
self.watchdog = WatchdogService()
self.health_checker = HealthChecker()
self.coordinator = coordinator # None for single-server
async def run(self) -> None:
"""Main agent loop."""
while True:
# 1. Reconcile local state
await self.watchdog.reconcile()
# 2. Run health checks
await self.health_checker.check_all()
# 3. Process restart policies
await self.watchdog.process_restarts()
# 4. Report to coordinator (if multi-server)
if self.coordinator:
await self.report_status()
await asyncio.sleep(30)
async def report_status(self) -> AgentStatus:
"""Report agent status to coordinator."""
return AgentStatus(
node_id=self.node_id,
timestamp=datetime.now(UTC),
apps=self._get_app_statuses(),
resources=self._get_resource_usage(),
)
def receive_task(self, task: Task) -> Promise:
"""Receive a task from coordinator and return a promise."""
return Promise(
promiser=self.node_id,
promisee="coordinator",
body=f"deploy:{task.app_name}",
status="pending",
)
Key deliverables:
- LocalAgent: Encapsulates all agent responsibilities
- AgentStatus: Standardized status reporting format
- Task / Promise: Data structures for task assignment
Phase 3: Multi-Server Coordinator¶
Phase 3 adds a central coordinator for multi-server deployments:
┌───────────────────────────────────────────────────────┐
│ Coordinator │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Scheduler │ │ Task │ │ Promise │ │
│ │ (3-phase) │ │ Queue │ │ Registry │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└────────────────────────────┬──────────────────────────┘
│ HTTP/gRPC
┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Agent │ │ Agent │ │ Agent │
│ Node 1 │ │ Node 2 │ │ Node 3 │
└──────────┘ └──────────┘ └──────────┘
Coordinator Responsibilities:
class Coordinator:
"""Central coordinator for multi-server Hop3."""
def __init__(self):
self.agents: dict[str, AgentConnection] = {}
self.scheduler = Scheduler()
self.task_queue = TaskQueue()
self.promise_registry = PromiseRegistry()
async def deploy_app(self, app: App) -> DeploymentResult:
"""Deploy application to best available node."""
# 1. Filter: Find capable nodes
candidates = self.scheduler.filter(
self.agents.values(),
requirements=app.resource_requirements,
)
# 2. Score: Rank by resource availability
scored = self.scheduler.score(candidates)
# 3. Pick: Select best node
target = self.scheduler.pick(scored)
# 4. Create task and send to agent
task = Task(app_name=app.name, action="deploy")
promise = await target.assign_task(task)
# 5. Register promise and track
self.promise_registry.register(promise)
return DeploymentResult(node=target.node_id, promise_id=promise.id)
async def reconciliation_loop(self) -> None:
"""Periodic reconciliation with all agents."""
while True:
for agent in self.agents.values():
status = await agent.get_status()
await self._reconcile_agent(agent, status)
await asyncio.sleep(30)
Three-Phase Scheduler:
class Scheduler:
"""Three-phase scheduler for task placement."""
def filter(
self,
agents: Iterable[AgentConnection],
requirements: ResourceRequirements,
) -> list[AgentConnection]:
"""Phase 1: Filter out incapable nodes."""
return [
agent for agent in agents
if agent.has_capacity(requirements)
and agent.is_healthy()
and not agent.is_draining()
]
def score(
self,
candidates: list[AgentConnection],
) -> list[tuple[AgentConnection, float]]:
"""Phase 2: Score remaining candidates."""
scored = []
for agent in candidates:
score = (
0.4 * agent.available_memory_ratio()
+ 0.3 * agent.available_cpu_ratio()
+ 0.2 * agent.app_spread_score() # Prefer nodes with fewer apps
+ 0.1 * agent.locality_score() # Prefer co-located dependencies
)
scored.append((agent, score))
return sorted(scored, key=lambda x: x[1], reverse=True)
def pick(
self,
scored: list[tuple[AgentConnection, float]],
) -> AgentConnection:
"""Phase 3: Pick the best candidate."""
if not scored:
raise NoCapacityError("No nodes available for scheduling")
return scored[0][0]
Key deliverables:
- Coordinator: Central management service
- Scheduler: Three-phase scheduling algorithm
- AgentConnection: Agent communication protocol
- PromiseRegistry: Track and verify promises
Phase 4: Advanced Orchestration and Federation¶
Phase 4 adds advanced features and explores two complementary coordination models:
Engineering track (coordinator-based):
- Placement Constraints: Affinity/anti-affinity rules
- Rolling Updates: Zero-downtime deployments across nodes
- Resource Quotas: Per-tenant resource limits
Research track (fully decentralized):
For edge and fog computing scenarios where a central coordinator is unavailable or undesirable (network partitions, multi-operator federation), an alternative coordination model based on CRDTs and gossip-based dissemination can replace the central coordinator:
- Each node maintains a local CRDT replica of the deployment state
- Nodes synchronise via gossip protocols rather than reporting to a coordinator
- Scheduling emerges from local evaluation of capability promises against workload requirements
- Conflict resolution uses CRDT merge semantics (see the Hop3 paper, Section 7.4)
This decentralized model is the natural end-state for edge deployments where nodes must operate autonomously under partition. The coordinator-based Phase 3 serves as a pragmatic intermediate step that validates the agent/promise model before removing the central authority.
The Hop3 paper develops the formal argument for this trajectory, grounding it in Promise Theory [PT1] and showing how Nix content-addressed closures enable bandwidth-efficient store-carry-forward updates between disconnected nodes.
Theory of Promises Integration¶
The theory of promises (Mark Burgess) provides the semantic foundation:
Promise Types:
class Promise:
"""A declaration by an agent about its behavior."""
promiser: str # Agent making the promise
promisee: str # Who can depend on it ("any" for broadcast)
body: str # What is promised (e.g., "running:myapp")
status: PromiseStatus # pending, kept, broken
timestamp: datetime
class PromiseStatus(enum.Enum):
PENDING = "pending" # Promise made, not yet verified
KEPT = "kept" # Promise verified as fulfilled
BROKEN = "broken" # Promise verified as not fulfilled
Promise Verification:
async def verify_promises(self) -> list[BrokenPromise]:
"""Verify all registered promises against actual state."""
broken = []
for promise in self.promise_registry.active():
agent = self.agents.get(promise.promiser)
if not agent:
broken.append(BrokenPromise(promise, reason="agent_unavailable"))
continue
# Parse promise body and verify
if promise.body.startswith("running:"):
app_name = promise.body.split(":")[1]
status = await agent.get_app_status(app_name)
if status.state != AppState.RUNNING:
broken.append(BrokenPromise(promise, reason="app_not_running"))
return broken
Promise-Based Coordination:
Instead of imperative commands ("deploy this app now"), the coordinator expresses desired state and agents make promises:
- Coordinator: "I need app X running on some node"
- Agent A: "I promise to run app X" (promise pending)
- Agent A deploys app X
- Agent A: Updates promise status to "kept"
- Coordinator: Verifies promise periodically
This provides: - Decentralization: Agents decide how to fulfill promises - Fault Tolerance: Broken promises trigger re-scheduling - Auditability: Promise history provides clear audit trail
Consequences¶
Benefits¶
- Incremental Value: Each phase delivers standalone improvements
- Production Reliability: Single-server deployments become self-healing
- Scalability Path: Clear evolution to multi-server when needed
- Semantic Clarity: Promise theory provides rigorous coordination semantics
- Debugging: Promise history and event logs enable root cause analysis
Drawbacks¶
- Complexity: Multi-server coordination adds significant complexity
- Overhead: Promise verification and reconciliation consume resources
- Learning Curve: Promise theory concepts require documentation and training
Trade-offs¶
| Aspect | Decision | Alternative | Why Rejected |
|---|---|---|---|
| Evolution | Phased | Big-bang | Risk too high, value delayed |
| Coordination | Promise-based | Imperative | Less fault tolerant |
| Scheduling | 3-phase | Simple round-robin | Insufficient for heterogeneous nodes |
| Storage | Per-agent SQLite | Distributed consensus | Unnecessary complexity for Hop3 scale |
Prior Art¶
- Kubernetes: Manager-worker architecture with reconciliation loops
- Nomad: Three-phase scheduler (feasibility, ranking, selection)
- CFEngine: Theory of promises for configuration management
- Cube Orchestrator: Educational reference for container orchestration patterns
- systemd: Local process supervision with restart policies
Related¶
- ADR 020: Pluggable Architecture - Plugin system used by agents
- ADR 022: Build and Deployment Plugin System - Deployment strategies
- ADR 029: Reconciliation and Health Checks - Phase 1 implementation details
References¶
- Burgess, M. "An Approach to Computer System Configuration Based on Promise Theory," Science of Computer Programming, vol. 71, no. 3, pp. 243–265, 2008.
- Burgess, M. and Bergstra, J.A. Promise Theory: Principles and Applications. Xtaxis Press, 2014.
- Burgess, M. "Testable System Administration," Communications of the ACM, vol. 54, no. 3, pp. 44–49, 2011.
- Burgess, M. "In Search of Certainty" (2013)
- Kubernetes Documentation: Controllers
- Nomad Documentation: Scheduling
- "Build an Orchestrator in Go (From Scratch)" by Tim Boring
- Hop3 paper, Section 7.4: "From Single Node to Distributed Edge: The Hop3 Agent Model"
Related ADRs: ADR 029: Application Reconciliation and Health Check System