Skip to main content

Command Palette

Search for a command to run...

Building Poseidon #3: Control Plane vs Data Plane

How we ensured our system can survive a master node crash

Updated
7 min read
Building Poseidon #3: Control Plane vs Data Plane

In Part 2, I built the Master-Worker orchestration core: gRPC communication, container spawning and basic routing. It worked locally. Containers spun up, users got unique subdomains and traffic routed correctly.

Then I realized the architecture had a fundamental flaw.

The Problem: Master as a Single Point of Failure

My initial design had centralized routing. The Master instance handled both Orchestration ( deciding which worker runs a container ) and routing ( directing traffic to that container ). This created two problems:

Problem 1: Availability If the Master crashed, users lost access to running containers, even though the containers were still alive on the Worker nodes. The control plane failure restricted access to the data plane.

Problem 2: Scalability The Master became a routing bottleneck. All traffic flowed through it, limiting horizontal scaling. We could add more workers, but routing capacity stays fixed.

This violates a core principle: separate the control plane from the data plane

The Solution: Service Mesh with Consul and Traefik

After reading Consul’s documentation and reading about service mesh patterns, the solution was clear:

  • Master ( Control Plane ): Orchestrates container lifecycle, maintaining state in Redis, registers services in Consul

  • Workers ( Data Plane ): Run containers, run Traefik instances, route traffic based on Consul’s service catalog

  • Consul: Single source of truth for service discovery and routing metadata.

The Master never touches production traffic. It just tells Consul, “container X is at IP Y on port Z with hostname H”. Every worker’s Traefik watches Consul and builds routing rules dynamically.

The New Architecture

User Request (a3f4b92c8d.orcactf.app)
    ↓
DNS Resolution (will point to NLB on AWS)
    ↓
Any Worker's Traefik (via load balancer)
    ↓
Consul Lookup: "Which IP:port has this hostname?"
    ↓
Forward to Container (possibly on different worker)
    ↓
Container responds

Key Insight: Any Traefik instance can route to any container, even if that container isn’t on the same worker. They all read from the same Consul Catalog.

This means:

  • Master crash doesn’t affect routing.

  • Routing scales horizontally ( N workers = N Traefik instances )

  • Clear separation: Master orchestrates, Workers route

  • Adding a Worker adds both compute and network routing capacity

Implementation: Service Registration in Consul

When the Master provisions a container, it now registers the service in Consul instead of updating Traefik configs:

class CatalogManager:
    def __init__(self, consul_host: str, consul_port: int):
        self.client = consul.aio.Consul(host=consul_host, port=consul_port)

    async def register_route(self, route_def: RouteDefinition, worker_node: WorkerNode):
        service_id = f"instance-{route_def.instance_id}"
        protocol = route_def.protocol.value.lower()  # 'http' or 'tcp'

        tags = [
            "traefik.enable=true",
            f"traefik.{protocol}.routers.{service_id}.rule=Host(`{route_def.hostname}`)",
            f"traefik.{protocol}.routers.{service_id}.entrypoints=challenge-web",
            f"traefik.{protocol}.services.{service_id}.loadbalancer.server.port={route_def.backend_port}"
        ]

        await self.client.agent.service.register(
            name=f"instance-svc-{route_def.instance_id}",
            service_id=service_id,
            address=route_def.backend_ip,  # Container's internal IP
            port=route_def.backend_port,
            tags=tags
        )

What’s tagged here:

  1. Service ID: Unique identifier for this container instance

  2. Traefik Tags: Embedding routing rules that Traefik reads from Consul

  3. Address/Ports: Where the container is actually running ( internal Docker network IP )

When a container terminates, we de-register:

async def deregister_route(self, instance_id: str):
    service_id = f"instance-{instance_id}"
    await self.client.agent.service.deregister(service_id)

Simple. The Master completes its job of writing to Consul and moves on.

Implementation: Traefik as Consul Catalog Consumer

Each Worker runs a Traefik instance configured to watch Consul:

# traefik.yml on each worker

log:
  level: DEBUG

api:
  insecure: true # Only for local dev, will be changed in production
  dashboard: true # For debugging, will be disabled in production

entryPoints:
  traefik:
    address: ":8181"  # Dashboard

  challenge-tcp:
    address: ":2222"  # For SSH/TCP challenges

  challenge-web:
    address: ":8080"  # For HTTP challenges

providers:
  consulCatalog:
    endpoint:
      address: "127.0.0.1:8500"
      scheme: "http"

    # Only services tagged with traefik.enable=true
    constraints: "Tag(`traefik.enable=true`)"

    # Don't auto-expose everything in Consul
    exposedByDefault: false

    # Look for tags prefixed with 'traefik'
    prefix: "traefik"

Traefik polls Consul every few seconds, discovers services with traefik.enable=true, reads their tags, and dynamically generates routing rules.

No config files to update. No restarts required. Pure service discovery.

The Worker Bootstrap Process

Each Worker needs to run three processes: Consul agent, Traefik, and the gRPC worker server. We use supervisord to manage them:

[supervisord]
nodaemon=true

[program:consul]
command=consul agent -node-id=%%INSTANCE_ID%% \
  -data-dir=/consul/data \
  -config-dir=/consul/config \
  -retry-join="consul" \
  -client="0.0.0.0" \
  -enable-local-script-checks=true
autostart=true
autorestart=true

[program:traefik]
command=traefik --configfile=/etc/traefik/traefik.yml
autostart=true
autorestart=true

[program:poseidon-worker]
command=python -m src.main
autostart=true
autorestart=true

The entrypoint script generates a unique worker ID ( locally for now, will use instance ID when used on AWS ) and starts all three processes:

#!/bin/sh
set -e

INSTANCE_ID=$(uuidgen)
echo "Worker node identified as: ${INSTANCE_ID}"

# Template substitution for configs
sed "s/%%INSTANCE_ID%%/${INSTANCE_ID}/g" \
  /etc/supervisor/conf.d/supervisord.conf.template > /tmp/supervisord.conf

sed "s/%%INSTANCE_ID%%/${INSTANCE_ID}/g" \
  /consul/config/worker-service.json.template > /consul/config/worker-service.json

exec /usr/bin/supervisord -c /tmp/supervisord.conf

Each Worker joins the Consul cluster, registers health check, and starts accepting gRPC requests from the Master.

What Broke: The Traefik Tag Nightmare

This all sounds clean in retrospect. In practice, getting Traefik to generate routes correctly took hours.

The Problem: Traefik was auto-generating routing rules based on the service name (instance-svc-{uuid}), not the hostname we wanted. Users would hit a3f4b92c8d.orcactf.app, and Traefik would return 404 because it had created a route for instance-svc-3f053024-2418-4b4c-97d1-0590c5d9f480.orcactf.app.

The Root Cause: Traefik’s Consul provider has “smart” defaults which infer routing rules from service metadata. It was too smart, overriding our tags with it’s own generated rules.

The Fix: Two critical config changes:

  1. Set exposedByDefault: false in traefik.yml

    • Forces Traefik to only process services explicitly tagged with traefik.enable=true

    • Disables auto-route generation

  2. Use explicit router rules in tags:

     f"traefik.{protocol}.routers.{service_id}.rule=Host(`{route_def.hostname}`)"
    
    • Overrides Traefik’s default rule generation

    • Ensures we control the hostname matching logic

After these changes, routes worked perfectly. But this took reading through Traefik’s Consul provider documentation quite a few times to understand the rules.

Lesson learned: When integrating tools that "just work," the hard part is understanding what they do automatically and how to override it when needed.

What’s Still Missing

This works locally. AWS deployment will add complexity:

  1. Network Load Balancer:

    Currently testing with direct Traefik access. In production, a NLB will front all workers, and the flow becomes

     DNS → Route53 → NLB → Worker Traefik → Container
    
  2. Multi-node Consul Cluster

    Right now, we have a single Consul server. If we ever scale this, we’ll need a 3-node cluster for HA (high availability). If Consul dies, routing goes down.

  3. VPC Networking:

    Workers need appropriate security group rules allowing inbound traffic from the NLB, inbound access from other workers and outbound to other workers

  4. TLS/HTTPS:

    Currently running on HTTP only, we need to implement Traefik’s TLS termination and Let’s Encrypt/AWS Certificate Manager

  5. The Reaper Proces:

    Containers have TTLs. We need a background process that:

    • Polls Redis for expired instances

    • Call deprovision_instance() for each of them

    • Handle orphaned containers ( if Master crashes before cleanup )

This is infrastructure work, not architecture work. The hard part of designing and implementing the system is done.

Failure Mode Analysis

Let's validate that we actually improved resilience:

FailureOld ArchitectureNew Architecture
Master CrashesRouting dies, users disconnectedRouting unaffected, orchestration paused
Worker CrashesOnly it’s containers dieOnly it’s containers die, NLB removes from pool
Traefik crashes (single)All routing diesOther Traefik’s compensate
Consul crashes (single node)N/ARouting breaks (needs multi-node cluster)

Net improvement: I eliminated the "Master down = routing down" failure mode, which was the biggest architectural risk.

Horizontal Scaling: The Math

Old architecture:

  • 1 Master with Traefik = fixed routing capacity

  • Add workers = more compute, same routing bottleneck

New architecture:

  • N workers = N Traefik instances = N × routing capacity

  • Add worker = more compute AND more routing capacity

Example: If each Traefik can handle 1000 req/sec:

  • 1 centralized Traefik = 1000 req/sec max

  • 5 workers with Traefik = 5000 req/sec max (behind NLB)

This scales routing capacity horizontally, not just compute capacity.

Lessons Learned

1. Control/Data Plane Separation is Real

This isn't just theory from distributed systems textbooks. The moment you separate "who decides" from "who executes," resilience improves dramatically.

2. Service Mesh Reduces Coupling

Master doesn't know about Traefik. Workers don't know about routing logic. Consul is the only shared dependency. This means:

  • We could swap Traefik for Nginx without touching the Master

  • We could change scheduling logic without touching Traefik

  • Each component has a single, clear responsibility

3. Tool Documentation Matters More Than You Think

The Traefik tag issue cost hours because I assumed defaults would work. Reading the Consul provider docs thoroughly upfront would have saved time.

4. Test Failure Modes Early

Spinning up multiple workers locally and killing processes randomly validated the architecture way before AWS deployment. If it doesn't work with 2 local workers, it won't work with 10 EC2 instances.

The system works. Now we need to make it production-ready.


Follow along as I continue building Poseidon. The code will be open sourced one we’re deployed and test on AWS infrastructure.

Previous Posts: Part 1 | Part 2

Poseidon #3: Control vs Data Plane Explained