Building Poseidon #3: Control Plane vs Data Plane
How we ensured our system can survive a master node crash

In Part 2, I built the Master-Worker orchestration core: gRPC communication, container spawning and basic routing. It worked locally. Containers spun up, users got unique subdomains and traffic routed correctly.
Then I realized the architecture had a fundamental flaw.
The Problem: Master as a Single Point of Failure
My initial design had centralized routing. The Master instance handled both Orchestration ( deciding which worker runs a container ) and routing ( directing traffic to that container ). This created two problems:
Problem 1: Availability If the Master crashed, users lost access to running containers, even though the containers were still alive on the Worker nodes. The control plane failure restricted access to the data plane.
Problem 2: Scalability The Master became a routing bottleneck. All traffic flowed through it, limiting horizontal scaling. We could add more workers, but routing capacity stays fixed.
This violates a core principle: separate the control plane from the data plane
The Solution: Service Mesh with Consul and Traefik
After reading Consul’s documentation and reading about service mesh patterns, the solution was clear:
Master ( Control Plane ): Orchestrates container lifecycle, maintaining state in Redis, registers services in Consul
Workers ( Data Plane ): Run containers, run Traefik instances, route traffic based on Consul’s service catalog
Consul: Single source of truth for service discovery and routing metadata.
The Master never touches production traffic. It just tells Consul, “container X is at IP Y on port Z with hostname H”. Every worker’s Traefik watches Consul and builds routing rules dynamically.
The New Architecture
User Request (a3f4b92c8d.orcactf.app)
↓
DNS Resolution (will point to NLB on AWS)
↓
Any Worker's Traefik (via load balancer)
↓
Consul Lookup: "Which IP:port has this hostname?"
↓
Forward to Container (possibly on different worker)
↓
Container responds
Key Insight: Any Traefik instance can route to any container, even if that container isn’t on the same worker. They all read from the same Consul Catalog.
This means:
Master crash doesn’t affect routing.
Routing scales horizontally ( N workers = N Traefik instances )
Clear separation: Master orchestrates, Workers route
Adding a Worker adds both compute and network routing capacity
Implementation: Service Registration in Consul
When the Master provisions a container, it now registers the service in Consul instead of updating Traefik configs:
class CatalogManager:
def __init__(self, consul_host: str, consul_port: int):
self.client = consul.aio.Consul(host=consul_host, port=consul_port)
async def register_route(self, route_def: RouteDefinition, worker_node: WorkerNode):
service_id = f"instance-{route_def.instance_id}"
protocol = route_def.protocol.value.lower() # 'http' or 'tcp'
tags = [
"traefik.enable=true",
f"traefik.{protocol}.routers.{service_id}.rule=Host(`{route_def.hostname}`)",
f"traefik.{protocol}.routers.{service_id}.entrypoints=challenge-web",
f"traefik.{protocol}.services.{service_id}.loadbalancer.server.port={route_def.backend_port}"
]
await self.client.agent.service.register(
name=f"instance-svc-{route_def.instance_id}",
service_id=service_id,
address=route_def.backend_ip, # Container's internal IP
port=route_def.backend_port,
tags=tags
)
What’s tagged here:
Service ID: Unique identifier for this container instance
Traefik Tags: Embedding routing rules that Traefik reads from Consul
Address/Ports: Where the container is actually running ( internal Docker network IP )
When a container terminates, we de-register:
async def deregister_route(self, instance_id: str):
service_id = f"instance-{instance_id}"
await self.client.agent.service.deregister(service_id)
Simple. The Master completes its job of writing to Consul and moves on.

Implementation: Traefik as Consul Catalog Consumer
Each Worker runs a Traefik instance configured to watch Consul:
# traefik.yml on each worker
log:
level: DEBUG
api:
insecure: true # Only for local dev, will be changed in production
dashboard: true # For debugging, will be disabled in production
entryPoints:
traefik:
address: ":8181" # Dashboard
challenge-tcp:
address: ":2222" # For SSH/TCP challenges
challenge-web:
address: ":8080" # For HTTP challenges
providers:
consulCatalog:
endpoint:
address: "127.0.0.1:8500"
scheme: "http"
# Only services tagged with traefik.enable=true
constraints: "Tag(`traefik.enable=true`)"
# Don't auto-expose everything in Consul
exposedByDefault: false
# Look for tags prefixed with 'traefik'
prefix: "traefik"
Traefik polls Consul every few seconds, discovers services with traefik.enable=true, reads their tags, and dynamically generates routing rules.
No config files to update. No restarts required. Pure service discovery.
The Worker Bootstrap Process
Each Worker needs to run three processes: Consul agent, Traefik, and the gRPC worker server. We use supervisord to manage them:
[supervisord]
nodaemon=true
[program:consul]
command=consul agent -node-id=%%INSTANCE_ID%% \
-data-dir=/consul/data \
-config-dir=/consul/config \
-retry-join="consul" \
-client="0.0.0.0" \
-enable-local-script-checks=true
autostart=true
autorestart=true
[program:traefik]
command=traefik --configfile=/etc/traefik/traefik.yml
autostart=true
autorestart=true
[program:poseidon-worker]
command=python -m src.main
autostart=true
autorestart=true
The entrypoint script generates a unique worker ID ( locally for now, will use instance ID when used on AWS ) and starts all three processes:
#!/bin/sh
set -e
INSTANCE_ID=$(uuidgen)
echo "Worker node identified as: ${INSTANCE_ID}"
# Template substitution for configs
sed "s/%%INSTANCE_ID%%/${INSTANCE_ID}/g" \
/etc/supervisor/conf.d/supervisord.conf.template > /tmp/supervisord.conf
sed "s/%%INSTANCE_ID%%/${INSTANCE_ID}/g" \
/consul/config/worker-service.json.template > /consul/config/worker-service.json
exec /usr/bin/supervisord -c /tmp/supervisord.conf
Each Worker joins the Consul cluster, registers health check, and starts accepting gRPC requests from the Master.
What Broke: The Traefik Tag Nightmare
This all sounds clean in retrospect. In practice, getting Traefik to generate routes correctly took hours.
The Problem: Traefik was auto-generating routing rules based on the service name (instance-svc-{uuid}), not the hostname we wanted. Users would hit a3f4b92c8d.orcactf.app, and Traefik would return 404 because it had created a route for instance-svc-3f053024-2418-4b4c-97d1-0590c5d9f480.orcactf.app.
The Root Cause: Traefik’s Consul provider has “smart” defaults which infer routing rules from service metadata. It was too smart, overriding our tags with it’s own generated rules.
The Fix: Two critical config changes:
Set
exposedByDefault: falseintraefik.ymlForces Traefik to only process services explicitly tagged with
traefik.enable=trueDisables auto-route generation
Use explicit router rules in tags:
f"traefik.{protocol}.routers.{service_id}.rule=Host(`{route_def.hostname}`)"Overrides Traefik’s default rule generation
Ensures we control the hostname matching logic
After these changes, routes worked perfectly. But this took reading through Traefik’s Consul provider documentation quite a few times to understand the rules.
Lesson learned: When integrating tools that "just work," the hard part is understanding what they do automatically and how to override it when needed.
What’s Still Missing
This works locally. AWS deployment will add complexity:
Network Load Balancer:
Currently testing with direct Traefik access. In production, a NLB will front all workers, and the flow becomes
DNS → Route53 → NLB → Worker Traefik → ContainerMulti-node Consul Cluster
Right now, we have a single Consul server. If we ever scale this, we’ll need a 3-node cluster for HA (high availability). If Consul dies, routing goes down.
VPC Networking:
Workers need appropriate security group rules allowing inbound traffic from the NLB, inbound access from other workers and outbound to other workers
TLS/HTTPS:
Currently running on HTTP only, we need to implement Traefik’s TLS termination and Let’s Encrypt/AWS Certificate Manager
The Reaper Proces:
Containers have TTLs. We need a background process that:
Polls Redis for expired instances
Call
deprovision_instance()for each of themHandle orphaned containers ( if Master crashes before cleanup )
This is infrastructure work, not architecture work. The hard part of designing and implementing the system is done.
Failure Mode Analysis
Let's validate that we actually improved resilience:
| Failure | Old Architecture | New Architecture |
| Master Crashes | Routing dies, users disconnected | Routing unaffected, orchestration paused |
| Worker Crashes | Only it’s containers die | Only it’s containers die, NLB removes from pool |
| Traefik crashes (single) | All routing dies | Other Traefik’s compensate |
| Consul crashes (single node) | N/A | Routing breaks (needs multi-node cluster) |
Net improvement: I eliminated the "Master down = routing down" failure mode, which was the biggest architectural risk.
Horizontal Scaling: The Math
Old architecture:
1 Master with Traefik = fixed routing capacity
Add workers = more compute, same routing bottleneck
New architecture:
N workers = N Traefik instances = N × routing capacity
Add worker = more compute AND more routing capacity
Example: If each Traefik can handle 1000 req/sec:
1 centralized Traefik = 1000 req/sec max
5 workers with Traefik = 5000 req/sec max (behind NLB)
This scales routing capacity horizontally, not just compute capacity.
Lessons Learned
1. Control/Data Plane Separation is Real
This isn't just theory from distributed systems textbooks. The moment you separate "who decides" from "who executes," resilience improves dramatically.
2. Service Mesh Reduces Coupling
Master doesn't know about Traefik. Workers don't know about routing logic. Consul is the only shared dependency. This means:
We could swap Traefik for Nginx without touching the Master
We could change scheduling logic without touching Traefik
Each component has a single, clear responsibility
3. Tool Documentation Matters More Than You Think
The Traefik tag issue cost hours because I assumed defaults would work. Reading the Consul provider docs thoroughly upfront would have saved time.
4. Test Failure Modes Early
Spinning up multiple workers locally and killing processes randomly validated the architecture way before AWS deployment. If it doesn't work with 2 local workers, it won't work with 10 EC2 instances.
The system works. Now we need to make it production-ready.
Follow along as I continue building Poseidon. The code will be open sourced one we’re deployed and test on AWS infrastructure.



![Terrier CTF [Part 1]: Network Reconnaissance to SSTI: The Methodology Beneath the Exploit](/_next/image?url=https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1760333413720%2Fa5955db5-f349-4ffb-b1c1-25a052b8c4c4.jpeg&w=3840&q=75)