Skip to main content

Command Palette

Search for a command to run...

Building Poseidon #2: The Master-Worker Dance

Designing the gRPC API and Orchestration logic that powers ephemeral containers

Updated
8 min read
Building Poseidon #2: The Master-Worker Dance

In Part 1 of Building Poseidon, we established why we’re building Poseidon instead of using existing solutions like Kubernetes or AWS Fargate. We need instance-specific routing, fine-grained lifecycle control, and cost-effective operation within the AWS free tier.

Now comes the challenging part: Actually building it.

Talk is cheap, show me the code - Linus Torvalds

This post dives into the master-worker communication protocol, the orchestration logic that schedules containers across workers, and the real-world problems we hit along the way. Everything described here is running on my local machine right now, AWS deployment comes later.

The Architecture: A Quick Recap

Poseidon follows a two-process model:

  • Master Process: Exposes a REST API to OrcaCTF’s backend, maintains global state in Redis, orchestrates container lifecycles across workers.

  • Worker Process: Run on compute nodes, interface with Docker daemon, spin up containers and report status back to the master.

The Master in itself doesn’t know ( or care ) about what kind of images it’s running. It’s a dumb executor that receives requests like:

{
  "user_id": "litmus",
  "challenge_id": "localhost:5000/orca-challenges/our-first-challenge:latest",
  "mem_limit": 256,
  "nano_cpus": 5000000,
  "protocol": "http"
}

It returns a unique hostname when the users requested container is accessible.

The Communication Protocol: REST for Clients, gRPC for Workers

Early on, I had to decide: how should the master and worker talk to each other?

The Decision: Hybrid Approach:

Master <→ Frontend/Backend: REST API

  • OrcaCTF’s backend makes simple HTTP requests to start/stop containers

  • Easy to debug with cURL, integrates with existing FastAPI infrastructure

  • No need for client ( OrcaCTF backend ) side gRPC dependencies

Master ←> Workers: gRPC

  • Workers are internal infrastructure, not exposed to users

  • Need high performance, low latency communication for container operations

  • Strong typing via Protocol Buffers. Catches integration bugs early

  • Bi-directional streaming support which can be used for real-time logs and metrics

This isn’t an either-or decision. REST is great for public APIs whilst gRPC is better for internal service-to-service communication where performance and type-safety matters.

The gRPC Protocol Definition

Here’s our .proto defining the worker service:

syntax = "proto3";
package poseidon.worker;

service WorkerService {
    rpc SpawnContainer(SpawnRequest) returns (SpawnResponse);
    rpc TerminateContainer(TerminateRequest) returns (TerminateResponse);
    rpc GetWorkerStatus(WorkerStatusRequest) returns (WorkerStatusResponse);
}

message SpawnRequest {
    string request_id = 1;
    string image_name = 2;
    int32 start_timeout_seconds = 3;
    int32 mem_limit = 4;
    int32 nano_cpus = 5;
}

message SpawnResponse {
    string container_id = 1;
    string internal_ip = 2;
    int32 internal_port = 3;
    bool success = 4;
    string error_message = 5;
}

message TerminateRequest {
    string container_id = 1;
}

message TerminateResponse {
    bool success = 1;
    string error_message = 2;
}

message WorkerStatusRequest {}

message WorkerStatusResponse {
    string node_id = 1;
    float cpu_usage_percent = 2;
    float memory_percent = 3;
    int32 container_count = 4;
}

Key design decisions:

  • SpawnResponse returns the internal IP and port. The master needs to know where to route traffic. Containers run on a shared Docker network, so we use internal IPs, not published ports.

  • success+error_message patterns. Workers can fail for many reasons ( image pull failure, resource exhaustion, Docker daemon issues ). Explicit success flags make error handling cleaner than relying on gRPC status codes alone.

  • GetWorkerStatus for load balancing: Workers periodically report CPU, memory and container count. The master uses this for intelligent scheduling.

Worker Discovery: Why Consul?

One of the first problems, how does the master know which workers exist and whether they’re healthy?

We could maintain a worker registry in Redis, but that introduces a new failure mode: what if a worker crashes without deregistering itself? The master would keep sending requests to a dead worker.

Enter Consul

Consul is a service mesh solution that handles service discovery and health-checking. Workers register themselves with Consul on startup, and Consul continuously health-checks them via gRPC’s health protocol.

Why Consul over Manual Health Checks?

Separation of Concerns: The master doesn’t have to implement heartbeat logic. Consul does that out of the box.

Graceful draining: When we deploy to AWS, the worker instances will be behind a Load Balancer, which may decide to de-provision a worker if the load isn’t high. Workers can then set a “draining” status in their health check responses, so Consul can inform the master to not schedule any new instances on the node while the existing containers complete their tenure and request for extensions are denied.

Proven reliability: Consul is battle-tested infrastructure. We’re not re-inventing service discovery.

The Master queries Consul for healthy workers and gets back a list of (node_id, address) pairs. Simple.

The Orchestration Flow: From Request to Running Container

Let’s walk through what happens when a user clicks “Start Challenge”

Phase 1: Worker Selection ( Scheduling )

The Orchestrator asks the Scheduler for the best available Worker:

selected_worker = await self.scheduler.get_best_worker()
if not selected_worker:
    raise RuntimeError("No available workers to handle request")

The scheduler implements a weighted score system:

cpu_weight = 0.5
mem_weight = 0.3
container_weight = 0.2

score = (cpu * cpu_weight) + (mem * mem_weight) + (container_count * container_weight)

Workers with lower scores ( less loaded ) are preferred. This is a simple heuristic, not a sophisticated bin-packing algorithm, but it works at our expected scale.

The Scheduler polls all workers every 15 seconds via GetWorkerStatus, caching their metrics in Redis. When a container request arrives, it picks the worker with the lowest score from the cached data.

Trade-off: There’s a 15 second window where Worker stats might be stale. A worker could become overloaded between polls, and the scheduler wouldn’t know. For our use case (15-20 concurrent users) this should do the trick. When we scale and need real-time accuracy, we could poll more frequently, or use event-driven updates.

Phase 2: Container Spawn

Once a worker is selected, the Orchestrator sends a gRPC SpawnContainer request:

running_container = await self.worker_client.spawn_container(
    selected_worker,
    challenge_id,
    instance_id,
    mem_limit,
    nano_cpus
)

On the Worker side, this triggers DockerManager:

def run_container(self, image_name: str, instance_id: str, 
                  mem_limit: int | None, nano_cpus: int | None) -> Container:
    memory_limit_str = f"{mem_limit}m" if mem_limit else "256m"

    try:
        container = self.client.containers.run(
            image=image_name,
            detach=True,
            auto_remove=True,
            labels={"poseidon.instance_id": instance_id},
            network=self.network_name,
            mem_limit=memory_limit_str,
            nano_cpus=nano_cpus
        )
        return container
    except ImageNotFound:
        logger.warning(f"Image '{image_name}' not found locally. Pulling...")
        self.client.images.pull(image_name)
        return self.run_container(image_name, instance_id, mem_limit, nano_cpus)

Key Details:

  • auto_remove: True: Containers are ephemeral; When they stop, Docker automatically cleans them up

  • labels={"poseidon.instance_id": instance_id} : Labels enable visibility. Having descriptive labels like this helps in pinpointing the sources of any issues. Traefik ( our reverse-proxy ) can also be configured to route traffic depending on these labels

  • Automatic image pulling: If the challenge image isn’t cached locally, the worker pulls it. This adds latency on the first request, but simplifies deployment.

Phase 3: Network Discovery

After the container starts, the worker needs to determine it’s internal IP and report it along with the port it is serving the container on. This is trickier than it sounds.

Containers run on a shared Docker network. We don’t use published ports because that would require dynamic port allocation and port conflict handling. Instead, containers expose their services on an internal port ( like 8080 for HTTP challenges ), and Traefik routes traffic based on the hostname.

def get_container_details(self, container_id: str) -> dict | None:
    container = self.client.containers.get(container_id)
    container.reload()

    # Get IP from the shared network
    network_settings = container.attrs['NetworkSettings']['Networks'][self.network_name]
    ip_address = network_settings['IPAddress']

    # Get the internal port from the image's EXPOSE directive
    exposed_ports = container.attrs['Config']['ExposedPorts']
    internal_port_str = list(exposed_ports.keys())[0]  # e.g., "80/tcp"
    internal_port = int(internal_port_str.split('/')[0])

    return {
        "id": container.id,
        "ip_address": ip_address,
        "internal_port": internal_port,
    }

This relies on challenge images correctly declaring their exposed ports via EXPOSE in the Dockerfile. If the challenge doesn’t expose a port, this logic fails. This is intentional as it helps catch misconfigurations early.

Phase 4: State Management and Routing

Once the container is running, the Orchestrator:

  • Generates a unique hostname: sha256(instance_id)[:12].<ctf_platform_domain>

  • Saves the instance metadata to Redis

      instance = Instance(
             instance_id=instance_id,
             user_id=user_id,
             challenge_id=challenge_id,
             worker=selected_worker,
             hostname=f"{external_hostname}.local",
             container_id=running_container.container_id,
             internal_ip=running_container.internal_ip,
             internal_port=running_container.internal_port,
             created_at=now_ts,
             expires_at=now_ts + cfg.default_ttl_seconds
         )
         await self.state_manager.save_instance(instance)
    
  • Creates a routing rule in Traefik

          route_definition = RouteDefinition(
                 instance_id=instance.instance_id,
                 protocol=protocol,
                 hostname=instance.hostname,
                 backend_ip=running_container.internal_ip,
                 backend_port=running_container.internal_port,
             )
             self.proxy_manager.create_route(route_definition)
    

Traefik now knows: “traffic to a3f4b92c8d.local should go to 172.18.0.5:80.

The user receives their unique URL, and can connect immediately.

What broke ( and how we fixed it )

Problem 1: Silent Deployment Failures

Early on, containers were failing to start, but the Orchestrator was reporting success. The issue? We weren’t checking the gRPC response properly

if not running_container.success:
    raise RuntimeError(f"Failed to spawn container: {running_container.error_message}")

Now, if the worker reports success=False, the Orchestrator propagates the error up to the User. Simple, but easy to miss.

Problem 2: Docker Networking Hell

Getting Traefik to reliably discover containers on the shared Docker network took far longer than expected. Containers were starting, but Traefik couldn’t route to them.

The issue was that Traefik was looking on the wrong network. We had to explicitly configure

network=self.network_name  # Must match Traefik's configured network
# An environment variable stored the value of self.network_name and initialized when the Orchestrator spawned

And ensure that Traefik was itself running on the same Docker network. Obvious in hindsight, nothing short of painful in practice.

Problem 3: CPU Limits don’t work in Docker-in-Docker

We’re running workers as Docker containers themselves ( for local testing ), but Docker-in-Docker doesn’t support nano_cpus limits properly. The inner containers inherit the outer containers limits, not their own.

For now, we’ve accepted this limitation. On AWS EC2 instances ( where Docker runs natively ), CPU limits will work as intended. But it’s a reminder that abstractions leak, and testing in production-like environments matter.

What’s next

This post covered the orchestration core: worker discovery, container spawning, and state management. But we're not done. We will explore the containment and resilience: isolation, resource control and ensuring the system can shut down gracefully without abandoning active users.


Follow along as we continue building Poseidon. The code will be open-sourced soon after we test a deployed PoC.