> For the complete documentation index, see [llms.txt](https://stakpak.gitbook.io/docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://stakpak.gitbook.io/docs/archive/stakpak-vs-claude-code.md).

# Stakpak Vs Claude Code

## Introduction

Engineers and founders run into the same question with AI Agents:

Can agents reliably follow production workflows, or do they behave inconsistently, get lost mid task, and burn time figuring things out?

Most general purpose coding agents, including Claude Code, are great at reasoning and generation but when it comes to shipping, knowing what to do isn’t enough. What matters is doing it the right way, in the right order, with the right constraints.

At [Stakpak](https://github.com/stakpak/agent), we built [Rulebooks](/docs/how-it-works/rulebooks.md)(Agent Skills) to solve this problem.

Rulebooks are markdown based, Standard Operating Procedure (SOPs) that encode how infrastructure work is actually done not generic best practices, but guided execution, production ready procedures.

But the real question isn’t what rulebooks are.\
It’s whether they measurably improve agent behavior.

So instead of relying on intuition, we ran controlled experiments to find out.

## The Experiment

**We ran** [**Stakpak**](#user-content-fn-1)[^1] **and Claude Code into 10 advanced infrastructure tasks:**

<details>

<summary>Monitoring &#x26; Alerting with Uptime Kuma</summary>

**Scenario:** Set up application monitoring with webhook alerts

**Success Criteria:**

* Uptime Kuma is running and accessible
* The web application is being monitored
* Webhook notifications are configured to alert on downtime
* turning down the server and checking if alerts are being received
* Completed under timeout 30 mins

</details>

<details>

<summary>End to End LLM Deployment</summary>

**Scenario:** Configure vLLM for OpenAI compatible API on CPU infrastructure

**Success Criteria:**

* Choosed a suitable model size based on RAM requirements \[cpus = 4 memory\_mb = 8192\
  storage\_mb = 20480]
* Configured vllm for the specified machine & test completion request

</details>

<details>

<summary><strong>Coolify Self Hosted PaaS</strong></summary>

**Scenario:** Deploy FastAPI with Traefik reverse proxy and SSL certificates

**Success Criteria:**

* Created new EC2 instance with SSH key
* Installed Coolify with Traefik reverse proxy
* Configured DNS A record for `fastapi.guku.io`
* Built and deployed FastAPI container with Traefik labels
* Auto provisioned SSL certificate via Let's Encrypt

</details>

<details>

<summary>TanStack on AWS + Cloudflare CDN</summary>

**Scenario:** Deploy a TanStack app on AWS with Cloudflare as CDN

**Success Criteria**

* All required credentials (AWS, Cloudflare, OpenWeatherMap API) are successfully read and accessible
* EC2 infrastructure is provisioned successfully (VPC, security group, key pair created)
* EC2 instance is launched and reachable
* Docker and all required dependencies are installed on the EC2 instance
* The weather application source code is cloned and built successfully with the API key configured
* The application container is running on the EC2 instance
* Cloudflared is installed and a Quick Tunnel is created successfully

</details>

<details>

<summary>Twelve Factor App Analysis</summary>

**Scenario:** Analyze a multi service app against all 12 factors, identify violations, and apply fixes

**Success Criteria:**

* Create a detailed JSON report is  at `/app/report.json` containing analysis for each service, compliance scores for all 12 factors, identified violations, and a proposed remediation plan.

</details>

<details>

<summary>Diagnose Production Performance Issues</summary>

**Scenario:** Analyze the performance of the API running at `http://localhost:8000` by profiling execution, memory usage, and load behavior using memray, perf, and k6. Summarize the findings and optimization recommendations in a diagnosis report.

**Success Criteria:**

* `memray` profiling results are analyzed for memory usage and allocation patterns
* `perf` analysis identifies CPU hotspots or inefficient execution paths
* `k6` load testing results capture latency, throughput, and error behavior
* A diagnosis report is written to `/app/diagnosis.md`
* The report clearly identifies performance bottlenecks and memory issues
* The report includes actionable optimization recommendations
* Task completes within the expected time limits (agent ≤ 500s, verifier ≤ 300s)

</details>

<details>

<summary>Serverless Deployment (Cloudflare Workers)</summary>

**Scenario:** Deploy a JavaScript-based Calculator API as a Cloudflare Worker by authenticating with Cloudflare, configuring Wrangler, deploying to the edge network, and verifying global API availability.

**Success Criteria:**

* The Cloudflare Worker is deployed successfully
* The deployment URL is written to `/app/deployment.json`
* The `/health` endpoint responds correctly
* The `/calculate` endpoint responds correctly
* The API is accessible via the deployed Cloudflare Worker URL
* Task completes within the expected time limits (agent ≤ 900s, verifier ≤ 900s)

</details>

<details>

<summary>MongoDB RPO / RTO Validation</summary>

**Scenario:** Implement and validate a MongoDB backup and restore strategy that meets strict RPO and RTO targets by simulating a database failure and measuring data loss and recovery time.

**Success Criteria:**

* A backup strategy is implemented for `mydb.events` on MongoDB
* The database is dropped and successfully restored from backup
* `/app/report.json` is generated with all required fields:
  * `pre_disaster_seq_id`
  * `post_restore_seq_id`
  * `records_lost`
  * `rpo_seconds`
  * `rto_seconds`
  * `rpo_pass`
  * `rto_pass`
* Calculated RPO is ≤ 60 seconds
* Calculated RTO is ≤ 120 seconds
* Both `rpo_pass` and `rto_pass` are `true`
* Task completes within the expected time limits (agent ≤ 900s, verifier ≤ 600s)

</details>

<details>

<summary>K3s Security &#x26; Compliance Audit</summary>

**Scenario:** Deploy a K3s cluster on an EC2 instance, deploy the provided web application and MongoDB manifests, then scan both container images and Kubernetes manifests for security issues using Trivy and Checkov. Generate JSON security reports on the EC2 instance

**Success Criteria:**

* `trivy-mongo.json` is generated with vulnerability results for the `mongo:5.0` image
* `trivy-webapp.json` is generated with vulnerability results for the `nanajanashia/k8s-demo-app:v1.0` image
* `checkov-report.json` is generated with security and compliance findings for the Kubernetes manifests
* All three reports exist in `/app/security-reports/` in the local container
* Reports are valid JSON and contain scan results
* Task completes within the expected time limits (agent ≤ 1800s, verifier ≤ 300s)

</details>

<details>

<summary>HAProxy / Load Balancer Configuration</summary>

**Scenario:** Configure NGINX as a reverse proxy in front of multiple backend API servers and use HAProxy to load balance traffic across the NGINX instances with health checks, session affinity, and operational visibility.

**Success Criteria:**

* NGINX is running and proxying traffic to backend services on ports `8001`, `8002`, and `8003`
* HAProxy is running and load balancing traffic on port `80`
* Health checks are enabled and correctly monitor the `/health` endpoint
* Session affinity (sticky sessions) is enabled and functioning
* HAProxy stats page is accessible at `:8404/stats`
* Failover works when backend services become unhealthy
* `/app/report.json` contains:
  * `nginx_running: true`
  * `nginx_instances` equals the number of running backends
  * `haproxy_running: true`
  * `load_balancing_works: true`
  * `session_affinity_works: true`
  * `health_checks_work: true`
  * `stats_page_works: true`
* Task completes within the expected time limits (agent ≤ 900s, verifier ≤ 900s)

</details>

Each scenario was run four times, testing Stakpak and Claude Code with both Claude Opus 4.5 and Claude Haiku 4.5.

{% hint style="success" %}
These rulebooks are now available in Stakpak.
{% endhint %}

### The Results

#### Stakpak[^2] Vs Claude Code (Model: Opus 4.5)

ADD THE NEW TABLE HERE

#### Stakpak[^2] Vs Claude Code (Model: Haiku 4.5)

### How Rulebooks Changed Everything?

Success rates jumped from 0–13% to 100%, execution became faster and cheaper, and Stakpak stopped “figuring things out” and started following how things are actually done.

### Why Stakpak Rulebooks Matter?

Every organization has tribal knowledge things senior engineers "just know" but aren't documented anywhere reliable:

* "On CPU instances, use opt-125m with --enforce-eager"
* "Coolify needs Traefik labels for SSL to work"
* "The Uptime Kuma UI setup must happen before webhook config"
* "For 8GB RAM, avoid models over 2B parameters"

This knowledge typically lives in:

* Senior engineers' heads
* Scattered Slack conversations
* Outdated wiki pages that no one updates

Stakpak rulebooks formalize tribal knowledge into executable procedures.

When that senior engineer is on vacation or leaves the company, Stakpak still knows what to do. When a new team member joins, they inherit decades of operational wisdom through Stakpak's rulebook system.

#### TLDR

AI agents fail in production not because they’re “dumb,” but because they rely on trial and error and guessing instead of deterministic execution.

We ran five real world DevOps scenarios with Stakpak, once without rulebooks and once with them.

**Results:**

* Success rates jumped from 0–13% → 100%
* Tasks finished faster and cheaper
* Agents stopped guessing and started following proven workflows

**Why?**\
Rulebooks turn tribal knowledge (the stuff senior engineers “just know”) into executable, repeatable instructions.

With rulebooks, Stakpak doesn’t improvise; it operates the way your team does, every time.

#### **Ready to turn your team’s operational knowledge into something reusable?**

Check [How to Write a Rulebook?](/docs/how-it-works/rulebooks/how-to-write-a-rulebook.md) to create rulebooks that encode how your team operates, or explore community contributed [Paks](https://paks.stakpak.dev/) for battle tested, reusable infrastructure patterns.

[^1]: With Rulebooks (Agent Skills)

[^2]: Stakpak with Rulebooks