file-chart-columnStakpak Vs Claude Code

Introduction

Engineers and founders run into the same question with AI Agents:

Can agents reliably follow production workflows, or do they behave inconsistently, get lost mid task, and burn time figuring things out?

Most general purpose coding agents, including Claude Code, are great at reasoning and generation but when it comes to shipping, knowing what to do isn’t enough. What matters is doing it the right way, in the right order, with the right constraints.

At Stakpakarrow-up-right, we built Rulebooks(Agent Skills) to solve this problem.

Rulebooks are markdown based, Standard Operating Procedure (SOPs) that encode how infrastructure work is actually done not generic best practices, but guided execution, production ready procedures.

But the real question isn’t what rulebooks are. It’s whether they measurably improve agent behavior.

So instead of relying on intuition, we ran controlled experiments to find out.

The Experiment

We ran Stakpak and Claude Code into 10 advanced infrastructure tasks:

chevron-rightMonitoring & Alerting with Uptime Kumahashtag

Scenario: Set up application monitoring with webhook alerts

Success Criteria:

  • Uptime Kuma is running and accessible

  • The web application is being monitored

  • Webhook notifications are configured to alert on downtime

  • turning down the server and checking if alerts are being received

  • Completed under timeout 30 mins

chevron-rightEnd to End LLM Deploymenthashtag

Scenario: Configure vLLM for OpenAI compatible API on CPU infrastructure

Success Criteria:

  • Choosed a suitable model size based on RAM requirements [cpus = 4 memory_mb = 8192 storage_mb = 20480]

  • Configured vllm for the specified machine & test completion request

chevron-rightCoolify Self Hosted PaaShashtag

Scenario: Deploy FastAPI with Traefik reverse proxy and SSL certificates

Success Criteria:

  • Created new EC2 instance with SSH key

  • Installed Coolify with Traefik reverse proxy

  • Configured DNS A record for fastapi.guku.io

  • Built and deployed FastAPI container with Traefik labels

  • Auto provisioned SSL certificate via Let's Encrypt

chevron-rightTanStack on AWS + Cloudflare CDNhashtag

Scenario: Deploy a TanStack app on AWS with Cloudflare as CDN

Success Criteria

  • All required credentials (AWS, Cloudflare, OpenWeatherMap API) are successfully read and accessible

  • EC2 infrastructure is provisioned successfully (VPC, security group, key pair created)

  • EC2 instance is launched and reachable

  • Docker and all required dependencies are installed on the EC2 instance

  • The weather application source code is cloned and built successfully with the API key configured

  • The application container is running on the EC2 instance

  • Cloudflared is installed and a Quick Tunnel is created successfully

chevron-rightTwelve Factor App Analysishashtag

Scenario: Analyze a multi service app against all 12 factors, identify violations, and apply fixes

Success Criteria:

  • Create a detailed JSON report is at /app/report.json containing analysis for each service, compliance scores for all 12 factors, identified violations, and a proposed remediation plan.

chevron-rightDiagnose Production Performance Issueshashtag

Scenario: Analyze the performance of the API running at http://localhost:8000 by profiling execution, memory usage, and load behavior using memray, perf, and k6. Summarize the findings and optimization recommendations in a diagnosis report.

Success Criteria:

  • memray profiling results are analyzed for memory usage and allocation patterns

  • perf analysis identifies CPU hotspots or inefficient execution paths

  • k6 load testing results capture latency, throughput, and error behavior

  • A diagnosis report is written to /app/diagnosis.md

  • The report clearly identifies performance bottlenecks and memory issues

  • The report includes actionable optimization recommendations

  • Task completes within the expected time limits (agent ≤ 500s, verifier ≤ 300s)

chevron-rightServerless Deployment (Cloudflare Workers)hashtag

Scenario: Deploy a JavaScript-based Calculator API as a Cloudflare Worker by authenticating with Cloudflare, configuring Wrangler, deploying to the edge network, and verifying global API availability.

Success Criteria:

  • The Cloudflare Worker is deployed successfully

  • The deployment URL is written to /app/deployment.json

  • The /health endpoint responds correctly

  • The /calculate endpoint responds correctly

  • The API is accessible via the deployed Cloudflare Worker URL

  • Task completes within the expected time limits (agent ≤ 900s, verifier ≤ 900s)

chevron-rightMongoDB RPO / RTO Validationhashtag

Scenario: Implement and validate a MongoDB backup and restore strategy that meets strict RPO and RTO targets by simulating a database failure and measuring data loss and recovery time.

Success Criteria:

  • A backup strategy is implemented for mydb.events on MongoDB

  • The database is dropped and successfully restored from backup

  • /app/report.json is generated with all required fields:

    • pre_disaster_seq_id

    • post_restore_seq_id

    • records_lost

    • rpo_seconds

    • rto_seconds

    • rpo_pass

    • rto_pass

  • Calculated RPO is ≤ 60 seconds

  • Calculated RTO is ≤ 120 seconds

  • Both rpo_pass and rto_pass are true

  • Task completes within the expected time limits (agent ≤ 900s, verifier ≤ 600s)

chevron-rightK3s Security & Compliance Audithashtag

Scenario: Deploy a K3s cluster on an EC2 instance, deploy the provided web application and MongoDB manifests, then scan both container images and Kubernetes manifests for security issues using Trivy and Checkov. Generate JSON security reports on the EC2 instance

Success Criteria:

  • trivy-mongo.json is generated with vulnerability results for the mongo:5.0 image

  • trivy-webapp.json is generated with vulnerability results for the nanajanashia/k8s-demo-app:v1.0 image

  • checkov-report.json is generated with security and compliance findings for the Kubernetes manifests

  • All three reports exist in /app/security-reports/ in the local container

  • Reports are valid JSON and contain scan results

  • Task completes within the expected time limits (agent ≤ 1800s, verifier ≤ 300s)

chevron-rightHAProxy / Load Balancer Configurationhashtag

Scenario: Configure NGINX as a reverse proxy in front of multiple backend API servers and use HAProxy to load balance traffic across the NGINX instances with health checks, session affinity, and operational visibility.

Success Criteria:

  • NGINX is running and proxying traffic to backend services on ports 8001, 8002, and 8003

  • HAProxy is running and load balancing traffic on port 80

  • Health checks are enabled and correctly monitor the /health endpoint

  • Session affinity (sticky sessions) is enabled and functioning

  • HAProxy stats page is accessible at :8404/stats

  • Failover works when backend services become unhealthy

  • /app/report.json contains:

    • nginx_running: true

    • nginx_instances equals the number of running backends

    • haproxy_running: true

    • load_balancing_works: true

    • session_affinity_works: true

    • health_checks_work: true

    • stats_page_works: true

  • Task completes within the expected time limits (agent ≤ 900s, verifier ≤ 900s)

Each scenario was run four times, testing Stakpak and Claude Code with both Claude Opus 4.5 and Claude Haiku 4.5.

circle-check

The Results

Stakpak Vs Claude Code (Model: Opus 4.5)

ADD THE NEW TABLE HERE

Stakpak Vs Claude Code (Model: Haiku 4.5)

How Rulebooks Changed Everything?

Success rates jumped from 0–13% to 100%, execution became faster and cheaper, and Stakpak stopped “figuring things out” and started following how things are actually done.

Why Stakpak Rulebooks Matter?

Every organization has tribal knowledge things senior engineers "just know" but aren't documented anywhere reliable:

  • "On CPU instances, use opt-125m with --enforce-eager"

  • "Coolify needs Traefik labels for SSL to work"

  • "The Uptime Kuma UI setup must happen before webhook config"

  • "For 8GB RAM, avoid models over 2B parameters"

This knowledge typically lives in:

  • Senior engineers' heads

  • Scattered Slack conversations

  • Outdated wiki pages that no one updates

Stakpak rulebooks formalize tribal knowledge into executable procedures.

When that senior engineer is on vacation or leaves the company, Stakpak still knows what to do. When a new team member joins, they inherit decades of operational wisdom through Stakpak's rulebook system.

TLDR

AI agents fail in production not because they’re “dumb,” but because they rely on trial and error and guessing instead of deterministic execution.

We ran five real world DevOps scenarios with Stakpak, once without rulebooks and once with them.

Results:

  • Success rates jumped from 0–13% → 100%

  • Tasks finished faster and cheaper

  • Agents stopped guessing and started following proven workflows

Why? Rulebooks turn tribal knowledge (the stuff senior engineers “just know”) into executable, repeatable instructions.

With rulebooks, Stakpak doesn’t improvise; it operates the way your team does, every time.

Ready to turn your team’s operational knowledge into something reusable?

Check How to Write a Rulebook? to create rulebooks that encode how your team operates, or explore community contributed Paksarrow-up-right for battle tested, reusable infrastructure patterns.

Last updated