Stakpak Vs Claude Code
Introduction
Engineers and founders run into the same question with AI Agents:
Can agents reliably follow production workflows, or do they behave inconsistently, get lost mid task, and burn time figuring things out?
Most general purpose coding agents, including Claude Code, are great at reasoning and generation but when it comes to shipping, knowing what to do isn’t enough. What matters is doing it the right way, in the right order, with the right constraints.
At Stakpak, we built Rulebooks(Agent Skills) to solve this problem.
Rulebooks are markdown based, Standard Operating Procedure (SOPs) that encode how infrastructure work is actually done not generic best practices, but guided execution, production ready procedures.
But the real question isn’t what rulebooks are. It’s whether they measurably improve agent behavior.
So instead of relying on intuition, we ran controlled experiments to find out.
The Experiment
We ran Stakpak and Claude Code into 10 advanced infrastructure tasks:
Monitoring & Alerting with Uptime Kuma
Scenario: Set up application monitoring with webhook alerts
Success Criteria:
Uptime Kuma is running and accessible
The web application is being monitored
Webhook notifications are configured to alert on downtime
turning down the server and checking if alerts are being received
Completed under timeout 30 mins
End to End LLM Deployment
Scenario: Configure vLLM for OpenAI compatible API on CPU infrastructure
Success Criteria:
Choosed a suitable model size based on RAM requirements [cpus = 4 memory_mb = 8192 storage_mb = 20480]
Configured vllm for the specified machine & test completion request
Coolify Self Hosted PaaS
Scenario: Deploy FastAPI with Traefik reverse proxy and SSL certificates
Success Criteria:
Created new EC2 instance with SSH key
Installed Coolify with Traefik reverse proxy
Configured DNS A record for
fastapi.guku.ioBuilt and deployed FastAPI container with Traefik labels
Auto provisioned SSL certificate via Let's Encrypt
TanStack on AWS + Cloudflare CDN
Scenario: Deploy a TanStack app on AWS with Cloudflare as CDN
Success Criteria
All required credentials (AWS, Cloudflare, OpenWeatherMap API) are successfully read and accessible
EC2 infrastructure is provisioned successfully (VPC, security group, key pair created)
EC2 instance is launched and reachable
Docker and all required dependencies are installed on the EC2 instance
The weather application source code is cloned and built successfully with the API key configured
The application container is running on the EC2 instance
Cloudflared is installed and a Quick Tunnel is created successfully
Twelve Factor App Analysis
Scenario: Analyze a multi service app against all 12 factors, identify violations, and apply fixes
Success Criteria:
Create a detailed JSON report is at
/app/report.jsoncontaining analysis for each service, compliance scores for all 12 factors, identified violations, and a proposed remediation plan.
Diagnose Production Performance Issues
Scenario: Analyze the performance of the API running at http://localhost:8000 by profiling execution, memory usage, and load behavior using memray, perf, and k6. Summarize the findings and optimization recommendations in a diagnosis report.
Success Criteria:
memrayprofiling results are analyzed for memory usage and allocation patternsperfanalysis identifies CPU hotspots or inefficient execution pathsk6load testing results capture latency, throughput, and error behaviorA diagnosis report is written to
/app/diagnosis.mdThe report clearly identifies performance bottlenecks and memory issues
The report includes actionable optimization recommendations
Task completes within the expected time limits (agent ≤ 500s, verifier ≤ 300s)
Serverless Deployment (Cloudflare Workers)
Scenario: Deploy a JavaScript-based Calculator API as a Cloudflare Worker by authenticating with Cloudflare, configuring Wrangler, deploying to the edge network, and verifying global API availability.
Success Criteria:
The Cloudflare Worker is deployed successfully
The deployment URL is written to
/app/deployment.jsonThe
/healthendpoint responds correctlyThe
/calculateendpoint responds correctlyThe API is accessible via the deployed Cloudflare Worker URL
Task completes within the expected time limits (agent ≤ 900s, verifier ≤ 900s)
MongoDB RPO / RTO Validation
Scenario: Implement and validate a MongoDB backup and restore strategy that meets strict RPO and RTO targets by simulating a database failure and measuring data loss and recovery time.
Success Criteria:
A backup strategy is implemented for
mydb.eventson MongoDBThe database is dropped and successfully restored from backup
/app/report.jsonis generated with all required fields:pre_disaster_seq_idpost_restore_seq_idrecords_lostrpo_secondsrto_secondsrpo_passrto_pass
Calculated RPO is ≤ 60 seconds
Calculated RTO is ≤ 120 seconds
Both
rpo_passandrto_passaretrueTask completes within the expected time limits (agent ≤ 900s, verifier ≤ 600s)
K3s Security & Compliance Audit
Scenario: Deploy a K3s cluster on an EC2 instance, deploy the provided web application and MongoDB manifests, then scan both container images and Kubernetes manifests for security issues using Trivy and Checkov. Generate JSON security reports on the EC2 instance
Success Criteria:
trivy-mongo.jsonis generated with vulnerability results for themongo:5.0imagetrivy-webapp.jsonis generated with vulnerability results for thenanajanashia/k8s-demo-app:v1.0imagecheckov-report.jsonis generated with security and compliance findings for the Kubernetes manifestsAll three reports exist in
/app/security-reports/in the local containerReports are valid JSON and contain scan results
Task completes within the expected time limits (agent ≤ 1800s, verifier ≤ 300s)
HAProxy / Load Balancer Configuration
Scenario: Configure NGINX as a reverse proxy in front of multiple backend API servers and use HAProxy to load balance traffic across the NGINX instances with health checks, session affinity, and operational visibility.
Success Criteria:
NGINX is running and proxying traffic to backend services on ports
8001,8002, and8003HAProxy is running and load balancing traffic on port
80Health checks are enabled and correctly monitor the
/healthendpointSession affinity (sticky sessions) is enabled and functioning
HAProxy stats page is accessible at
:8404/statsFailover works when backend services become unhealthy
/app/report.jsoncontains:nginx_running: truenginx_instancesequals the number of running backendshaproxy_running: trueload_balancing_works: truesession_affinity_works: truehealth_checks_work: truestats_page_works: true
Task completes within the expected time limits (agent ≤ 900s, verifier ≤ 900s)
Each scenario was run four times, testing Stakpak and Claude Code with both Claude Opus 4.5 and Claude Haiku 4.5.
These rulebooks are now available in Stakpak.
The Results
Stakpak Vs Claude Code (Model: Opus 4.5)
ADD THE NEW TABLE HERE
Stakpak Vs Claude Code (Model: Haiku 4.5)
How Rulebooks Changed Everything?
Success rates jumped from 0–13% to 100%, execution became faster and cheaper, and Stakpak stopped “figuring things out” and started following how things are actually done.
Why Stakpak Rulebooks Matter?
Every organization has tribal knowledge things senior engineers "just know" but aren't documented anywhere reliable:
"On CPU instances, use opt-125m with --enforce-eager"
"Coolify needs Traefik labels for SSL to work"
"The Uptime Kuma UI setup must happen before webhook config"
"For 8GB RAM, avoid models over 2B parameters"
This knowledge typically lives in:
Senior engineers' heads
Scattered Slack conversations
Outdated wiki pages that no one updates
Stakpak rulebooks formalize tribal knowledge into executable procedures.
When that senior engineer is on vacation or leaves the company, Stakpak still knows what to do. When a new team member joins, they inherit decades of operational wisdom through Stakpak's rulebook system.
TLDR
AI agents fail in production not because they’re “dumb,” but because they rely on trial and error and guessing instead of deterministic execution.
We ran five real world DevOps scenarios with Stakpak, once without rulebooks and once with them.
Results:
Success rates jumped from 0–13% → 100%
Tasks finished faster and cheaper
Agents stopped guessing and started following proven workflows
Why? Rulebooks turn tribal knowledge (the stuff senior engineers “just know”) into executable, repeatable instructions.
With rulebooks, Stakpak doesn’t improvise; it operates the way your team does, every time.
Ready to turn your team’s operational knowledge into something reusable?
Check How to Write a Rulebook? to create rulebooks that encode how your team operates, or explore community contributed Paks for battle tested, reusable infrastructure patterns.
Last updated