Stakpak Vs Claude Code

Introduction

Engineers and founders run into the same question with AI Agents:

Can agents reliably follow production workflows, or do they behave inconsistently, get lost mid task, and burn time figuring things out?

Most general purpose coding agents, including Claude Code, are great at reasoning and generation but when it comes to shipping, knowing what to do isn’t enough. What matters is doing it the right way, in the right order, with the right constraints.

At Stakpak, we built Rulebooks(Agent Skills) to solve this problem.

Rulebooks are markdown based, Standard Operating Procedure (SOPs) that encode how infrastructure work is actually done not generic best practices, but guided execution, production ready procedures.

But the real question isn’t what rulebooks are. It’s whether they measurably improve agent behavior.

So instead of relying on intuition, we ran controlled experiments to find out.

The Experiment

We ran Stakpak and Claude Code into 10 advanced infrastructure tasks:

Monitoring & Alerting with Uptime Kuma

Scenario: Set up application monitoring with webhook alerts

Success Criteria:

Uptime Kuma is running and accessible
The web application is being monitored
Webhook notifications are configured to alert on downtime
turning down the server and checking if alerts are being received
Completed under timeout 30 mins

End to End LLM Deployment

Scenario: Configure vLLM for OpenAI compatible API on CPU infrastructure

Success Criteria:

Choosed a suitable model size based on RAM requirements [cpus = 4 memory_mb = 8192 storage_mb = 20480]
Configured vllm for the specified machine & test completion request

Coolify Self Hosted PaaS

Scenario: Deploy FastAPI with Traefik reverse proxy and SSL certificates

Success Criteria:

Created new EC2 instance with SSH key
Installed Coolify with Traefik reverse proxy
Configured DNS A record for fastapi.guku.io
Built and deployed FastAPI container with Traefik labels
Auto provisioned SSL certificate via Let's Encrypt

TanStack on AWS + Cloudflare CDN

Scenario: Deploy a TanStack app on AWS with Cloudflare as CDN

Success Criteria

All required credentials (AWS, Cloudflare, OpenWeatherMap API) are successfully read and accessible
EC2 infrastructure is provisioned successfully (VPC, security group, key pair created)
EC2 instance is launched and reachable
Docker and all required dependencies are installed on the EC2 instance
The weather application source code is cloned and built successfully with the API key configured
The application container is running on the EC2 instance
Cloudflared is installed and a Quick Tunnel is created successfully

Twelve Factor App Analysis

Scenario: Analyze a multi service app against all 12 factors, identify violations, and apply fixes

Success Criteria:

Create a detailed JSON report is at /app/report.json containing analysis for each service, compliance scores for all 12 factors, identified violations, and a proposed remediation plan.

Diagnose Production Performance Issues

Scenario: Analyze the performance of the API running at http://localhost:8000 by profiling execution, memory usage, and load behavior using memray, perf, and k6. Summarize the findings and optimization recommendations in a diagnosis report.

Success Criteria:

memray profiling results are analyzed for memory usage and allocation patterns
perf analysis identifies CPU hotspots or inefficient execution paths
k6 load testing results capture latency, throughput, and error behavior
A diagnosis report is written to /app/diagnosis.md
The report clearly identifies performance bottlenecks and memory issues
The report includes actionable optimization recommendations
Task completes within the expected time limits (agent ≤ 500s, verifier ≤ 300s)

Serverless Deployment (Cloudflare Workers)

Scenario: Deploy a JavaScript-based Calculator API as a Cloudflare Worker by authenticating with Cloudflare, configuring Wrangler, deploying to the edge network, and verifying global API availability.

Success Criteria:

The Cloudflare Worker is deployed successfully
The deployment URL is written to /app/deployment.json
The /health endpoint responds correctly
The /calculate endpoint responds correctly
The API is accessible via the deployed Cloudflare Worker URL
Task completes within the expected time limits (agent ≤ 900s, verifier ≤ 900s)

MongoDB RPO / RTO Validation

Scenario: Implement and validate a MongoDB backup and restore strategy that meets strict RPO and RTO targets by simulating a database failure and measuring data loss and recovery time.

Success Criteria:

A backup strategy is implemented for mydb.events on MongoDB
The database is dropped and successfully restored from backup
/app/report.json is generated with all required fields:
- pre_disaster_seq_id
- post_restore_seq_id
- records_lost
- rpo_seconds
- rto_seconds
- rpo_pass
- rto_pass
Calculated RPO is ≤ 60 seconds
Calculated RTO is ≤ 120 seconds
Both rpo_pass and rto_pass are true
Task completes within the expected time limits (agent ≤ 900s, verifier ≤ 600s)

K3s Security & Compliance Audit

Scenario: Deploy a K3s cluster on an EC2 instance, deploy the provided web application and MongoDB manifests, then scan both container images and Kubernetes manifests for security issues using Trivy and Checkov. Generate JSON security reports on the EC2 instance

Success Criteria:

trivy-mongo.json is generated with vulnerability results for the mongo:5.0 image
trivy-webapp.json is generated with vulnerability results for the nanajanashia/k8s-demo-app:v1.0 image
checkov-report.json is generated with security and compliance findings for the Kubernetes manifests
All three reports exist in /app/security-reports/ in the local container
Reports are valid JSON and contain scan results
Task completes within the expected time limits (agent ≤ 1800s, verifier ≤ 300s)

HAProxy / Load Balancer Configuration

Scenario: Configure NGINX as a reverse proxy in front of multiple backend API servers and use HAProxy to load balance traffic across the NGINX instances with health checks, session affinity, and operational visibility.

Success Criteria:

NGINX is running and proxying traffic to backend services on ports 8001, 8002, and 8003
HAProxy is running and load balancing traffic on port 80
Health checks are enabled and correctly monitor the /health endpoint
Session affinity (sticky sessions) is enabled and functioning
HAProxy stats page is accessible at :8404/stats
Failover works when backend services become unhealthy
/app/report.json contains:
- nginx_running: true
- nginx_instances equals the number of running backends
- haproxy_running: true
- load_balancing_works: true
- session_affinity_works: true
- health_checks_work: true
- stats_page_works: true
Task completes within the expected time limits (agent ≤ 900s, verifier ≤ 900s)

Each scenario was run four times, testing Stakpak and Claude Code with both Claude Opus 4.5 and Claude Haiku 4.5.

These rulebooks are now available in Stakpak.

The Results

Stakpak Vs Claude Code (Model: Opus 4.5)

ADD THE NEW TABLE HERE

Stakpak Vs Claude Code (Model: Haiku 4.5)

How Rulebooks Changed Everything?

Success rates jumped from 0–13% to 100%, execution became faster and cheaper, and Stakpak stopped “figuring things out” and started following how things are actually done.

Why Stakpak Rulebooks Matter?

Every organization has tribal knowledge things senior engineers "just know" but aren't documented anywhere reliable:

"On CPU instances, use opt-125m with --enforce-eager"
"Coolify needs Traefik labels for SSL to work"
"The Uptime Kuma UI setup must happen before webhook config"
"For 8GB RAM, avoid models over 2B parameters"

This knowledge typically lives in:

Senior engineers' heads
Scattered Slack conversations
Outdated wiki pages that no one updates

Stakpak rulebooks formalize tribal knowledge into executable procedures.

When that senior engineer is on vacation or leaves the company, Stakpak still knows what to do. When a new team member joins, they inherit decades of operational wisdom through Stakpak's rulebook system.

TLDR

AI agents fail in production not because they’re “dumb,” but because they rely on trial and error and guessing instead of deterministic execution.

We ran five real world DevOps scenarios with Stakpak, once without rulebooks and once with them.

Results:

Success rates jumped from 0–13% → 100%
Tasks finished faster and cheaper
Agents stopped guessing and started following proven workflows

Why? Rulebooks turn tribal knowledge (the stuff senior engineers “just know”) into executable, repeatable instructions.

With rulebooks, Stakpak doesn’t improvise; it operates the way your team does, every time.

Ready to turn your team’s operational knowledge into something reusable?

Check How to Write a Rulebook? to create rulebooks that encode how your team operates, or explore community contributed Paks for battle tested, reusable infrastructure patterns.

PreviousComparison NextInstall Stakpak

Last updated 1 month ago

hashtagIntroduction

hashtagThe Experiment

hashtagThe Results

hashtagStakpak Vs Claude Code (Model: Opus 4.5)

hashtagStakpak Vs Claude Code (Model: Haiku 4.5)

hashtagHow Rulebooks Changed Everything?

hashtagWhy Stakpak Rulebooks Matter?

hashtagTLDR

hashtagReady to turn your team’s operational knowledge into something reusable?

Introduction

The Experiment

The Results

Stakpak Vs Claude Code (Model: Opus 4.5)

Stakpak Vs Claude Code (Model: Haiku 4.5)

How Rulebooks Changed Everything?

Why Stakpak Rulebooks Matter?

TLDR

Ready to turn your team’s operational knowledge into something reusable?