What happens when you give Stakpak step by step operational procedures? We ran the experiments, here's the data.
Introduction
DevOps engineers all run into the same question with AI agents:
Can they reliably follow production workflows, or do they behave inconsistently, get lost mid task, and take a lot of time?
At Stakpak, we built Rulebooks to solve this problem. Rulebooks are markdown based standard operating procedures that encode how your team actually operates, turning tribal knowledge into clear, executable guidance for the agent.
But the real question isn’t what rulebooks are it’s whether they measurably improve agent behavior.
So instead of relying on intuition, we ran controlled experiments to find out.
The Experiment: Stakpak With and Without Rulebooks
We ran Stakpak through five demanding DevOps scenarios:
1. Monitoring & Alerting with Uptime Kuma:
Scenario: Set up application monitoring with webhook alerts
Success Criteria:
Uptime Kuma is running and accessible
The web application is being monitored
Webhook notifications are configured to alert on downtime
turning down the server and checking if alerts are being received
Completed under timeout 30 mins
2. End to End LLM Deployment:
Scenario: Configure vLLM for OpenAI compatible API on CPU infrastructure
Success Criteria:
Choosed a suitable model size based on RAM requirements [cpus = 4 memory_mb = 8192
storage_mb = 20480]
Configured vllm for the specified machine & test completion request
3. Coolify Self Hosted PaaS:
Scenario: Deploy FastAPI with Traefik reverse proxy and SSL certificates
Success Criteria:
Created new EC2 instance with SSH key
Installed Coolify with Traefik reverse proxy
Configured DNS A record for fastapi.guku.io
Built and deployed FastAPI container with Traefik labels
Auto provisioned SSL certificate via Let's Encrypt
4. TanStack on AWS + Cloudflare CDN:
Scenario: Deploy a TanStack app on AWS with Cloudflare as CDN
Success Criteria
All required credentials (AWS, Cloudflare, OpenWeatherMap API) are successfully read and accessible
Docker and all required dependencies are installed on the EC2 instance
The weather application source code is cloned and built successfully with the API key configured
The application container is running on the EC2 instance
Cloudflared is installed and a Quick Tunnel is created successfully
5. Twelve Factor App Analysis:
Scenario: Analyze a multi service app against all 12 factors, identify violations, and apply fixes
Success Criteria:
Create a detailed JSON report is at /app/report.json containing analysis for each service, compliance scores for all 12 factors, identified violations, and a proposed remediation plan.
Each scenario was run twice: once with Stakpak operating autonomously (no rulebook), and once with a Stakpak rulebook
These rulebooks are now available in Stakpak.
The Results: The Power of Rulebooks
Stakpak Without Rulebooks
Scenario
Context Utilization
Time
Success Rate
Monitoring & Alerting with Uptime Kuma
22099
25 Min
13%
End to End LLM Deployment
84407
Timeout
(30 Min or more)
0%
Coolify Self Hosted PaaS
61937
Timeout
(30 Min or more)
0%
Deploy TanStack on AWS + Cloudflare CDN
29995
4.27 Min
100%
Twelve Factor Analysis
34110
5.11 Min
100%
Stakpak With Rulebooks
Scenario
Context Utilization
Time
Success Rate
Monitoring & Alerting with Uptime Kuma
18951 (14.24% Better)
4.83 Min
100%
End to End LLM Deployment
34378 (59.27% Better)
7.46 Min
100%
Coolify Self Hosted PaaS
41105 (33.63% Better)
8.53 Min
100%
Deploy TanStack on AWS + Cloudflare CDN
17270 (42.43 Better)
3.36 Min
100%
Twelve Factor Analysis
34218 (-0.31% Worse)
4.67 Min
100%
How Rulebooks Changed Everything?
Success rates jumped from 0–13% to 100%, execution became faster and cheaper, and Stakpak stopped “figuring things out” and started following how things are actually done.
Why Stakpak Rulebooks Matter?
Every organization has tribal knowledge things senior engineers "just know" but aren't documented anywhere reliable:
"On CPU instances, use opt-125m with --enforce-eager"
"Coolify needs Traefik labels for SSL to work"
"The Uptime Kuma UI setup must happen before webhook config"
"For 8GB RAM, avoid models over 2B parameters"
This knowledge typically lives in:
Senior engineers' heads
Scattered Slack conversations
Outdated wiki pages that no one updates
Stakpak rulebooks formalize tribal knowledge into executable procedures.
When that senior engineer is on vacation or leaves the company, Stakpak still knows what to do. When a new team member joins, they inherit decades of operational wisdom through Stakpak's rulebook system.
TLDR
AI agents fail in production not because they’re “dumb,” but because they rely on trial and error and guessing instead of deterministic execution.
We ran five real world DevOps scenarios with Stakpak, once without rulebooks and once with them.
Results:
Success rates jumped from 0–13% → 100%
Tasks finished faster and cheaper
Agents stopped guessing and started following proven workflows
Why?
Rulebooks turn tribal knowledge (the stuff senior engineers “just know”) into executable, repeatable instructions.
With rulebooks, Stakpak doesn’t improvise; it operates the way your team does, every time.
Ready to turn your team’s operational knowledge into something reusable?
Check How to Write a Rulebook? to create rulebooks that encode how your team operates, or explore community contributed Paks for battle tested, reusable infrastructure patterns.