> For the complete documentation index, see [llms.txt](https://stakpak.gitbook.io/docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://stakpak.gitbook.io/docs/tutorial/investigate-why-aws-costs-suddenly-increased.md).

# Investigate Why AWS Costs Suddenly Increased

## Overview

By the end of this tutorial, you'll learn how to use Stakpak to investigate AWS cost spikes in a live production account, identify every hidden source of unnecessary spend, apply the right remediations safely, validate that the bill returns to baseline, and configure Stakpak [Autopilot](/docs/how-it-works/autopilot.md) to help detect similar cost incidents automatically in the future.

{% hint style="info" %}
Stakpak is open source, vendor neutral, and works with any model you choose.
{% endhint %}

## Problem

Your AWS production application is healthy. The pipeline is green, the SLOs are green, the on-call channel is quiet.

**But your FinOps lead just pinged the team:**

* We're $4,300 over budget this month and trending 35% above last. Nothing in the apps catalog has changed.&#x20;

You start the usual cost investigation loop, Cost Explorer by service and by tag, VPCs and NAT Gateways, unattached EBS volumes, stale snapshots, idle Elastic IPs, VPC endpoints, RDS instances, CloudWatch log retention, S3 lifecycle policies, CloudTrail events.

Cost Explorer shows the highest cost is from EC2, Other, EKS, and CloudWatch. The rest is scattered across eight services in chunks too small to feel urgent on their own. Tag breakdowns are messy because half the spend rolls up under (no tag) or Owner=unknown, and the biggest single CUR line item is a Fargate workload nobody on the current team recognizes.

* Is the $890 NAT Gateway data line the orphaned VPC nobody decommissioned, or production traffic that should be flowing through a VPC endpoint?
* Are the 1,400+ EBS snapshots load-bearing, or from a Lambda deprecated 18 months ago and never disabled?
* Is the RDS instance tagged Environment=staging-old truly idle, or does some nightly job still touch it?
* Which of the 12 likely cost drivers, if any, would be the wrong thing to delete?

Cost Explorer gives you part of the picture. AWS resource APIs give you the rest. But you still have to connect them, attribute them to owners, correlate them with utilization, and decide what is safe to remediate.

## How Stakpak Helps?

Instead of manually cross referencing Cost Explorer breakdowns, CUR line items, tag attribution, CloudWatch utilization, CloudTrail provenance, and resource configuration across a dozen AWS services, we can ask Stakpak to investigate the\
account for us.

Stakpak inspects both sides of the cost spike the billing and usage data and the live AWS environment, connects the signals across the account, identifies every resource and configuration pattern driving unnecessary spend (including the\
ones a single Cost Explorer view cannot see), proposes a remediation plan ordered by impact and risk, applies the safe fixes to bring the bill back to baseline, and validates that the application stays healthy throughout.

Then, we'll configure Stakpak [Autopilot](/docs/how-it-works/autopilot.md) to continuously monitor the AWS account and help detect similar cost incidents automatically in the future.

## Application

Northstar Commerce is a B2B ecommerce platform running on AWS, with workloads spread across EKS, ECS Fargate, Lambda, and Vercel. The main components are:

* storefront: Customer facing Next.js app on Vercel.
* api-gateway: Public REST and GraphQL edge on EKS
* orders-service: Order lifecycle, Go on EKS, backed by Aurora PostgreSQL.
* payments-service: Java on ECS Fargate, integrates with Stripe.
* inventory-worker: Celery workers on EKS draining an SQS queue.
* search-indexer: Rust Lambda keeping OpenSearch in sync.
* admin-console: React SPA on S3 behind CloudFront.

Shared infrastructure includes an EKS cluster, an Aurora cluster, an ElastiCache Redis, an MSK cluster, an OpenSearch domain, ECR, Route 53, ACM, and Secrets Manager.&#x20;

Primary region is us-east-1, with us-west-2 as a disaster recovery region.

Every workload in the catalog is healthy and serving traffic. None of the recent deploys touched infrastructure. Which is what makes a 35% cost jump suspicious: the bill is growing faster than the application is.

Now that we understand the app and architecture, we can start investigating the cost spike.

## Step-by-Step Guide

### Prerequisites

1. [Install Stakpak](/docs/get-started/install-stakpak.md)
2. Cloud provider credentials configured

### Troubleshooting

1. Open Stakpak and ask it to `investigate the cloud cost spike`

Now lets let it do its magic

<figure><img src="/files/l1FShhEjigbpbAf0IFCi" alt=""><figcaption></figcaption></figure>

Stakpak traced the cost spike across billing, utilization, and infrastructure signals and identified multiple sources of unnecessary spend driving the 35% increase.

It found that the $4,270 June overage came from 12 distinct cost drivers totaling \~$6,800/month of avoidable spend, none caused by application changes. The signals were spread across Cost Explorer deltas, tag anomalies (staging-old +5,854%, intern-summer-2025 +9,677%), CUR line items, CloudWatch utilization, and CloudTrail provenance.

Then it:

* Deleted the orphaned legacy VPC and its NAT Gateway, abandoned since the 2024 EKS migration
* Terminated three m5.2xlarge legacy batch workers idling at 2% CPU
* Deleted the forgotten eks-dev-intern cluster and its Fargate Spot profile, running since July 2025
* Deleted the staging-old RDS instance after 30 days of zero connections
* Removed five unattached EBS volumes, three idle Elastic IPs, and 1,400+ stale snapshots from a deprecated 2023 backup Lambda
* Disabled GuardDuty in eu-west-1 and ap-southeast-1 where no workloads exist
* Added an S3 Gateway VPC endpoint to the production VPC, eliminating $890/month of NAT data processing
* Applied lifecycle rules to northstar-prod-edge-logs and 30-day retention to three "Never expire" log groups
* Fixed cross-AZ traffic on orders-service
* Deployed AWS Budgets with anomaly detection, tag-enforcement SCPs, and Config rules

After the changes were applied, Stakpak verified that:

* All 12 driver resources are gone or reconfigured
* Every production workload remained healthy with no SLO regressions
* Projected run-rate dropped to \~$9,600/month, below the January baseline

Now everything is cleaned up 🥳

Now its asking us if we want to sit up stakpak [Autopilot](/docs/how-it-works/autopilot.md)to avoid future cost spikes

{% hint style="info" %}
Stakpak Autopilot monitors your apps 24/7, detects unexpected changes, fixes what’s safe, and only alerts you when it actually matters.
{% endhint %}

### Monitoring

<figure><img src="/files/2q8eLxHBTCuqOM1XOKE1" alt=""><figcaption></figcaption></figure>

## Extra Resources:

### Related Use Cases

* [Containerize a Python App](/docs/tutorial/containerize-a-python-app.md)
* [Fix Kubernetes CrashLoopBackOff in Minutes](/docs/tutorial/fix-kubernetes-crashloopbackoff-in-minutes.md)
* [Fix Kubernetes Apps That Are Running but Not Reachable](/docs/tutorial/fix-kubernetes-apps-that-are-running-but-not-reachable.md)

and more...

### References

* [Install Stakpak](/docs/get-started/install-stakpak.md)
* [Configure Stakpak](/docs/get-started/configure-stakpak.md)
* [Configuration and credential file settings in the AWS CLI](https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-files.html)
* [Autopilot](/docs/how-it-works/autopilot.md)
* [Handling Secrets](/docs/how-it-works/handling-secrets.md)
* [Warden Guardrails](/docs/how-it-works/warden-guardrails.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://stakpak.gitbook.io/docs/tutorial/investigate-why-aws-costs-suddenly-increased.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
