> For the complete documentation index, see [llms.txt](https://stakpak.gitbook.io/docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://stakpak.gitbook.io/docs/tutorial/investigate-why-terraform-drifted-from-production.md).

# Investigate Why Terraform Drifted From Production

## Overview

By the end of this tutorial, you'll learn how to use Stakpak to investigate Terraform drift in a live AWS environment, apply the fixes safely, validate that the infrastructure matches code again, and configure Stakpak [Autopilot](/docs/how-it-works/autopilot.md)to help detect similar drift incidents automatically in the future.

{% hint style="info" %}
Stakpak is open source, vendor neutral, and works with any model you choose.
{% endhint %}

## Problem

You manage your AWS infrastructure with Terraform, and everything seems fine at first.

* The code is in Git.
* The state file is up to date.
* The last terraform apply ran cleanly two weeks ago.
* The application is serving traffic.

But a compliance scan just flagged your account for new "public S3" and "wide-open security group" findings, and you have no idea where they came from.

You check the workload:

```
curl -sS -o /dev/null -w "%{http_code}\n" http://northstar-prod-edge-alb-391835729.us-east-1.elb.amazonaws.com/
```

And the app returns 200. Traffic is fine. So you start the usual Terraform drift debugging loop:

```
terraform state list

terraform plan

aws ec2 describe-security-groups --filters "Name=vpc-id,Values=#### Vpc"

aws s3api get-bucket-policy --bucket #### Bucket

aws s3api get-public-access-block --bucket #### Bucket

aws iam list-role-policies --role-name #### Role

aws iam list-attached-role-policies --role-name #### Role

aws ec2 describe-instances --filters "Name=vpc-id,Values=#### Vpc"

aws cloudtrail lookup-events --max-results 50
```

Terraform plan shows `Plan: 0 to add, 5 to change, 0 to destroy` , but is that the whole story?

* Is the plan output complete, or is it missing resources that were never in state
* Are there resources running in AWS that Terraform doesn't know about at all
* Did someone attach an IAM policy out of band that escalates blast radius?
* Did someone modify an EC2 instance directly, and will it revert on the next ASG refresh?
* Did someone open a public S3 path through a resource type your code doesn't manage?

Terraform gives you part of the picture. AWS gives you the rest. But you still have to connect them.

## How Stakpak Helps?

Instead of manually cross referencing Terraform state, terraform plan output, AWS API responses, CloudTrail events, and resource tags across security groups, IAM roles, S3 buckets, EC2 instances, and load balancers, we can ask Stakpak to investigate the environment for us.

Stakpak inspects both sides of the drift, the Terraform code and state and the live AWS environment, connects the signals across the account, identifies every resource that has diverged (including the ones terraform plan cannot see), applies\
the fix to reconcile code and reality, and validates that the application stays healthy throughout.

Then, we'll configure Stakpak [Autopilot](/docs/how-it-works/autopilot.md) to continuously monitor the AWS account and help detect similar Terraform drift incidents automatically in the future.

## Application

The application is the public edge tier of the Northstar Commerce platform, running on AWS and provisioned with Terraform.

It runs as an Auto Scaling Group of EC2 instances behind an Application Load Balancer because it needs horizontal scaling, automatic replacement of unhealthy instances, and a stable public entrypoint.&#x20;

Terraform is the source of truth for the stack, and AWS must reflect exactly what the Terraform code and state describe, otherwise the environment has drifted and Terraform can no longer safely manage it.

**The main components are:**

* ALB: Public entrypoint that forwards HTTP traffic to healthy instances.
* Auto Scaling Group: Maintains the desired number of EC2 instances and replaces unhealthy ones.
* Launch Template: Defines the AMI, instance type, IAM profile, and security groups for every instance the ASG launches.
* Security Groups: One for the ALB (HTTP from internet), one for the instances (HTTP from ALB only).
* IAM role: Grants instances least-privilege access to SSM and the logs bucket.
* S3 bucket: Stores edge access logs. Versioned, encrypted, and blocked from public access.
* Terraform state: Records every managed resource and its expected configuration.

The normal management flow is: an engineer changes the Terraform code, CI runs terraform plan, the change is merged, and terraform apply reconciles AWS to match. Between applies, terraform plan should report No changes.

Now that we understand the app and architecture, we can start investigating the drift.

## Step-by-Step Guide

### Prerequisites

1. [Install Stakpak](/docs/get-started/install-stakpak.md)
2. Cloud provider credentials configured

### Troubleshooting

1. Open Stakpak and ask it to `investigate the Terraform drift`

Now lets let it do its magic

<figure><img src="/files/l1FShhEjigbpbAf0IFCi" alt=""><figcaption></figcaption></figure>

Stakpak started investigating Terraform drift in the northstar-edge module and traced the issue through the module structure, Terraform plan refresh, declared resource definitions, and the live AWS state of each drifted resource.

It found that 5 resources had drifted from their declared Terraform state, all from out of band changes (likely AWS console edits):

<figure><img src="/files/zqqHbEkAiT6wczTj0des" alt=""><figcaption></figcaption></figure>

<figure><img src="/files/bx7jfElJEDtBxmYF0aAN" alt=""><figcaption></figcaption></figure>

Then it found the root cause:

<figure><img src="/files/CWxvm0gOGDczSCfRL7FV" alt=""><figcaption></figcaption></figure>

Then it started remediating the drift:

<figure><img src="/files/wKoEPkGHVCVnC2B5OLeN" alt=""><figcaption></figcaption></figure>

* Removed unauthorized SSH ingress rule from aws\_security\_group.alb
* Removed unauthorized MySQL ingress rule from aws\_security\_group.app
* Re-enabled all S3 public access protections
* Restored S3 bucket versioning&#x20;
* Reverted ALB tags to declared Terraform values
* Removed manual-console metadata tag

Then it validated that everything back to normal:

* SSH (22) no longer exposed to the internet
* MySQL (3306) no longer exposed to the internet
* S3 public access protections enabled
* S3 versioning enabled
* ALB tags match Terraform configuration
* terraform plan returned "No changes"

Now everything is back to normal🥳

Now its asking us if we want to sit up stakpak [Autopilot](/docs/how-it-works/autopilot.md)to avoid future drifts

<figure><img src="/files/53cLwImGJkPdO4LaYpw8" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
Stakpak Autopilot monitors your apps 24/7, detects unexpected changes, fixes what’s safe, and only alerts you when it actually matters.
{% endhint %}

### Monitoring

<figure><img src="/files/kJvhkQWnYOtEg6Q6YS0D" alt=""><figcaption></figcaption></figure>

## Extra Resources:

### Related Use Cases

* [Containerize a Python App](/docs/tutorial/containerize-a-python-app.md)
* [Fix Kubernetes CrashLoopBackOff in Minutes](/docs/tutorial/fix-kubernetes-crashloopbackoff-in-minutes.md)
* [Fix Kubernetes Apps That Are Running but Not Reachable](/docs/tutorial/fix-kubernetes-apps-that-are-running-but-not-reachable.md)

and more...

### References

* [Install Stakpak](/docs/get-started/install-stakpak.md)
* [Configure Stakpak](/docs/get-started/configure-stakpak.md)
* [Configuration and credential file settings in the AWS CLI](https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-files.html)
* [Autopilot](/docs/how-it-works/autopilot.md)
* [Handling Secrets](/docs/how-it-works/handling-secrets.md)
* [Warden Guardrails](/docs/how-it-works/warden-guardrails.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://stakpak.gitbook.io/docs/tutorial/investigate-why-terraform-drifted-from-production.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
