# Fix Kubernetes CrashLoopBackOff in Minutes

## Overview

In this tutorial, we'll use Stakpak to investigate and fix three CrashLoopBackOff incidents in Kubernetes.

Instead of manually jumping between kubectl logs, kubectl describe, and YAML files trying to piece together what went wrong, Stakpak analyzes your cluster, pinpoints the root cause, and applies the fix quickly

Then, we'll set up Stakpak Autopilot to monitor our cluster continuously and handle these issues automatically next time, so we don't have to wake up at 3 AM.

{% hint style="info" %}
Stakpak is open source, vendor neutral, and works with any model you choose.
{% endhint %}

## Problem

<figure><img src="/files/rbUc9yBNwKqVUzbe7vSf" alt=""><figcaption></figcaption></figure>

You deploy your services to Kubernetes, and everything looks fine, until it isn't.

Pods start crashing, restarting, and crashing again. Kubernetes shows you CrashLoopBackOff but doesn't tell you why.

So you start investigating. You check logs, describe pods, read events, compare YAML files, and Google exit codes.&#x20;

The same debugging loop every time:

* kubectl get pods -> something is broken
* kubectl logs -> maybe empty, maybe unhelpful
* kubectl describe pod -> wall of text, what matters?
* kubectl get events -> which event is the one?

And the root cause could be anything:

* Wrong command&#x20;
* Bad health check&#x20;
* Memory too low&#x20;
* Missing config
* Bad image tag

These aren't code bugs. They're infrastructure misconfigurations that look the same on the surface but have completely different root causes.

## How Stakpak Helps?

Debugging CrashLoopBackOff means connecting dots across logs, events, pod specs, and image configs. That takes time and experience.

Stakpak takes care of that as it

* Reads the signals for you -> logs, events, exit codes, and pod state, analyzed together, not one at a time
* Pinpoints the root cause -> tells you why the pod is crashing, not just that it's crashing
* Applies the fix updates the manifest and rolls out the changes in one step
* Sets up continuous monitoring -> Stakpak [Autopilot](/docs/how-it-works/autopilot.md) watches your cluster 24/7 and fixes issues before you even notice

Instead of spending 15 minutes per incident cycling through kubectl commands, Stakpak resolves it in under 2 minutes, and with Autopilot, you will not have to get involved at all.

## Step-by-Step Guide

### Prerequisites

1. [Install Stakpak](/docs/get-started/install-stakpak.md)
2. [Configure Stakpak](/docs/get-started/configure-stakpak.md)

### Troubleshooting

1. Open the stakpak and ask it to `investigate and fix the CrashLoopBackOff`

Now lets let it do its magic

<figure><img src="/files/fAp6izzGTD8nu72MFlGc" alt=""><figcaption></figcaption></figure>

2. It will start scanning the cluster for pods in CrashLoopBackOff state
3. Pulls logs from the crashed containers, including previous instances
4. Reads pod events to correlate exit codes, restart counts, and failure reasons
5. Cross references the deployment spec, image config, and source code
6. Identifies the root cause for each failing pod
7. &#x20;Applies the fix, updates the manifest and rolls out the new deployment
8. Verify that the pod is running healthy with zero restarts

As you can see, it found three different reasons for the crashloop

<figure><img src="/files/TJ9KX4KK5A1SksczT3gT" alt=""><figcaption></figcaption></figure>

**Wrong CMD Crashloop**

* The container image was built with a command that references app1.py, but the actual file inside the container is app.py. The process exits immediately every time, and Kubernetes keeps restarting it.

<figure><img src="/files/tmRGSaGV6rcYNatwPspq" alt=""><figcaption></figcaption></figure>

**Livenessprobe Crashloop**

* The app itself runs fine, but the liveness probe checks for a file (/tmp/healthy) that doesn't exist. Kubernetes thinks the container is unhealthy and kills it every second. The app never did anything wrong.

<figure><img src="/files/giubaHfGCBRk6e8vnJUx" alt=""><figcaption></figcaption></figure>

**Out of Memory Crashloop**

* The memory limit is set to 25Mi, which isn't enough for a Python/Flask process. The container starts, loads the runtime, hits the ceiling, and gets OOM-killed. No error in the logs it just disappears.

Now everything is working🥳

Three pods, same symptom, three completely different causes. That's what makes CrashLoopBackOff frustrating to debug manually.

Now let's ask it to set up Stakpak [Autopilot](/docs/how-it-works/autopilot.md)so we avoid waking up at 3am because of crashloop🤡

{% hint style="info" %}
Stakpak Autopilot monitors your apps 24/7, detects unexpected changes, fixes what’s safe, and only alerts you when it actually matters.
{% endhint %}

### Monitoring

<figure><img src="/files/s4TjjZ44bcvg5DvKRIhp" alt=""><figcaption></figcaption></figure>

Thats it, now CrashLoop won't hunt us in our nightmares at 3 am.

## Extra Resources:

### References

* [Install Stakpak](/docs/get-started/install-stakpak.md)
* [Configure Stakpak](/docs/get-started/configure-stakpak.md)
* [Autopilot](/docs/how-it-works/autopilot.md)
* [Handling Secrets](/docs/how-it-works/handling-secrets.md)
* [Warden Guardrails](/docs/how-it-works/warden-guardrails.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://stakpak.gitbook.io/docs/tutorial/fix-kubernetes-crashloopbackoff-in-minutes.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
