Step 1

Runtime Incident Detection

On this page

Exercise 1: Runtime Incident Detection

Goal: Detect a runtime incident and correlate it to its source — proving that runtime detection catches what guardrails missed.

📝 Open docs/incident-observation.md — record your SRE Agent alert details and Defender correlation findings.

How SRE Agent Works — 6-Stage Architecture

Before diving in, understand what makes SRE Agent fundamentally different from traditional alerting:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Azure SRE Agent — 6 Stages                          │
├────────────┬────────────┬─────────────┬───────────┬────────────┬───────────┤
│ 1.CONNECTS │ 2.MONITORS │3.INVESTIGATES│4.DIAGNOSES│5.REMEDIATES│ 6.LEARNS  │
│            │            │             │           │            │           │
│ Loads code,│ Scheduled  │ Multi-signal│ Root cause│ Executes   │ Retains   │
│ logs,      │ intelligence│ correlation │ with      │ runbooks   │ knowledge │
│ metrics,   │ + reactive │ (logs +     │ context   │ or requests│ for faster│
│ toolchain  │ alerts     │ metrics +   │ from prior│ human      │ response  │
│ at SETUP   │            │ topology +  │ incidents │ approval   │ next time │
│            │            │ code changes│ (memory)  │            │           │
└────────────┴────────────┴─────────────┴───────────┴────────────┴───────────┘

🔑 Key: SRE Agent has DEEP CONTEXT — it loaded your environment at setup, not when the alert fires. This is why diagnosis takes 2–10 min (AI) vs 15–30 min (human scrambling for context).

📊 Microsoft at scale: 1,300+ SRE agents inside Microsoft, 35,000+ incidents mitigated, 20,000+ hours saved/month.

Proactive vs Reactive — Two Operating Modes

SRE Agent doesn’t just REACT to incidents — it can PREVENT them:

Mode	How It Works	Example
Reactive (this exercise)	Alert fires → Agent investigates → Diagnoses → Proposes fix	Health check failure detected
Proactive (Exercise 3 addition)	Scheduled intelligence → Agent checks health → Catches issues BEFORE incidents	Hourly probe validation

💡 The agent you’re seeing in Exercise 1 is in REACTIVE mode. In Exercise 3, you’ll configure it for PROACTIVE mode — preventing recurrence before the next incident.

Step 1: Observe the Alert

The staged incident triggers — the container begins crash-looping (or Defender detects anomalous behavior).

# Watch the deployment status — you should see restarts increasing
kubectl get pods -n YOUR_NAMESPACE -w

Expected output:

NAME                        READY   STATUS             RESTARTS   AGE
myapp-7f8b9c6d4-x2k9p      0/1     CrashLoopBackOff   3          2m

Step 2: SRE Agent Alert

SRE Agent receives the alert and displays:

What resource is affected: Container myapp in namespace YOUR_NAMESPACE
What the symptoms are: Readiness probe failing, container restarting
Suggested investigation steps: Check container logs, check recent deployments, review probe configuration

Review the SRE Agent notification in your configured channel (Azure Portal / Slack / Teams).

Step 3: Record T0

📝 Open MTTR-TRACKER.md and record:

T0 = Alert fired — the timestamp when the SRE Agent alert was received

Step 4: Defender Correlation

Open Microsoft Defender for Cloud and examine the alert context:

Navigate to Security Alerts → find the container health alert
Review the code-to-runtime correlation:

Runtime Alert → Container Image → Registry → Pipeline Build → Source Repository

Defender maps the runtime issue back to the source repository and deployment pipeline, showing the full chain.

Step 5: Root Cause Analysis

Investigate: Was this vulnerability known? Was it caught by GHAS?

# Check recent code scanning alerts
gh api /repos/YOUR_ORG/YOUR_REPO/code-scanning/alerts --jq '.[0:5] | .[] | {rule: .rule.id, state: .state}'

# Check the pod events for root cause
kubectl describe pod -l app=myapp -n YOUR_NAMESPACE | grep -A 10 "Events:"

Key finding: This is a configuration issue (readiness probe misconfiguration), not a code vulnerability. GHAS code scanning wouldn’t catch this — it validates source code, not Kubernetes manifests. This is exactly why runtime detection exists.

Step 6: Record T1

📝 Record in MTTR-TRACKER.md:

T1 = Investigation complete — you now understand the root cause

Expected Outcome

Alert received, context understood, source correlated. You know WHAT is wrong and WHY.

💡 Key Insight: “In production, you don’t always know a vulnerability exists until it manifests. Runtime detection catches what pre-deployment guardrails missed — and speed of detection determines blast radius.”

💡 Run scripts/verify-exercise1.sh to validate your Exercise 1 completion.

← → to navigate between steps