Step 1

Runtime Incident Detection

On this page

Exercise 1: Runtime Incident Detection

Goal: Detect a runtime incident and correlate it to its source β€” proving that runtime detection catches what guardrails missed.

πŸ“ Open docs/incident-observation.md β€” record your SRE Agent alert details and Defender correlation findings.


How SRE Agent Works β€” 6-Stage Architecture

Before diving in, understand what makes SRE Agent fundamentally different from traditional alerting:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Azure SRE Agent β€” 6 Stages                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1.CONNECTS β”‚ 2.MONITORS β”‚3.INVESTIGATESβ”‚4.DIAGNOSESβ”‚5.REMEDIATESβ”‚ 6.LEARNS  β”‚
β”‚            β”‚            β”‚             β”‚           β”‚            β”‚           β”‚
β”‚ Loads code,β”‚ Scheduled  β”‚ Multi-signalβ”‚ Root causeβ”‚ Executes   β”‚ Retains   β”‚
β”‚ logs,      β”‚ intelligenceβ”‚ correlation β”‚ with      β”‚ runbooks   β”‚ knowledge β”‚
β”‚ metrics,   β”‚ + reactive β”‚ (logs +     β”‚ context   β”‚ or requestsβ”‚ for fasterβ”‚
β”‚ toolchain  β”‚ alerts     β”‚ metrics +   β”‚ from priorβ”‚ human      β”‚ response  β”‚
β”‚ at SETUP   β”‚            β”‚ topology +  β”‚ incidents β”‚ approval   β”‚ next time β”‚
β”‚            β”‚            β”‚ code changesβ”‚ (memory)  β”‚            β”‚           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”‘ Key: SRE Agent has DEEP CONTEXT β€” it loaded your environment at setup, not when the alert fires. This is why diagnosis takes 2–10 min (AI) vs 15–30 min (human scrambling for context).

πŸ“Š Microsoft at scale: 1,300+ SRE agents inside Microsoft, 35,000+ incidents mitigated, 20,000+ hours saved/month.


Proactive vs Reactive β€” Two Operating Modes

SRE Agent doesn’t just REACT to incidents β€” it can PREVENT them:

Mode How It Works Example
Reactive (this exercise) Alert fires β†’ Agent investigates β†’ Diagnoses β†’ Proposes fix Health check failure detected
Proactive (Exercise 3 addition) Scheduled intelligence β†’ Agent checks health β†’ Catches issues BEFORE incidents Hourly probe validation

πŸ’‘ The agent you’re seeing in Exercise 1 is in REACTIVE mode. In Exercise 3, you’ll configure it for PROACTIVE mode β€” preventing recurrence before the next incident.


Step 1: Observe the Alert

The staged incident triggers β€” the container begins crash-looping (or Defender detects anomalous behavior).

# Watch the deployment status β€” you should see restarts increasing
kubectl get pods -n YOUR_NAMESPACE -w

Expected output:

NAME                        READY   STATUS             RESTARTS   AGE
myapp-7f8b9c6d4-x2k9p      0/1     CrashLoopBackOff   3          2m

Step 2: SRE Agent Alert

SRE Agent receives the alert and displays:

  • What resource is affected: Container myapp in namespace YOUR_NAMESPACE
  • What the symptoms are: Readiness probe failing, container restarting
  • Suggested investigation steps: Check container logs, check recent deployments, review probe configuration

Review the SRE Agent notification in your configured channel (Azure Portal / Slack / Teams).


Step 3: Record T0

πŸ“ Open MTTR-TRACKER.md and record:

  • T0 = Alert fired β€” the timestamp when the SRE Agent alert was received

Step 4: Defender Correlation

Open Microsoft Defender for Cloud and examine the alert context:

  1. Navigate to Security Alerts β†’ find the container health alert
  2. Review the code-to-runtime correlation:
Runtime Alert β†’ Container Image β†’ Registry β†’ Pipeline Build β†’ Source Repository

Defender maps the runtime issue back to the source repository and deployment pipeline, showing the full chain.


Step 5: Root Cause Analysis

Investigate: Was this vulnerability known? Was it caught by GHAS?

# Check recent code scanning alerts
gh api /repos/YOUR_ORG/YOUR_REPO/code-scanning/alerts --jq '.[0:5] | .[] | {rule: .rule.id, state: .state}'

# Check the pod events for root cause
kubectl describe pod -l app=myapp -n YOUR_NAMESPACE | grep -A 10 "Events:"

Key finding: This is a configuration issue (readiness probe misconfiguration), not a code vulnerability. GHAS code scanning wouldn’t catch this β€” it validates source code, not Kubernetes manifests. This is exactly why runtime detection exists.


Step 6: Record T1

πŸ“ Record in MTTR-TRACKER.md:

  • T1 = Investigation complete β€” you now understand the root cause

Expected Outcome

Alert received, context understood, source correlated. You know WHAT is wrong and WHY.

πŸ’‘ Key Insight: β€œIn production, you don’t always know a vulnerability exists until it manifests. Runtime detection catches what pre-deployment guardrails missed β€” and speed of detection determines blast radius.”

πŸ’‘ Run scripts/verify-exercise1.sh to validate your Exercise 1 completion.

← β†’ to navigate between steps