Runtime Incident Detection
Exercise 1: Runtime Incident Detection
Goal: Detect a runtime incident and correlate it to its source β proving that runtime detection catches what guardrails missed.
π Open
docs/incident-observation.mdβ record your SRE Agent alert details and Defender correlation findings.
How SRE Agent Works β 6-Stage Architecture
Before diving in, understand what makes SRE Agent fundamentally different from traditional alerting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Azure SRE Agent β 6 Stages β
ββββββββββββββ¬βββββββββββββ¬ββββββββββββββ¬ββββββββββββ¬βββββββββββββ¬ββββββββββββ€
β 1.CONNECTS β 2.MONITORS β3.INVESTIGATESβ4.DIAGNOSESβ5.REMEDIATESβ 6.LEARNS β
β β β β β β β
β Loads code,β Scheduled β Multi-signalβ Root causeβ Executes β Retains β
β logs, β intelligenceβ correlation β with β runbooks β knowledge β
β metrics, β + reactive β (logs + β context β or requestsβ for fasterβ
β toolchain β alerts β metrics + β from priorβ human β response β
β at SETUP β β topology + β incidents β approval β next time β
β β β code changesβ (memory) β β β
ββββββββββββββ΄βββββββββββββ΄ββββββββββββββ΄ββββββββββββ΄βββββββββββββ΄ββββββββββββ
π Key: SRE Agent has DEEP CONTEXT β it loaded your environment at setup, not when the alert fires. This is why diagnosis takes 2β10 min (AI) vs 15β30 min (human scrambling for context).
π Microsoft at scale: 1,300+ SRE agents inside Microsoft, 35,000+ incidents mitigated, 20,000+ hours saved/month.
Proactive vs Reactive β Two Operating Modes
SRE Agent doesnβt just REACT to incidents β it can PREVENT them:
| Mode | How It Works | Example |
|---|---|---|
| Reactive (this exercise) | Alert fires β Agent investigates β Diagnoses β Proposes fix | Health check failure detected |
| Proactive (Exercise 3 addition) | Scheduled intelligence β Agent checks health β Catches issues BEFORE incidents | Hourly probe validation |
π‘ The agent youβre seeing in Exercise 1 is in REACTIVE mode. In Exercise 3, youβll configure it for PROACTIVE mode β preventing recurrence before the next incident.
Step 1: Observe the Alert
The staged incident triggers β the container begins crash-looping (or Defender detects anomalous behavior).
# Watch the deployment status β you should see restarts increasing
kubectl get pods -n YOUR_NAMESPACE -w
Expected output:
NAME READY STATUS RESTARTS AGE
myapp-7f8b9c6d4-x2k9p 0/1 CrashLoopBackOff 3 2m
Step 2: SRE Agent Alert
SRE Agent receives the alert and displays:
- What resource is affected: Container
myappin namespaceYOUR_NAMESPACE - What the symptoms are: Readiness probe failing, container restarting
- Suggested investigation steps: Check container logs, check recent deployments, review probe configuration
Review the SRE Agent notification in your configured channel (Azure Portal / Slack / Teams).
Step 3: Record T0
π Open
MTTR-TRACKER.mdand record:
- T0 = Alert fired β the timestamp when the SRE Agent alert was received
Step 4: Defender Correlation
Open Microsoft Defender for Cloud and examine the alert context:
- Navigate to Security Alerts β find the container health alert
- Review the code-to-runtime correlation:
Runtime Alert β Container Image β Registry β Pipeline Build β Source Repository
Defender maps the runtime issue back to the source repository and deployment pipeline, showing the full chain.
Step 5: Root Cause Analysis
Investigate: Was this vulnerability known? Was it caught by GHAS?
# Check recent code scanning alerts
gh api /repos/YOUR_ORG/YOUR_REPO/code-scanning/alerts --jq '.[0:5] | .[] | {rule: .rule.id, state: .state}'
# Check the pod events for root cause
kubectl describe pod -l app=myapp -n YOUR_NAMESPACE | grep -A 10 "Events:"
Key finding: This is a configuration issue (readiness probe misconfiguration), not a code vulnerability. GHAS code scanning wouldnβt catch this β it validates source code, not Kubernetes manifests. This is exactly why runtime detection exists.
Step 6: Record T1
π Record in
MTTR-TRACKER.md:
- T1 = Investigation complete β you now understand the root cause
Expected Outcome
Alert received, context understood, source correlated. You know WHAT is wrong and WHY.
π‘ Key Insight: βIn production, you donβt always know a vulnerability exists until it manifests. Runtime detection catches what pre-deployment guardrails missed β and speed of detection determines blast radius.β
π‘ Run
scripts/verify-exercise1.shto validate your Exercise 1 completion.