Step 2

AI-Assisted Investigation & Remediation

On this page

Exercise 2: AI-Assisted Investigation & Remediation

Goal: Measure how AI compresses the time between alert and fix — while keeping human judgment at every gate.

Copilot Lens: 🔍 “HOW FAST is the response?”

Step 1: SRE Agent Diagnosis

SRE Agent provides a natural-language diagnosis:

“Container myapp in namespace YOUR_NAMESPACE is crash-looping. Root cause: readiness probe timeout is set to 1s but the application needs approximately 5s to initialize. The probe fires before the application is ready, Kubernetes marks the pod as unhealthy, and the container is killed and restarted.

Recommended fix: Increase readinessProbe.timeoutSeconds to 10s and initialDelaySeconds to 10s.”

Incident Response Plans — Autonomy Levels

Before approving, understand how organizations configure SRE Agent autonomy per incident type:

Incident Type	Severity	Autonomy	Approval Required?
Pod restart / scale up	Low	Auto-remediate	❌ No — agent fixes autonomously
Config rollback	Medium	Single approval	✅ 1 person approves
Production deployment change	High	Multi-person	✅ 2+ people approve
Data migration / schema change	Critical	Human-only	✅ Agent advises only

💡 For THIS workshop (probe misconfiguration), the agent requests single approval. In YOUR production, you’d configure autonomy levels matching your risk tolerance.

Step 2: Approve SRE Agent Action

SRE Agent suggests remediation steps. You must click “Approve” before any action is executed.

⚠️ AI investigates and proposes. Humans decide. Every SRE Agent action requires explicit approval.

Step 3: SRE Agent → GitHub Issue → Copilot PR Pipeline

The full agentic pipeline from incident detection to code fix:

SRE Agent Detects Incident
    ↓
SRE Agent Investigates (multi-signal correlation: logs + metrics + code changes)
    ↓
SRE Agent Creates GitHub Issue (auto-populated with RCA + metrics + blast radius)
    ↓
Copilot Coding Agent Picks Up Issue → Creates Fix PR
    ↓
Human Reviews & Approves PR
    ↓
CI/CD Deploys Fix → SRE Agent Confirms Resolution

🤖 TWO AI agents working together:

SRE Agent diagnoses and documents (creates the GitHub Issue with full RCA)

Copilot implements the fix (picks up the issue, writes code, opens PR)

Human approves at every gate

Assign Copilot coding agent the remediation task:

@copilot Fix the readiness probe configuration in the Kubernetes deployment manifest.
The readiness probe timeoutSeconds should be at least 10s and initialDelaySeconds
should be at least 10s. The application requires approximately 5 seconds to start up.
Also ensure a liveness probe is configured with appropriate thresholds.

Copilot coding agent will:

Create a new branch
Modify the Kubernetes manifest with the corrected probe configuration
Commit the fix
Open a Pull Request

Step 4: Record T2

📝 Record in MTTR-TRACKER.md:

T2 = Fix PR created — Copilot has opened the PR

Step 5: Review and Approve the PR

Open the PR and review the diff:

# List open PRs
gh pr list --repo YOUR_ORG/YOUR_REPO

# View the PR diff
gh pr diff PR_NUMBER --repo YOUR_ORG/YOUR_REPO

Review checklist:

Is the fix correct? Does timeoutSeconds >= 10s?
Is initialDelaySeconds >= 10s?
Does it include both readiness AND liveness probes?
Does it introduce any new issues?

Approve the PR — this is the human-in-the-loop gate:

gh pr review PR_NUMBER --approve --repo YOUR_ORG/YOUR_REPO
gh pr merge PR_NUMBER --merge --repo YOUR_ORG/YOUR_REPO

Step 6: Record T3

📝 Record in MTTR-TRACKER.md:

T3 = Fix PR approved — human has reviewed and merged

Step 7: Deploy the Fix

The CI/CD pipeline rebuilds and redeploys the fixed container:

# Watch the pipeline
gh run list --repo YOUR_ORG/YOUR_REPO --limit 1

# Wait for deployment to complete, then verify
kubectl rollout status deployment/myapp -n YOUR_NAMESPACE

Step 8: Record T4

📝 Record in MTTR-TRACKER.md:

T4 = Fix deployed — the corrected container is running

Step 9: Verify Resolution

# Confirm the container is now healthy
kubectl get pods -n YOUR_NAMESPACE -l app=myapp

# Verify the health endpoint responds
kubectl port-forward svc/YOUR_SERVICE 8080:80 -n YOUR_NAMESPACE &
curl -s http://localhost:8080/health && echo "✅ Incident resolved"

Confirm: SRE Agent alert has cleared. The application is healthy.

Step 10: Record T5 and Calculate MTTR

📝 Record in MTTR-TRACKER.md:

T5 = Incident resolved — SRE Agent alert cleared, application healthy

MTTR = T5 − T0

MTTR Analysis

Review your completed MTTR tracker:

Phase	What Happened	AI Role
T0 → T1 (Detection → Investigation)	SRE Agent diagnosed root cause	AI accelerated
T1 → T2 (Investigation → Fix PR)	Copilot created fix branch + PR	AI accelerated
T2 → T3 (PR created → PR approved)	Human reviewed and approved	Human judgment
T3 → T4 (Approved → Deployed)	CI/CD pipeline rebuilt and deployed	Automated
T4 → T5 (Deployed → Resolved)	Verification and alert clearance	Human verified

Which phase took the longest?
Where did AI save the most time?
Where is human judgment essential and cannot be replaced?

Expected Outcome

Full incident lifecycle completed. MTTR measured. AI compressed investigation and fix creation from hours to minutes.

💡 Key Insight: “The value of AI in incident response is not replacing humans. It’s compressing the time between alert and fix — from hours to minutes — while keeping human judgment at every approval gate.”

💡 Run scripts/verify-exercise2.sh to validate your Exercise 2 completion.

← → to navigate between steps