AI-Assisted Investigation & Remediation
Exercise 2: AI-Assisted Investigation & Remediation
Goal: Measure how AI compresses the time between alert and fix β while keeping human judgment at every gate.
Copilot Lens: π βHOW FAST is the response?β
Step 1: SRE Agent Diagnosis
SRE Agent provides a natural-language diagnosis:
βContainer
myappin namespaceYOUR_NAMESPACEis crash-looping. Root cause: readiness probe timeout is set to1sbut the application needs approximately5sto initialize. The probe fires before the application is ready, Kubernetes marks the pod as unhealthy, and the container is killed and restarted.Recommended fix: Increase
readinessProbe.timeoutSecondsto10sandinitialDelaySecondsto10s.β
Incident Response Plans β Autonomy Levels
Before approving, understand how organizations configure SRE Agent autonomy per incident type:
| Incident Type | Severity | Autonomy | Approval Required? |
|---|---|---|---|
| Pod restart / scale up | Low | Auto-remediate | β No β agent fixes autonomously |
| Config rollback | Medium | Single approval | β 1 person approves |
| Production deployment change | High | Multi-person | β 2+ people approve |
| Data migration / schema change | Critical | Human-only | β Agent advises only |
π‘ For THIS workshop (probe misconfiguration), the agent requests single approval. In YOUR production, youβd configure autonomy levels matching your risk tolerance.
Step 2: Approve SRE Agent Action
SRE Agent suggests remediation steps. You must click βApproveβ before any action is executed.
β οΈ AI investigates and proposes. Humans decide. Every SRE Agent action requires explicit approval.
Step 3: SRE Agent β GitHub Issue β Copilot PR Pipeline
The full agentic pipeline from incident detection to code fix:
SRE Agent Detects Incident
β
SRE Agent Investigates (multi-signal correlation: logs + metrics + code changes)
β
SRE Agent Creates GitHub Issue (auto-populated with RCA + metrics + blast radius)
β
Copilot Coding Agent Picks Up Issue β Creates Fix PR
β
Human Reviews & Approves PR
β
CI/CD Deploys Fix β SRE Agent Confirms Resolution
π€ TWO AI agents working together:
- SRE Agent diagnoses and documents (creates the GitHub Issue with full RCA)
- Copilot implements the fix (picks up the issue, writes code, opens PR)
- Human approves at every gate
Assign Copilot coding agent the remediation task:
@copilot Fix the readiness probe configuration in the Kubernetes deployment manifest.
The readiness probe timeoutSeconds should be at least 10s and initialDelaySeconds
should be at least 10s. The application requires approximately 5 seconds to start up.
Also ensure a liveness probe is configured with appropriate thresholds.
Copilot coding agent will:
- Create a new branch
- Modify the Kubernetes manifest with the corrected probe configuration
- Commit the fix
- Open a Pull Request
Step 4: Record T2
π Record in
MTTR-TRACKER.md:
- T2 = Fix PR created β Copilot has opened the PR
Step 5: Review and Approve the PR
Open the PR and review the diff:
# List open PRs
gh pr list --repo YOUR_ORG/YOUR_REPO
# View the PR diff
gh pr diff PR_NUMBER --repo YOUR_ORG/YOUR_REPO
Review checklist:
- Is the fix correct? Does
timeoutSeconds>= 10s? - Is
initialDelaySeconds>= 10s? - Does it include both readiness AND liveness probes?
- Does it introduce any new issues?
Approve the PR β this is the human-in-the-loop gate:
gh pr review PR_NUMBER --approve --repo YOUR_ORG/YOUR_REPO
gh pr merge PR_NUMBER --merge --repo YOUR_ORG/YOUR_REPO
Step 6: Record T3
π Record in
MTTR-TRACKER.md:
- T3 = Fix PR approved β human has reviewed and merged
Step 7: Deploy the Fix
The CI/CD pipeline rebuilds and redeploys the fixed container:
# Watch the pipeline
gh run list --repo YOUR_ORG/YOUR_REPO --limit 1
# Wait for deployment to complete, then verify
kubectl rollout status deployment/myapp -n YOUR_NAMESPACE
Step 8: Record T4
π Record in
MTTR-TRACKER.md:
- T4 = Fix deployed β the corrected container is running
Step 9: Verify Resolution
# Confirm the container is now healthy
kubectl get pods -n YOUR_NAMESPACE -l app=myapp
# Verify the health endpoint responds
kubectl port-forward svc/YOUR_SERVICE 8080:80 -n YOUR_NAMESPACE &
curl -s http://localhost:8080/health && echo "β
Incident resolved"
Confirm: SRE Agent alert has cleared. The application is healthy.
Step 10: Record T5 and Calculate MTTR
π Record in
MTTR-TRACKER.md:
- T5 = Incident resolved β SRE Agent alert cleared, application healthy
- MTTR = T5 β T0
MTTR Analysis
Review your completed MTTR tracker:
| Phase | What Happened | AI Role |
|---|---|---|
| T0 β T1 (Detection β Investigation) | SRE Agent diagnosed root cause | AI accelerated |
| T1 β T2 (Investigation β Fix PR) | Copilot created fix branch + PR | AI accelerated |
| T2 β T3 (PR created β PR approved) | Human reviewed and approved | Human judgment |
| T3 β T4 (Approved β Deployed) | CI/CD pipeline rebuilt and deployed | Automated |
| T4 β T5 (Deployed β Resolved) | Verification and alert clearance | Human verified |
- Which phase took the longest?
- Where did AI save the most time?
- Where is human judgment essential and cannot be replaced?
Expected Outcome
Full incident lifecycle completed. MTTR measured. AI compressed investigation and fix creation from hours to minutes.
π‘ Key Insight: βThe value of AI in incident response is not replacing humans. Itβs compressing the time between alert and fix β from hours to minutes β while keeping human judgment at every approval gate.β
π‘ Run
scripts/verify-exercise2.shto validate your Exercise 2 completion.