Continuous Improvement Loop (Grand Finale)
Exercise 3: Continuous Improvement Loop
Goal: Prove the system gets stronger from every incident β and close the DevSecOps loop back to Workshops 1β3.
π Open
docs/feedback-actions-checklist.mdβ track completion of each feedback action and which workshop layer it strengthens.
Each action below is a concrete, hands-on task β not a description of something you could do, but something you will do right now.
(a) Update a Ruleset (~2 min)
β Strengthens: WS2 (Guardrails β Policy layer)
Add a required check that would have caught this misconfiguration before deployment:
- Navigate to Organization Settings β Rulesets (or Repository Settings β Rulesets)
- Edit your existing ruleset (or create a new one)
- Add a new Required Status Check:
kubernetes-manifest-validation
# Verify the ruleset is active
gh api /repos/YOUR_ORG/YOUR_REPO/rulesets --jq '.[] | {name: .name, enforcement: .enforcement}'
This check will block any PR that includes a Kubernetes manifest failing validation β including misconfigured readiness probes.
(b) Add a Copilot Custom Instruction (~2 min)
β Strengthens: WS2 (Guardrails β AI remediation)
Update the .github/copilot-instructions.md (originally created in WS1) with lessons from this incident:
- Open
.github/copilot-instructions.mdin your editor - Add the following section:
## Lessons from Incident β Runtime Health Check Failure
- When configuring Kubernetes readiness probes, always set `timeoutSeconds` >= 5s
and `initialDelaySeconds` >= 10s to allow the application time to start up.
- Always include BOTH readiness and liveness probes for container deployments.
- Never set readiness probe timeout lower than the application's known startup time.
- Liveness probe should use a longer interval than readiness probe to avoid
premature restarts during transient slowdowns.
- Verify: Start a new Copilot Chat session β ask it to write a Kubernetes deployment β observe that the response includes the corrected probe guidance.
@copilot Write a Kubernetes Deployment manifest for a Node.js web application
with health checks configured.
(c) Create a Regression Test (~3 min)
β Strengthens: WS2 (Guardrails β Detection layer) + WS3 (Supply Chain β pipeline integrity)
Write a manifest validation script that prevents this class of issue from recurring:
- Create
scripts/validate-k8s-manifests.sh:
#!/bin/bash
# Kubernetes Manifest Validation β Regression Test
# Ensures readiness/liveness probes meet minimum thresholds
set -euo pipefail
ERRORS=0
for file in $(find . -name '*.yaml' -o -name '*.yml' | xargs grep -l 'kind: Deployment' 2>/dev/null); do
echo "Validating: $file"
# Check readiness probe exists
if ! grep -q 'readinessProbe' "$file"; then
echo " β FAIL: Missing readinessProbe"
ERRORS=$((ERRORS + 1))
fi
# Check liveness probe exists
if ! grep -q 'livenessProbe' "$file"; then
echo " β FAIL: Missing livenessProbe"
ERRORS=$((ERRORS + 1))
fi
# Check readiness probe timeout >= 5s
TIMEOUT=$(grep -A5 'readinessProbe' "$file" | grep 'timeoutSeconds' | awk '{print $2}')
if [ -n "$TIMEOUT" ] && [ "$TIMEOUT" -lt 5 ]; then
echo " β FAIL: readinessProbe.timeoutSeconds is ${TIMEOUT}s (minimum 5s)"
ERRORS=$((ERRORS + 1))
fi
# Check initialDelaySeconds >= 5s
DELAY=$(grep -A5 'readinessProbe' "$file" | grep 'initialDelaySeconds' | awk '{print $2}')
if [ -n "$DELAY" ] && [ "$DELAY" -lt 5 ]; then
echo " β FAIL: readinessProbe.initialDelaySeconds is ${DELAY}s (minimum 5s)"
ERRORS=$((ERRORS + 1))
fi
done
if [ $ERRORS -gt 0 ]; then
echo ""
echo "β Validation failed with $ERRORS error(s)"
exit 1
else
echo ""
echo "β
All Kubernetes manifests passed validation"
fi
- Make it executable and add to CI:
chmod +x scripts/validate-k8s-manifests.sh
- Add to your CI/CD workflow (
.github/workflows/ci.yml):
kubernetes-manifest-validation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate Kubernetes manifests
run: ./scripts/validate-k8s-manifests.sh
(d) Update the Threat Model (~2 min)
β Strengthens: WS1 (Trust Boundary β risk awareness)
Open THREAT-MODEL.md and fill in row 10 with the incident learnings:
| # | Attack Vector | Target Asset | Threat Actor | Current Control | Gap? |
|---|---|---|---|---|---|
| 10 | Incident response is slow or absent | A4 β Deployed App | T1 β External | β SRE Agent alert + Copilot auto-remediation + regression test | Gap: Manifest validation not org-wide |
The threat model is now more complete than when the series began β row 10 now has concrete controls and identified residual gaps from real incident learnings.
(e) Verify the Loop Closes (~3 min)
β Proves: The system is now stronger than before the incident
- Attempt to push a new manifest with the SAME misconfiguration:
# test-bad-manifest.yaml β intentionally misconfigured
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-regression
spec:
replicas: 1
selector:
matchLabels:
app: test-regression
template:
metadata:
labels:
app: test-regression
spec:
containers:
- name: app
image: nginx:latest
readinessProbe:
httpGet:
path: /health
port: 80
timeoutSeconds: 1 # β Same bad value as before
initialDelaySeconds: 1 # β Same bad value as before
- Run the regression test against it:
./scripts/validate-k8s-manifests.sh
Expected output:
Validating: ./test-bad-manifest.yaml
β FAIL: Missing livenessProbe
β FAIL: readinessProbe.timeoutSeconds is 1s (minimum 5s)
β FAIL: readinessProbe.initialDelaySeconds is 1s (minimum 5s)
β Validation failed with 3 error(s)
- Observe Copilotβs updated behavior β ask Copilot to fix the manifest:
@copilot Fix the readiness probe in test-bad-manifest.yaml
Copilot, now following your updated custom instructions, suggests timeoutSeconds: 10 and initialDelaySeconds: 10 β the correct values.
- Clean up the test file:
rm test-bad-manifest.yaml
The system is NOW stronger than before the incident. π
Persistent Learning β The Agent Gets Smarter
SRE Agent ITSELF learns from this incident. Unlike static runbooks, the agentβs persistent memory means:
- It remembers this incident pattern (probe misconfiguration β crash loop)
- Next time a similar probe misconfiguration occurs β diagnosed in SECONDS, not minutes
- The agentβs institutional knowledge grows with every incident
You can explicitly TEACH the agent by adding this incident pattern to its knowledge base:
"Remember: readinessProbe.timeoutSeconds < 5 on containers with startup > 3s
causes crash loops. Fix: set timeoutSeconds >= 10 and initialDelaySeconds >= 10."
π‘ This is the ultimate Agentic DevSecOps loop: the AI gets smarter, not just the rules.
(f) Schedule a Proactive Health Check (~1.5 min)
β Strengthens: WS4 (Response β prevents recurrence proactively)
Create a scheduled intelligence task in SRE Agent that runs every hour:
- Open Azure Portal β SRE Agent β Scheduled Tasks
- Create a new scheduled task:
Name: "Readiness Probe Validation"
Schedule: Every 1 hour
Scope: All Kubernetes deployments in cluster
Check: readinessProbe.timeoutSeconds >= 5 for all containers
Action: Alert platform team if misconfigured probes found
Natural language: "Check all deployments hourly for readiness probes with
timeoutSeconds < 5. If found, create an alert for the platform team
BEFORE it becomes a production incident."
- Verify the scheduled task is active:
# Confirm scheduled tasks in SRE Agent (via Azure Portal or CLI)
az monitor scheduled-query list --resource-group YOUR_RG -o table
π This transforms the feedback from REACTIVE (regression test catches it at PR time) to PROACTIVE (SRE Agent catches it at RUNTIME, before anyone pushes code).
Grand Finale: The Closed Loop
Every feedback action you just performed strengthens a specific layer from a previous workshop:
Feedback Action β Strengthens Which Layer?
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Updated ruleset β WS2 π (Guardrails β Policy)
New Copilot instruction β WS2 π (Guardrails β AI remediation)
New regression test β WS2 π + WS3 π (Detection + Pipeline)
Updated threat model β WS1 π‘οΈ (Trust Boundary β risk awareness)
Monitoring alert refined β WS4 π (Response β faster next time)
Proactive scheduled check (NEW) β WS4 π (Response β PREVENTS recurrence)
SRE Agent learning (NEW) β WS4 π (Response β AI gets smarter)
βDevSecOps is not a set of tools. It is a closed-loop operating model.β
NIST SSDF Callout
NIST SP 800-218 RV.1 requires identifying and confirming vulnerabilities on an ongoing basis. RV.3 requires root cause analysis and implementing corrective actions. Our feedback loop β from incident detection to ruleset update, regression test, and threat model extension β is this requirement in practice.
π‘ Key Insight: βDevSecOps is a closed loop. Every incident makes your guardrails, policies, AI instructions, and threat models stronger. The system that let this issue through today will catch it tomorrow.β
π‘ Run
scripts/verify-exercise3.shto validate your Exercise 3 completion β and celebrate closing the DevSecOps loop! π
π Series Conclusion
βWe started by defining WHERE development trust lives. We built guardrails to prevent bad code. We secured the pipeline and gained visibility. And now weβve closed the loop β every incident makes the entire system stronger.β
This is Agentic DevSecOps.
WS1 π‘οΈ Trust Boundary & Platform Trust
WS2 π Secure by Design Guardrails
WS3 π Supply Chain Integrity & Code-to-Cloud Visibility
WS4 π Operational Response & Continuous Improvement β COMPLETE