Step 3

Continuous Improvement Loop (Grand Finale)

On this page

Exercise 3: Continuous Improvement Loop

Goal: Prove the system gets stronger from every incident — and close the DevSecOps loop back to Workshops 1–3.

📝 Open docs/feedback-actions-checklist.md — track completion of each feedback action and which workshop layer it strengthens.

Each action below is a concrete, hands-on task — not a description of something you could do, but something you will do right now.

(a) Update a Ruleset (~2 min)

→ Strengthens: WS2 (Guardrails — Policy layer)

Add a required check that would have caught this misconfiguration before deployment:

Navigate to Organization Settings → Rulesets (or Repository Settings → Rulesets)
Edit your existing ruleset (or create a new one)
Add a new Required Status Check: kubernetes-manifest-validation

# Verify the ruleset is active
gh api /repos/YOUR_ORG/YOUR_REPO/rulesets --jq '.[] | {name: .name, enforcement: .enforcement}'

This check will block any PR that includes a Kubernetes manifest failing validation — including misconfigured readiness probes.

(b) Add a Copilot Custom Instruction (~2 min)

→ Strengthens: WS2 (Guardrails — AI remediation)

Update the .github/copilot-instructions.md (originally created in WS1) with lessons from this incident:

Open .github/copilot-instructions.md in your editor
Add the following section:

## Lessons from Incident — Runtime Health Check Failure

- When configuring Kubernetes readiness probes, always set `timeoutSeconds` >= 5s
  and `initialDelaySeconds` >= 10s to allow the application time to start up.
- Always include BOTH readiness and liveness probes for container deployments.
- Never set readiness probe timeout lower than the application's known startup time.
- Liveness probe should use a longer interval than readiness probe to avoid
  premature restarts during transient slowdowns.

Verify: Start a new Copilot Chat session → ask it to write a Kubernetes deployment → observe that the response includes the corrected probe guidance.

@copilot Write a Kubernetes Deployment manifest for a Node.js web application
with health checks configured.

(c) Create a Regression Test (~3 min)

→ Strengthens: WS2 (Guardrails — Detection layer) + WS3 (Supply Chain — pipeline integrity)

Write a manifest validation script that prevents this class of issue from recurring:

Create scripts/validate-k8s-manifests.sh:

#!/bin/bash
# Kubernetes Manifest Validation — Regression Test
# Ensures readiness/liveness probes meet minimum thresholds

set -euo pipefail

ERRORS=0

for file in $(find . -name '*.yaml' -o -name '*.yml' | xargs grep -l 'kind: Deployment' 2>/dev/null); do
  echo "Validating: $file"

  # Check readiness probe exists
  if ! grep -q 'readinessProbe' "$file"; then
    echo "  ❌ FAIL: Missing readinessProbe"
    ERRORS=$((ERRORS + 1))
  fi

  # Check liveness probe exists
  if ! grep -q 'livenessProbe' "$file"; then
    echo "  ❌ FAIL: Missing livenessProbe"
    ERRORS=$((ERRORS + 1))
  fi

  # Check readiness probe timeout >= 5s
  TIMEOUT=$(grep -A5 'readinessProbe' "$file" | grep 'timeoutSeconds' | awk '{print $2}')
  if [ -n "$TIMEOUT" ] && [ "$TIMEOUT" -lt 5 ]; then
    echo "  ❌ FAIL: readinessProbe.timeoutSeconds is ${TIMEOUT}s (minimum 5s)"
    ERRORS=$((ERRORS + 1))
  fi

  # Check initialDelaySeconds >= 5s
  DELAY=$(grep -A5 'readinessProbe' "$file" | grep 'initialDelaySeconds' | awk '{print $2}')
  if [ -n "$DELAY" ] && [ "$DELAY" -lt 5 ]; then
    echo "  ❌ FAIL: readinessProbe.initialDelaySeconds is ${DELAY}s (minimum 5s)"
    ERRORS=$((ERRORS + 1))
  fi
done

if [ $ERRORS -gt 0 ]; then
  echo ""
  echo "❌ Validation failed with $ERRORS error(s)"
  exit 1
else
  echo ""
  echo "✅ All Kubernetes manifests passed validation"
fi

Make it executable and add to CI:

chmod +x scripts/validate-k8s-manifests.sh

Add to your CI/CD workflow (.github/workflows/ci.yml):

  kubernetes-manifest-validation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate Kubernetes manifests
        run: ./scripts/validate-k8s-manifests.sh

(d) Update the Threat Model (~2 min)

→ Strengthens: WS1 (Trust Boundary — risk awareness)

Open THREAT-MODEL.md and fill in row 10 with the incident learnings:

#	Attack Vector	Target Asset	Threat Actor	Current Control	Gap?
10	Incident response is slow or absent	A4 — Deployed App	T1 — External	✅ SRE Agent alert + Copilot auto-remediation + regression test	Gap: Manifest validation not org-wide

The threat model is now more complete than when the series began — row 10 now has concrete controls and identified residual gaps from real incident learnings.

(e) Verify the Loop Closes (~3 min)

→ Proves: The system is now stronger than before the incident

Attempt to push a new manifest with the SAME misconfiguration:

# test-bad-manifest.yaml — intentionally misconfigured
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-regression
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-regression
  template:
    metadata:
      labels:
        app: test-regression
    spec:
      containers:
        - name: app
          image: nginx:latest
          readinessProbe:
            httpGet:
              path: /health
              port: 80
            timeoutSeconds: 1        # ← Same bad value as before
            initialDelaySeconds: 1   # ← Same bad value as before

Run the regression test against it:

./scripts/validate-k8s-manifests.sh

Expected output:

Validating: ./test-bad-manifest.yaml
  ❌ FAIL: Missing livenessProbe
  ❌ FAIL: readinessProbe.timeoutSeconds is 1s (minimum 5s)
  ❌ FAIL: readinessProbe.initialDelaySeconds is 1s (minimum 5s)

❌ Validation failed with 3 error(s)

Observe Copilot’s updated behavior — ask Copilot to fix the manifest:

@copilot Fix the readiness probe in test-bad-manifest.yaml

Copilot, now following your updated custom instructions, suggests timeoutSeconds: 10 and initialDelaySeconds: 10 — the correct values.

Clean up the test file:

rm test-bad-manifest.yaml

The system is NOW stronger than before the incident. 🎉

Persistent Learning — The Agent Gets Smarter

SRE Agent ITSELF learns from this incident. Unlike static runbooks, the agent’s persistent memory means:

It remembers this incident pattern (probe misconfiguration → crash loop)
Next time a similar probe misconfiguration occurs → diagnosed in SECONDS, not minutes
The agent’s institutional knowledge grows with every incident

You can explicitly TEACH the agent by adding this incident pattern to its knowledge base:

"Remember: readinessProbe.timeoutSeconds < 5 on containers with startup > 3s
causes crash loops. Fix: set timeoutSeconds >= 10 and initialDelaySeconds >= 10."

💡 This is the ultimate Agentic DevSecOps loop: the AI gets smarter, not just the rules.

(f) Schedule a Proactive Health Check (~1.5 min)

→ Strengthens: WS4 (Response — prevents recurrence proactively)

Create a scheduled intelligence task in SRE Agent that runs every hour:

Open Azure Portal → SRE Agent → Scheduled Tasks
Create a new scheduled task:

Name: "Readiness Probe Validation"
Schedule: Every 1 hour
Scope: All Kubernetes deployments in cluster
Check: readinessProbe.timeoutSeconds >= 5 for all containers
Action: Alert platform team if misconfigured probes found
Natural language: "Check all deployments hourly for readiness probes with
timeoutSeconds < 5. If found, create an alert for the platform team
BEFORE it becomes a production incident."

Verify the scheduled task is active:

# Confirm scheduled tasks in SRE Agent (via Azure Portal or CLI)
az monitor scheduled-query list --resource-group YOUR_RG -o table

🔄 This transforms the feedback from REACTIVE (regression test catches it at PR time) to PROACTIVE (SRE Agent catches it at RUNTIME, before anyone pushes code).

Grand Finale: The Closed Loop

Every feedback action you just performed strengthens a specific layer from a previous workshop:

Feedback Action                 → Strengthens Which Layer?
──────────────────────────────────────────────────────────────
Updated ruleset                 → WS2 🔒 (Guardrails — Policy)
New Copilot instruction         → WS2 🔒 (Guardrails — AI remediation)
New regression test             → WS2 🔒 + WS3 🔗 (Detection + Pipeline)
Updated threat model            → WS1 🛡️ (Trust Boundary — risk awareness)
Monitoring alert refined        → WS4 🔄 (Response — faster next time)
Proactive scheduled check (NEW) → WS4 🔄 (Response — PREVENTS recurrence)
SRE Agent learning (NEW)        → WS4 🔄 (Response — AI gets smarter)

“DevSecOps is not a set of tools. It is a closed-loop operating model.”

NIST SSDF Callout

NIST SP 800-218 RV.1 requires identifying and confirming vulnerabilities on an ongoing basis. RV.3 requires root cause analysis and implementing corrective actions. Our feedback loop — from incident detection to ruleset update, regression test, and threat model extension — is this requirement in practice.

💡 Key Insight: “DevSecOps is a closed loop. Every incident makes your guardrails, policies, AI instructions, and threat models stronger. The system that let this issue through today will catch it tomorrow.”

💡 Run scripts/verify-exercise3.sh to validate your Exercise 3 completion — and celebrate closing the DevSecOps loop! 🎉

🔄 Series Conclusion

“We started by defining WHERE development trust lives. We built guardrails to prevent bad code. We secured the pipeline and gained visibility. And now we’ve closed the loop — every incident makes the entire system stronger.”

This is Agentic DevSecOps.

  WS1 🛡️ Trust Boundary & Platform Trust
  WS2 🔒 Secure by Design Guardrails
  WS3 🔗 Supply Chain Integrity & Code-to-Cloud Visibility
  WS4 🔄 Operational Response & Continuous Improvement ← COMPLETE

← → to navigate between steps