Step 3

Continuous Improvement Loop (Grand Finale)

On this page

Exercise 3: Continuous Improvement Loop

Goal: Prove the system gets stronger from every incident β€” and close the DevSecOps loop back to Workshops 1–3.

πŸ“ Open docs/feedback-actions-checklist.md β€” track completion of each feedback action and which workshop layer it strengthens.

Each action below is a concrete, hands-on task β€” not a description of something you could do, but something you will do right now.


(a) Update a Ruleset (~2 min)

β†’ Strengthens: WS2 (Guardrails β€” Policy layer)

Add a required check that would have caught this misconfiguration before deployment:

  1. Navigate to Organization Settings β†’ Rulesets (or Repository Settings β†’ Rulesets)
  2. Edit your existing ruleset (or create a new one)
  3. Add a new Required Status Check: kubernetes-manifest-validation
# Verify the ruleset is active
gh api /repos/YOUR_ORG/YOUR_REPO/rulesets --jq '.[] | {name: .name, enforcement: .enforcement}'

This check will block any PR that includes a Kubernetes manifest failing validation β€” including misconfigured readiness probes.


(b) Add a Copilot Custom Instruction (~2 min)

β†’ Strengthens: WS2 (Guardrails β€” AI remediation)

Update the .github/copilot-instructions.md (originally created in WS1) with lessons from this incident:

  1. Open .github/copilot-instructions.md in your editor
  2. Add the following section:
## Lessons from Incident β€” Runtime Health Check Failure

- When configuring Kubernetes readiness probes, always set `timeoutSeconds` >= 5s
  and `initialDelaySeconds` >= 10s to allow the application time to start up.
- Always include BOTH readiness and liveness probes for container deployments.
- Never set readiness probe timeout lower than the application's known startup time.
- Liveness probe should use a longer interval than readiness probe to avoid
  premature restarts during transient slowdowns.
  1. Verify: Start a new Copilot Chat session β†’ ask it to write a Kubernetes deployment β†’ observe that the response includes the corrected probe guidance.
@copilot Write a Kubernetes Deployment manifest for a Node.js web application
with health checks configured.

(c) Create a Regression Test (~3 min)

β†’ Strengthens: WS2 (Guardrails β€” Detection layer) + WS3 (Supply Chain β€” pipeline integrity)

Write a manifest validation script that prevents this class of issue from recurring:

  1. Create scripts/validate-k8s-manifests.sh:
#!/bin/bash
# Kubernetes Manifest Validation β€” Regression Test
# Ensures readiness/liveness probes meet minimum thresholds

set -euo pipefail

ERRORS=0

for file in $(find . -name '*.yaml' -o -name '*.yml' | xargs grep -l 'kind: Deployment' 2>/dev/null); do
  echo "Validating: $file"

  # Check readiness probe exists
  if ! grep -q 'readinessProbe' "$file"; then
    echo "  ❌ FAIL: Missing readinessProbe"
    ERRORS=$((ERRORS + 1))
  fi

  # Check liveness probe exists
  if ! grep -q 'livenessProbe' "$file"; then
    echo "  ❌ FAIL: Missing livenessProbe"
    ERRORS=$((ERRORS + 1))
  fi

  # Check readiness probe timeout >= 5s
  TIMEOUT=$(grep -A5 'readinessProbe' "$file" | grep 'timeoutSeconds' | awk '{print $2}')
  if [ -n "$TIMEOUT" ] && [ "$TIMEOUT" -lt 5 ]; then
    echo "  ❌ FAIL: readinessProbe.timeoutSeconds is ${TIMEOUT}s (minimum 5s)"
    ERRORS=$((ERRORS + 1))
  fi

  # Check initialDelaySeconds >= 5s
  DELAY=$(grep -A5 'readinessProbe' "$file" | grep 'initialDelaySeconds' | awk '{print $2}')
  if [ -n "$DELAY" ] && [ "$DELAY" -lt 5 ]; then
    echo "  ❌ FAIL: readinessProbe.initialDelaySeconds is ${DELAY}s (minimum 5s)"
    ERRORS=$((ERRORS + 1))
  fi
done

if [ $ERRORS -gt 0 ]; then
  echo ""
  echo "❌ Validation failed with $ERRORS error(s)"
  exit 1
else
  echo ""
  echo "βœ… All Kubernetes manifests passed validation"
fi
  1. Make it executable and add to CI:
chmod +x scripts/validate-k8s-manifests.sh
  1. Add to your CI/CD workflow (.github/workflows/ci.yml):
  kubernetes-manifest-validation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate Kubernetes manifests
        run: ./scripts/validate-k8s-manifests.sh

(d) Update the Threat Model (~2 min)

β†’ Strengthens: WS1 (Trust Boundary β€” risk awareness)

Open THREAT-MODEL.md and fill in row 10 with the incident learnings:

# Attack Vector Target Asset Threat Actor Current Control Gap?
10 Incident response is slow or absent A4 β€” Deployed App T1 β€” External βœ… SRE Agent alert + Copilot auto-remediation + regression test Gap: Manifest validation not org-wide

The threat model is now more complete than when the series began β€” row 10 now has concrete controls and identified residual gaps from real incident learnings.


(e) Verify the Loop Closes (~3 min)

β†’ Proves: The system is now stronger than before the incident

  1. Attempt to push a new manifest with the SAME misconfiguration:
# test-bad-manifest.yaml β€” intentionally misconfigured
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-regression
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-regression
  template:
    metadata:
      labels:
        app: test-regression
    spec:
      containers:
        - name: app
          image: nginx:latest
          readinessProbe:
            httpGet:
              path: /health
              port: 80
            timeoutSeconds: 1        # ← Same bad value as before
            initialDelaySeconds: 1   # ← Same bad value as before
  1. Run the regression test against it:
./scripts/validate-k8s-manifests.sh

Expected output:

Validating: ./test-bad-manifest.yaml
  ❌ FAIL: Missing livenessProbe
  ❌ FAIL: readinessProbe.timeoutSeconds is 1s (minimum 5s)
  ❌ FAIL: readinessProbe.initialDelaySeconds is 1s (minimum 5s)

❌ Validation failed with 3 error(s)
  1. Observe Copilot’s updated behavior β€” ask Copilot to fix the manifest:
@copilot Fix the readiness probe in test-bad-manifest.yaml

Copilot, now following your updated custom instructions, suggests timeoutSeconds: 10 and initialDelaySeconds: 10 β€” the correct values.

  1. Clean up the test file:
rm test-bad-manifest.yaml

The system is NOW stronger than before the incident. πŸŽ‰


Persistent Learning β€” The Agent Gets Smarter

SRE Agent ITSELF learns from this incident. Unlike static runbooks, the agent’s persistent memory means:

  • It remembers this incident pattern (probe misconfiguration β†’ crash loop)
  • Next time a similar probe misconfiguration occurs β†’ diagnosed in SECONDS, not minutes
  • The agent’s institutional knowledge grows with every incident

You can explicitly TEACH the agent by adding this incident pattern to its knowledge base:

"Remember: readinessProbe.timeoutSeconds < 5 on containers with startup > 3s
causes crash loops. Fix: set timeoutSeconds >= 10 and initialDelaySeconds >= 10."

πŸ’‘ This is the ultimate Agentic DevSecOps loop: the AI gets smarter, not just the rules.


(f) Schedule a Proactive Health Check (~1.5 min)

β†’ Strengthens: WS4 (Response β€” prevents recurrence proactively)

Create a scheduled intelligence task in SRE Agent that runs every hour:

  1. Open Azure Portal β†’ SRE Agent β†’ Scheduled Tasks
  2. Create a new scheduled task:
Name: "Readiness Probe Validation"
Schedule: Every 1 hour
Scope: All Kubernetes deployments in cluster
Check: readinessProbe.timeoutSeconds >= 5 for all containers
Action: Alert platform team if misconfigured probes found
Natural language: "Check all deployments hourly for readiness probes with
timeoutSeconds < 5. If found, create an alert for the platform team
BEFORE it becomes a production incident."
  1. Verify the scheduled task is active:
# Confirm scheduled tasks in SRE Agent (via Azure Portal or CLI)
az monitor scheduled-query list --resource-group YOUR_RG -o table

πŸ”„ This transforms the feedback from REACTIVE (regression test catches it at PR time) to PROACTIVE (SRE Agent catches it at RUNTIME, before anyone pushes code).


Grand Finale: The Closed Loop

Every feedback action you just performed strengthens a specific layer from a previous workshop:

Feedback Action                 β†’ Strengthens Which Layer?
──────────────────────────────────────────────────────────────
Updated ruleset                 β†’ WS2 πŸ”’ (Guardrails β€” Policy)
New Copilot instruction         β†’ WS2 πŸ”’ (Guardrails β€” AI remediation)
New regression test             β†’ WS2 πŸ”’ + WS3 πŸ”— (Detection + Pipeline)
Updated threat model            β†’ WS1 πŸ›‘οΈ (Trust Boundary β€” risk awareness)
Monitoring alert refined        β†’ WS4 πŸ”„ (Response β€” faster next time)
Proactive scheduled check (NEW) β†’ WS4 πŸ”„ (Response β€” PREVENTS recurrence)
SRE Agent learning (NEW)        β†’ WS4 πŸ”„ (Response β€” AI gets smarter)

β€œDevSecOps is not a set of tools. It is a closed-loop operating model.”


NIST SSDF Callout

NIST SP 800-218 RV.1 requires identifying and confirming vulnerabilities on an ongoing basis. RV.3 requires root cause analysis and implementing corrective actions. Our feedback loop β€” from incident detection to ruleset update, regression test, and threat model extension β€” is this requirement in practice.

πŸ’‘ Key Insight: β€œDevSecOps is a closed loop. Every incident makes your guardrails, policies, AI instructions, and threat models stronger. The system that let this issue through today will catch it tomorrow.”

πŸ’‘ Run scripts/verify-exercise3.sh to validate your Exercise 3 completion β€” and celebrate closing the DevSecOps loop! πŸŽ‰


πŸ”„ Series Conclusion

β€œWe started by defining WHERE development trust lives. We built guardrails to prevent bad code. We secured the pipeline and gained visibility. And now we’ve closed the loop β€” every incident makes the entire system stronger.”

This is Agentic DevSecOps.

  WS1 πŸ›‘οΈ Trust Boundary & Platform Trust
  WS2 πŸ”’ Secure by Design Guardrails
  WS3 πŸ”— Supply Chain Integrity & Code-to-Cloud Visibility
  WS4 πŸ”„ Operational Response & Continuous Improvement ← COMPLETE
Previous
Step 3 of 4
Next
← β†’ to navigate between steps