incident-runbook-templates
Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.
Documentation
Incident Runbook Templates
Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
Do not use this skill when
- The task is unrelated to incident runbook templates
- You need a different domain or tool outside this scope
Instructions
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open
resources/implementation-playbook.md.
Use this skill when
- Creating incident response procedures
- Building service-specific runbooks
- Establishing escalation paths
- Documenting recovery procedures
- Responding to active incidents
- Onboarding on-call engineers
Core Concepts
1. Incident Severity Levels
| Severity | Impact | Response Time | Example |
|---|---|---|---|
| SEV1 | Complete outage, data loss | 15 min | Production down |
| SEV2 | Major degradation | 30 min | Critical feature broken |
| SEV3 | Minor impact | 2 hours | Non-critical bug |
| SEV4 | Minimal impact | Next business day | Cosmetic issue |
2. Runbook Structure
1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix
Runbook Templates
Template 1: Service Outage Runbook
# [Service Name] Outage Runbook
## Overview
**Service**: Payment Processing Service
**Owner**: Platform Team
**Slack**: #payments-incidents
**PagerDuty**: payments-oncall
## Impact Assessment
- [ ] Which customers are affected?
- [ ] What percentage of traffic is impacted?
- [ ] Are there financial implications?
- [ ] What's the blast radius?
## Detection
### Alerts
- `payment_error_rate > 5%` (PagerDuty)
- `payment_latency_p99 > 2s` (Slack)
- `payment_success_rate < 95%` (PagerDuty)
### Dashboards
- [Payment Service Dashboard](https://grafana/d/payments)
- [Error Tracking](https://sentry.io/payments)
- [Dependency Status](https://status.stripe.com)
## Initial Triage (First 5 Minutes)
### 1. Assess Scope
```bash
# Check service health
kubectl get pods -n payments -l app=payment-service
# Check recent deployments
kubectl rollout history deployment/payment-service -n payments
# Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
2. Quick Health Checks
- Can you reach the service?
curl -I https://api.company.com/payments/health - Database connectivity? Check connection pool metrics
- External dependencies? Check Stripe, bank API status
- Recent changes? Check deploy history
3. Initial Classification
| Symptom | Likely Cause | Go To Section |
|---|---|---|
| All requests failing | Service down | Section 4.1 |
| High latency | Database/dependency | Section 4.2 |
| Partial failures | Code bug | Section 4.3 |
| Spike in errors | Traffic surge | Section 4.4 |
Mitigation Procedures
4.1 Service Completely Down
# Step 1: Check pod status
kubectl get pods -n payments
# Step 2: If pods are crash-looping, check logs
kubectl logs -n payments -l app=payment-service --tail=100
# Step 3: Check recent deployments
kubectl rollout history deployment/payment-service -n payments
# Step 4: ROLLBACK if recent deploy is suspect
kubectl rollout undo deployment/payment-service -n payments
# Step 5: Scale up if resource constrained
kubectl scale deployment/payment-service -n payments --replicas=10
# Step 6: Verify recovery
kubectl rollout status deployment/payment-service -n payments
4.2 High Latency
# Step 1: Check database connections
kubectl exec -n payments deploy/payment-service -- \
curl localhost:8080/metrics | grep db_pool
# Step 2: Check slow queries (if DB issue)
psql -h $DB_HOST -U $DB_USER -c "
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND duration > interval '5 seconds'
ORDER BY duration DESC;"
# Step 3: Kill long-running queries if needed
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
# Step 4: Check external dependency latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
# Step 5: Enable circuit breaker if dependency is slow
kubectl set env deployment/payment-service \
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
4.3 Partial Failures (Specific Errors)
# Step 1: Identify error pattern
kubectl logs -n payments -l app=payment-service --tail=500 | \
grep -i error | sort | uniq -c | sort -rn | head -20
# Step 2: Check error tracking
# Go to Sentry: https://sentry.io/payments
# Step 3: If specific endpoint, enable feature flag to disable
curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "DISABLE_PROB
Quick Info
- Source
- antigravity
- Category
- Document Processing
- Repository
- View Repo
- Scraped At
- Jan 29, 2026
Tags
Related Skills
ab-test-setup
Structured guide for setting up A/B tests with mandatory gates for hypothesis, metrics, and execution readiness.
airflow-dag-patterns
Build production Apache Airflow DAGs with best practices for operators, sensors, testing, and deployment. Use when creating data pipelines, orchestrating workflows, or scheduling batch jobs.
algorithmic-art
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.