Runbook: [Incident Name]
Runbook: [Incident Name]
Overview
Severity: [Critical/High/Medium/Low] Component: [Service/Feature name] Last Updated: [YYYY-MM-DD]
[Brief description of what this incident is and its impact on users or services.]
Detection & Alerts
Key Metrics to Monitor
- Metric name: [threshold] - What indicates this issue is happening
- Metric name: [threshold] - What to look for
Alert Triggers
- Alert name with condition
- What the alert tells you about the problem
Diagnosis
1. Confirm the Issue
- Check [monitoring dashboard/logs] for [specific metric/error pattern]
- Verify the issue affects [all users/specific regions/feature]
- Note the start time: _________________
- Check recent deployments or changes that might be related
2. Gather Information
Run these commands/queries:
[Command or log query example]
Look for:
- Error messages or stack traces
- Unusual latency spikes
- Resource exhaustion (CPU, memory, disk)
- Failed external service calls
3. Check Common Causes
- Service health: Check if [specific service] is running and responsive
- Database connectivity: Run
[command to verify DB connection] - External dependencies: Verify API status for [external services]
- Resource limits: Check [CPU/memory/disk] usage with
[command] - Recent code changes: Review commits in last [X hours]
4. Review Logs
Key log locations:
- Application logs:
[path] - System logs:
[path] - Error logs:
[path]
Search for:
[Example log patterns to search for]
Resolution Steps
Quick Fixes (Try These First)
-
Restart the service
[Command to restart]- Monitor [metric] for recovery
- Expected recovery time: [X minutes]
-
Clear cache/refresh state
[Command to clear cache]- Verify with:
[verification command]
- Verify with:
-
Scale up resources
- Increase [resource type] from [current] to [recommended]
- Command:
[scaling command] - Monitor: Watch [metric] for improvement
If Quick Fixes Don't Work
Step 1: Isolate the Issue
- Is it affecting [all/specific] users?
- Is it [all features/specific feature]?
- Reproduce locally with: [steps to reproduce]
Step 2: Deep Dive Debugging
- Enable debug logging:
[command] - Attach debugger to [component]:
[command] - Check [specific subsystem] with:
[diagnostic command]
Step 3: Implement Workaround (if needed)
- [Describe temporary workaround]
- Caution: This is temporary; file ticket for permanent fix
- Monitor for [specific issues] while workaround is active
Step 4: Deploy Fix
- Create hotfix branch:
git checkout -b hotfix/[issue-name] - Apply fix: [Brief description of changes]
- Test locally:
[test command] - Deploy:
[deployment command] - Monitor: Watch [key metrics] for [X minutes]
Rollback Plan
If the fix makes things worse:
- Identify rollback commit:
git log --oneline | head -20 - Deploy previous version:
[rollback command] - Verify recovery: Monitor [metric] for normalization
- Timeline: Should recover within [X minutes]
Verification
After applying the fix, verify recovery with:
- Metric [X] returned to normal:
[verification query] - No error rate spike:
[verification query] - End users can [key action]: Manual testing
- Alert [X] is no longer firing
Prevention & Post-Incident
Root Cause
[What actually caused this issue?]
Prevention Measures
- Add monitoring for [new metric]
- Add test case for [scenario that should catch this]
- Improve [documentation/code/architecture]
- Alert on [new threshold] to catch earlier
Tickets to File
- [Ticket description]: Priority [High/Medium]
- [Ticket description]: Priority [High/Medium]
Related Incidents
- [Link to previous similar incidents]
Escalation
On-call escalation path:
- Page [Team/Person] if service not recovered after [X minutes]
- Page [Escalation contact] if still unresolved after [X more minutes]
Contact:
- Primary on-call: [Slack/Phone]
- Backup: [Slack/Phone]
- Team lead: [Slack/Phone]
Additional Resources
- [Link to monitoring dashboard]
- [Link to relevant documentation]
- [Link to architecture diagram]
- [Link to deployment runbook]
Appendix: Common Issues & Solutions
| Issue | Symptoms | Quick Fix |
|---|---|---|
| [Issue name] | [What to look for] | [How to fix] |
| [Issue name] | [What to look for] | [How to fix] |
Last Updated By: [Name] Review Frequency: Quarterly or after each incident