1.5 KiB
1.5 KiB
Incident Response Runbook
Severity Levels
- P0 - Critical: Service completely down, data loss
- P1 - High: Major functionality broken, affecting many users
- P2 - Medium: Minor functionality broken, workaround available
- P3 - Low: Cosmetic issues, no user impact
Response Process
1. Acknowledge Incident
- Identify severity level
- Notify team via Slack/email
- Create incident ticket
2. Investigate
- Check service health endpoints
- Review logs:
./scripts/dev/logs.sh <service> - Check monitoring dashboards (Grafana)
- Review recent deployments
3. Mitigate
- Apply quick fixes if available
- Rollback if recent deployment caused issue
- Scale up if resource constraint
4. Resolve
- Implement permanent fix
- Verify resolution
- Update documentation
5. Post-Mortem
- Document incident
- Identify root cause
- Create action items
- Update runbooks
Common Scenarios
Service Down
- Check Kubernetes pods:
kubectl get pods -n <namespace> - Check pod logs:
kubectl logs <pod-name> -n <namespace> - Restart service:
kubectl rollout restart deployment/<service> -n <namespace> - If persistent, rollback:
kubectl rollout undo deployment/<service> -n <namespace>
Database Issues
- Check database connectivity
- Review slow queries
- Check connection pool
- Scale database if needed
High Error Rate
- Check error logs
- Review recent changes
- Check external dependencies
- Implement circuit breaker if needed