Files
pos-system/microservices/docs/en/runbooks/incident-response.md
Ho Ngoc Hai 76d75c753b Migrate
2026-05-23 18:37:02 +07:00

1.5 KiB

Incident Response Runbook

Severity Levels

  • P0 - Critical: Service completely down, data loss
  • P1 - High: Major functionality broken, affecting many users
  • P2 - Medium: Minor functionality broken, workaround available
  • P3 - Low: Cosmetic issues, no user impact

Response Process

1. Acknowledge Incident

  • Identify severity level
  • Notify team via Slack/email
  • Create incident ticket

2. Investigate

  • Check service health endpoints
  • Review logs: ./scripts/dev/logs.sh <service>
  • Check monitoring dashboards (Grafana)
  • Review recent deployments

3. Mitigate

  • Apply quick fixes if available
  • Rollback if recent deployment caused issue
  • Scale up if resource constraint

4. Resolve

  • Implement permanent fix
  • Verify resolution
  • Update documentation

5. Post-Mortem

  • Document incident
  • Identify root cause
  • Create action items
  • Update runbooks

Common Scenarios

Service Down

  1. Check Kubernetes pods: kubectl get pods -n <namespace>
  2. Check pod logs: kubectl logs <pod-name> -n <namespace>
  3. Restart service: kubectl rollout restart deployment/<service> -n <namespace>
  4. If persistent, rollback: kubectl rollout undo deployment/<service> -n <namespace>

Database Issues

  1. Check database connectivity
  2. Review slow queries
  3. Check connection pool
  4. Scale database if needed

High Error Rate

  1. Check error logs
  2. Review recent changes
  3. Check external dependencies
  4. Implement circuit breaker if needed