admin/pos-system

Files

Ho Ngoc Hai 76d75c753b Migrate

2026-05-23 18:37:02 +07:00

1.5 KiB

Raw Permalink Blame History

Incident Response Runbook

Severity Levels

P0 - Critical: Service completely down, data loss
P1 - High: Major functionality broken, affecting many users
P2 - Medium: Minor functionality broken, workaround available
P3 - Low: Cosmetic issues, no user impact

Response Process

1. Acknowledge Incident

Identify severity level
Notify team via Slack/email
Create incident ticket

2. Investigate

Check service health endpoints
Review logs: ./scripts/dev/logs.sh <service>
Check monitoring dashboards (Grafana)
Review recent deployments

3. Mitigate

Apply quick fixes if available
Rollback if recent deployment caused issue
Scale up if resource constraint

4. Resolve

Implement permanent fix
Verify resolution
Update documentation

5. Post-Mortem

Document incident
Identify root cause
Create action items
Update runbooks

Common Scenarios

Service Down

Check Kubernetes pods: kubectl get pods -n <namespace>
Check pod logs: kubectl logs <pod-name> -n <namespace>
Restart service: kubectl rollout restart deployment/<service> -n <namespace>
If persistent, rollback: kubectl rollout undo deployment/<service> -n <namespace>

Database Issues

Check database connectivity
Review slow queries
Check connection pool
Scale database if needed

High Error Rate

Check error logs
Review recent changes
Check external dependencies
Implement circuit breaker if needed