Commit Graph

2 Commits

Author SHA1 Message Date
Ho Ngoc Hai
9409706c58 feat(monitoring): add comprehensive alerting rules, Alertmanager, and DR validation
Expand production monitoring with full alert coverage for database connections,
Redis memory/connections, container resources, disk usage, service health, and
backup integrity. Add Alertmanager service with Slack routing for critical and
warning alerts, and add automated backup verification to the pg-backup cron
schedule. Update runbook with DR validation procedures and quarterly checklist.

- Expand Prometheus alert rules from 4 to 24 alerts across 7 groups
- Add Alertmanager container (prom/alertmanager:v0.27.0) with Slack routing
- Configure inhibition rules (critical suppresses warning for same service)
- Schedule automated backup verification at 04:00 UTC daily
- Add Alertmanager datasource to Grafana provisioning
- Update runbook with Section 9: DR Validation (automated + manual procedures)
- Add SLACK_WEBHOOK_URL and Grafana vars to .env.example

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-11 20:15:36 +07:00
Ho Ngoc Hai
f27b13f712 docs: add production operational runbook
Create comprehensive docs/RUNBOOK.md covering all 14 production services,
health checks, 10 common incident scenarios with diagnosis/resolution,
recovery procedures (DB restore, Redis flush, rolling restart, rollback),
escalation matrix, monitoring dashboards, and PromQL queries.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-11 00:28:02 +07:00