feat(monitoring): add comprehensive alerting rules, Alertmanager, and DR validation

Expand production monitoring with full alert coverage for database connections,
Redis memory/connections, container resources, disk usage, service health, and
backup integrity. Add Alertmanager service with Slack routing for critical and
warning alerts, and add automated backup verification to the pg-backup cron
schedule. Update runbook with DR validation procedures and quarterly checklist.

- Expand Prometheus alert rules from 4 to 24 alerts across 7 groups
- Add Alertmanager container (prom/alertmanager:v0.27.0) with Slack routing
- Configure inhibition rules (critical suppresses warning for same service)
- Schedule automated backup verification at 04:00 UTC daily
- Add Alertmanager datasource to Grafana provisioning
- Update runbook with Section 9: DR Validation (automated + manual procedures)
- Add SLACK_WEBHOOK_URL and Grafana vars to .env.example

Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
Ho Ngoc Hai
2026-04-11 20:15:36 +07:00
parent 33c2e5ac1d
commit 9409706c58
8 changed files with 1108 additions and 2 deletions

View File

@@ -164,3 +164,14 @@ KYC_ENCRYPTION_KEY_VERSION=1
# Logging
# -----------------------------------------------------------------------------
LOG_LEVEL=info
# -----------------------------------------------------------------------------
# Monitoring & Alerting
# -----------------------------------------------------------------------------
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=CHANGE_ME
GRAFANA_PORT=3002
GRAFANA_ROOT_URL=http://localhost:3002
# Slack webhook for alert notifications (Alertmanager + CI/CD)
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/CHANGE_ME