feat(monitoring): add comprehensive alerting rules, Alertmanager, and DR validation
Expand production monitoring with full alert coverage for database connections, Redis memory/connections, container resources, disk usage, service health, and backup integrity. Add Alertmanager service with Slack routing for critical and warning alerts, and add automated backup verification to the pg-backup cron schedule. Update runbook with DR validation procedures and quarterly checklist. - Expand Prometheus alert rules from 4 to 24 alerts across 7 groups - Add Alertmanager container (prom/alertmanager:v0.27.0) with Slack routing - Configure inhibition rules (critical suppresses warning for same service) - Schedule automated backup verification at 04:00 UTC daily - Add Alertmanager datasource to Grafana provisioning - Update runbook with Section 9: DR Validation (automated + manual procedures) - Add SLACK_WEBHOOK_URL and Grafana vars to .env.example Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
11
.env.example
11
.env.example
@@ -164,3 +164,14 @@ KYC_ENCRYPTION_KEY_VERSION=1
|
||||
# Logging
|
||||
# -----------------------------------------------------------------------------
|
||||
LOG_LEVEL=info
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Monitoring & Alerting
|
||||
# -----------------------------------------------------------------------------
|
||||
GRAFANA_ADMIN_USER=admin
|
||||
GRAFANA_ADMIN_PASSWORD=CHANGE_ME
|
||||
GRAFANA_PORT=3002
|
||||
GRAFANA_ROOT_URL=http://localhost:3002
|
||||
|
||||
# Slack webhook for alert notifications (Alertmanager + CI/CD)
|
||||
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/CHANGE_ME
|
||||
|
||||
Reference in New Issue
Block a user