feat(monitoring): add comprehensive alerting rules, Alertmanager, and DR validation
Expand production monitoring with full alert coverage for database connections, Redis memory/connections, container resources, disk usage, service health, and backup integrity. Add Alertmanager service with Slack routing for critical and warning alerts, and add automated backup verification to the pg-backup cron schedule. Update runbook with DR validation procedures and quarterly checklist. - Expand Prometheus alert rules from 4 to 24 alerts across 7 groups - Add Alertmanager container (prom/alertmanager:v0.27.0) with Slack routing - Configure inhibition rules (critical suppresses warning for same service) - Schedule automated backup verification at 04:00 UTC daily - Add Alertmanager datasource to Grafana provisioning - Update runbook with Section 9: DR Validation (automated + manual procedures) - Add SLACK_WEBHOOK_URL and Grafana vars to .env.example Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
@@ -21,3 +21,12 @@ datasources:
|
||||
matcherRegex: 'correlationId":"([^"]+)'
|
||||
name: correlationId
|
||||
url: '$${__value.raw}'
|
||||
|
||||
- name: Alertmanager
|
||||
uid: alertmanager
|
||||
type: alertmanager
|
||||
access: proxy
|
||||
url: http://alertmanager:9093
|
||||
editable: true
|
||||
jsonData:
|
||||
implementation: prometheus
|
||||
|
||||
Reference in New Issue
Block a user