Expand production monitoring with full alert coverage for database connections, Redis memory/connections, container resources, disk usage, service health, and backup integrity. Add Alertmanager service with Slack routing for critical and warning alerts, and add automated backup verification to the pg-backup cron schedule. Update runbook with DR validation procedures and quarterly checklist. - Expand Prometheus alert rules from 4 to 24 alerts across 7 groups - Add Alertmanager container (prom/alertmanager:v0.27.0) with Slack routing - Configure inhibition rules (critical suppresses warning for same service) - Schedule automated backup verification at 04:00 UTC daily - Add Alertmanager datasource to Grafana provisioning - Update runbook with Section 9: DR Validation (automated + manual procedures) - Add SLACK_WEBHOOK_URL and Grafana vars to .env.example Co-Authored-By: Paperclip <noreply@paperclip.ing>
33 lines
658 B
YAML
33 lines
658 B
YAML
apiVersion: 1
|
|
|
|
datasources:
|
|
- name: Prometheus
|
|
uid: prometheus
|
|
type: prometheus
|
|
access: proxy
|
|
url: http://prometheus:9090
|
|
isDefault: true
|
|
editable: true
|
|
|
|
- name: Loki
|
|
uid: loki
|
|
type: loki
|
|
access: proxy
|
|
url: http://loki:3100
|
|
editable: true
|
|
jsonData:
|
|
derivedFields:
|
|
- datasourceUid: prometheus
|
|
matcherRegex: 'correlationId":"([^"]+)'
|
|
name: correlationId
|
|
url: '$${__value.raw}'
|
|
|
|
- name: Alertmanager
|
|
uid: alertmanager
|
|
type: alertmanager
|
|
access: proxy
|
|
url: http://alertmanager:9093
|
|
editable: true
|
|
jsonData:
|
|
implementation: prometheus
|