Expand production monitoring with full alert coverage for database connections, Redis memory/connections, container resources, disk usage, service health, and backup integrity. Add Alertmanager service with Slack routing for critical and warning alerts, and add automated backup verification to the pg-backup cron schedule. Update runbook with DR validation procedures and quarterly checklist. - Expand Prometheus alert rules from 4 to 24 alerts across 7 groups - Add Alertmanager container (prom/alertmanager:v0.27.0) with Slack routing - Configure inhibition rules (critical suppresses warning for same service) - Schedule automated backup verification at 04:00 UTC daily - Add Alertmanager datasource to Grafana provisioning - Update runbook with Section 9: DR Validation (automated + manual procedures) - Add SLACK_WEBHOOK_URL and Grafana vars to .env.example Co-Authored-By: Paperclip <noreply@paperclip.ing>
30 lines
696 B
YAML
30 lines
696 B
YAML
global:
|
|
scrape_interval: 15s
|
|
evaluation_interval: 15s
|
|
|
|
rule_files:
|
|
- 'alert-rules.yml'
|
|
|
|
alerting:
|
|
alertmanagers:
|
|
- static_configs:
|
|
- targets: ['alertmanager:9093']
|
|
|
|
scrape_configs:
|
|
- job_name: 'goodgo-api'
|
|
metrics_path: '/metrics'
|
|
static_configs:
|
|
# host.docker.internal for dev (API on host), api:3001 for prod (API in container)
|
|
- targets: ['host.docker.internal:3001']
|
|
labels:
|
|
service: 'goodgo-api'
|
|
environment: 'development'
|
|
- targets: ['api:3001']
|
|
labels:
|
|
service: 'goodgo-api'
|
|
environment: 'production'
|
|
|
|
- job_name: 'prometheus'
|
|
static_configs:
|
|
- targets: ['localhost:9090']
|