Files
goodgo-platform/monitoring/prometheus/prometheus.yml
Ho Ngoc Hai 9409706c58 feat(monitoring): add comprehensive alerting rules, Alertmanager, and DR validation
Expand production monitoring with full alert coverage for database connections,
Redis memory/connections, container resources, disk usage, service health, and
backup integrity. Add Alertmanager service with Slack routing for critical and
warning alerts, and add automated backup verification to the pg-backup cron
schedule. Update runbook with DR validation procedures and quarterly checklist.

- Expand Prometheus alert rules from 4 to 24 alerts across 7 groups
- Add Alertmanager container (prom/alertmanager:v0.27.0) with Slack routing
- Configure inhibition rules (critical suppresses warning for same service)
- Schedule automated backup verification at 04:00 UTC daily
- Add Alertmanager datasource to Grafana provisioning
- Update runbook with Section 9: DR Validation (automated + manual procedures)
- Add SLACK_WEBHOOK_URL and Grafana vars to .env.example

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-11 20:15:36 +07:00

30 lines
696 B
YAML

global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- 'alert-rules.yml'
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
- job_name: 'goodgo-api'
metrics_path: '/metrics'
static_configs:
# host.docker.internal for dev (API on host), api:3001 for prod (API in container)
- targets: ['host.docker.internal:3001']
labels:
service: 'goodgo-api'
environment: 'development'
- targets: ['api:3001']
labels:
service: 'goodgo-api'
environment: 'production'
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']