feat(monitoring): add comprehensive alerting rules, Alertmanager, and DR validation
Expand production monitoring with full alert coverage for database connections, Redis memory/connections, container resources, disk usage, service health, and backup integrity. Add Alertmanager service with Slack routing for critical and warning alerts, and add automated backup verification to the pg-backup cron schedule. Update runbook with DR validation procedures and quarterly checklist. - Expand Prometheus alert rules from 4 to 24 alerts across 7 groups - Add Alertmanager container (prom/alertmanager:v0.27.0) with Slack routing - Configure inhibition rules (critical suppresses warning for same service) - Schedule automated backup verification at 04:00 UTC daily - Add Alertmanager datasource to Grafana provisioning - Update runbook with Section 9: DR Validation (automated + manual procedures) - Add SLACK_WEBHOOK_URL and Grafana vars to .env.example Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
210
docs/RUNBOOK.md
210
docs/RUNBOOK.md
@@ -53,6 +53,7 @@
|
||||
| **promtail** | `grafana/promtail:3.0.0` | — | 0.25 CPU / 256 MB | — |
|
||||
| **prometheus** | `prom/prometheus:v2.51.0` | 9090 (internal) | 0.5 CPU / 1 GB | `wget /-/healthy` |
|
||||
| **grafana** | `grafana/grafana:10.4.1` | 3002 (external) | 0.5 CPU / 512 MB | `wget /api/health` |
|
||||
| **alertmanager** | `prom/alertmanager:v0.27.0` | 9093 (internal) | 0.25 CPU / 256 MB | `wget /-/healthy` |
|
||||
|
||||
### Development-Only Services (`docker-compose.yml`)
|
||||
|
||||
@@ -67,7 +68,7 @@ web --> api --> pgbouncer --> postgres
|
||||
|-> minio
|
||||
|-> ai-services
|
||||
|
||||
grafana --> prometheus
|
||||
grafana --> prometheus --> alertmanager
|
||||
|-> loki --> promtail (Docker socket)
|
||||
|
||||
pg-backup --> postgres
|
||||
@@ -128,6 +129,9 @@ curl -sf http://localhost:3100/ready && echo "Loki OK"
|
||||
|
||||
# Grafana
|
||||
curl -sf http://localhost:3002/api/health | jq .
|
||||
|
||||
# Alertmanager
|
||||
curl -sf http://localhost:9093/-/healthy && echo "Alertmanager OK"
|
||||
```
|
||||
|
||||
### Container Resource Usage
|
||||
@@ -864,6 +868,7 @@ All dashboards are provisioned automatically via `monitoring/grafana/provisionin
|
||||
**Data Sources:**
|
||||
- **Prometheus** (`http://prometheus:9090`) — Metrics (default)
|
||||
- **Loki** (`http://loki:3100`) — Logs, with correlation ID linking to Prometheus
|
||||
- **Alertmanager** (`http://alertmanager:9093`) — Alert state and silences
|
||||
|
||||
---
|
||||
|
||||
@@ -963,13 +968,216 @@ rate(container_cpu_usage_seconds_total{name=~"goodgo-.*"}[5m])
|
||||
|
||||
---
|
||||
|
||||
## 9. Disaster Recovery Validation
|
||||
|
||||
### Automated Verification
|
||||
|
||||
Backup verification runs **daily at 04:00 UTC** inside the `pg-backup` container. It restores the latest backup to an isolated test database and checks:
|
||||
|
||||
- Table existence (all 22 Prisma models)
|
||||
- Row count comparison against live database
|
||||
- Data checksums on critical tables (User, Property, Listing, Payment, Subscription, Transaction, Plan)
|
||||
- PostGIS extension availability
|
||||
- Index count match
|
||||
- Enum type count match
|
||||
|
||||
**Check latest verification report:**
|
||||
|
||||
```bash
|
||||
docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .
|
||||
```
|
||||
|
||||
**Check verification logs:**
|
||||
|
||||
```bash
|
||||
docker exec goodgo-pg-backup cat /var/log/pg-verify.log
|
||||
```
|
||||
|
||||
### Manual DR Validation Procedure
|
||||
|
||||
Run this quarterly (or after major schema changes) to validate the full DR process end-to-end.
|
||||
|
||||
#### Step 1: Verify Backups Exist and Are Recent
|
||||
|
||||
```bash
|
||||
# List backups with timestamps and sizes
|
||||
docker exec goodgo-pg-backup ls -lht /backups/goodgo_*.sql.gz
|
||||
|
||||
# Verify latest backup is < 25 hours old
|
||||
LATEST=$(docker exec goodgo-pg-backup ls -t /backups/goodgo_*.sql.gz | head -1)
|
||||
echo "Latest backup: $LATEST"
|
||||
```
|
||||
|
||||
#### Step 2: Run Verification Against Latest Backup
|
||||
|
||||
```bash
|
||||
# Automated verification (creates temp DB, validates, drops)
|
||||
docker exec -e REPORT_FILE=/backups/verify-latest.json goodgo-pg-backup \
|
||||
/scripts/pg-verify-backup.sh
|
||||
|
||||
# Review results
|
||||
docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .
|
||||
```
|
||||
|
||||
**Expected output:** All checks pass, restore completes in < 60 seconds for typical dataset.
|
||||
|
||||
#### Step 3: Test Full Restore (Staging Only)
|
||||
|
||||
> ⚠️ **WARNING:** Only perform this on a staging or isolated environment. Never on production.
|
||||
|
||||
```bash
|
||||
# 1. Create a separate test environment
|
||||
docker compose -f docker-compose.yml -p goodgo-dr-test up -d postgres
|
||||
|
||||
# 2. Wait for PostgreSQL to be ready
|
||||
docker exec goodgo-dr-test-postgres-1 pg_isready
|
||||
|
||||
# 3. Run restore against the test environment
|
||||
PGHOST=localhost PGPORT=<test-port> PGUSER=goodgo PGPASSWORD=<password> \
|
||||
/scripts/pg-restore.sh /backups/<latest-backup>.sql.gz
|
||||
|
||||
# 4. Verify key tables
|
||||
docker exec goodgo-dr-test-postgres-1 psql -U goodgo -d goodgo -c \
|
||||
"SELECT count(*) FROM \"User\"; SELECT count(*) FROM \"Property\"; SELECT count(*) FROM \"Listing\";"
|
||||
|
||||
# 5. Clean up test environment
|
||||
docker compose -f docker-compose.yml -p goodgo-dr-test down -v
|
||||
```
|
||||
|
||||
#### Step 4: Validate Service Recovery Chain
|
||||
|
||||
Test that all services can start from a clean state with restored data:
|
||||
|
||||
```bash
|
||||
# 1. Note current service status
|
||||
docker compose -f docker-compose.prod.yml ps --format "table {{.Name}}\t{{.Status}}\t{{.Health}}"
|
||||
|
||||
# 2. Restart all services in dependency order
|
||||
docker compose -f docker-compose.prod.yml restart postgres
|
||||
sleep 10 # Wait for PostgreSQL
|
||||
|
||||
docker compose -f docker-compose.prod.yml restart pgbouncer redis typesense
|
||||
sleep 10 # Wait for data services
|
||||
|
||||
docker compose -f docker-compose.prod.yml restart api web ai-services
|
||||
sleep 15 # Wait for application services
|
||||
|
||||
# 3. Verify all health checks
|
||||
curl -sf http://localhost:3001/health/ready | jq .
|
||||
curl -sf http://localhost:3000 > /dev/null && echo "Web OK"
|
||||
curl -sf http://localhost:9090/-/healthy && echo "Prometheus OK"
|
||||
curl -sf http://localhost:9093/-/healthy && echo "Alertmanager OK"
|
||||
curl -sf http://localhost:3002/api/health | jq .
|
||||
```
|
||||
|
||||
#### Step 5: Validate Alerting Pipeline
|
||||
|
||||
```bash
|
||||
# 1. Check Prometheus is loading alert rules
|
||||
curl -sf http://localhost:9090/api/v1/rules | jq '.data.groups | length'
|
||||
# Expected: 7 groups
|
||||
|
||||
# 2. Check current alerts (should be empty if healthy)
|
||||
curl -sf http://localhost:9090/api/v1/alerts | jq '.data.alerts | length'
|
||||
|
||||
# 3. Check Alertmanager is receiving from Prometheus
|
||||
curl -sf http://localhost:9093/api/v2/status | jq '.cluster'
|
||||
|
||||
# 4. Verify Alertmanager config is loaded
|
||||
curl -sf http://localhost:9093/api/v2/status | jq '.config'
|
||||
```
|
||||
|
||||
### DR Validation Checklist
|
||||
|
||||
Use this checklist during quarterly DR reviews:
|
||||
|
||||
- [ ] Latest backup is < 25 hours old
|
||||
- [ ] Automated verification report shows all checks passed
|
||||
- [ ] Manual restore to test DB succeeds with correct row counts
|
||||
- [ ] Full service restart completes within RTO target (< 30 min)
|
||||
- [ ] All health endpoints respond after restart
|
||||
- [ ] Prometheus alert rules are loaded (7 groups)
|
||||
- [ ] Alertmanager is reachable and configured
|
||||
- [ ] Slack notification channel is receiving test alerts
|
||||
- [ ] Grafana dashboards show data after restart
|
||||
- [ ] Typesense search returns results after restart
|
||||
|
||||
### RPO/RTO Summary
|
||||
|
||||
| Metric | Target | Actual (Measured) | Notes |
|
||||
|--------|--------|-------------------|-------|
|
||||
| **RPO** | ≤ 24 hours | ~24h (daily at 02:00 UTC) | Reduce with WAL archiving |
|
||||
| **RTO — Local backup** | ≤ 15 minutes | Measure during DR test | Restore + service restart |
|
||||
| **RTO — Off-site backup** | ≤ 30 minutes | Measure during DR test | Add transfer time |
|
||||
| **RTO — Full host recovery** | ≤ 60 minutes | Measure during DR test | New host + restore + deploy |
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Alert Rules Reference
|
||||
|
||||
### API & Error Alerts
|
||||
|
||||
| Alert | Expression | Severity | Duration |
|
||||
|-------|-----------|----------|----------|
|
||||
| `ApiLatencyP99High` | p99 > 1s | Warning | 5 min |
|
||||
| `ApiEndpointLatencyP99High` | Per-route p99 > 2s | Warning | 5 min |
|
||||
| `ApiLatencyP99Critical` | p99 > 3s (SLO breach) | Critical | 3 min |
|
||||
| `ApiErrorRate5xxHigh` | 5xx rate > 1% | Warning | 5 min |
|
||||
| `ApiErrorRate5xxCritical` | 5xx rate > 5% | Critical | 3 min |
|
||||
| `ApiNoTraffic` | Request rate = 0 | Warning | 10 min |
|
||||
|
||||
### Database Alerts
|
||||
|
||||
| Alert | Expression | Severity | Duration |
|
||||
|-------|-----------|----------|----------|
|
||||
| `PostgresActiveConnectionsHigh` | Active connections > 15 | Warning | 5 min |
|
||||
| `PostgresConnectionPoolCritical` | Total connections > 180 | Critical | 2 min |
|
||||
| `PostgresSlowQueries` | Lock-waiting queries > 5 | Warning | 5 min |
|
||||
| `PostgresDown` | API scrape target down | Critical | 1 min |
|
||||
|
||||
### Redis Alerts
|
||||
|
||||
| Alert | Expression | Severity | Duration |
|
||||
|-------|-----------|----------|----------|
|
||||
| `RedisMemoryHigh` | Memory usage > 80% | Warning | 5 min |
|
||||
| `RedisMemoryCritical` | Memory usage > 95% | Critical | 2 min |
|
||||
| `RedisConnectedClientsHigh` | Clients > 150 | Warning | 5 min |
|
||||
| `RedisRejectedConnections` | Rejected connections > 0 | Critical | 1 min |
|
||||
|
||||
### Container Resource Alerts
|
||||
|
||||
| Alert | Expression | Severity | Duration |
|
||||
|-------|-----------|----------|----------|
|
||||
| `ContainerRestartLoop` | > 3 restarts in 15 min | Critical | 5 min |
|
||||
| `ContainerMemoryHigh` | Memory > 85% of limit | Warning | 5 min |
|
||||
| `ContainerCPUThrottled` | CPU throttle rate > 0.5s/s | Warning | 10 min |
|
||||
|
||||
### Disk & Infrastructure Alerts
|
||||
|
||||
| Alert | Expression | Severity | Duration |
|
||||
|-------|-----------|----------|----------|
|
||||
| `HostDiskUsageHigh` | Root disk > 80% | Warning | 10 min |
|
||||
| `HostDiskUsageCritical` | Root disk > 90% | Critical | 5 min |
|
||||
| `ApiHealthCheckFailing` | Health probe fails | Critical | 2 min |
|
||||
| `PrometheusTargetDown` | Scrape target down | Warning | 5 min |
|
||||
|
||||
### Backup Alerts
|
||||
|
||||
| Alert | Expression | Severity | Duration |
|
||||
|-------|-----------|----------|----------|
|
||||
| `BackupTooOld` | Last backup > 25 hours ago | Warning | 5 min |
|
||||
| `BackupVerificationFailed` | Verify result = fail | Warning | 1 min |
|
||||
|
||||
### Alert Routing
|
||||
|
||||
Alerts are routed via Alertmanager (`monitoring/alertmanager/alertmanager.yml`):
|
||||
|
||||
| Channel | Routes | Repeat Interval |
|
||||
|---------|--------|-----------------|
|
||||
| `#sre-oncall` (Slack) | All warning alerts | 4 hours |
|
||||
| `#sre-oncall` (Slack) | All critical alerts (priority) | 1 hour |
|
||||
| `#infrastructure` (Slack) | Backup-related alerts | 6 hours |
|
||||
|
||||
**Inhibition:** Warning alerts are suppressed when a critical alert for the same service is already firing.
|
||||
|
||||
Alert rules are defined in `monitoring/prometheus/alert-rules.yml` and evaluated every 15 seconds.
|
||||
|
||||
Reference in New Issue
Block a user