feat(monitoring): add comprehensive alerting rules, Alertmanager, and DR validation

Expand production monitoring with full alert coverage for database connections, Redis memory/connections, container resources, disk usage, service health, and backup integrity. Add Alertmanager service with Slack routing for critical and warning alerts, and add automated backup verification to the pg-backup cron schedule. Update runbook with DR validation procedures and quarterly checklist. - Expand Prometheus alert rules from 4 to 24 alerts across 7 groups - Add Alertmanager container (prom/alertmanager:v0.27.0) with Slack routing - Configure inhibition rules (critical suppresses warning for same service) - Schedule automated backup verification at 04:00 UTC daily - Add Alertmanager datasource to Grafana provisioning - Update runbook with Section 9: DR Validation (automated + manual procedures) - Add SLACK_WEBHOOK_URL and Grafana vars to .env.example Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-11 20:15:36 +07:00
parent 33c2e5ac1d
commit 9409706c58
8 changed files with 1108 additions and 2 deletions
--- a/docs/RUNBOOK.md
+++ b/docs/RUNBOOK.md
@@ -53,6 +53,7 @@
 | **promtail** | `grafana/promtail:3.0.0` | — | 0.25 CPU / 256 MB | — |
 | **prometheus** | `prom/prometheus:v2.51.0` | 9090 (internal) | 0.5 CPU / 1 GB | `wget /-/healthy` |
 | **grafana** | `grafana/grafana:10.4.1` | 3002 (external) | 0.5 CPU / 512 MB | `wget /api/health` |
+| **alertmanager** | `prom/alertmanager:v0.27.0` | 9093 (internal) | 0.25 CPU / 256 MB | `wget /-/healthy` |

 ### Development-Only Services (`docker-compose.yml`)

@@ -67,7 +68,7 @@ web --> api --> pgbouncer --> postgres
                  |-> minio
                  |-> ai-services

-grafana --> prometheus
+grafana --> prometheus --> alertmanager
        |-> loki --> promtail (Docker socket)

 pg-backup --> postgres
@@ -128,6 +129,9 @@ curl -sf http://localhost:3100/ready && echo "Loki OK"

 # Grafana
 curl -sf http://localhost:3002/api/health | jq .
+
+# Alertmanager
+curl -sf http://localhost:9093/-/healthy && echo "Alertmanager OK"
 ```

 ### Container Resource Usage
@@ -864,6 +868,7 @@ All dashboards are provisioned automatically via `monitoring/grafana/provisionin
 **Data Sources:**
 - **Prometheus** (`http://prometheus:9090`) — Metrics (default)
 - **Loki** (`http://loki:3100`) — Logs, with correlation ID linking to Prometheus
+- **Alertmanager** (`http://alertmanager:9093`) — Alert state and silences

 ---

@@ -963,13 +968,216 @@ rate(container_cpu_usage_seconds_total{name=~"goodgo-.*"}[5m])

 ---

+## 9. Disaster Recovery Validation
+
+### Automated Verification
+
+Backup verification runs **daily at 04:00 UTC** inside the `pg-backup` container. It restores the latest backup to an isolated test database and checks:
+
+- Table existence (all 22 Prisma models)
+- Row count comparison against live database
+- Data checksums on critical tables (User, Property, Listing, Payment, Subscription, Transaction, Plan)
+- PostGIS extension availability
+- Index count match
+- Enum type count match
+
+**Check latest verification report:**
+
+```bash
+docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .
+```
+
+**Check verification logs:**
+
+```bash
+docker exec goodgo-pg-backup cat /var/log/pg-verify.log
+```
+
+### Manual DR Validation Procedure
+
+Run this quarterly (or after major schema changes) to validate the full DR process end-to-end.
+
+#### Step 1: Verify Backups Exist and Are Recent
+
+```bash
+# List backups with timestamps and sizes
+docker exec goodgo-pg-backup ls -lht /backups/goodgo_*.sql.gz
+
+# Verify latest backup is < 25 hours old
+LATEST=$(docker exec goodgo-pg-backup ls -t /backups/goodgo_*.sql.gz | head -1)
+echo "Latest backup: $LATEST"
+```
+
+#### Step 2: Run Verification Against Latest Backup
+
+```bash
+# Automated verification (creates temp DB, validates, drops)
+docker exec -e REPORT_FILE=/backups/verify-latest.json goodgo-pg-backup \
+  /scripts/pg-verify-backup.sh
+
+# Review results
+docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .
+```
+
+**Expected output:** All checks pass, restore completes in < 60 seconds for typical dataset.
+
+#### Step 3: Test Full Restore (Staging Only)
+
+> ⚠️ **WARNING:** Only perform this on a staging or isolated environment. Never on production.
+
+```bash
+# 1. Create a separate test environment
+docker compose -f docker-compose.yml -p goodgo-dr-test up -d postgres
+
+# 2. Wait for PostgreSQL to be ready
+docker exec goodgo-dr-test-postgres-1 pg_isready
+
+# 3. Run restore against the test environment
+PGHOST=localhost PGPORT=<test-port> PGUSER=goodgo PGPASSWORD=<password> \
+  /scripts/pg-restore.sh /backups/<latest-backup>.sql.gz
+
+# 4. Verify key tables
+docker exec goodgo-dr-test-postgres-1 psql -U goodgo -d goodgo -c \
+  "SELECT count(*) FROM \"User\"; SELECT count(*) FROM \"Property\"; SELECT count(*) FROM \"Listing\";"
+
+# 5. Clean up test environment
+docker compose -f docker-compose.yml -p goodgo-dr-test down -v
+```
+
+#### Step 4: Validate Service Recovery Chain
+
+Test that all services can start from a clean state with restored data:
+
+```bash
+# 1. Note current service status
+docker compose -f docker-compose.prod.yml ps --format "table {{.Name}}\t{{.Status}}\t{{.Health}}"
+
+# 2. Restart all services in dependency order
+docker compose -f docker-compose.prod.yml restart postgres
+sleep 10  # Wait for PostgreSQL
+
+docker compose -f docker-compose.prod.yml restart pgbouncer redis typesense
+sleep 10  # Wait for data services
+
+docker compose -f docker-compose.prod.yml restart api web ai-services
+sleep 15  # Wait for application services
+
+# 3. Verify all health checks
+curl -sf http://localhost:3001/health/ready | jq .
+curl -sf http://localhost:3000 > /dev/null && echo "Web OK"
+curl -sf http://localhost:9090/-/healthy && echo "Prometheus OK"
+curl -sf http://localhost:9093/-/healthy && echo "Alertmanager OK"
+curl -sf http://localhost:3002/api/health | jq .
+```
+
+#### Step 5: Validate Alerting Pipeline
+
+```bash
+# 1. Check Prometheus is loading alert rules
+curl -sf http://localhost:9090/api/v1/rules | jq '.data.groups | length'
+# Expected: 7 groups
+
+# 2. Check current alerts (should be empty if healthy)
+curl -sf http://localhost:9090/api/v1/alerts | jq '.data.alerts | length'
+
+# 3. Check Alertmanager is receiving from Prometheus
+curl -sf http://localhost:9093/api/v2/status | jq '.cluster'
+
+# 4. Verify Alertmanager config is loaded
+curl -sf http://localhost:9093/api/v2/status | jq '.config'
+```
+
+### DR Validation Checklist
+
+Use this checklist during quarterly DR reviews:
+
+- [ ] Latest backup is < 25 hours old
+- [ ] Automated verification report shows all checks passed
+- [ ] Manual restore to test DB succeeds with correct row counts
+- [ ] Full service restart completes within RTO target (< 30 min)
+- [ ] All health endpoints respond after restart
+- [ ] Prometheus alert rules are loaded (7 groups)
+- [ ] Alertmanager is reachable and configured
+- [ ] Slack notification channel is receiving test alerts
+- [ ] Grafana dashboards show data after restart
+- [ ] Typesense search returns results after restart
+
+### RPO/RTO Summary
+
+| Metric | Target | Actual (Measured) | Notes |
+|--------|--------|-------------------|-------|
+| **RPO** | ≤ 24 hours | ~24h (daily at 02:00 UTC) | Reduce with WAL archiving |
+| **RTO — Local backup** | ≤ 15 minutes | Measure during DR test | Restore + service restart |
+| **RTO — Off-site backup** | ≤ 30 minutes | Measure during DR test | Add transfer time |
+| **RTO — Full host recovery** | ≤ 60 minutes | Measure during DR test | New host + restore + deploy |
+
+---
+
 ## Appendix: Alert Rules Reference

+### API & Error Alerts
+
 | Alert | Expression | Severity | Duration |
 |-------|-----------|----------|----------|
 | `ApiLatencyP99High` | p99 > 1s | Warning | 5 min |
 | `ApiEndpointLatencyP99High` | Per-route p99 > 2s | Warning | 5 min |
 | `ApiLatencyP99Critical` | p99 > 3s (SLO breach) | Critical | 3 min |
 | `ApiErrorRate5xxHigh` | 5xx rate > 1% | Warning | 5 min |
+| `ApiErrorRate5xxCritical` | 5xx rate > 5% | Critical | 3 min |
+| `ApiNoTraffic` | Request rate = 0 | Warning | 10 min |
+
+### Database Alerts
+
+| Alert | Expression | Severity | Duration |
+|-------|-----------|----------|----------|
+| `PostgresActiveConnectionsHigh` | Active connections > 15 | Warning | 5 min |
+| `PostgresConnectionPoolCritical` | Total connections > 180 | Critical | 2 min |
+| `PostgresSlowQueries` | Lock-waiting queries > 5 | Warning | 5 min |
+| `PostgresDown` | API scrape target down | Critical | 1 min |
+
+### Redis Alerts
+
+| Alert | Expression | Severity | Duration |
+|-------|-----------|----------|----------|
+| `RedisMemoryHigh` | Memory usage > 80% | Warning | 5 min |
+| `RedisMemoryCritical` | Memory usage > 95% | Critical | 2 min |
+| `RedisConnectedClientsHigh` | Clients > 150 | Warning | 5 min |
+| `RedisRejectedConnections` | Rejected connections > 0 | Critical | 1 min |
+
+### Container Resource Alerts
+
+| Alert | Expression | Severity | Duration |
+|-------|-----------|----------|----------|
+| `ContainerRestartLoop` | > 3 restarts in 15 min | Critical | 5 min |
+| `ContainerMemoryHigh` | Memory > 85% of limit | Warning | 5 min |
+| `ContainerCPUThrottled` | CPU throttle rate > 0.5s/s | Warning | 10 min |
+
+### Disk & Infrastructure Alerts
+
+| Alert | Expression | Severity | Duration |
+|-------|-----------|----------|----------|
+| `HostDiskUsageHigh` | Root disk > 80% | Warning | 10 min |
+| `HostDiskUsageCritical` | Root disk > 90% | Critical | 5 min |
+| `ApiHealthCheckFailing` | Health probe fails | Critical | 2 min |
+| `PrometheusTargetDown` | Scrape target down | Warning | 5 min |
+
+### Backup Alerts
+
+| Alert | Expression | Severity | Duration |
+|-------|-----------|----------|----------|
+| `BackupTooOld` | Last backup > 25 hours ago | Warning | 5 min |
+| `BackupVerificationFailed` | Verify result = fail | Warning | 1 min |
+
+### Alert Routing
+
+Alerts are routed via Alertmanager (`monitoring/alertmanager/alertmanager.yml`):
+
+| Channel | Routes | Repeat Interval |
+|---------|--------|-----------------|
+| `#sre-oncall` (Slack) | All warning alerts | 4 hours |
+| `#sre-oncall` (Slack) | All critical alerts (priority) | 1 hour |
+| `#infrastructure` (Slack) | Backup-related alerts | 6 hours |
+
+**Inhibition:** Warning alerts are suppressed when a critical alert for the same service is already firing.

 Alert rules are defined in `monitoring/prometheus/alert-rules.yml` and evaluated every 15 seconds.