feat(monitoring): add comprehensive alerting rules, Alertmanager, and DR validation

Expand production monitoring with full alert coverage for database connections,
Redis memory/connections, container resources, disk usage, service health, and
backup integrity. Add Alertmanager service with Slack routing for critical and
warning alerts, and add automated backup verification to the pg-backup cron
schedule. Update runbook with DR validation procedures and quarterly checklist.

- Expand Prometheus alert rules from 4 to 24 alerts across 7 groups
- Add Alertmanager container (prom/alertmanager:v0.27.0) with Slack routing
- Configure inhibition rules (critical suppresses warning for same service)
- Schedule automated backup verification at 04:00 UTC daily
- Add Alertmanager datasource to Grafana provisioning
- Update runbook with Section 9: DR Validation (automated + manual procedures)
- Add SLACK_WEBHOOK_URL and Grafana vars to .env.example

Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
Ho Ngoc Hai
2026-04-11 20:15:36 +07:00
parent 33c2e5ac1d
commit 9409706c58
8 changed files with 1108 additions and 2 deletions

View File

@@ -53,6 +53,7 @@
| **promtail** | `grafana/promtail:3.0.0` | — | 0.25 CPU / 256 MB | — |
| **prometheus** | `prom/prometheus:v2.51.0` | 9090 (internal) | 0.5 CPU / 1 GB | `wget /-/healthy` |
| **grafana** | `grafana/grafana:10.4.1` | 3002 (external) | 0.5 CPU / 512 MB | `wget /api/health` |
| **alertmanager** | `prom/alertmanager:v0.27.0` | 9093 (internal) | 0.25 CPU / 256 MB | `wget /-/healthy` |
### Development-Only Services (`docker-compose.yml`)
@@ -67,7 +68,7 @@ web --> api --> pgbouncer --> postgres
|-> minio
|-> ai-services
grafana --> prometheus
grafana --> prometheus --> alertmanager
|-> loki --> promtail (Docker socket)
pg-backup --> postgres
@@ -128,6 +129,9 @@ curl -sf http://localhost:3100/ready && echo "Loki OK"
# Grafana
curl -sf http://localhost:3002/api/health | jq .
# Alertmanager
curl -sf http://localhost:9093/-/healthy && echo "Alertmanager OK"
```
### Container Resource Usage
@@ -864,6 +868,7 @@ All dashboards are provisioned automatically via `monitoring/grafana/provisionin
**Data Sources:**
- **Prometheus** (`http://prometheus:9090`) — Metrics (default)
- **Loki** (`http://loki:3100`) — Logs, with correlation ID linking to Prometheus
- **Alertmanager** (`http://alertmanager:9093`) — Alert state and silences
---
@@ -963,13 +968,216 @@ rate(container_cpu_usage_seconds_total{name=~"goodgo-.*"}[5m])
---
## 9. Disaster Recovery Validation
### Automated Verification
Backup verification runs **daily at 04:00 UTC** inside the `pg-backup` container. It restores the latest backup to an isolated test database and checks:
- Table existence (all 22 Prisma models)
- Row count comparison against live database
- Data checksums on critical tables (User, Property, Listing, Payment, Subscription, Transaction, Plan)
- PostGIS extension availability
- Index count match
- Enum type count match
**Check latest verification report:**
```bash
docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .
```
**Check verification logs:**
```bash
docker exec goodgo-pg-backup cat /var/log/pg-verify.log
```
### Manual DR Validation Procedure
Run this quarterly (or after major schema changes) to validate the full DR process end-to-end.
#### Step 1: Verify Backups Exist and Are Recent
```bash
# List backups with timestamps and sizes
docker exec goodgo-pg-backup ls -lht /backups/goodgo_*.sql.gz
# Verify latest backup is < 25 hours old
LATEST=$(docker exec goodgo-pg-backup ls -t /backups/goodgo_*.sql.gz | head -1)
echo "Latest backup: $LATEST"
```
#### Step 2: Run Verification Against Latest Backup
```bash
# Automated verification (creates temp DB, validates, drops)
docker exec -e REPORT_FILE=/backups/verify-latest.json goodgo-pg-backup \
/scripts/pg-verify-backup.sh
# Review results
docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .
```
**Expected output:** All checks pass, restore completes in < 60 seconds for typical dataset.
#### Step 3: Test Full Restore (Staging Only)
> ⚠️ **WARNING:** Only perform this on a staging or isolated environment. Never on production.
```bash
# 1. Create a separate test environment
docker compose -f docker-compose.yml -p goodgo-dr-test up -d postgres
# 2. Wait for PostgreSQL to be ready
docker exec goodgo-dr-test-postgres-1 pg_isready
# 3. Run restore against the test environment
PGHOST=localhost PGPORT=<test-port> PGUSER=goodgo PGPASSWORD=<password> \
/scripts/pg-restore.sh /backups/<latest-backup>.sql.gz
# 4. Verify key tables
docker exec goodgo-dr-test-postgres-1 psql -U goodgo -d goodgo -c \
"SELECT count(*) FROM \"User\"; SELECT count(*) FROM \"Property\"; SELECT count(*) FROM \"Listing\";"
# 5. Clean up test environment
docker compose -f docker-compose.yml -p goodgo-dr-test down -v
```
#### Step 4: Validate Service Recovery Chain
Test that all services can start from a clean state with restored data:
```bash
# 1. Note current service status
docker compose -f docker-compose.prod.yml ps --format "table {{.Name}}\t{{.Status}}\t{{.Health}}"
# 2. Restart all services in dependency order
docker compose -f docker-compose.prod.yml restart postgres
sleep 10 # Wait for PostgreSQL
docker compose -f docker-compose.prod.yml restart pgbouncer redis typesense
sleep 10 # Wait for data services
docker compose -f docker-compose.prod.yml restart api web ai-services
sleep 15 # Wait for application services
# 3. Verify all health checks
curl -sf http://localhost:3001/health/ready | jq .
curl -sf http://localhost:3000 > /dev/null && echo "Web OK"
curl -sf http://localhost:9090/-/healthy && echo "Prometheus OK"
curl -sf http://localhost:9093/-/healthy && echo "Alertmanager OK"
curl -sf http://localhost:3002/api/health | jq .
```
#### Step 5: Validate Alerting Pipeline
```bash
# 1. Check Prometheus is loading alert rules
curl -sf http://localhost:9090/api/v1/rules | jq '.data.groups | length'
# Expected: 7 groups
# 2. Check current alerts (should be empty if healthy)
curl -sf http://localhost:9090/api/v1/alerts | jq '.data.alerts | length'
# 3. Check Alertmanager is receiving from Prometheus
curl -sf http://localhost:9093/api/v2/status | jq '.cluster'
# 4. Verify Alertmanager config is loaded
curl -sf http://localhost:9093/api/v2/status | jq '.config'
```
### DR Validation Checklist
Use this checklist during quarterly DR reviews:
- [ ] Latest backup is < 25 hours old
- [ ] Automated verification report shows all checks passed
- [ ] Manual restore to test DB succeeds with correct row counts
- [ ] Full service restart completes within RTO target (< 30 min)
- [ ] All health endpoints respond after restart
- [ ] Prometheus alert rules are loaded (7 groups)
- [ ] Alertmanager is reachable and configured
- [ ] Slack notification channel is receiving test alerts
- [ ] Grafana dashboards show data after restart
- [ ] Typesense search returns results after restart
### RPO/RTO Summary
| Metric | Target | Actual (Measured) | Notes |
|--------|--------|-------------------|-------|
| **RPO** | ≤ 24 hours | ~24h (daily at 02:00 UTC) | Reduce with WAL archiving |
| **RTO — Local backup** | ≤ 15 minutes | Measure during DR test | Restore + service restart |
| **RTO — Off-site backup** | ≤ 30 minutes | Measure during DR test | Add transfer time |
| **RTO — Full host recovery** | ≤ 60 minutes | Measure during DR test | New host + restore + deploy |
---
## Appendix: Alert Rules Reference
### API & Error Alerts
| Alert | Expression | Severity | Duration |
|-------|-----------|----------|----------|
| `ApiLatencyP99High` | p99 > 1s | Warning | 5 min |
| `ApiEndpointLatencyP99High` | Per-route p99 > 2s | Warning | 5 min |
| `ApiLatencyP99Critical` | p99 > 3s (SLO breach) | Critical | 3 min |
| `ApiErrorRate5xxHigh` | 5xx rate > 1% | Warning | 5 min |
| `ApiErrorRate5xxCritical` | 5xx rate > 5% | Critical | 3 min |
| `ApiNoTraffic` | Request rate = 0 | Warning | 10 min |
### Database Alerts
| Alert | Expression | Severity | Duration |
|-------|-----------|----------|----------|
| `PostgresActiveConnectionsHigh` | Active connections > 15 | Warning | 5 min |
| `PostgresConnectionPoolCritical` | Total connections > 180 | Critical | 2 min |
| `PostgresSlowQueries` | Lock-waiting queries > 5 | Warning | 5 min |
| `PostgresDown` | API scrape target down | Critical | 1 min |
### Redis Alerts
| Alert | Expression | Severity | Duration |
|-------|-----------|----------|----------|
| `RedisMemoryHigh` | Memory usage > 80% | Warning | 5 min |
| `RedisMemoryCritical` | Memory usage > 95% | Critical | 2 min |
| `RedisConnectedClientsHigh` | Clients > 150 | Warning | 5 min |
| `RedisRejectedConnections` | Rejected connections > 0 | Critical | 1 min |
### Container Resource Alerts
| Alert | Expression | Severity | Duration |
|-------|-----------|----------|----------|
| `ContainerRestartLoop` | > 3 restarts in 15 min | Critical | 5 min |
| `ContainerMemoryHigh` | Memory > 85% of limit | Warning | 5 min |
| `ContainerCPUThrottled` | CPU throttle rate > 0.5s/s | Warning | 10 min |
### Disk & Infrastructure Alerts
| Alert | Expression | Severity | Duration |
|-------|-----------|----------|----------|
| `HostDiskUsageHigh` | Root disk > 80% | Warning | 10 min |
| `HostDiskUsageCritical` | Root disk > 90% | Critical | 5 min |
| `ApiHealthCheckFailing` | Health probe fails | Critical | 2 min |
| `PrometheusTargetDown` | Scrape target down | Warning | 5 min |
### Backup Alerts
| Alert | Expression | Severity | Duration |
|-------|-----------|----------|----------|
| `BackupTooOld` | Last backup > 25 hours ago | Warning | 5 min |
| `BackupVerificationFailed` | Verify result = fail | Warning | 1 min |
### Alert Routing
Alerts are routed via Alertmanager (`monitoring/alertmanager/alertmanager.yml`):
| Channel | Routes | Repeat Interval |
|---------|--------|-----------------|
| `#sre-oncall` (Slack) | All warning alerts | 4 hours |
| `#sre-oncall` (Slack) | All critical alerts (priority) | 1 hour |
| `#infrastructure` (Slack) | Backup-related alerts | 6 hours |
**Inhibition:** Warning alerts are suppressed when a critical alert for the same service is already firing.
Alert rules are defined in `monitoring/prometheus/alert-rules.yml` and evaluated every 15 seconds.