Expand production monitoring with full alert coverage for database connections, Redis memory/connections, container resources, disk usage, service health, and backup integrity. Add Alertmanager service with Slack routing for critical and warning alerts, and add automated backup verification to the pg-backup cron schedule. Update runbook with DR validation procedures and quarterly checklist. - Expand Prometheus alert rules from 4 to 24 alerts across 7 groups - Add Alertmanager container (prom/alertmanager:v0.27.0) with Slack routing - Configure inhibition rules (critical suppresses warning for same service) - Schedule automated backup verification at 04:00 UTC daily - Add Alertmanager datasource to Grafana provisioning - Update runbook with Section 9: DR Validation (automated + manual procedures) - Add SLACK_WEBHOOK_URL and Grafana vars to .env.example Co-Authored-By: Paperclip <noreply@paperclip.ing>
1184 lines
40 KiB
Markdown
1184 lines
40 KiB
Markdown
# GoodGo Platform — Production Runbook
|
|
|
|
> **Audience:** On-call SRE, DevOps engineers, and platform operators.
|
|
> **Last updated:** 2026-04-11
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Service Inventory](#1-service-inventory)
|
|
2. [Health Checks](#2-health-checks)
|
|
3. [Common Incidents](#3-common-incidents)
|
|
- [3.1 Database Connection Pool Exhaustion](#31-database-connection-pool-exhaustion)
|
|
- [3.2 Redis Connection Failure](#32-redis-connection-failure)
|
|
- [3.3 Typesense Unavailable](#33-typesense-unavailable)
|
|
- [3.4 High API Latency](#34-high-api-latency)
|
|
- [3.5 Payment Callback Failures](#35-payment-callback-failures)
|
|
- [3.6 Disk Space Alerts](#36-disk-space-alerts)
|
|
- [3.7 MinIO / Object Storage Failure](#37-minio--object-storage-failure)
|
|
- [3.8 AI Services Unavailable](#38-ai-services-unavailable)
|
|
- [3.9 Log Pipeline Failure (Loki/Promtail)](#39-log-pipeline-failure-lokipromtail)
|
|
- [3.10 5xx Error Rate Spike](#310-5xx-error-rate-spike)
|
|
4. [Recovery Procedures](#4-recovery-procedures)
|
|
- [4.1 Database Restore from Backup](#41-database-restore-from-backup)
|
|
- [4.2 Redis Cache Flush & Warm-up](#42-redis-cache-flush--warm-up)
|
|
- [4.3 Rolling Restart Procedures](#43-rolling-restart-procedures)
|
|
- [4.4 Rollback Deployment](#44-rollback-deployment)
|
|
- [4.5 Typesense Reindex from PostgreSQL](#45-typesense-reindex-from-postgresql)
|
|
- [4.6 Full Host Recovery](#46-full-host-recovery)
|
|
5. [Escalation Matrix](#5-escalation-matrix)
|
|
6. [Monitoring Dashboards](#6-monitoring-dashboards)
|
|
7. [Useful PromQL Queries](#7-useful-promql-queries)
|
|
8. [Environment Quick Reference](#8-environment-quick-reference)
|
|
|
|
---
|
|
|
|
## 1. Service Inventory
|
|
|
|
### Production Services (`docker-compose.prod.yml`)
|
|
|
|
| Service | Image | Port | Resource Limits | Health Check |
|
|
|---------|-------|------|-----------------|--------------|
|
|
| **api** (NestJS) | `ghcr.io/goodgo/goodgo-api` | 3001 | 1 CPU / 1 GB | `GET /health` (node fetch) |
|
|
| **web** (Next.js) | `ghcr.io/goodgo/goodgo-web` | 3000 | 0.5 CPU / 512 MB | `GET /` (node fetch) |
|
|
| **ai-services** (FastAPI) | `ghcr.io/goodgo/goodgo-ai-services` | 8000 | 1 CPU / 1 GB | `GET /health` (httpx) |
|
|
| **postgres** | `postgis/postgis:16-3.4` | 5432 (internal) | 2 CPU / 2 GB, shm=256m | `pg_isready` |
|
|
| **pgbouncer** | `edoburu/pgbouncer:1.23.1-p2` | 6432 (internal) | 0.5 CPU / 256 MB | `pg_isready -p 6432` |
|
|
| **redis** | `redis:7-alpine` | 6379 (internal) | 0.5 CPU / 768 MB | `redis-cli ping` |
|
|
| **typesense** | `typesense/typesense:27.1` | 8108 (internal) | 1 CPU / 1 GB | `curl /health` |
|
|
| **minio** | `minio/minio:latest` | 9000/9001 (internal) | 0.5 CPU / 1 GB | `mc ready local` |
|
|
| **pg-backup** | `postgis/postgis:16-3.4` | — | 0.5 CPU / 512 MB | — (cron daemon) |
|
|
| **loki** | `grafana/loki:3.0.0` | 3100 (internal) | 0.5 CPU / 512 MB | `wget /ready` |
|
|
| **promtail** | `grafana/promtail:3.0.0` | — | 0.25 CPU / 256 MB | — |
|
|
| **prometheus** | `prom/prometheus:v2.51.0` | 9090 (internal) | 0.5 CPU / 1 GB | `wget /-/healthy` |
|
|
| **grafana** | `grafana/grafana:10.4.1` | 3002 (external) | 0.5 CPU / 512 MB | `wget /api/health` |
|
|
| **alertmanager** | `prom/alertmanager:v0.27.0` | 9093 (internal) | 0.25 CPU / 256 MB | `wget /-/healthy` |
|
|
|
|
### Development-Only Services (`docker-compose.yml`)
|
|
|
|
Development uses the same data and monitoring services but runs API/Web on the host. The `pg-backup` service also runs in dev with default credentials.
|
|
|
|
### Service Dependency Chain
|
|
|
|
```
|
|
web --> api --> pgbouncer --> postgres
|
|
|-> redis
|
|
|-> typesense
|
|
|-> minio
|
|
|-> ai-services
|
|
|
|
grafana --> prometheus --> alertmanager
|
|
|-> loki --> promtail (Docker socket)
|
|
|
|
pg-backup --> postgres
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Health Checks
|
|
|
|
### Application Health Endpoints
|
|
|
|
| Endpoint | Type | Checks | Expected Response |
|
|
|----------|------|--------|-------------------|
|
|
| `GET /health` | Liveness | Process is running | `200 { status: "ok" }` |
|
|
| `GET /health/ready` | Readiness | PostgreSQL + Redis | `200 { status: "ok", info: { database: ..., redis: ... } }` |
|
|
| `GET /health/db` | Database only | PostgreSQL connectivity | `200 { status: "ok", info: { database: ... } }` |
|
|
| `GET /health/redis` | Redis only | Redis connectivity | `200 { status: "ok", info: { redis: ... } }` |
|
|
|
|
### Verify All Services Are Healthy
|
|
|
|
```bash
|
|
# Quick check — all containers
|
|
docker compose -f docker-compose.prod.yml ps --format "table {{.Name}}\t{{.Status}}\t{{.Health}}"
|
|
|
|
# API liveness
|
|
curl -sf http://localhost:3001/health && echo "API OK" || echo "API FAIL"
|
|
|
|
# API readiness (DB + Redis)
|
|
curl -sf http://localhost:3001/health/ready | jq .
|
|
|
|
# Individual dependency checks
|
|
curl -sf http://localhost:3001/health/db | jq .
|
|
curl -sf http://localhost:3001/health/redis | jq .
|
|
|
|
# Typesense
|
|
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health
|
|
|
|
# MinIO
|
|
docker exec goodgo-minio mc ready local && echo "MinIO OK"
|
|
|
|
# AI Services
|
|
curl -sf http://localhost:8000/health && echo "AI OK" || echo "AI FAIL"
|
|
|
|
# PostgreSQL (direct)
|
|
docker exec goodgo-postgres pg_isready -U ${DB_USER} -d ${DB_NAME}
|
|
|
|
# PgBouncer
|
|
docker exec goodgo-pgbouncer pg_isready -h 127.0.0.1 -p 6432 -U ${DB_USER}
|
|
|
|
# Redis
|
|
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" ping
|
|
|
|
# Prometheus
|
|
curl -sf http://localhost:9090/-/healthy && echo "Prometheus OK"
|
|
|
|
# Loki
|
|
curl -sf http://localhost:3100/ready && echo "Loki OK"
|
|
|
|
# Grafana
|
|
curl -sf http://localhost:3002/api/health | jq .
|
|
|
|
# Alertmanager
|
|
curl -sf http://localhost:9093/-/healthy && echo "Alertmanager OK"
|
|
```
|
|
|
|
### Container Resource Usage
|
|
|
|
```bash
|
|
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.NetIO}}"
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Common Incidents
|
|
|
|
### 3.1 Database Connection Pool Exhaustion
|
|
|
|
**Symptoms:**
|
|
- API returns 503 or hangs on requests
|
|
- `/health/ready` returns unhealthy for `database`
|
|
- PgBouncer logs: `no more connections allowed` or `query_wait_timeout`
|
|
- Prometheus: spike in `pg_stat_activity` active connections
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
# Check PgBouncer pool status
|
|
docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW POOLS;"
|
|
docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW CLIENTS;"
|
|
docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW STATS;"
|
|
|
|
# Check PostgreSQL active connections
|
|
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
|
|
"SELECT state, count(*) FROM pg_stat_activity WHERE datname = '${DB_NAME}' GROUP BY state;"
|
|
|
|
# Identify long-running queries
|
|
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
|
|
"SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
|
|
FROM pg_stat_activity
|
|
WHERE datname = '${DB_NAME}' AND state != 'idle'
|
|
ORDER BY duration DESC
|
|
LIMIT 10;"
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
```bash
|
|
# 1. Kill long-running queries (> 5 minutes)
|
|
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
|
|
"SELECT pg_terminate_backend(pid)
|
|
FROM pg_stat_activity
|
|
WHERE datname = '${DB_NAME}'
|
|
AND state != 'idle'
|
|
AND now() - query_start > interval '5 minutes'
|
|
AND pid <> pg_backend_pid();"
|
|
|
|
# 2. If pool is fully exhausted, restart PgBouncer
|
|
docker compose -f docker-compose.prod.yml restart pgbouncer
|
|
|
|
# 3. If issue persists, increase pool size temporarily
|
|
# Edit PGBOUNCER_POOL_SIZE in .env, then:
|
|
docker compose -f docker-compose.prod.yml up -d --no-deps pgbouncer
|
|
```
|
|
|
|
**PgBouncer Configuration Reference:**
|
|
- Pool mode: `transaction` (connections returned to pool after each transaction)
|
|
- Default pool size: 20 server connections per user/db pair
|
|
- Max client connections: 200
|
|
- Reserve pool: 5 extra connections (after 3s wait)
|
|
- Query wait timeout: 120s (error if client waits this long)
|
|
|
|
### 3.2 Redis Connection Failure
|
|
|
|
**Symptoms:**
|
|
- `/health/redis` returns unhealthy
|
|
- Increased API response times (cache misses hitting DB)
|
|
- API logs show Redis connection errors
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
# Check Redis container
|
|
docker logs --tail=50 goodgo-redis
|
|
|
|
# Test connectivity
|
|
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" ping
|
|
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO server
|
|
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO memory
|
|
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO clients
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
```bash
|
|
# 1. Restart Redis (data persisted via AOF)
|
|
docker compose -f docker-compose.prod.yml restart redis
|
|
|
|
# 2. If OOM — check memory usage
|
|
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO memory | grep used_memory_human
|
|
# Max memory is 512 MB (prod), eviction policy: allkeys-lru
|
|
|
|
# 3. If AOF is corrupted
|
|
docker compose -f docker-compose.prod.yml stop redis
|
|
docker exec goodgo-redis redis-check-aof --fix /data/appendonly.aof
|
|
docker compose -f docker-compose.prod.yml start redis
|
|
```
|
|
|
|
**Graceful Degradation:** The API is designed to continue operating when Redis is unavailable. Cache misses fall through to PostgreSQL. Performance will degrade but functionality is preserved. Redis is non-critical for core operations.
|
|
|
|
### 3.3 Typesense Unavailable
|
|
|
|
**Symptoms:**
|
|
- Search functionality returns errors or falls back to basic DB search
|
|
- `curl http://localhost:8108/health` fails
|
|
- API logs show Typesense connection timeouts
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
# Check container status
|
|
docker logs --tail=50 goodgo-typesense
|
|
|
|
# Check health
|
|
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health
|
|
|
|
# Check collections
|
|
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/collections | jq .
|
|
|
|
# Check disk space for Typesense data volume
|
|
docker system df -v | grep typesense
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
```bash
|
|
# 1. Restart Typesense
|
|
docker compose -f docker-compose.prod.yml restart typesense
|
|
|
|
# 2. If data is corrupted — rebuild from PostgreSQL
|
|
docker compose -f docker-compose.prod.yml stop typesense
|
|
docker volume rm goodgo-platform-ai_typesense_data
|
|
docker compose -f docker-compose.prod.yml up -d typesense
|
|
# Wait for healthy, then reindex:
|
|
docker compose -f docker-compose.prod.yml exec api npx ts-node scripts/typesense-reindex.ts
|
|
# Or: pnpm run typesense:reindex
|
|
|
|
# 3. If volume backup exists — restore
|
|
docker compose -f docker-compose.prod.yml stop typesense
|
|
docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
|
|
alpine sh -c "rm -rf /data/* && tar xzf /backup/typesense_YYYYMMDD_HHMMSS.tar.gz -C /data"
|
|
docker compose -f docker-compose.prod.yml start typesense
|
|
```
|
|
|
|
**Fallback Behavior:** When Typesense is unavailable, property search falls back to PostgreSQL full-text search with PostGIS geo queries. Search quality degrades but core functionality works.
|
|
|
|
### 3.4 High API Latency
|
|
|
|
**Symptoms:**
|
|
- Prometheus alert `ApiLatencyP99High` fires (p99 > 1s for 5 min)
|
|
- Critical alert `ApiLatencyP99Critical` fires (p99 > 3s for 3 min — SLO breach)
|
|
- Users report slow page loads
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
# 1. Check which endpoints are slow
|
|
# Grafana: GoodGo API Latency dashboard
|
|
# Or via PromQL:
|
|
curl -s "http://localhost:9090/api/v1/query" --data-urlencode \
|
|
'query=topk(10, histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket[5m])) by (le, route, method)))' \
|
|
| jq '.data.result[] | {route: .metric.route, method: .metric.method, p99: .value[1]}'
|
|
|
|
# 2. Check database slow queries
|
|
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
|
|
"SELECT pid, now() - query_start AS duration, left(query, 100) AS query_preview
|
|
FROM pg_stat_activity
|
|
WHERE state = 'active' AND now() - query_start > interval '1 second'
|
|
ORDER BY duration DESC;"
|
|
|
|
# 3. Check PgBouncer wait times
|
|
docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW POOLS;"
|
|
|
|
# 4. Check container resource usage
|
|
docker stats --no-stream goodgo-api goodgo-postgres goodgo-redis goodgo-pgbouncer
|
|
|
|
# 5. Check Redis latency
|
|
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" --latency-history -i 3
|
|
|
|
# 6. Check application logs for errors
|
|
docker logs --tail=200 --since=5m goodgo-api 2>&1 | grep -i "error\|timeout\|slow"
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
```bash
|
|
# 1. If DB slow queries — terminate them
|
|
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
|
|
"SELECT pg_terminate_backend(pid)
|
|
FROM pg_stat_activity
|
|
WHERE state = 'active' AND now() - query_start > interval '30 seconds';"
|
|
|
|
# 2. If connection pool exhaustion — see Section 3.1
|
|
|
|
# 3. If Redis is slow — restart
|
|
docker compose -f docker-compose.prod.yml restart redis
|
|
|
|
# 4. If API container OOM — restart with more memory
|
|
docker compose -f docker-compose.prod.yml restart api
|
|
|
|
# 5. If specific endpoint is the bottleneck — check Loki logs:
|
|
# Grafana > Explore > Loki > {container_name="goodgo-api"} |= "slow"
|
|
```
|
|
|
|
### 3.5 Payment Callback Failures
|
|
|
|
**Symptoms:**
|
|
- Users report payments stuck in "pending" state
|
|
- VNPay/MoMo/ZaloPay IPN callbacks returning errors
|
|
- Payment reconciliation mismatches
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
# 1. Check payment callback logs
|
|
docker logs goodgo-api 2>&1 | grep -i "payment\|callback\|vnpay\|momo\|zalopay" | tail -50
|
|
|
|
# 2. Check for pending payments in DB
|
|
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
|
|
"SELECT id, provider, status, \"amountVND\", \"createdAt\"
|
|
FROM \"Payment\"
|
|
WHERE status = 'PENDING'
|
|
AND \"createdAt\" > now() - interval '24 hours'
|
|
ORDER BY \"createdAt\" DESC
|
|
LIMIT 20;"
|
|
|
|
# 3. Verify callback URL is reachable from external networks
|
|
curl -sf https://your-domain.com/api/payments/vnpay/callback && echo "Callback URL reachable"
|
|
|
|
# 4. Check if API is receiving callbacks (via Loki)
|
|
# Grafana > Explore > Loki > {container_name="goodgo-api"} |= "callback" |= "payment"
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
```bash
|
|
# 1. If callbacks are timing out — check API health and restart if needed
|
|
docker compose -f docker-compose.prod.yml restart api
|
|
|
|
# 2. If VNPay signature verification fails — verify VNPAY_* env vars
|
|
docker compose -f docker-compose.prod.yml exec api printenv | grep VNPAY
|
|
|
|
# 3. For stuck payments — manual reconciliation
|
|
# Check VNPay/MoMo merchant portal for actual transaction status
|
|
# Update payment status in DB if confirmed paid:
|
|
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
|
|
"UPDATE \"Payment\" SET status = 'COMPLETED', \"updatedAt\" = now()
|
|
WHERE id = '<payment-id>' AND status = 'PENDING';"
|
|
|
|
# 4. If callbacks are not reaching the server — check:
|
|
# - Firewall rules (port 3001 or reverse proxy port must be open)
|
|
# - SSL certificate validity
|
|
# - DNS resolution
|
|
# - Payment provider webhook configuration (correct callback URL)
|
|
```
|
|
|
|
**Important:** The payment callback handler uses idempotent processing with atomic state transitions. Replaying a callback is safe and will not duplicate payments.
|
|
|
|
### 3.6 Disk Space Alerts
|
|
|
|
**Symptoms:**
|
|
- Containers failing to start or crashing
|
|
- PostgreSQL refusing writes (`PANIC: could not write to file`)
|
|
- Docker daemon running out of space
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
# Host disk usage
|
|
df -h
|
|
|
|
# Docker disk usage
|
|
docker system df
|
|
docker system df -v
|
|
|
|
# Check individual volume sizes
|
|
for vol in $(docker volume ls -q | grep goodgo); do
|
|
echo -n "$vol: "
|
|
docker run --rm -v "${vol}:/data" alpine du -sh /data 2>/dev/null
|
|
done
|
|
|
|
# Check backup volume specifically
|
|
docker exec goodgo-pg-backup du -sh /backups/
|
|
docker exec goodgo-pg-backup ls -lht /backups/
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
```bash
|
|
# 1. Clean up Docker artifacts
|
|
docker system prune -f # Remove stopped containers, unused networks, dangling images
|
|
docker image prune -a -f # Remove ALL unused images (careful in prod)
|
|
|
|
# 2. Clean old backups (if retention not working)
|
|
docker exec goodgo-pg-backup find /backups -name "goodgo_*.sql.gz" -mtime +7 -delete
|
|
|
|
# 3. Clean Prometheus data (if too large)
|
|
# Prometheus retention is 30d (prod) / 15d (dev) — configured via --storage.tsdb.retention.time
|
|
# To force compaction:
|
|
curl -sf -XPOST http://localhost:9090/-/quit # Graceful shutdown triggers compaction
|
|
docker compose -f docker-compose.prod.yml start prometheus
|
|
|
|
# 4. Clean Loki data (15-day retention)
|
|
# Loki handles its own cleanup via compactor. If urgent:
|
|
docker compose -f docker-compose.prod.yml restart loki
|
|
|
|
# 5. Truncate Docker container logs
|
|
sudo truncate -s 0 $(docker inspect --format='{{.LogPath}}' goodgo-api)
|
|
# Or for all containers:
|
|
sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log'
|
|
```
|
|
|
|
**Prevention:** All production containers use `json-file` logging with `max-size: 10m` and `max-file: 3-5`. Backup retention is 7 days (configurable via `BACKUP_RETENTION_DAYS`).
|
|
|
|
### 3.7 MinIO / Object Storage Failure
|
|
|
|
**Symptoms:**
|
|
- Image/file uploads fail
|
|
- Property photos not loading
|
|
- MinIO console inaccessible at port 9001
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
docker logs --tail=50 goodgo-minio
|
|
docker exec goodgo-minio mc ready local
|
|
docker exec goodgo-minio mc admin info local
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
```bash
|
|
# 1. Restart MinIO
|
|
docker compose -f docker-compose.prod.yml restart minio
|
|
|
|
# 2. If data volume corrupted
|
|
docker compose -f docker-compose.prod.yml stop minio
|
|
docker volume rm goodgo-platform-ai_minio_data # WARNING: data loss
|
|
docker compose -f docker-compose.prod.yml up -d minio
|
|
# Recreate buckets via API or admin console
|
|
```
|
|
|
|
### 3.8 AI Services Unavailable
|
|
|
|
**Symptoms:**
|
|
- AI-powered features (AVM, property descriptions) fail
|
|
- `GET /health` on port 8000 fails
|
|
- API logs show AI service connection timeouts
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
docker logs --tail=50 goodgo-ai-services
|
|
curl -sf http://localhost:8000/health
|
|
docker stats --no-stream goodgo-ai-services
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
```bash
|
|
# 1. Restart AI services
|
|
docker compose -f docker-compose.prod.yml restart ai-services
|
|
|
|
# 2. Check rate limits (default: 60/minute)
|
|
docker compose -f docker-compose.prod.yml exec ai-services printenv | grep AI_RATE_LIMIT
|
|
|
|
# 3. If OOM — the service has 1 GB limit; may need to increase for large models
|
|
```
|
|
|
|
**Graceful Degradation:** AI features are optional. The API should handle AI service unavailability gracefully and return non-AI results.
|
|
|
|
### 3.9 Log Pipeline Failure (Loki/Promtail)
|
|
|
|
**Symptoms:**
|
|
- Grafana log explorer returns empty results
|
|
- Promtail container unhealthy or crash-looping
|
|
- Loki returning 503
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
docker logs --tail=50 goodgo-loki
|
|
docker logs --tail=50 goodgo-promtail
|
|
curl -sf http://localhost:3100/ready && echo "Loki ready" || echo "Loki NOT ready"
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
```bash
|
|
# 1. Restart the pipeline
|
|
docker compose -f docker-compose.prod.yml restart loki promtail
|
|
|
|
# 2. If Loki data corrupted
|
|
docker compose -f docker-compose.prod.yml stop loki promtail
|
|
docker volume rm goodgo-platform-ai_loki_data
|
|
docker compose -f docker-compose.prod.yml up -d loki promtail
|
|
# Historical logs are lost but new logs will flow immediately
|
|
|
|
# 3. If Promtail can't access Docker socket
|
|
ls -la /var/run/docker.sock
|
|
# Ensure the promtail container has the Docker socket mounted
|
|
```
|
|
|
|
### 3.10 5xx Error Rate Spike
|
|
|
|
**Symptoms:**
|
|
- Prometheus alert `ApiErrorRate5xxHigh` fires (> 1% 5xx for 5 min)
|
|
- Users reporting errors
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
# Check which endpoints are returning 5xx
|
|
curl -s "http://localhost:9090/api/v1/query" --data-urlencode \
|
|
'query=topk(10, sum(rate(http_requests_total{job="goodgo-api", status_code=~"5.."}[5m])) by (route, method))' \
|
|
| jq '.data.result'
|
|
|
|
# Check API error logs
|
|
docker logs --tail=200 --since=5m goodgo-api 2>&1 | grep -i "error\|exception\|500"
|
|
|
|
# Check all dependency health
|
|
curl -sf http://localhost:3001/health/ready | jq .
|
|
```
|
|
|
|
**Resolution:**
|
|
1. If DB-related: see [Section 3.1](#31-database-connection-pool-exhaustion)
|
|
2. If Redis-related: see [Section 3.2](#32-redis-connection-failure)
|
|
3. If recent deployment: see [Section 4.4](#44-rollback-deployment)
|
|
4. If unknown: restart API and investigate logs
|
|
|
|
---
|
|
|
|
## 4. Recovery Procedures
|
|
|
|
### 4.1 Database Restore from Backup
|
|
|
|
**Automated backups run daily at 02:00 UTC** via the `pg-backup` container. Retention: 7 days. Format: `pg_dump --format=custom --compress=6`.
|
|
|
|
**Automated verification runs daily at 04:00 UTC** — restores to an isolated test database, verifies table existence, row counts, checksums, PostGIS extension, indexes, and enums. Reports are written to `/backups/verify-latest.json`.
|
|
|
|
#### List Available Backups
|
|
|
|
```bash
|
|
docker exec goodgo-pg-backup ls -lht /backups/goodgo_*.sql.gz
|
|
```
|
|
|
|
#### Create an On-Demand Backup
|
|
|
|
```bash
|
|
docker exec goodgo-pg-backup /scripts/pg-backup.sh
|
|
```
|
|
|
|
#### Full Restore Procedure
|
|
|
|
```bash
|
|
# 1. Stop application services
|
|
docker compose -f docker-compose.prod.yml stop api web ai-services
|
|
|
|
# 2. (Production) Stop PgBouncer to prevent stale connections
|
|
docker compose -f docker-compose.prod.yml stop pgbouncer
|
|
|
|
# 3. Run the restore script
|
|
docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz
|
|
# The script will:
|
|
# - Terminate active DB connections
|
|
# - DROP and recreate the database
|
|
# - Restore from the backup file
|
|
|
|
# 4. Verify the restore
|
|
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c '\dt'
|
|
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c 'SELECT count(*) FROM "User";'
|
|
|
|
# 5. Apply any pending migrations
|
|
docker compose -f docker-compose.prod.yml exec api npx prisma migrate deploy
|
|
|
|
# 6. Restart all services
|
|
docker compose -f docker-compose.prod.yml up -d
|
|
|
|
# 7. Verify application health
|
|
curl -sf http://localhost:3001/health/ready | jq .
|
|
```
|
|
|
|
#### Verify a Backup Without Restoring
|
|
|
|
```bash
|
|
# Run verification against latest backup (creates temp DB, drops it after)
|
|
docker compose run --rm pg-verify-backup
|
|
|
|
# Or verify a specific backup file
|
|
docker exec goodgo-pg-backup /scripts/pg-verify-backup.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz
|
|
|
|
# Check latest verification report
|
|
docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .
|
|
```
|
|
|
|
**RPO/RTO:**
|
|
- RPO: ≤ 24 hours (daily backups; consider WAL archiving for lower RPO)
|
|
- RTO: ~15 minutes (local volume), ~30 minutes (off-site)
|
|
|
|
### 4.2 Redis Cache Flush & Warm-up
|
|
|
|
```bash
|
|
# Flush all Redis data
|
|
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" FLUSHALL
|
|
|
|
# Verify flush
|
|
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" DBSIZE
|
|
# Should return: (integer) 0
|
|
```
|
|
|
|
**Warm-up:** Redis uses `allkeys-lru` eviction. Cache warms naturally as users make requests. No manual warm-up script is needed — cache misses fall through to PostgreSQL.
|
|
|
|
**When to flush:**
|
|
- After database restore (stale cache references)
|
|
- After data corruption at the application level
|
|
- After schema changes that alter cached data structures
|
|
|
|
### 4.3 Rolling Restart Procedures
|
|
|
|
#### Single Service Restart (Zero Downtime)
|
|
|
|
```bash
|
|
# API — the --wait flag ensures health check passes before moving on
|
|
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api
|
|
|
|
# Web
|
|
docker compose -f docker-compose.prod.yml up -d --no-deps --wait web
|
|
|
|
# AI Services
|
|
docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services
|
|
```
|
|
|
|
#### Full Stack Rolling Restart
|
|
|
|
```bash
|
|
# Data services first (order matters for dependency chain)
|
|
docker compose -f docker-compose.prod.yml restart redis
|
|
docker compose -f docker-compose.prod.yml restart typesense
|
|
|
|
# Wait for data services to be healthy
|
|
sleep 10
|
|
|
|
# Connection pooling
|
|
docker compose -f docker-compose.prod.yml restart pgbouncer
|
|
sleep 5
|
|
|
|
# Application services
|
|
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api
|
|
docker compose -f docker-compose.prod.yml up -d --no-deps --wait web
|
|
docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services
|
|
|
|
# Verify
|
|
curl -sf http://localhost:3001/health/ready | jq .
|
|
```
|
|
|
|
#### Emergency: Restart Everything
|
|
|
|
```bash
|
|
docker compose -f docker-compose.prod.yml down
|
|
docker compose -f docker-compose.prod.yml up -d --wait
|
|
```
|
|
|
|
### 4.4 Rollback Deployment
|
|
|
|
The CI/CD pipeline (`.github/workflows/deploy.yml`) supports automatic rollback if production smoke tests fail. For manual rollback:
|
|
|
|
#### Quick Rollback (Revert to Previous Images)
|
|
|
|
```bash
|
|
# SSH into production host
|
|
ssh deploy@$PRODUCTION_HOST
|
|
|
|
cd ~/goodgo
|
|
|
|
# Stop current app containers
|
|
docker compose -f docker-compose.prod.yml down api web ai-services
|
|
|
|
# The previous images are still cached locally
|
|
# Restart without pulling — uses last-known-good images
|
|
docker compose -f docker-compose.prod.yml up -d --wait api web ai-services
|
|
|
|
# Verify
|
|
curl -sf http://localhost:3001/health && echo "Rollback successful"
|
|
```
|
|
|
|
#### Rollback to a Specific Git Commit / Image Tag
|
|
|
|
```bash
|
|
# Set the target tag (git SHA)
|
|
export IMAGE_TAG=<previous-commit-sha>
|
|
export REGISTRY_URL=ghcr.io/goodgo
|
|
|
|
# Pull specific version
|
|
docker compose -f docker-compose.prod.yml pull api web ai-services
|
|
|
|
# Deploy
|
|
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api
|
|
docker compose -f docker-compose.prod.yml up -d --no-deps --wait web
|
|
docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services
|
|
|
|
# Verify
|
|
curl -sf http://localhost:3001/health/ready | jq .
|
|
```
|
|
|
|
#### Rollback Database Migrations
|
|
|
|
```bash
|
|
# WARNING: Prisma does not support automatic down-migrations.
|
|
# For migration rollback, restore from the pre-migration backup:
|
|
|
|
# 1. Stop application
|
|
docker compose -f docker-compose.prod.yml stop api web ai-services pgbouncer
|
|
|
|
# 2. Restore from backup taken before the migration
|
|
docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/<pre-migration-backup>.sql.gz
|
|
|
|
# 3. Deploy the previous code version (older IMAGE_TAG)
|
|
export IMAGE_TAG=<previous-commit-sha>
|
|
docker compose -f docker-compose.prod.yml up -d --wait
|
|
```
|
|
|
|
### 4.5 Typesense Reindex from PostgreSQL
|
|
|
|
If Typesense data is lost or corrupted, rebuild the search index from PostgreSQL:
|
|
|
|
```bash
|
|
# 1. Ensure Typesense is running and healthy
|
|
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health
|
|
|
|
# 2. Run reindex
|
|
docker compose -f docker-compose.prod.yml exec api npx ts-node scripts/typesense-reindex.ts
|
|
# Or from host:
|
|
pnpm run typesense:reindex
|
|
|
|
# 3. Verify collections
|
|
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/collections | jq '.[].name'
|
|
```
|
|
|
|
### 4.6 Full Host Recovery
|
|
|
|
For complete host failure or migration to a new server:
|
|
|
|
```bash
|
|
# 1. Provision new host with Docker + Docker Compose
|
|
# Requirements: Docker >= 24, Docker Compose v2, 8 GB RAM minimum
|
|
|
|
# 2. Clone repository and configure
|
|
git clone <repo-url> ~/goodgo && cd ~/goodgo
|
|
cp .env.example .env
|
|
# Edit .env with production secrets (from secrets manager)
|
|
|
|
# 3. Restore PostgreSQL backup from off-site storage
|
|
# Transfer backup file to the new host
|
|
scp backups/goodgo_latest.sql.gz deploy@newhost:~/goodgo/backups/
|
|
|
|
# 4. Start infrastructure services
|
|
docker compose -f docker-compose.prod.yml up -d postgres redis typesense minio
|
|
|
|
# 5. Wait for PostgreSQL to be ready, then restore
|
|
docker compose -f docker-compose.prod.yml exec postgres pg_isready
|
|
docker exec goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_latest.sql.gz
|
|
|
|
# 6. Start application services
|
|
docker compose -f docker-compose.prod.yml up -d
|
|
|
|
# 7. Run migrations (if backup predates latest code)
|
|
docker compose -f docker-compose.prod.yml exec api npx prisma migrate deploy
|
|
|
|
# 8. Rebuild Typesense index
|
|
pnpm run typesense:reindex
|
|
|
|
# 9. Flush Redis (stale cache from old host)
|
|
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" FLUSHALL
|
|
|
|
# 10. Verify everything
|
|
curl -sf http://localhost:3001/health/ready | jq .
|
|
curl -sf http://localhost:3000 > /dev/null && echo "Web OK"
|
|
|
|
# Expected RTO: ~60 minutes (depends on backup transfer speed)
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Escalation Matrix
|
|
|
|
| Severity | Condition | First Responder | Escalation | SLA |
|
|
|----------|-----------|-----------------|------------|-----|
|
|
| **P0 — Critical** | Full outage, data loss, payment corruption | On-call SRE | CTO + CEO within 15 min | Acknowledge: 5 min, Resolve: 1 hour |
|
|
| **P1 — High** | Partial outage, SLO breach (p99 > 3s), 5xx > 5% | On-call SRE | Engineering lead within 30 min | Acknowledge: 15 min, Resolve: 4 hours |
|
|
| **P2 — Medium** | Degraded performance, single service down (non-critical), p99 > 1s | On-call SRE | Team lead next business day | Acknowledge: 1 hour, Resolve: 24 hours |
|
|
| **P3 — Low** | Cosmetic issues, monitoring gaps, non-urgent improvements | Assigned engineer | Sprint planning | Next sprint |
|
|
|
|
### Contact Channels
|
|
|
|
| Role | Channel |
|
|
|------|---------|
|
|
| On-call SRE | Slack `#sre-oncall` + PagerDuty |
|
|
| Engineering Lead | Slack `#engineering` |
|
|
| CTO | Slack DM / Phone (see PagerDuty) |
|
|
| Payment Issues | Slack `#payments` + VNPay/MoMo support portals |
|
|
| Infrastructure | Slack `#infrastructure` |
|
|
|
|
### Slack Notifications
|
|
|
|
The deploy pipeline automatically notifies `#deployments` (via `SLACK_WEBHOOK_URL`) on:
|
|
- Production deploy success
|
|
- Staging smoke test failure
|
|
- Production rollback triggered
|
|
|
|
---
|
|
|
|
## 6. Monitoring Dashboards
|
|
|
|
All dashboards are provisioned automatically via `monitoring/grafana/provisioning/` and are available in the **GoodGo** folder in Grafana.
|
|
|
|
| Dashboard | Grafana Path | Purpose |
|
|
|-----------|--------------|---------|
|
|
| **API Overview** | `api-overview` | Request rates, status codes, active connections |
|
|
| **API Latency** | `api-latency` | p50/p95/p99 latency by endpoint, latency heatmaps |
|
|
| **Database** | `database` | PostgreSQL connections, query performance, PgBouncer stats |
|
|
| **Search** | `search` | Typesense query rates, latency, index sizes |
|
|
| **Business Metrics** | `business-metrics` | Listings, inquiries, payments, user registrations |
|
|
| **Web Vitals** | `web-vitals` | Core Web Vitals (LCP, FID, CLS), page load times |
|
|
| **Logs** | `logs` | Loki log explorer with filters by service, level, correlation ID |
|
|
|
|
**Access:** `http://localhost:3002` (default credentials in `.env`: `GRAFANA_ADMIN_USER` / `GRAFANA_ADMIN_PASSWORD`)
|
|
|
|
**Data Sources:**
|
|
- **Prometheus** (`http://prometheus:9090`) — Metrics (default)
|
|
- **Loki** (`http://loki:3100`) — Logs, with correlation ID linking to Prometheus
|
|
- **Alertmanager** (`http://alertmanager:9093`) — Alert state and silences
|
|
|
|
---
|
|
|
|
## 7. Useful PromQL Queries
|
|
|
|
### API Performance
|
|
|
|
```promql
|
|
# Overall p99 latency
|
|
histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket{job="goodgo-api"}[5m])) by (le))
|
|
|
|
# Per-endpoint p99 latency (top 10 slowest)
|
|
topk(10, histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket{job="goodgo-api"}[5m])) by (le, route, method)))
|
|
|
|
# Request rate by status code
|
|
sum(rate(http_requests_total{job="goodgo-api"}[5m])) by (status_code)
|
|
|
|
# 5xx error percentage
|
|
(sum(rate(http_requests_total{job="goodgo-api", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="goodgo-api"}[5m]))) * 100
|
|
```
|
|
|
|
### Database
|
|
|
|
```promql
|
|
# Active connections
|
|
pg_stat_activity_count{datname="goodgo", state="active"}
|
|
|
|
# Connection pool utilization (if PgBouncer metrics are scraped)
|
|
# Manual check via: SHOW POOLS in PgBouncer admin console
|
|
```
|
|
|
|
### Infrastructure
|
|
|
|
```promql
|
|
# Container memory usage
|
|
container_memory_usage_bytes{name=~"goodgo-.*"}
|
|
|
|
# Container CPU usage
|
|
rate(container_cpu_usage_seconds_total{name=~"goodgo-.*"}[5m])
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Environment Quick Reference
|
|
|
|
### Key Environment Variables
|
|
|
|
| Variable | Required | Description |
|
|
|----------|----------|-------------|
|
|
| `DATABASE_URL` | Yes | PostgreSQL via PgBouncer (`postgresql://user:pass@pgbouncer:6432/db`) |
|
|
| `DATABASE_URL_DIRECT` | Yes (prod) | Direct PostgreSQL for migrations (`postgresql://user:pass@postgres:5432/db`) |
|
|
| `JWT_SECRET` | Yes | JWT signing secret |
|
|
| `JWT_REFRESH_SECRET` | Yes | Refresh token signing secret |
|
|
| `REDIS_URL` | Yes | Redis connection (`redis://:password@redis:6379`) |
|
|
| `REDIS_PASSWORD` | Yes (prod) | Redis auth password |
|
|
| `TYPESENSE_API_KEY` | Yes | Typesense admin API key |
|
|
| `MINIO_ACCESS_KEY` | Yes | MinIO root user |
|
|
| `MINIO_SECRET_KEY` | Yes | MinIO root password |
|
|
| `VNPAY_*` | Yes | VNPay payment gateway configuration |
|
|
| `AI_API_KEY` | Yes | AI services authentication |
|
|
| `GRAFANA_ADMIN_USER` | Yes (prod) | Grafana admin username |
|
|
| `GRAFANA_ADMIN_PASSWORD` | Yes (prod) | Grafana admin password |
|
|
| `PGBOUNCER_POOL_SIZE` | No | PgBouncer pool size (default: 20) |
|
|
| `PGBOUNCER_MAX_CLIENT_CONN` | No | Max PgBouncer client connections (default: 200) |
|
|
| `BACKUP_RETENTION_DAYS` | No | Backup retention period (default: 7) |
|
|
| `IMAGE_TAG` | No (prod) | Container image tag (default: `latest`) |
|
|
|
|
### Port Map
|
|
|
|
| Port | Service | Exposed |
|
|
|------|---------|---------|
|
|
| 3000 | Web (Next.js) | External |
|
|
| 3001 | API (NestJS) | External |
|
|
| 3002 | Grafana | External (admin only) |
|
|
| 5432 | PostgreSQL | Internal |
|
|
| 6432 | PgBouncer | Internal |
|
|
| 6379 | Redis | Internal |
|
|
| 8000 | AI Services | Internal |
|
|
| 8108 | Typesense | Internal |
|
|
| 9000 | MinIO API | Internal |
|
|
| 9001 | MinIO Console | Internal |
|
|
| 9090 | Prometheus | Internal |
|
|
| 3100 | Loki | Internal |
|
|
|
|
### Docker Volumes
|
|
|
|
| Volume | Service | Purpose |
|
|
|--------|---------|---------|
|
|
| `pgdata` | PostgreSQL | Database files |
|
|
| `redis_data` | Redis | AOF persistence |
|
|
| `typesense_data` | Typesense | Search index data |
|
|
| `minio_data` | MinIO | Object storage (images, files) |
|
|
| `pg_backups` | pg-backup | Database backup files |
|
|
| `loki_data` | Loki | Log storage (15-day retention) |
|
|
| `prometheus_data` | Prometheus | Metrics (30-day retention prod / 15-day dev) |
|
|
| `grafana_data` | Grafana | Dashboard state, user preferences |
|
|
|
|
---
|
|
|
|
## 9. Disaster Recovery Validation
|
|
|
|
### Automated Verification
|
|
|
|
Backup verification runs **daily at 04:00 UTC** inside the `pg-backup` container. It restores the latest backup to an isolated test database and checks:
|
|
|
|
- Table existence (all 22 Prisma models)
|
|
- Row count comparison against live database
|
|
- Data checksums on critical tables (User, Property, Listing, Payment, Subscription, Transaction, Plan)
|
|
- PostGIS extension availability
|
|
- Index count match
|
|
- Enum type count match
|
|
|
|
**Check latest verification report:**
|
|
|
|
```bash
|
|
docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .
|
|
```
|
|
|
|
**Check verification logs:**
|
|
|
|
```bash
|
|
docker exec goodgo-pg-backup cat /var/log/pg-verify.log
|
|
```
|
|
|
|
### Manual DR Validation Procedure
|
|
|
|
Run this quarterly (or after major schema changes) to validate the full DR process end-to-end.
|
|
|
|
#### Step 1: Verify Backups Exist and Are Recent
|
|
|
|
```bash
|
|
# List backups with timestamps and sizes
|
|
docker exec goodgo-pg-backup ls -lht /backups/goodgo_*.sql.gz
|
|
|
|
# Verify latest backup is < 25 hours old
|
|
LATEST=$(docker exec goodgo-pg-backup ls -t /backups/goodgo_*.sql.gz | head -1)
|
|
echo "Latest backup: $LATEST"
|
|
```
|
|
|
|
#### Step 2: Run Verification Against Latest Backup
|
|
|
|
```bash
|
|
# Automated verification (creates temp DB, validates, drops)
|
|
docker exec -e REPORT_FILE=/backups/verify-latest.json goodgo-pg-backup \
|
|
/scripts/pg-verify-backup.sh
|
|
|
|
# Review results
|
|
docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .
|
|
```
|
|
|
|
**Expected output:** All checks pass, restore completes in < 60 seconds for typical dataset.
|
|
|
|
#### Step 3: Test Full Restore (Staging Only)
|
|
|
|
> ⚠️ **WARNING:** Only perform this on a staging or isolated environment. Never on production.
|
|
|
|
```bash
|
|
# 1. Create a separate test environment
|
|
docker compose -f docker-compose.yml -p goodgo-dr-test up -d postgres
|
|
|
|
# 2. Wait for PostgreSQL to be ready
|
|
docker exec goodgo-dr-test-postgres-1 pg_isready
|
|
|
|
# 3. Run restore against the test environment
|
|
PGHOST=localhost PGPORT=<test-port> PGUSER=goodgo PGPASSWORD=<password> \
|
|
/scripts/pg-restore.sh /backups/<latest-backup>.sql.gz
|
|
|
|
# 4. Verify key tables
|
|
docker exec goodgo-dr-test-postgres-1 psql -U goodgo -d goodgo -c \
|
|
"SELECT count(*) FROM \"User\"; SELECT count(*) FROM \"Property\"; SELECT count(*) FROM \"Listing\";"
|
|
|
|
# 5. Clean up test environment
|
|
docker compose -f docker-compose.yml -p goodgo-dr-test down -v
|
|
```
|
|
|
|
#### Step 4: Validate Service Recovery Chain
|
|
|
|
Test that all services can start from a clean state with restored data:
|
|
|
|
```bash
|
|
# 1. Note current service status
|
|
docker compose -f docker-compose.prod.yml ps --format "table {{.Name}}\t{{.Status}}\t{{.Health}}"
|
|
|
|
# 2. Restart all services in dependency order
|
|
docker compose -f docker-compose.prod.yml restart postgres
|
|
sleep 10 # Wait for PostgreSQL
|
|
|
|
docker compose -f docker-compose.prod.yml restart pgbouncer redis typesense
|
|
sleep 10 # Wait for data services
|
|
|
|
docker compose -f docker-compose.prod.yml restart api web ai-services
|
|
sleep 15 # Wait for application services
|
|
|
|
# 3. Verify all health checks
|
|
curl -sf http://localhost:3001/health/ready | jq .
|
|
curl -sf http://localhost:3000 > /dev/null && echo "Web OK"
|
|
curl -sf http://localhost:9090/-/healthy && echo "Prometheus OK"
|
|
curl -sf http://localhost:9093/-/healthy && echo "Alertmanager OK"
|
|
curl -sf http://localhost:3002/api/health | jq .
|
|
```
|
|
|
|
#### Step 5: Validate Alerting Pipeline
|
|
|
|
```bash
|
|
# 1. Check Prometheus is loading alert rules
|
|
curl -sf http://localhost:9090/api/v1/rules | jq '.data.groups | length'
|
|
# Expected: 7 groups
|
|
|
|
# 2. Check current alerts (should be empty if healthy)
|
|
curl -sf http://localhost:9090/api/v1/alerts | jq '.data.alerts | length'
|
|
|
|
# 3. Check Alertmanager is receiving from Prometheus
|
|
curl -sf http://localhost:9093/api/v2/status | jq '.cluster'
|
|
|
|
# 4. Verify Alertmanager config is loaded
|
|
curl -sf http://localhost:9093/api/v2/status | jq '.config'
|
|
```
|
|
|
|
### DR Validation Checklist
|
|
|
|
Use this checklist during quarterly DR reviews:
|
|
|
|
- [ ] Latest backup is < 25 hours old
|
|
- [ ] Automated verification report shows all checks passed
|
|
- [ ] Manual restore to test DB succeeds with correct row counts
|
|
- [ ] Full service restart completes within RTO target (< 30 min)
|
|
- [ ] All health endpoints respond after restart
|
|
- [ ] Prometheus alert rules are loaded (7 groups)
|
|
- [ ] Alertmanager is reachable and configured
|
|
- [ ] Slack notification channel is receiving test alerts
|
|
- [ ] Grafana dashboards show data after restart
|
|
- [ ] Typesense search returns results after restart
|
|
|
|
### RPO/RTO Summary
|
|
|
|
| Metric | Target | Actual (Measured) | Notes |
|
|
|--------|--------|-------------------|-------|
|
|
| **RPO** | ≤ 24 hours | ~24h (daily at 02:00 UTC) | Reduce with WAL archiving |
|
|
| **RTO — Local backup** | ≤ 15 minutes | Measure during DR test | Restore + service restart |
|
|
| **RTO — Off-site backup** | ≤ 30 minutes | Measure during DR test | Add transfer time |
|
|
| **RTO — Full host recovery** | ≤ 60 minutes | Measure during DR test | New host + restore + deploy |
|
|
|
|
---
|
|
|
|
## Appendix: Alert Rules Reference
|
|
|
|
### API & Error Alerts
|
|
|
|
| Alert | Expression | Severity | Duration |
|
|
|-------|-----------|----------|----------|
|
|
| `ApiLatencyP99High` | p99 > 1s | Warning | 5 min |
|
|
| `ApiEndpointLatencyP99High` | Per-route p99 > 2s | Warning | 5 min |
|
|
| `ApiLatencyP99Critical` | p99 > 3s (SLO breach) | Critical | 3 min |
|
|
| `ApiErrorRate5xxHigh` | 5xx rate > 1% | Warning | 5 min |
|
|
| `ApiErrorRate5xxCritical` | 5xx rate > 5% | Critical | 3 min |
|
|
| `ApiNoTraffic` | Request rate = 0 | Warning | 10 min |
|
|
|
|
### Database Alerts
|
|
|
|
| Alert | Expression | Severity | Duration |
|
|
|-------|-----------|----------|----------|
|
|
| `PostgresActiveConnectionsHigh` | Active connections > 15 | Warning | 5 min |
|
|
| `PostgresConnectionPoolCritical` | Total connections > 180 | Critical | 2 min |
|
|
| `PostgresSlowQueries` | Lock-waiting queries > 5 | Warning | 5 min |
|
|
| `PostgresDown` | API scrape target down | Critical | 1 min |
|
|
|
|
### Redis Alerts
|
|
|
|
| Alert | Expression | Severity | Duration |
|
|
|-------|-----------|----------|----------|
|
|
| `RedisMemoryHigh` | Memory usage > 80% | Warning | 5 min |
|
|
| `RedisMemoryCritical` | Memory usage > 95% | Critical | 2 min |
|
|
| `RedisConnectedClientsHigh` | Clients > 150 | Warning | 5 min |
|
|
| `RedisRejectedConnections` | Rejected connections > 0 | Critical | 1 min |
|
|
|
|
### Container Resource Alerts
|
|
|
|
| Alert | Expression | Severity | Duration |
|
|
|-------|-----------|----------|----------|
|
|
| `ContainerRestartLoop` | > 3 restarts in 15 min | Critical | 5 min |
|
|
| `ContainerMemoryHigh` | Memory > 85% of limit | Warning | 5 min |
|
|
| `ContainerCPUThrottled` | CPU throttle rate > 0.5s/s | Warning | 10 min |
|
|
|
|
### Disk & Infrastructure Alerts
|
|
|
|
| Alert | Expression | Severity | Duration |
|
|
|-------|-----------|----------|----------|
|
|
| `HostDiskUsageHigh` | Root disk > 80% | Warning | 10 min |
|
|
| `HostDiskUsageCritical` | Root disk > 90% | Critical | 5 min |
|
|
| `ApiHealthCheckFailing` | Health probe fails | Critical | 2 min |
|
|
| `PrometheusTargetDown` | Scrape target down | Warning | 5 min |
|
|
|
|
### Backup Alerts
|
|
|
|
| Alert | Expression | Severity | Duration |
|
|
|-------|-----------|----------|----------|
|
|
| `BackupTooOld` | Last backup > 25 hours ago | Warning | 5 min |
|
|
| `BackupVerificationFailed` | Verify result = fail | Warning | 1 min |
|
|
|
|
### Alert Routing
|
|
|
|
Alerts are routed via Alertmanager (`monitoring/alertmanager/alertmanager.yml`):
|
|
|
|
| Channel | Routes | Repeat Interval |
|
|
|---------|--------|-----------------|
|
|
| `#sre-oncall` (Slack) | All warning alerts | 4 hours |
|
|
| `#sre-oncall` (Slack) | All critical alerts (priority) | 1 hour |
|
|
| `#infrastructure` (Slack) | Backup-related alerts | 6 hours |
|
|
|
|
**Inhibition:** Warning alerts are suppressed when a critical alert for the same service is already firing.
|
|
|
|
Alert rules are defined in `monitoring/prometheus/alert-rules.yml` and evaluated every 15 seconds.
|