Files
goodgo-platform/docs/RUNBOOK.md
Ho Ngoc Hai 9409706c58 feat(monitoring): add comprehensive alerting rules, Alertmanager, and DR validation
Expand production monitoring with full alert coverage for database connections,
Redis memory/connections, container resources, disk usage, service health, and
backup integrity. Add Alertmanager service with Slack routing for critical and
warning alerts, and add automated backup verification to the pg-backup cron
schedule. Update runbook with DR validation procedures and quarterly checklist.

- Expand Prometheus alert rules from 4 to 24 alerts across 7 groups
- Add Alertmanager container (prom/alertmanager:v0.27.0) with Slack routing
- Configure inhibition rules (critical suppresses warning for same service)
- Schedule automated backup verification at 04:00 UTC daily
- Add Alertmanager datasource to Grafana provisioning
- Update runbook with Section 9: DR Validation (automated + manual procedures)
- Add SLACK_WEBHOOK_URL and Grafana vars to .env.example

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-11 20:15:36 +07:00

40 KiB

GoodGo Platform — Production Runbook

Audience: On-call SRE, DevOps engineers, and platform operators. Last updated: 2026-04-11


Table of Contents

  1. Service Inventory
  2. Health Checks
  3. Common Incidents
  4. Recovery Procedures
  5. Escalation Matrix
  6. Monitoring Dashboards
  7. Useful PromQL Queries
  8. Environment Quick Reference

1. Service Inventory

Production Services (docker-compose.prod.yml)

Service Image Port Resource Limits Health Check
api (NestJS) ghcr.io/goodgo/goodgo-api 3001 1 CPU / 1 GB GET /health (node fetch)
web (Next.js) ghcr.io/goodgo/goodgo-web 3000 0.5 CPU / 512 MB GET / (node fetch)
ai-services (FastAPI) ghcr.io/goodgo/goodgo-ai-services 8000 1 CPU / 1 GB GET /health (httpx)
postgres postgis/postgis:16-3.4 5432 (internal) 2 CPU / 2 GB, shm=256m pg_isready
pgbouncer edoburu/pgbouncer:1.23.1-p2 6432 (internal) 0.5 CPU / 256 MB pg_isready -p 6432
redis redis:7-alpine 6379 (internal) 0.5 CPU / 768 MB redis-cli ping
typesense typesense/typesense:27.1 8108 (internal) 1 CPU / 1 GB curl /health
minio minio/minio:latest 9000/9001 (internal) 0.5 CPU / 1 GB mc ready local
pg-backup postgis/postgis:16-3.4 0.5 CPU / 512 MB — (cron daemon)
loki grafana/loki:3.0.0 3100 (internal) 0.5 CPU / 512 MB wget /ready
promtail grafana/promtail:3.0.0 0.25 CPU / 256 MB
prometheus prom/prometheus:v2.51.0 9090 (internal) 0.5 CPU / 1 GB wget /-/healthy
grafana grafana/grafana:10.4.1 3002 (external) 0.5 CPU / 512 MB wget /api/health
alertmanager prom/alertmanager:v0.27.0 9093 (internal) 0.25 CPU / 256 MB wget /-/healthy

Development-Only Services (docker-compose.yml)

Development uses the same data and monitoring services but runs API/Web on the host. The pg-backup service also runs in dev with default credentials.

Service Dependency Chain

web --> api --> pgbouncer --> postgres
                  |-> redis
                  |-> typesense
                  |-> minio
                  |-> ai-services

grafana --> prometheus --> alertmanager
        |-> loki --> promtail (Docker socket)

pg-backup --> postgres

2. Health Checks

Application Health Endpoints

Endpoint Type Checks Expected Response
GET /health Liveness Process is running 200 { status: "ok" }
GET /health/ready Readiness PostgreSQL + Redis 200 { status: "ok", info: { database: ..., redis: ... } }
GET /health/db Database only PostgreSQL connectivity 200 { status: "ok", info: { database: ... } }
GET /health/redis Redis only Redis connectivity 200 { status: "ok", info: { redis: ... } }

Verify All Services Are Healthy

# Quick check — all containers
docker compose -f docker-compose.prod.yml ps --format "table {{.Name}}\t{{.Status}}\t{{.Health}}"

# API liveness
curl -sf http://localhost:3001/health && echo "API OK" || echo "API FAIL"

# API readiness (DB + Redis)
curl -sf http://localhost:3001/health/ready | jq .

# Individual dependency checks
curl -sf http://localhost:3001/health/db | jq .
curl -sf http://localhost:3001/health/redis | jq .

# Typesense
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health

# MinIO
docker exec goodgo-minio mc ready local && echo "MinIO OK"

# AI Services
curl -sf http://localhost:8000/health && echo "AI OK" || echo "AI FAIL"

# PostgreSQL (direct)
docker exec goodgo-postgres pg_isready -U ${DB_USER} -d ${DB_NAME}

# PgBouncer
docker exec goodgo-pgbouncer pg_isready -h 127.0.0.1 -p 6432 -U ${DB_USER}

# Redis
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" ping

# Prometheus
curl -sf http://localhost:9090/-/healthy && echo "Prometheus OK"

# Loki
curl -sf http://localhost:3100/ready && echo "Loki OK"

# Grafana
curl -sf http://localhost:3002/api/health | jq .

# Alertmanager
curl -sf http://localhost:9093/-/healthy && echo "Alertmanager OK"

Container Resource Usage

docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.NetIO}}"

3. Common Incidents

3.1 Database Connection Pool Exhaustion

Symptoms:

  • API returns 503 or hangs on requests
  • /health/ready returns unhealthy for database
  • PgBouncer logs: no more connections allowed or query_wait_timeout
  • Prometheus: spike in pg_stat_activity active connections

Diagnosis:

# Check PgBouncer pool status
docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW POOLS;"
docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW CLIENTS;"
docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW STATS;"

# Check PostgreSQL active connections
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
  "SELECT state, count(*) FROM pg_stat_activity WHERE datname = '${DB_NAME}' GROUP BY state;"

# Identify long-running queries
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
  "SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
   FROM pg_stat_activity
   WHERE datname = '${DB_NAME}' AND state != 'idle'
   ORDER BY duration DESC
   LIMIT 10;"

Resolution:

# 1. Kill long-running queries (> 5 minutes)
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
  "SELECT pg_terminate_backend(pid)
   FROM pg_stat_activity
   WHERE datname = '${DB_NAME}'
     AND state != 'idle'
     AND now() - query_start > interval '5 minutes'
     AND pid <> pg_backend_pid();"

# 2. If pool is fully exhausted, restart PgBouncer
docker compose -f docker-compose.prod.yml restart pgbouncer

# 3. If issue persists, increase pool size temporarily
#    Edit PGBOUNCER_POOL_SIZE in .env, then:
docker compose -f docker-compose.prod.yml up -d --no-deps pgbouncer

PgBouncer Configuration Reference:

  • Pool mode: transaction (connections returned to pool after each transaction)
  • Default pool size: 20 server connections per user/db pair
  • Max client connections: 200
  • Reserve pool: 5 extra connections (after 3s wait)
  • Query wait timeout: 120s (error if client waits this long)

3.2 Redis Connection Failure

Symptoms:

  • /health/redis returns unhealthy
  • Increased API response times (cache misses hitting DB)
  • API logs show Redis connection errors

Diagnosis:

# Check Redis container
docker logs --tail=50 goodgo-redis

# Test connectivity
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" ping
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO server
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO memory
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO clients

Resolution:

# 1. Restart Redis (data persisted via AOF)
docker compose -f docker-compose.prod.yml restart redis

# 2. If OOM — check memory usage
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO memory | grep used_memory_human
# Max memory is 512 MB (prod), eviction policy: allkeys-lru

# 3. If AOF is corrupted
docker compose -f docker-compose.prod.yml stop redis
docker exec goodgo-redis redis-check-aof --fix /data/appendonly.aof
docker compose -f docker-compose.prod.yml start redis

Graceful Degradation: The API is designed to continue operating when Redis is unavailable. Cache misses fall through to PostgreSQL. Performance will degrade but functionality is preserved. Redis is non-critical for core operations.

3.3 Typesense Unavailable

Symptoms:

  • Search functionality returns errors or falls back to basic DB search
  • curl http://localhost:8108/health fails
  • API logs show Typesense connection timeouts

Diagnosis:

# Check container status
docker logs --tail=50 goodgo-typesense

# Check health
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health

# Check collections
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/collections | jq .

# Check disk space for Typesense data volume
docker system df -v | grep typesense

Resolution:

# 1. Restart Typesense
docker compose -f docker-compose.prod.yml restart typesense

# 2. If data is corrupted — rebuild from PostgreSQL
docker compose -f docker-compose.prod.yml stop typesense
docker volume rm goodgo-platform-ai_typesense_data
docker compose -f docker-compose.prod.yml up -d typesense
# Wait for healthy, then reindex:
docker compose -f docker-compose.prod.yml exec api npx ts-node scripts/typesense-reindex.ts
# Or: pnpm run typesense:reindex

# 3. If volume backup exists — restore
docker compose -f docker-compose.prod.yml stop typesense
docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
  alpine sh -c "rm -rf /data/* && tar xzf /backup/typesense_YYYYMMDD_HHMMSS.tar.gz -C /data"
docker compose -f docker-compose.prod.yml start typesense

Fallback Behavior: When Typesense is unavailable, property search falls back to PostgreSQL full-text search with PostGIS geo queries. Search quality degrades but core functionality works.

3.4 High API Latency

Symptoms:

  • Prometheus alert ApiLatencyP99High fires (p99 > 1s for 5 min)
  • Critical alert ApiLatencyP99Critical fires (p99 > 3s for 3 min — SLO breach)
  • Users report slow page loads

Diagnosis:

# 1. Check which endpoints are slow
# Grafana: GoodGo API Latency dashboard
# Or via PromQL:
curl -s "http://localhost:9090/api/v1/query" --data-urlencode \
  'query=topk(10, histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket[5m])) by (le, route, method)))' \
  | jq '.data.result[] | {route: .metric.route, method: .metric.method, p99: .value[1]}'

# 2. Check database slow queries
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
  "SELECT pid, now() - query_start AS duration, left(query, 100) AS query_preview
   FROM pg_stat_activity
   WHERE state = 'active' AND now() - query_start > interval '1 second'
   ORDER BY duration DESC;"

# 3. Check PgBouncer wait times
docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW POOLS;"

# 4. Check container resource usage
docker stats --no-stream goodgo-api goodgo-postgres goodgo-redis goodgo-pgbouncer

# 5. Check Redis latency
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" --latency-history -i 3

# 6. Check application logs for errors
docker logs --tail=200 --since=5m goodgo-api 2>&1 | grep -i "error\|timeout\|slow"

Resolution:

# 1. If DB slow queries — terminate them
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
  "SELECT pg_terminate_backend(pid)
   FROM pg_stat_activity
   WHERE state = 'active' AND now() - query_start > interval '30 seconds';"

# 2. If connection pool exhaustion — see Section 3.1

# 3. If Redis is slow — restart
docker compose -f docker-compose.prod.yml restart redis

# 4. If API container OOM — restart with more memory
docker compose -f docker-compose.prod.yml restart api

# 5. If specific endpoint is the bottleneck — check Loki logs:
# Grafana > Explore > Loki > {container_name="goodgo-api"} |= "slow"

3.5 Payment Callback Failures

Symptoms:

  • Users report payments stuck in "pending" state
  • VNPay/MoMo/ZaloPay IPN callbacks returning errors
  • Payment reconciliation mismatches

Diagnosis:

# 1. Check payment callback logs
docker logs goodgo-api 2>&1 | grep -i "payment\|callback\|vnpay\|momo\|zalopay" | tail -50

# 2. Check for pending payments in DB
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
  "SELECT id, provider, status, \"amountVND\", \"createdAt\"
   FROM \"Payment\"
   WHERE status = 'PENDING'
   AND \"createdAt\" > now() - interval '24 hours'
   ORDER BY \"createdAt\" DESC
   LIMIT 20;"

# 3. Verify callback URL is reachable from external networks
curl -sf https://your-domain.com/api/payments/vnpay/callback && echo "Callback URL reachable"

# 4. Check if API is receiving callbacks (via Loki)
# Grafana > Explore > Loki > {container_name="goodgo-api"} |= "callback" |= "payment"

Resolution:

# 1. If callbacks are timing out — check API health and restart if needed
docker compose -f docker-compose.prod.yml restart api

# 2. If VNPay signature verification fails — verify VNPAY_* env vars
docker compose -f docker-compose.prod.yml exec api printenv | grep VNPAY

# 3. For stuck payments — manual reconciliation
#    Check VNPay/MoMo merchant portal for actual transaction status
#    Update payment status in DB if confirmed paid:
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
  "UPDATE \"Payment\" SET status = 'COMPLETED', \"updatedAt\" = now()
   WHERE id = '<payment-id>' AND status = 'PENDING';"

# 4. If callbacks are not reaching the server — check:
#    - Firewall rules (port 3001 or reverse proxy port must be open)
#    - SSL certificate validity
#    - DNS resolution
#    - Payment provider webhook configuration (correct callback URL)

Important: The payment callback handler uses idempotent processing with atomic state transitions. Replaying a callback is safe and will not duplicate payments.

3.6 Disk Space Alerts

Symptoms:

  • Containers failing to start or crashing
  • PostgreSQL refusing writes (PANIC: could not write to file)
  • Docker daemon running out of space

Diagnosis:

# Host disk usage
df -h

# Docker disk usage
docker system df
docker system df -v

# Check individual volume sizes
for vol in $(docker volume ls -q | grep goodgo); do
  echo -n "$vol: "
  docker run --rm -v "${vol}:/data" alpine du -sh /data 2>/dev/null
done

# Check backup volume specifically
docker exec goodgo-pg-backup du -sh /backups/
docker exec goodgo-pg-backup ls -lht /backups/

Resolution:

# 1. Clean up Docker artifacts
docker system prune -f          # Remove stopped containers, unused networks, dangling images
docker image prune -a -f        # Remove ALL unused images (careful in prod)

# 2. Clean old backups (if retention not working)
docker exec goodgo-pg-backup find /backups -name "goodgo_*.sql.gz" -mtime +7 -delete

# 3. Clean Prometheus data (if too large)
# Prometheus retention is 30d (prod) / 15d (dev) — configured via --storage.tsdb.retention.time
# To force compaction:
curl -sf -XPOST http://localhost:9090/-/quit  # Graceful shutdown triggers compaction
docker compose -f docker-compose.prod.yml start prometheus

# 4. Clean Loki data (15-day retention)
# Loki handles its own cleanup via compactor. If urgent:
docker compose -f docker-compose.prod.yml restart loki

# 5. Truncate Docker container logs
sudo truncate -s 0 $(docker inspect --format='{{.LogPath}}' goodgo-api)
# Or for all containers:
sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log'

Prevention: All production containers use json-file logging with max-size: 10m and max-file: 3-5. Backup retention is 7 days (configurable via BACKUP_RETENTION_DAYS).

3.7 MinIO / Object Storage Failure

Symptoms:

  • Image/file uploads fail
  • Property photos not loading
  • MinIO console inaccessible at port 9001

Diagnosis:

docker logs --tail=50 goodgo-minio
docker exec goodgo-minio mc ready local
docker exec goodgo-minio mc admin info local

Resolution:

# 1. Restart MinIO
docker compose -f docker-compose.prod.yml restart minio

# 2. If data volume corrupted
docker compose -f docker-compose.prod.yml stop minio
docker volume rm goodgo-platform-ai_minio_data  # WARNING: data loss
docker compose -f docker-compose.prod.yml up -d minio
# Recreate buckets via API or admin console

3.8 AI Services Unavailable

Symptoms:

  • AI-powered features (AVM, property descriptions) fail
  • GET /health on port 8000 fails
  • API logs show AI service connection timeouts

Diagnosis:

docker logs --tail=50 goodgo-ai-services
curl -sf http://localhost:8000/health
docker stats --no-stream goodgo-ai-services

Resolution:

# 1. Restart AI services
docker compose -f docker-compose.prod.yml restart ai-services

# 2. Check rate limits (default: 60/minute)
docker compose -f docker-compose.prod.yml exec ai-services printenv | grep AI_RATE_LIMIT

# 3. If OOM — the service has 1 GB limit; may need to increase for large models

Graceful Degradation: AI features are optional. The API should handle AI service unavailability gracefully and return non-AI results.

3.9 Log Pipeline Failure (Loki/Promtail)

Symptoms:

  • Grafana log explorer returns empty results
  • Promtail container unhealthy or crash-looping
  • Loki returning 503

Diagnosis:

docker logs --tail=50 goodgo-loki
docker logs --tail=50 goodgo-promtail
curl -sf http://localhost:3100/ready && echo "Loki ready" || echo "Loki NOT ready"

Resolution:

# 1. Restart the pipeline
docker compose -f docker-compose.prod.yml restart loki promtail

# 2. If Loki data corrupted
docker compose -f docker-compose.prod.yml stop loki promtail
docker volume rm goodgo-platform-ai_loki_data
docker compose -f docker-compose.prod.yml up -d loki promtail
# Historical logs are lost but new logs will flow immediately

# 3. If Promtail can't access Docker socket
ls -la /var/run/docker.sock
# Ensure the promtail container has the Docker socket mounted

3.10 5xx Error Rate Spike

Symptoms:

  • Prometheus alert ApiErrorRate5xxHigh fires (> 1% 5xx for 5 min)
  • Users reporting errors

Diagnosis:

# Check which endpoints are returning 5xx
curl -s "http://localhost:9090/api/v1/query" --data-urlencode \
  'query=topk(10, sum(rate(http_requests_total{job="goodgo-api", status_code=~"5.."}[5m])) by (route, method))' \
  | jq '.data.result'

# Check API error logs
docker logs --tail=200 --since=5m goodgo-api 2>&1 | grep -i "error\|exception\|500"

# Check all dependency health
curl -sf http://localhost:3001/health/ready | jq .

Resolution:

  1. If DB-related: see Section 3.1
  2. If Redis-related: see Section 3.2
  3. If recent deployment: see Section 4.4
  4. If unknown: restart API and investigate logs

4. Recovery Procedures

4.1 Database Restore from Backup

Automated backups run daily at 02:00 UTC via the pg-backup container. Retention: 7 days. Format: pg_dump --format=custom --compress=6.

Automated verification runs daily at 04:00 UTC — restores to an isolated test database, verifies table existence, row counts, checksums, PostGIS extension, indexes, and enums. Reports are written to /backups/verify-latest.json.

List Available Backups

docker exec goodgo-pg-backup ls -lht /backups/goodgo_*.sql.gz

Create an On-Demand Backup

docker exec goodgo-pg-backup /scripts/pg-backup.sh

Full Restore Procedure

# 1. Stop application services
docker compose -f docker-compose.prod.yml stop api web ai-services

# 2. (Production) Stop PgBouncer to prevent stale connections
docker compose -f docker-compose.prod.yml stop pgbouncer

# 3. Run the restore script
docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz
# The script will:
#   - Terminate active DB connections
#   - DROP and recreate the database
#   - Restore from the backup file

# 4. Verify the restore
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c '\dt'
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c 'SELECT count(*) FROM "User";'

# 5. Apply any pending migrations
docker compose -f docker-compose.prod.yml exec api npx prisma migrate deploy

# 6. Restart all services
docker compose -f docker-compose.prod.yml up -d

# 7. Verify application health
curl -sf http://localhost:3001/health/ready | jq .

Verify a Backup Without Restoring

# Run verification against latest backup (creates temp DB, drops it after)
docker compose run --rm pg-verify-backup

# Or verify a specific backup file
docker exec goodgo-pg-backup /scripts/pg-verify-backup.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz

# Check latest verification report
docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .

RPO/RTO:

  • RPO: ≤ 24 hours (daily backups; consider WAL archiving for lower RPO)
  • RTO: ~15 minutes (local volume), ~30 minutes (off-site)

4.2 Redis Cache Flush & Warm-up

# Flush all Redis data
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" FLUSHALL

# Verify flush
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" DBSIZE
# Should return: (integer) 0

Warm-up: Redis uses allkeys-lru eviction. Cache warms naturally as users make requests. No manual warm-up script is needed — cache misses fall through to PostgreSQL.

When to flush:

  • After database restore (stale cache references)
  • After data corruption at the application level
  • After schema changes that alter cached data structures

4.3 Rolling Restart Procedures

Single Service Restart (Zero Downtime)

# API — the --wait flag ensures health check passes before moving on
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api

# Web
docker compose -f docker-compose.prod.yml up -d --no-deps --wait web

# AI Services
docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services

Full Stack Rolling Restart

# Data services first (order matters for dependency chain)
docker compose -f docker-compose.prod.yml restart redis
docker compose -f docker-compose.prod.yml restart typesense

# Wait for data services to be healthy
sleep 10

# Connection pooling
docker compose -f docker-compose.prod.yml restart pgbouncer
sleep 5

# Application services
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api
docker compose -f docker-compose.prod.yml up -d --no-deps --wait web
docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services

# Verify
curl -sf http://localhost:3001/health/ready | jq .

Emergency: Restart Everything

docker compose -f docker-compose.prod.yml down
docker compose -f docker-compose.prod.yml up -d --wait

4.4 Rollback Deployment

The CI/CD pipeline (.github/workflows/deploy.yml) supports automatic rollback if production smoke tests fail. For manual rollback:

Quick Rollback (Revert to Previous Images)

# SSH into production host
ssh deploy@$PRODUCTION_HOST

cd ~/goodgo

# Stop current app containers
docker compose -f docker-compose.prod.yml down api web ai-services

# The previous images are still cached locally
# Restart without pulling — uses last-known-good images
docker compose -f docker-compose.prod.yml up -d --wait api web ai-services

# Verify
curl -sf http://localhost:3001/health && echo "Rollback successful"

Rollback to a Specific Git Commit / Image Tag

# Set the target tag (git SHA)
export IMAGE_TAG=<previous-commit-sha>
export REGISTRY_URL=ghcr.io/goodgo

# Pull specific version
docker compose -f docker-compose.prod.yml pull api web ai-services

# Deploy
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api
docker compose -f docker-compose.prod.yml up -d --no-deps --wait web
docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services

# Verify
curl -sf http://localhost:3001/health/ready | jq .

Rollback Database Migrations

# WARNING: Prisma does not support automatic down-migrations.
# For migration rollback, restore from the pre-migration backup:

# 1. Stop application
docker compose -f docker-compose.prod.yml stop api web ai-services pgbouncer

# 2. Restore from backup taken before the migration
docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/<pre-migration-backup>.sql.gz

# 3. Deploy the previous code version (older IMAGE_TAG)
export IMAGE_TAG=<previous-commit-sha>
docker compose -f docker-compose.prod.yml up -d --wait

4.5 Typesense Reindex from PostgreSQL

If Typesense data is lost or corrupted, rebuild the search index from PostgreSQL:

# 1. Ensure Typesense is running and healthy
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health

# 2. Run reindex
docker compose -f docker-compose.prod.yml exec api npx ts-node scripts/typesense-reindex.ts
# Or from host:
pnpm run typesense:reindex

# 3. Verify collections
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/collections | jq '.[].name'

4.6 Full Host Recovery

For complete host failure or migration to a new server:

# 1. Provision new host with Docker + Docker Compose
# Requirements: Docker >= 24, Docker Compose v2, 8 GB RAM minimum

# 2. Clone repository and configure
git clone <repo-url> ~/goodgo && cd ~/goodgo
cp .env.example .env
# Edit .env with production secrets (from secrets manager)

# 3. Restore PostgreSQL backup from off-site storage
# Transfer backup file to the new host
scp backups/goodgo_latest.sql.gz deploy@newhost:~/goodgo/backups/

# 4. Start infrastructure services
docker compose -f docker-compose.prod.yml up -d postgres redis typesense minio

# 5. Wait for PostgreSQL to be ready, then restore
docker compose -f docker-compose.prod.yml exec postgres pg_isready
docker exec goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_latest.sql.gz

# 6. Start application services
docker compose -f docker-compose.prod.yml up -d

# 7. Run migrations (if backup predates latest code)
docker compose -f docker-compose.prod.yml exec api npx prisma migrate deploy

# 8. Rebuild Typesense index
pnpm run typesense:reindex

# 9. Flush Redis (stale cache from old host)
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" FLUSHALL

# 10. Verify everything
curl -sf http://localhost:3001/health/ready | jq .
curl -sf http://localhost:3000 > /dev/null && echo "Web OK"

# Expected RTO: ~60 minutes (depends on backup transfer speed)

5. Escalation Matrix

Severity Condition First Responder Escalation SLA
P0 — Critical Full outage, data loss, payment corruption On-call SRE CTO + CEO within 15 min Acknowledge: 5 min, Resolve: 1 hour
P1 — High Partial outage, SLO breach (p99 > 3s), 5xx > 5% On-call SRE Engineering lead within 30 min Acknowledge: 15 min, Resolve: 4 hours
P2 — Medium Degraded performance, single service down (non-critical), p99 > 1s On-call SRE Team lead next business day Acknowledge: 1 hour, Resolve: 24 hours
P3 — Low Cosmetic issues, monitoring gaps, non-urgent improvements Assigned engineer Sprint planning Next sprint

Contact Channels

Role Channel
On-call SRE Slack #sre-oncall + PagerDuty
Engineering Lead Slack #engineering
CTO Slack DM / Phone (see PagerDuty)
Payment Issues Slack #payments + VNPay/MoMo support portals
Infrastructure Slack #infrastructure

Slack Notifications

The deploy pipeline automatically notifies #deployments (via SLACK_WEBHOOK_URL) on:

  • Production deploy success
  • Staging smoke test failure
  • Production rollback triggered

6. Monitoring Dashboards

All dashboards are provisioned automatically via monitoring/grafana/provisioning/ and are available in the GoodGo folder in Grafana.

Dashboard Grafana Path Purpose
API Overview api-overview Request rates, status codes, active connections
API Latency api-latency p50/p95/p99 latency by endpoint, latency heatmaps
Database database PostgreSQL connections, query performance, PgBouncer stats
Search search Typesense query rates, latency, index sizes
Business Metrics business-metrics Listings, inquiries, payments, user registrations
Web Vitals web-vitals Core Web Vitals (LCP, FID, CLS), page load times
Logs logs Loki log explorer with filters by service, level, correlation ID

Access: http://localhost:3002 (default credentials in .env: GRAFANA_ADMIN_USER / GRAFANA_ADMIN_PASSWORD)

Data Sources:

  • Prometheus (http://prometheus:9090) — Metrics (default)
  • Loki (http://loki:3100) — Logs, with correlation ID linking to Prometheus
  • Alertmanager (http://alertmanager:9093) — Alert state and silences

7. Useful PromQL Queries

API Performance

# Overall p99 latency
histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket{job="goodgo-api"}[5m])) by (le))

# Per-endpoint p99 latency (top 10 slowest)
topk(10, histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket{job="goodgo-api"}[5m])) by (le, route, method)))

# Request rate by status code
sum(rate(http_requests_total{job="goodgo-api"}[5m])) by (status_code)

# 5xx error percentage
(sum(rate(http_requests_total{job="goodgo-api", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="goodgo-api"}[5m]))) * 100

Database

# Active connections
pg_stat_activity_count{datname="goodgo", state="active"}

# Connection pool utilization (if PgBouncer metrics are scraped)
# Manual check via: SHOW POOLS in PgBouncer admin console

Infrastructure

# Container memory usage
container_memory_usage_bytes{name=~"goodgo-.*"}

# Container CPU usage
rate(container_cpu_usage_seconds_total{name=~"goodgo-.*"}[5m])

8. Environment Quick Reference

Key Environment Variables

Variable Required Description
DATABASE_URL Yes PostgreSQL via PgBouncer (postgresql://user:pass@pgbouncer:6432/db)
DATABASE_URL_DIRECT Yes (prod) Direct PostgreSQL for migrations (postgresql://user:pass@postgres:5432/db)
JWT_SECRET Yes JWT signing secret
JWT_REFRESH_SECRET Yes Refresh token signing secret
REDIS_URL Yes Redis connection (redis://:password@redis:6379)
REDIS_PASSWORD Yes (prod) Redis auth password
TYPESENSE_API_KEY Yes Typesense admin API key
MINIO_ACCESS_KEY Yes MinIO root user
MINIO_SECRET_KEY Yes MinIO root password
VNPAY_* Yes VNPay payment gateway configuration
AI_API_KEY Yes AI services authentication
GRAFANA_ADMIN_USER Yes (prod) Grafana admin username
GRAFANA_ADMIN_PASSWORD Yes (prod) Grafana admin password
PGBOUNCER_POOL_SIZE No PgBouncer pool size (default: 20)
PGBOUNCER_MAX_CLIENT_CONN No Max PgBouncer client connections (default: 200)
BACKUP_RETENTION_DAYS No Backup retention period (default: 7)
IMAGE_TAG No (prod) Container image tag (default: latest)

Port Map

Port Service Exposed
3000 Web (Next.js) External
3001 API (NestJS) External
3002 Grafana External (admin only)
5432 PostgreSQL Internal
6432 PgBouncer Internal
6379 Redis Internal
8000 AI Services Internal
8108 Typesense Internal
9000 MinIO API Internal
9001 MinIO Console Internal
9090 Prometheus Internal
3100 Loki Internal

Docker Volumes

Volume Service Purpose
pgdata PostgreSQL Database files
redis_data Redis AOF persistence
typesense_data Typesense Search index data
minio_data MinIO Object storage (images, files)
pg_backups pg-backup Database backup files
loki_data Loki Log storage (15-day retention)
prometheus_data Prometheus Metrics (30-day retention prod / 15-day dev)
grafana_data Grafana Dashboard state, user preferences

9. Disaster Recovery Validation

Automated Verification

Backup verification runs daily at 04:00 UTC inside the pg-backup container. It restores the latest backup to an isolated test database and checks:

  • Table existence (all 22 Prisma models)
  • Row count comparison against live database
  • Data checksums on critical tables (User, Property, Listing, Payment, Subscription, Transaction, Plan)
  • PostGIS extension availability
  • Index count match
  • Enum type count match

Check latest verification report:

docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .

Check verification logs:

docker exec goodgo-pg-backup cat /var/log/pg-verify.log

Manual DR Validation Procedure

Run this quarterly (or after major schema changes) to validate the full DR process end-to-end.

Step 1: Verify Backups Exist and Are Recent

# List backups with timestamps and sizes
docker exec goodgo-pg-backup ls -lht /backups/goodgo_*.sql.gz

# Verify latest backup is < 25 hours old
LATEST=$(docker exec goodgo-pg-backup ls -t /backups/goodgo_*.sql.gz | head -1)
echo "Latest backup: $LATEST"

Step 2: Run Verification Against Latest Backup

# Automated verification (creates temp DB, validates, drops)
docker exec -e REPORT_FILE=/backups/verify-latest.json goodgo-pg-backup \
  /scripts/pg-verify-backup.sh

# Review results
docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .

Expected output: All checks pass, restore completes in < 60 seconds for typical dataset.

Step 3: Test Full Restore (Staging Only)

⚠️ WARNING: Only perform this on a staging or isolated environment. Never on production.

# 1. Create a separate test environment
docker compose -f docker-compose.yml -p goodgo-dr-test up -d postgres

# 2. Wait for PostgreSQL to be ready
docker exec goodgo-dr-test-postgres-1 pg_isready

# 3. Run restore against the test environment
PGHOST=localhost PGPORT=<test-port> PGUSER=goodgo PGPASSWORD=<password> \
  /scripts/pg-restore.sh /backups/<latest-backup>.sql.gz

# 4. Verify key tables
docker exec goodgo-dr-test-postgres-1 psql -U goodgo -d goodgo -c \
  "SELECT count(*) FROM \"User\"; SELECT count(*) FROM \"Property\"; SELECT count(*) FROM \"Listing\";"

# 5. Clean up test environment
docker compose -f docker-compose.yml -p goodgo-dr-test down -v

Step 4: Validate Service Recovery Chain

Test that all services can start from a clean state with restored data:

# 1. Note current service status
docker compose -f docker-compose.prod.yml ps --format "table {{.Name}}\t{{.Status}}\t{{.Health}}"

# 2. Restart all services in dependency order
docker compose -f docker-compose.prod.yml restart postgres
sleep 10  # Wait for PostgreSQL

docker compose -f docker-compose.prod.yml restart pgbouncer redis typesense
sleep 10  # Wait for data services

docker compose -f docker-compose.prod.yml restart api web ai-services
sleep 15  # Wait for application services

# 3. Verify all health checks
curl -sf http://localhost:3001/health/ready | jq .
curl -sf http://localhost:3000 > /dev/null && echo "Web OK"
curl -sf http://localhost:9090/-/healthy && echo "Prometheus OK"
curl -sf http://localhost:9093/-/healthy && echo "Alertmanager OK"
curl -sf http://localhost:3002/api/health | jq .

Step 5: Validate Alerting Pipeline

# 1. Check Prometheus is loading alert rules
curl -sf http://localhost:9090/api/v1/rules | jq '.data.groups | length'
# Expected: 7 groups

# 2. Check current alerts (should be empty if healthy)
curl -sf http://localhost:9090/api/v1/alerts | jq '.data.alerts | length'

# 3. Check Alertmanager is receiving from Prometheus
curl -sf http://localhost:9093/api/v2/status | jq '.cluster'

# 4. Verify Alertmanager config is loaded
curl -sf http://localhost:9093/api/v2/status | jq '.config'

DR Validation Checklist

Use this checklist during quarterly DR reviews:

  • Latest backup is < 25 hours old
  • Automated verification report shows all checks passed
  • Manual restore to test DB succeeds with correct row counts
  • Full service restart completes within RTO target (< 30 min)
  • All health endpoints respond after restart
  • Prometheus alert rules are loaded (7 groups)
  • Alertmanager is reachable and configured
  • Slack notification channel is receiving test alerts
  • Grafana dashboards show data after restart
  • Typesense search returns results after restart

RPO/RTO Summary

Metric Target Actual (Measured) Notes
RPO ≤ 24 hours ~24h (daily at 02:00 UTC) Reduce with WAL archiving
RTO — Local backup ≤ 15 minutes Measure during DR test Restore + service restart
RTO — Off-site backup ≤ 30 minutes Measure during DR test Add transfer time
RTO — Full host recovery ≤ 60 minutes Measure during DR test New host + restore + deploy

Appendix: Alert Rules Reference

API & Error Alerts

Alert Expression Severity Duration
ApiLatencyP99High p99 > 1s Warning 5 min
ApiEndpointLatencyP99High Per-route p99 > 2s Warning 5 min
ApiLatencyP99Critical p99 > 3s (SLO breach) Critical 3 min
ApiErrorRate5xxHigh 5xx rate > 1% Warning 5 min
ApiErrorRate5xxCritical 5xx rate > 5% Critical 3 min
ApiNoTraffic Request rate = 0 Warning 10 min

Database Alerts

Alert Expression Severity Duration
PostgresActiveConnectionsHigh Active connections > 15 Warning 5 min
PostgresConnectionPoolCritical Total connections > 180 Critical 2 min
PostgresSlowQueries Lock-waiting queries > 5 Warning 5 min
PostgresDown API scrape target down Critical 1 min

Redis Alerts

Alert Expression Severity Duration
RedisMemoryHigh Memory usage > 80% Warning 5 min
RedisMemoryCritical Memory usage > 95% Critical 2 min
RedisConnectedClientsHigh Clients > 150 Warning 5 min
RedisRejectedConnections Rejected connections > 0 Critical 1 min

Container Resource Alerts

Alert Expression Severity Duration
ContainerRestartLoop > 3 restarts in 15 min Critical 5 min
ContainerMemoryHigh Memory > 85% of limit Warning 5 min
ContainerCPUThrottled CPU throttle rate > 0.5s/s Warning 10 min

Disk & Infrastructure Alerts

Alert Expression Severity Duration
HostDiskUsageHigh Root disk > 80% Warning 10 min
HostDiskUsageCritical Root disk > 90% Critical 5 min
ApiHealthCheckFailing Health probe fails Critical 2 min
PrometheusTargetDown Scrape target down Warning 5 min

Backup Alerts

Alert Expression Severity Duration
BackupTooOld Last backup > 25 hours ago Warning 5 min
BackupVerificationFailed Verify result = fail Warning 1 min

Alert Routing

Alerts are routed via Alertmanager (monitoring/alertmanager/alertmanager.yml):

Channel Routes Repeat Interval
#sre-oncall (Slack) All warning alerts 4 hours
#sre-oncall (Slack) All critical alerts (priority) 1 hour
#infrastructure (Slack) Backup-related alerts 6 hours

Inhibition: Warning alerts are suppressed when a critical alert for the same service is already firing.

Alert rules are defined in monitoring/prometheus/alert-rules.yml and evaluated every 15 seconds.