Files

Ho Ngoc Hai 9409706c58 feat(monitoring): add comprehensive alerting rules, Alertmanager, and DR validation

Expand production monitoring with full alert coverage for database connections,
Redis memory/connections, container resources, disk usage, service health, and
backup integrity. Add Alertmanager service with Slack routing for critical and
warning alerts, and add automated backup verification to the pg-backup cron
schedule. Update runbook with DR validation procedures and quarterly checklist.

- Expand Prometheus alert rules from 4 to 24 alerts across 7 groups
- Add Alertmanager container (prom/alertmanager:v0.27.0) with Slack routing
- Configure inhibition rules (critical suppresses warning for same service)
- Schedule automated backup verification at 04:00 UTC daily
- Add Alertmanager datasource to Grafana provisioning
- Update runbook with Section 9: DR Validation (automated + manual procedures)
- Add SLACK_WEBHOOK_URL and Grafana vars to .env.example

Co-Authored-By: Paperclip <noreply@paperclip.ing>

2026-04-11 20:15:36 +07:00

40 KiB

Raw Blame History

GoodGo Platform — Production Runbook

Audience: On-call SRE, DevOps engineers, and platform operators. Last updated: 2026-04-11

Service Inventory
Health Checks
Common Incidents
Recovery Procedures
Escalation Matrix
Monitoring Dashboards
Useful PromQL Queries
Environment Quick Reference

1. Service Inventory

Production Services (`docker-compose.prod.yml`)

Service	Image	Port	Resource Limits	Health Check
api (NestJS)	`ghcr.io/goodgo/goodgo-api`	3001	1 CPU / 1 GB	`GET /health` (node fetch)
web (Next.js)	`ghcr.io/goodgo/goodgo-web`	3000	0.5 CPU / 512 MB	`GET /` (node fetch)
ai-services (FastAPI)	`ghcr.io/goodgo/goodgo-ai-services`	8000	1 CPU / 1 GB	`GET /health` (httpx)
postgres	`postgis/postgis:16-3.4`	5432 (internal)	2 CPU / 2 GB, shm=256m	`pg_isready`
pgbouncer	`edoburu/pgbouncer:1.23.1-p2`	6432 (internal)	0.5 CPU / 256 MB	`pg_isready -p 6432`
redis	`redis:7-alpine`	6379 (internal)	0.5 CPU / 768 MB	`redis-cli ping`
typesense	`typesense/typesense:27.1`	8108 (internal)	1 CPU / 1 GB	`curl /health`
minio	`minio/minio:latest`	9000/9001 (internal)	0.5 CPU / 1 GB	`mc ready local`
pg-backup	`postgis/postgis:16-3.4`	—	0.5 CPU / 512 MB	— (cron daemon)
loki	`grafana/loki:3.0.0`	3100 (internal)	0.5 CPU / 512 MB	`wget /ready`
promtail	`grafana/promtail:3.0.0`	—	0.25 CPU / 256 MB	—
prometheus	`prom/prometheus:v2.51.0`	9090 (internal)	0.5 CPU / 1 GB	`wget /-/healthy`
grafana	`grafana/grafana:10.4.1`	3002 (external)	0.5 CPU / 512 MB	`wget /api/health`
alertmanager	`prom/alertmanager:v0.27.0`	9093 (internal)	0.25 CPU / 256 MB	`wget /-/healthy`

Development-Only Services (`docker-compose.yml`)

Development uses the same data and monitoring services but runs API/Web on the host. The pg-backup service also runs in dev with default credentials.

Service Dependency Chain

web --> api --> pgbouncer --> postgres
                  |-> redis
                  |-> typesense
                  |-> minio
                  |-> ai-services

grafana --> prometheus --> alertmanager
        |-> loki --> promtail (Docker socket)

pg-backup --> postgres

2. Health Checks

Application Health Endpoints

Endpoint	Type	Checks	Expected Response
`GET /health`	Liveness	Process is running	`200 { status: "ok" }`
`GET /health/ready`	Readiness	PostgreSQL + Redis	`200 { status: "ok", info: { database: ..., redis: ... } }`
`GET /health/db`	Database only	PostgreSQL connectivity	`200 { status: "ok", info: { database: ... } }`
`GET /health/redis`	Redis only	Redis connectivity	`200 { status: "ok", info: { redis: ... } }`

Verify All Services Are Healthy

# Quick check — all containers
docker compose -f docker-compose.prod.yml ps --format "table {{.Name}}\t{{.Status}}\t{{.Health}}"

# API liveness
curl -sf http://localhost:3001/health && echo "API OK" || echo "API FAIL"

# API readiness (DB + Redis)
curl -sf http://localhost:3001/health/ready | jq .

# Individual dependency checks
curl -sf http://localhost:3001/health/db | jq .
curl -sf http://localhost:3001/health/redis | jq .

# Typesense
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health

# MinIO
docker exec goodgo-minio mc ready local && echo "MinIO OK"

# AI Services
curl -sf http://localhost:8000/health && echo "AI OK" || echo "AI FAIL"

# PostgreSQL (direct)
docker exec goodgo-postgres pg_isready -U ${DB_USER} -d ${DB_NAME}

# PgBouncer
docker exec goodgo-pgbouncer pg_isready -h 127.0.0.1 -p 6432 -U ${DB_USER}

# Redis
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" ping

# Prometheus
curl -sf http://localhost:9090/-/healthy && echo "Prometheus OK"

# Loki
curl -sf http://localhost:3100/ready && echo "Loki OK"

# Grafana
curl -sf http://localhost:3002/api/health | jq .

# Alertmanager
curl -sf http://localhost:9093/-/healthy && echo "Alertmanager OK"

Container Resource Usage

docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.NetIO}}"

3. Common Incidents

3.1 Database Connection Pool Exhaustion

Symptoms:

API returns 503 or hangs on requests
/health/ready returns unhealthy for database
PgBouncer logs: no more connections allowed or query_wait_timeout
Prometheus: spike in pg_stat_activity active connections

Diagnosis:

# Check PgBouncer pool status
docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW POOLS;"
docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW CLIENTS;"
docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW STATS;"

# Check PostgreSQL active connections
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
  "SELECT state, count(*) FROM pg_stat_activity WHERE datname = '${DB_NAME}' GROUP BY state;"

# Identify long-running queries
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
  "SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
   FROM pg_stat_activity
   WHERE datname = '${DB_NAME}' AND state != 'idle'
   ORDER BY duration DESC
   LIMIT 10;"

Resolution:

# 1. Kill long-running queries (> 5 minutes)
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
  "SELECT pg_terminate_backend(pid)
   FROM pg_stat_activity
   WHERE datname = '${DB_NAME}'
     AND state != 'idle'
     AND now() - query_start > interval '5 minutes'
     AND pid <> pg_backend_pid();"

# 2. If pool is fully exhausted, restart PgBouncer
docker compose -f docker-compose.prod.yml restart pgbouncer

# 3. If issue persists, increase pool size temporarily
#    Edit PGBOUNCER_POOL_SIZE in .env, then:
docker compose -f docker-compose.prod.yml up -d --no-deps pgbouncer

PgBouncer Configuration Reference:

Pool mode: transaction (connections returned to pool after each transaction)
Default pool size: 20 server connections per user/db pair
Max client connections: 200
Reserve pool: 5 extra connections (after 3s wait)
Query wait timeout: 120s (error if client waits this long)

3.2 Redis Connection Failure

Symptoms:

/health/redis returns unhealthy
Increased API response times (cache misses hitting DB)
API logs show Redis connection errors

Diagnosis:

# Check Redis container
docker logs --tail=50 goodgo-redis

# Test connectivity
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" ping
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO server
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO memory
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO clients

Resolution:

# 1. Restart Redis (data persisted via AOF)
docker compose -f docker-compose.prod.yml restart redis

# 2. If OOM — check memory usage
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO memory | grep used_memory_human
# Max memory is 512 MB (prod), eviction policy: allkeys-lru

# 3. If AOF is corrupted
docker compose -f docker-compose.prod.yml stop redis
docker exec goodgo-redis redis-check-aof --fix /data/appendonly.aof
docker compose -f docker-compose.prod.yml start redis

Graceful Degradation: The API is designed to continue operating when Redis is unavailable. Cache misses fall through to PostgreSQL. Performance will degrade but functionality is preserved. Redis is non-critical for core operations.

3.3 Typesense Unavailable

Symptoms:

Search functionality returns errors or falls back to basic DB search
curl http://localhost:8108/health fails
API logs show Typesense connection timeouts

Diagnosis:

# Check container status
docker logs --tail=50 goodgo-typesense

# Check health
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health

# Check collections
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/collections | jq .

# Check disk space for Typesense data volume
docker system df -v | grep typesense

Resolution:

# 1. Restart Typesense
docker compose -f docker-compose.prod.yml restart typesense

# 2. If data is corrupted — rebuild from PostgreSQL
docker compose -f docker-compose.prod.yml stop typesense
docker volume rm goodgo-platform-ai_typesense_data
docker compose -f docker-compose.prod.yml up -d typesense
# Wait for healthy, then reindex:
docker compose -f docker-compose.prod.yml exec api npx ts-node scripts/typesense-reindex.ts
# Or: pnpm run typesense:reindex

# 3. If volume backup exists — restore
docker compose -f docker-compose.prod.yml stop typesense
docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
  alpine sh -c "rm -rf /data/* && tar xzf /backup/typesense_YYYYMMDD_HHMMSS.tar.gz -C /data"
docker compose -f docker-compose.prod.yml start typesense

Fallback Behavior: When Typesense is unavailable, property search falls back to PostgreSQL full-text search with PostGIS geo queries. Search quality degrades but core functionality works.

3.4 High API Latency

Symptoms:

Prometheus alert ApiLatencyP99High fires (p99 > 1s for 5 min)
Critical alert ApiLatencyP99Critical fires (p99 > 3s for 3 min — SLO breach)
Users report slow page loads

Diagnosis:

# 1. Check which endpoints are slow
# Grafana: GoodGo API Latency dashboard
# Or via PromQL:
curl -s "http://localhost:9090/api/v1/query" --data-urlencode \
  'query=topk(10, histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket[5m])) by (le, route, method)))' \
  | jq '.data.result[] | {route: .metric.route, method: .metric.method, p99: .value[1]}'

# 2. Check database slow queries
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
  "SELECT pid, now() - query_start AS duration, left(query, 100) AS query_preview
   FROM pg_stat_activity
   WHERE state = 'active' AND now() - query_start > interval '1 second'
   ORDER BY duration DESC;"

# 3. Check PgBouncer wait times
docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW POOLS;"

# 4. Check container resource usage
docker stats --no-stream goodgo-api goodgo-postgres goodgo-redis goodgo-pgbouncer

# 5. Check Redis latency
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" --latency-history -i 3

# 6. Check application logs for errors
docker logs --tail=200 --since=5m goodgo-api 2>&1 | grep -i "error\|timeout\|slow"

Resolution:

# 1. If DB slow queries — terminate them
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
  "SELECT pg_terminate_backend(pid)
   FROM pg_stat_activity
   WHERE state = 'active' AND now() - query_start > interval '30 seconds';"

# 2. If connection pool exhaustion — see Section 3.1

# 3. If Redis is slow — restart
docker compose -f docker-compose.prod.yml restart redis

# 4. If API container OOM — restart with more memory
docker compose -f docker-compose.prod.yml restart api

# 5. If specific endpoint is the bottleneck — check Loki logs:
# Grafana > Explore > Loki > {container_name="goodgo-api"} |= "slow"

3.5 Payment Callback Failures

Symptoms:

Users report payments stuck in "pending" state
VNPay/MoMo/ZaloPay IPN callbacks returning errors
Payment reconciliation mismatches

Diagnosis:

# 1. Check payment callback logs
docker logs goodgo-api 2>&1 | grep -i "payment\|callback\|vnpay\|momo\|zalopay" | tail -50

# 2. Check for pending payments in DB
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
  "SELECT id, provider, status, \"amountVND\", \"createdAt\"
   FROM \"Payment\"
   WHERE status = 'PENDING'
   AND \"createdAt\" > now() - interval '24 hours'
   ORDER BY \"createdAt\" DESC
   LIMIT 20;"

# 3. Verify callback URL is reachable from external networks
curl -sf https://your-domain.com/api/payments/vnpay/callback && echo "Callback URL reachable"

# 4. Check if API is receiving callbacks (via Loki)
# Grafana > Explore > Loki > {container_name="goodgo-api"} |= "callback" |= "payment"

Resolution:

# 1. If callbacks are timing out — check API health and restart if needed
docker compose -f docker-compose.prod.yml restart api

# 2. If VNPay signature verification fails — verify VNPAY_* env vars
docker compose -f docker-compose.prod.yml exec api printenv | grep VNPAY

# 3. For stuck payments — manual reconciliation
#    Check VNPay/MoMo merchant portal for actual transaction status
#    Update payment status in DB if confirmed paid:
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
  "UPDATE \"Payment\" SET status = 'COMPLETED', \"updatedAt\" = now()
   WHERE id = '<payment-id>' AND status = 'PENDING';"

# 4. If callbacks are not reaching the server — check:
#    - Firewall rules (port 3001 or reverse proxy port must be open)
#    - SSL certificate validity
#    - DNS resolution
#    - Payment provider webhook configuration (correct callback URL)

Important: The payment callback handler uses idempotent processing with atomic state transitions. Replaying a callback is safe and will not duplicate payments.

3.6 Disk Space Alerts

Symptoms:

Containers failing to start or crashing
PostgreSQL refusing writes (PANIC: could not write to file)
Docker daemon running out of space

Diagnosis:

# Host disk usage
df -h

# Docker disk usage
docker system df
docker system df -v

# Check individual volume sizes
for vol in $(docker volume ls -q | grep goodgo); do
  echo -n "$vol: "
  docker run --rm -v "${vol}:/data" alpine du -sh /data 2>/dev/null
done

# Check backup volume specifically
docker exec goodgo-pg-backup du -sh /backups/
docker exec goodgo-pg-backup ls -lht /backups/

Resolution:

# 1. Clean up Docker artifacts
docker system prune -f          # Remove stopped containers, unused networks, dangling images
docker image prune -a -f        # Remove ALL unused images (careful in prod)

# 2. Clean old backups (if retention not working)
docker exec goodgo-pg-backup find /backups -name "goodgo_*.sql.gz" -mtime +7 -delete

# 3. Clean Prometheus data (if too large)
# Prometheus retention is 30d (prod) / 15d (dev) — configured via --storage.tsdb.retention.time
# To force compaction:
curl -sf -XPOST http://localhost:9090/-/quit  # Graceful shutdown triggers compaction
docker compose -f docker-compose.prod.yml start prometheus

# 4. Clean Loki data (15-day retention)
# Loki handles its own cleanup via compactor. If urgent:
docker compose -f docker-compose.prod.yml restart loki

# 5. Truncate Docker container logs
sudo truncate -s 0 $(docker inspect --format='{{.LogPath}}' goodgo-api)
# Or for all containers:
sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log'

Prevention: All production containers use json-file logging with max-size: 10m and max-file: 3-5. Backup retention is 7 days (configurable via BACKUP_RETENTION_DAYS).

3.7 MinIO / Object Storage Failure

Symptoms:

Image/file uploads fail
Property photos not loading
MinIO console inaccessible at port 9001

Diagnosis:

docker logs --tail=50 goodgo-minio
docker exec goodgo-minio mc ready local
docker exec goodgo-minio mc admin info local

Resolution:

# 1. Restart MinIO
docker compose -f docker-compose.prod.yml restart minio

# 2. If data volume corrupted
docker compose -f docker-compose.prod.yml stop minio
docker volume rm goodgo-platform-ai_minio_data  # WARNING: data loss
docker compose -f docker-compose.prod.yml up -d minio
# Recreate buckets via API or admin console

3.8 AI Services Unavailable

Symptoms:

AI-powered features (AVM, property descriptions) fail
GET /health on port 8000 fails
API logs show AI service connection timeouts

Diagnosis:

docker logs --tail=50 goodgo-ai-services
curl -sf http://localhost:8000/health
docker stats --no-stream goodgo-ai-services

Resolution:

# 1. Restart AI services
docker compose -f docker-compose.prod.yml restart ai-services

# 2. Check rate limits (default: 60/minute)
docker compose -f docker-compose.prod.yml exec ai-services printenv | grep AI_RATE_LIMIT

# 3. If OOM — the service has 1 GB limit; may need to increase for large models

Graceful Degradation: AI features are optional. The API should handle AI service unavailability gracefully and return non-AI results.

3.9 Log Pipeline Failure (Loki/Promtail)

Symptoms:

Grafana log explorer returns empty results
Promtail container unhealthy or crash-looping
Loki returning 503

Diagnosis:

docker logs --tail=50 goodgo-loki
docker logs --tail=50 goodgo-promtail
curl -sf http://localhost:3100/ready && echo "Loki ready" || echo "Loki NOT ready"

Resolution:

# 1. Restart the pipeline
docker compose -f docker-compose.prod.yml restart loki promtail

# 2. If Loki data corrupted
docker compose -f docker-compose.prod.yml stop loki promtail
docker volume rm goodgo-platform-ai_loki_data
docker compose -f docker-compose.prod.yml up -d loki promtail
# Historical logs are lost but new logs will flow immediately

# 3. If Promtail can't access Docker socket
ls -la /var/run/docker.sock
# Ensure the promtail container has the Docker socket mounted

3.10 5xx Error Rate Spike

Symptoms:

Prometheus alert ApiErrorRate5xxHigh fires (> 1% 5xx for 5 min)
Users reporting errors

Diagnosis:

# Check which endpoints are returning 5xx
curl -s "http://localhost:9090/api/v1/query" --data-urlencode \
  'query=topk(10, sum(rate(http_requests_total{job="goodgo-api", status_code=~"5.."}[5m])) by (route, method))' \
  | jq '.data.result'

# Check API error logs
docker logs --tail=200 --since=5m goodgo-api 2>&1 | grep -i "error\|exception\|500"

# Check all dependency health
curl -sf http://localhost:3001/health/ready | jq .

Resolution:

If DB-related: see Section 3.1
If Redis-related: see Section 3.2
If recent deployment: see Section 4.4
If unknown: restart API and investigate logs

4. Recovery Procedures

4.1 Database Restore from Backup

Automated backups run daily at 02:00 UTC via the pg-backup container. Retention: 7 days. Format: pg_dump --format=custom --compress=6.

Automated verification runs daily at 04:00 UTC — restores to an isolated test database, verifies table existence, row counts, checksums, PostGIS extension, indexes, and enums. Reports are written to /backups/verify-latest.json.

List Available Backups

docker exec goodgo-pg-backup ls -lht /backups/goodgo_*.sql.gz

Create an On-Demand Backup

docker exec goodgo-pg-backup /scripts/pg-backup.sh

Full Restore Procedure

# 1. Stop application services
docker compose -f docker-compose.prod.yml stop api web ai-services

# 2. (Production) Stop PgBouncer to prevent stale connections
docker compose -f docker-compose.prod.yml stop pgbouncer

# 3. Run the restore script
docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz
# The script will:
#   - Terminate active DB connections
#   - DROP and recreate the database
#   - Restore from the backup file

# 4. Verify the restore
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c '\dt'
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c 'SELECT count(*) FROM "User";'

# 5. Apply any pending migrations
docker compose -f docker-compose.prod.yml exec api npx prisma migrate deploy

# 6. Restart all services
docker compose -f docker-compose.prod.yml up -d

# 7. Verify application health
curl -sf http://localhost:3001/health/ready | jq .

Verify a Backup Without Restoring

# Run verification against latest backup (creates temp DB, drops it after)
docker compose run --rm pg-verify-backup

# Or verify a specific backup file
docker exec goodgo-pg-backup /scripts/pg-verify-backup.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz

# Check latest verification report
docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .

RPO/RTO:

RPO: ≤ 24 hours (daily backups; consider WAL archiving for lower RPO)
RTO: ~15 minutes (local volume), ~30 minutes (off-site)

4.2 Redis Cache Flush & Warm-up

# Flush all Redis data
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" FLUSHALL

# Verify flush
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" DBSIZE
# Should return: (integer) 0

Warm-up: Redis uses allkeys-lru eviction. Cache warms naturally as users make requests. No manual warm-up script is needed — cache misses fall through to PostgreSQL.

When to flush:

After database restore (stale cache references)
After data corruption at the application level
After schema changes that alter cached data structures

4.3 Rolling Restart Procedures

Single Service Restart (Zero Downtime)

# API — the --wait flag ensures health check passes before moving on
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api

# Web
docker compose -f docker-compose.prod.yml up -d --no-deps --wait web

# AI Services
docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services

Full Stack Rolling Restart

# Data services first (order matters for dependency chain)
docker compose -f docker-compose.prod.yml restart redis
docker compose -f docker-compose.prod.yml restart typesense

# Wait for data services to be healthy
sleep 10

# Connection pooling
docker compose -f docker-compose.prod.yml restart pgbouncer
sleep 5

# Application services
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api
docker compose -f docker-compose.prod.yml up -d --no-deps --wait web
docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services

# Verify
curl -sf http://localhost:3001/health/ready | jq .

Emergency: Restart Everything

docker compose -f docker-compose.prod.yml down
docker compose -f docker-compose.prod.yml up -d --wait

4.4 Rollback Deployment

The CI/CD pipeline (.github/workflows/deploy.yml) supports automatic rollback if production smoke tests fail. For manual rollback:

Quick Rollback (Revert to Previous Images)

# SSH into production host
ssh deploy@$PRODUCTION_HOST

cd ~/goodgo

# Stop current app containers
docker compose -f docker-compose.prod.yml down api web ai-services

# The previous images are still cached locally
# Restart without pulling — uses last-known-good images
docker compose -f docker-compose.prod.yml up -d --wait api web ai-services

# Verify
curl -sf http://localhost:3001/health && echo "Rollback successful"

Rollback to a Specific Git Commit / Image Tag

# Set the target tag (git SHA)
export IMAGE_TAG=<previous-commit-sha>
export REGISTRY_URL=ghcr.io/goodgo

# Pull specific version
docker compose -f docker-compose.prod.yml pull api web ai-services

# Deploy
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api
docker compose -f docker-compose.prod.yml up -d --no-deps --wait web
docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services

# Verify
curl -sf http://localhost:3001/health/ready | jq .

Rollback Database Migrations

# WARNING: Prisma does not support automatic down-migrations.
# For migration rollback, restore from the pre-migration backup:

# 1. Stop application
docker compose -f docker-compose.prod.yml stop api web ai-services pgbouncer

# 2. Restore from backup taken before the migration
docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/<pre-migration-backup>.sql.gz

# 3. Deploy the previous code version (older IMAGE_TAG)
export IMAGE_TAG=<previous-commit-sha>
docker compose -f docker-compose.prod.yml up -d --wait

4.5 Typesense Reindex from PostgreSQL

If Typesense data is lost or corrupted, rebuild the search index from PostgreSQL:

# 1. Ensure Typesense is running and healthy
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health

# 2. Run reindex
docker compose -f docker-compose.prod.yml exec api npx ts-node scripts/typesense-reindex.ts
# Or from host:
pnpm run typesense:reindex

# 3. Verify collections
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/collections | jq '.[].name'

4.6 Full Host Recovery

For complete host failure or migration to a new server:

# 1. Provision new host with Docker + Docker Compose
# Requirements: Docker >= 24, Docker Compose v2, 8 GB RAM minimum

# 2. Clone repository and configure
git clone <repo-url> ~/goodgo && cd ~/goodgo
cp .env.example .env
# Edit .env with production secrets (from secrets manager)

# 3. Restore PostgreSQL backup from off-site storage
# Transfer backup file to the new host
scp backups/goodgo_latest.sql.gz deploy@newhost:~/goodgo/backups/

# 4. Start infrastructure services
docker compose -f docker-compose.prod.yml up -d postgres redis typesense minio

# 5. Wait for PostgreSQL to be ready, then restore
docker compose -f docker-compose.prod.yml exec postgres pg_isready
docker exec goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_latest.sql.gz

# 6. Start application services
docker compose -f docker-compose.prod.yml up -d

# 7. Run migrations (if backup predates latest code)
docker compose -f docker-compose.prod.yml exec api npx prisma migrate deploy

# 8. Rebuild Typesense index
pnpm run typesense:reindex

# 9. Flush Redis (stale cache from old host)
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" FLUSHALL

# 10. Verify everything
curl -sf http://localhost:3001/health/ready | jq .
curl -sf http://localhost:3000 > /dev/null && echo "Web OK"

# Expected RTO: ~60 minutes (depends on backup transfer speed)

5. Escalation Matrix

Severity	Condition	First Responder	Escalation	SLA
P0 — Critical	Full outage, data loss, payment corruption	On-call SRE	CTO + CEO within 15 min	Acknowledge: 5 min, Resolve: 1 hour
P1 — High	Partial outage, SLO breach (p99 > 3s), 5xx > 5%	On-call SRE	Engineering lead within 30 min	Acknowledge: 15 min, Resolve: 4 hours
P2 — Medium	Degraded performance, single service down (non-critical), p99 > 1s	On-call SRE	Team lead next business day	Acknowledge: 1 hour, Resolve: 24 hours
P3 — Low	Cosmetic issues, monitoring gaps, non-urgent improvements	Assigned engineer	Sprint planning	Next sprint

Contact Channels

Role	Channel
On-call SRE	Slack `#sre-oncall` + PagerDuty
Engineering Lead	Slack `#engineering`
CTO	Slack DM / Phone (see PagerDuty)
Payment Issues	Slack `#payments` + VNPay/MoMo support portals
Infrastructure	Slack `#infrastructure`

Slack Notifications

The deploy pipeline automatically notifies #deployments (via SLACK_WEBHOOK_URL) on:

Production deploy success
Staging smoke test failure
Production rollback triggered

6. Monitoring Dashboards

All dashboards are provisioned automatically via monitoring/grafana/provisioning/ and are available in the GoodGo folder in Grafana.

Dashboard	Grafana Path	Purpose
API Overview	`api-overview`	Request rates, status codes, active connections
API Latency	`api-latency`	p50/p95/p99 latency by endpoint, latency heatmaps
Database	`database`	PostgreSQL connections, query performance, PgBouncer stats
Search	`search`	Typesense query rates, latency, index sizes
Business Metrics	`business-metrics`	Listings, inquiries, payments, user registrations
Web Vitals	`web-vitals`	Core Web Vitals (LCP, FID, CLS), page load times
Logs	`logs`	Loki log explorer with filters by service, level, correlation ID

Access: http://localhost:3002 (default credentials in .env: GRAFANA_ADMIN_USER / GRAFANA_ADMIN_PASSWORD)

Data Sources:

Prometheus (http://prometheus:9090) — Metrics (default)
Loki (http://loki:3100) — Logs, with correlation ID linking to Prometheus
Alertmanager (http://alertmanager:9093) — Alert state and silences

7. Useful PromQL Queries

API Performance

# Overall p99 latency
histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket{job="goodgo-api"}[5m])) by (le))

# Per-endpoint p99 latency (top 10 slowest)
topk(10, histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket{job="goodgo-api"}[5m])) by (le, route, method)))

# Request rate by status code
sum(rate(http_requests_total{job="goodgo-api"}[5m])) by (status_code)

# 5xx error percentage
(sum(rate(http_requests_total{job="goodgo-api", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="goodgo-api"}[5m]))) * 100

Database

# Active connections
pg_stat_activity_count{datname="goodgo", state="active"}

# Connection pool utilization (if PgBouncer metrics are scraped)
# Manual check via: SHOW POOLS in PgBouncer admin console

Infrastructure

# Container memory usage
container_memory_usage_bytes{name=~"goodgo-.*"}

# Container CPU usage
rate(container_cpu_usage_seconds_total{name=~"goodgo-.*"}[5m])

8. Environment Quick Reference

Key Environment Variables

Variable	Required	Description
`DATABASE_URL`	Yes	PostgreSQL via PgBouncer (`postgresql://user:pass@pgbouncer:6432/db`)
`DATABASE_URL_DIRECT`	Yes (prod)	Direct PostgreSQL for migrations (`postgresql://user:pass@postgres:5432/db`)
`JWT_SECRET`	Yes	JWT signing secret
`JWT_REFRESH_SECRET`	Yes	Refresh token signing secret
`REDIS_URL`	Yes	Redis connection (`redis://:password@redis:6379`)
`REDIS_PASSWORD`	Yes (prod)	Redis auth password
`TYPESENSE_API_KEY`	Yes	Typesense admin API key
`MINIO_ACCESS_KEY`	Yes	MinIO root user
`MINIO_SECRET_KEY`	Yes	MinIO root password
`VNPAY_*`	Yes	VNPay payment gateway configuration
`AI_API_KEY`	Yes	AI services authentication
`GRAFANA_ADMIN_USER`	Yes (prod)	Grafana admin username
`GRAFANA_ADMIN_PASSWORD`	Yes (prod)	Grafana admin password
`PGBOUNCER_POOL_SIZE`	No	PgBouncer pool size (default: 20)
`PGBOUNCER_MAX_CLIENT_CONN`	No	Max PgBouncer client connections (default: 200)
`BACKUP_RETENTION_DAYS`	No	Backup retention period (default: 7)
`IMAGE_TAG`	No (prod)	Container image tag (default: `latest`)

Port Map

Port	Service	Exposed
3000	Web (Next.js)	External
3001	API (NestJS)	External
3002	Grafana	External (admin only)
5432	PostgreSQL	Internal
6432	PgBouncer	Internal
6379	Redis	Internal
8000	AI Services	Internal
8108	Typesense	Internal
9000	MinIO API	Internal
9001	MinIO Console	Internal
9090	Prometheus	Internal
3100	Loki	Internal

Docker Volumes

Volume	Service	Purpose
`pgdata`	PostgreSQL	Database files
`redis_data`	Redis	AOF persistence
`typesense_data`	Typesense	Search index data
`minio_data`	MinIO	Object storage (images, files)
`pg_backups`	pg-backup	Database backup files
`loki_data`	Loki	Log storage (15-day retention)
`prometheus_data`	Prometheus	Metrics (30-day retention prod / 15-day dev)
`grafana_data`	Grafana	Dashboard state, user preferences

9. Disaster Recovery Validation

Automated Verification

Backup verification runs daily at 04:00 UTC inside the pg-backup container. It restores the latest backup to an isolated test database and checks:

Table existence (all 22 Prisma models)
Row count comparison against live database
Data checksums on critical tables (User, Property, Listing, Payment, Subscription, Transaction, Plan)
PostGIS extension availability
Index count match
Enum type count match

Check latest verification report:

docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .

Check verification logs:

docker exec goodgo-pg-backup cat /var/log/pg-verify.log

Manual DR Validation Procedure

Run this quarterly (or after major schema changes) to validate the full DR process end-to-end.

Step 1: Verify Backups Exist and Are Recent

# List backups with timestamps and sizes
docker exec goodgo-pg-backup ls -lht /backups/goodgo_*.sql.gz

# Verify latest backup is < 25 hours old
LATEST=$(docker exec goodgo-pg-backup ls -t /backups/goodgo_*.sql.gz | head -1)
echo "Latest backup: $LATEST"

Step 2: Run Verification Against Latest Backup

# Automated verification (creates temp DB, validates, drops)
docker exec -e REPORT_FILE=/backups/verify-latest.json goodgo-pg-backup \
  /scripts/pg-verify-backup.sh

# Review results
docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .

Expected output: All checks pass, restore completes in < 60 seconds for typical dataset.

Step 3: Test Full Restore (Staging Only)

⚠️ WARNING: Only perform this on a staging or isolated environment. Never on production.

# 1. Create a separate test environment
docker compose -f docker-compose.yml -p goodgo-dr-test up -d postgres

# 2. Wait for PostgreSQL to be ready
docker exec goodgo-dr-test-postgres-1 pg_isready

# 3. Run restore against the test environment
PGHOST=localhost PGPORT=<test-port> PGUSER=goodgo PGPASSWORD=<password> \
  /scripts/pg-restore.sh /backups/<latest-backup>.sql.gz

# 4. Verify key tables
docker exec goodgo-dr-test-postgres-1 psql -U goodgo -d goodgo -c \
  "SELECT count(*) FROM \"User\"; SELECT count(*) FROM \"Property\"; SELECT count(*) FROM \"Listing\";"

# 5. Clean up test environment
docker compose -f docker-compose.yml -p goodgo-dr-test down -v

Step 4: Validate Service Recovery Chain

Test that all services can start from a clean state with restored data:

# 1. Note current service status
docker compose -f docker-compose.prod.yml ps --format "table {{.Name}}\t{{.Status}}\t{{.Health}}"

# 2. Restart all services in dependency order
docker compose -f docker-compose.prod.yml restart postgres
sleep 10  # Wait for PostgreSQL

docker compose -f docker-compose.prod.yml restart pgbouncer redis typesense
sleep 10  # Wait for data services

docker compose -f docker-compose.prod.yml restart api web ai-services
sleep 15  # Wait for application services

# 3. Verify all health checks
curl -sf http://localhost:3001/health/ready | jq .
curl -sf http://localhost:3000 > /dev/null && echo "Web OK"
curl -sf http://localhost:9090/-/healthy && echo "Prometheus OK"
curl -sf http://localhost:9093/-/healthy && echo "Alertmanager OK"
curl -sf http://localhost:3002/api/health | jq .

Step 5: Validate Alerting Pipeline

# 1. Check Prometheus is loading alert rules
curl -sf http://localhost:9090/api/v1/rules | jq '.data.groups | length'
# Expected: 7 groups

# 2. Check current alerts (should be empty if healthy)
curl -sf http://localhost:9090/api/v1/alerts | jq '.data.alerts | length'

# 3. Check Alertmanager is receiving from Prometheus
curl -sf http://localhost:9093/api/v2/status | jq '.cluster'

# 4. Verify Alertmanager config is loaded
curl -sf http://localhost:9093/api/v2/status | jq '.config'

DR Validation Checklist

Use this checklist during quarterly DR reviews:

Latest backup is < 25 hours old
Automated verification report shows all checks passed
Manual restore to test DB succeeds with correct row counts
Full service restart completes within RTO target (< 30 min)
All health endpoints respond after restart
Prometheus alert rules are loaded (7 groups)
Alertmanager is reachable and configured
Slack notification channel is receiving test alerts
Grafana dashboards show data after restart
Typesense search returns results after restart

RPO/RTO Summary

Metric	Target	Actual (Measured)	Notes
RPO	≤ 24 hours	~24h (daily at 02:00 UTC)	Reduce with WAL archiving
RTO — Local backup	≤ 15 minutes	Measure during DR test	Restore + service restart
RTO — Off-site backup	≤ 30 minutes	Measure during DR test	Add transfer time
RTO — Full host recovery	≤ 60 minutes	Measure during DR test	New host + restore + deploy

Appendix: Alert Rules Reference

API & Error Alerts

Alert	Expression	Severity	Duration
`ApiLatencyP99High`	p99 > 1s	Warning	5 min
`ApiEndpointLatencyP99High`	Per-route p99 > 2s	Warning	5 min
`ApiLatencyP99Critical`	p99 > 3s (SLO breach)	Critical	3 min
`ApiErrorRate5xxHigh`	5xx rate > 1%	Warning	5 min
`ApiErrorRate5xxCritical`	5xx rate > 5%	Critical	3 min
`ApiNoTraffic`	Request rate = 0	Warning	10 min

Database Alerts

Alert	Expression	Severity	Duration
`PostgresActiveConnectionsHigh`	Active connections > 15	Warning	5 min
`PostgresConnectionPoolCritical`	Total connections > 180	Critical	2 min
`PostgresSlowQueries`	Lock-waiting queries > 5	Warning	5 min
`PostgresDown`	API scrape target down	Critical	1 min

Redis Alerts

Alert	Expression	Severity	Duration
`RedisMemoryHigh`	Memory usage > 80%	Warning	5 min
`RedisMemoryCritical`	Memory usage > 95%	Critical	2 min
`RedisConnectedClientsHigh`	Clients > 150	Warning	5 min
`RedisRejectedConnections`	Rejected connections > 0	Critical	1 min

Container Resource Alerts

Alert	Expression	Severity	Duration
`ContainerRestartLoop`	> 3 restarts in 15 min	Critical	5 min
`ContainerMemoryHigh`	Memory > 85% of limit	Warning	5 min
`ContainerCPUThrottled`	CPU throttle rate > 0.5s/s	Warning	10 min

Disk & Infrastructure Alerts

Alert	Expression	Severity	Duration
`HostDiskUsageHigh`	Root disk > 80%	Warning	10 min
`HostDiskUsageCritical`	Root disk > 90%	Critical	5 min
`ApiHealthCheckFailing`	Health probe fails	Critical	2 min
`PrometheusTargetDown`	Scrape target down	Warning	5 min

Backup Alerts

Alert	Expression	Severity	Duration
`BackupTooOld`	Last backup > 25 hours ago	Warning	5 min
`BackupVerificationFailed`	Verify result = fail	Warning	1 min

Alert Routing

Alerts are routed via Alertmanager (monitoring/alertmanager/alertmanager.yml):

Channel	Routes	Repeat Interval
`#sre-oncall` (Slack)	All warning alerts	4 hours
`#sre-oncall` (Slack)	All critical alerts (priority)	1 hour
`#infrastructure` (Slack)	Backup-related alerts	6 hours

Inhibition: Warning alerts are suppressed when a critical alert for the same service is already firing.

Alert rules are defined in monitoring/prometheus/alert-rules.yml and evaluated every 15 seconds.

40 KiB Raw Blame History

GoodGo Platform — Production Runbook

Table of Contents

1. Service Inventory

Production Services (docker-compose.prod.yml)

Development-Only Services (docker-compose.yml)

Service Dependency Chain

2. Health Checks

Application Health Endpoints

Verify All Services Are Healthy

Container Resource Usage

3. Common Incidents

3.1 Database Connection Pool Exhaustion

3.2 Redis Connection Failure

3.3 Typesense Unavailable

3.4 High API Latency

3.5 Payment Callback Failures

3.6 Disk Space Alerts

3.7 MinIO / Object Storage Failure

3.8 AI Services Unavailable

3.9 Log Pipeline Failure (Loki/Promtail)

3.10 5xx Error Rate Spike

4. Recovery Procedures

4.1 Database Restore from Backup

List Available Backups

Create an On-Demand Backup

Full Restore Procedure

Verify a Backup Without Restoring

4.2 Redis Cache Flush & Warm-up

4.3 Rolling Restart Procedures

Single Service Restart (Zero Downtime)

Full Stack Rolling Restart

Emergency: Restart Everything

4.4 Rollback Deployment

Quick Rollback (Revert to Previous Images)

Rollback to a Specific Git Commit / Image Tag

Rollback Database Migrations

4.5 Typesense Reindex from PostgreSQL

4.6 Full Host Recovery

5. Escalation Matrix

Contact Channels

Slack Notifications

6. Monitoring Dashboards

7. Useful PromQL Queries

API Performance

Database

Infrastructure

8. Environment Quick Reference

Key Environment Variables

Port Map

Docker Volumes

9. Disaster Recovery Validation

Automated Verification

Manual DR Validation Procedure

Step 1: Verify Backups Exist and Are Recent

Step 2: Run Verification Against Latest Backup

Step 3: Test Full Restore (Staging Only)

Step 4: Validate Service Recovery Chain

Step 5: Validate Alerting Pipeline

DR Validation Checklist

RPO/RTO Summary

Appendix: Alert Rules Reference

API & Error Alerts

Database Alerts

Redis Alerts

Container Resource Alerts

Disk & Infrastructure Alerts

Backup Alerts

Alert Routing

40 KiB

Raw Blame History

Production Services (`docker-compose.prod.yml`)

Development-Only Services (`docker-compose.yml`)