Expand production monitoring with full alert coverage for database connections, Redis memory/connections, container resources, disk usage, service health, and backup integrity. Add Alertmanager service with Slack routing for critical and warning alerts, and add automated backup verification to the pg-backup cron schedule. Update runbook with DR validation procedures and quarterly checklist. - Expand Prometheus alert rules from 4 to 24 alerts across 7 groups - Add Alertmanager container (prom/alertmanager:v0.27.0) with Slack routing - Configure inhibition rules (critical suppresses warning for same service) - Schedule automated backup verification at 04:00 UTC daily - Add Alertmanager datasource to Grafana provisioning - Update runbook with Section 9: DR Validation (automated + manual procedures) - Add SLACK_WEBHOOK_URL and Grafana vars to .env.example Co-Authored-By: Paperclip <noreply@paperclip.ing>
40 KiB
GoodGo Platform — Production Runbook
Audience: On-call SRE, DevOps engineers, and platform operators. Last updated: 2026-04-11
Table of Contents
- Service Inventory
- Health Checks
- Common Incidents
- Recovery Procedures
- Escalation Matrix
- Monitoring Dashboards
- Useful PromQL Queries
- Environment Quick Reference
1. Service Inventory
Production Services (docker-compose.prod.yml)
| Service | Image | Port | Resource Limits | Health Check |
|---|---|---|---|---|
| api (NestJS) | ghcr.io/goodgo/goodgo-api |
3001 | 1 CPU / 1 GB | GET /health (node fetch) |
| web (Next.js) | ghcr.io/goodgo/goodgo-web |
3000 | 0.5 CPU / 512 MB | GET / (node fetch) |
| ai-services (FastAPI) | ghcr.io/goodgo/goodgo-ai-services |
8000 | 1 CPU / 1 GB | GET /health (httpx) |
| postgres | postgis/postgis:16-3.4 |
5432 (internal) | 2 CPU / 2 GB, shm=256m | pg_isready |
| pgbouncer | edoburu/pgbouncer:1.23.1-p2 |
6432 (internal) | 0.5 CPU / 256 MB | pg_isready -p 6432 |
| redis | redis:7-alpine |
6379 (internal) | 0.5 CPU / 768 MB | redis-cli ping |
| typesense | typesense/typesense:27.1 |
8108 (internal) | 1 CPU / 1 GB | curl /health |
| minio | minio/minio:latest |
9000/9001 (internal) | 0.5 CPU / 1 GB | mc ready local |
| pg-backup | postgis/postgis:16-3.4 |
— | 0.5 CPU / 512 MB | — (cron daemon) |
| loki | grafana/loki:3.0.0 |
3100 (internal) | 0.5 CPU / 512 MB | wget /ready |
| promtail | grafana/promtail:3.0.0 |
— | 0.25 CPU / 256 MB | — |
| prometheus | prom/prometheus:v2.51.0 |
9090 (internal) | 0.5 CPU / 1 GB | wget /-/healthy |
| grafana | grafana/grafana:10.4.1 |
3002 (external) | 0.5 CPU / 512 MB | wget /api/health |
| alertmanager | prom/alertmanager:v0.27.0 |
9093 (internal) | 0.25 CPU / 256 MB | wget /-/healthy |
Development-Only Services (docker-compose.yml)
Development uses the same data and monitoring services but runs API/Web on the host. The pg-backup service also runs in dev with default credentials.
Service Dependency Chain
web --> api --> pgbouncer --> postgres
|-> redis
|-> typesense
|-> minio
|-> ai-services
grafana --> prometheus --> alertmanager
|-> loki --> promtail (Docker socket)
pg-backup --> postgres
2. Health Checks
Application Health Endpoints
| Endpoint | Type | Checks | Expected Response |
|---|---|---|---|
GET /health |
Liveness | Process is running | 200 { status: "ok" } |
GET /health/ready |
Readiness | PostgreSQL + Redis | 200 { status: "ok", info: { database: ..., redis: ... } } |
GET /health/db |
Database only | PostgreSQL connectivity | 200 { status: "ok", info: { database: ... } } |
GET /health/redis |
Redis only | Redis connectivity | 200 { status: "ok", info: { redis: ... } } |
Verify All Services Are Healthy
# Quick check — all containers
docker compose -f docker-compose.prod.yml ps --format "table {{.Name}}\t{{.Status}}\t{{.Health}}"
# API liveness
curl -sf http://localhost:3001/health && echo "API OK" || echo "API FAIL"
# API readiness (DB + Redis)
curl -sf http://localhost:3001/health/ready | jq .
# Individual dependency checks
curl -sf http://localhost:3001/health/db | jq .
curl -sf http://localhost:3001/health/redis | jq .
# Typesense
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health
# MinIO
docker exec goodgo-minio mc ready local && echo "MinIO OK"
# AI Services
curl -sf http://localhost:8000/health && echo "AI OK" || echo "AI FAIL"
# PostgreSQL (direct)
docker exec goodgo-postgres pg_isready -U ${DB_USER} -d ${DB_NAME}
# PgBouncer
docker exec goodgo-pgbouncer pg_isready -h 127.0.0.1 -p 6432 -U ${DB_USER}
# Redis
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" ping
# Prometheus
curl -sf http://localhost:9090/-/healthy && echo "Prometheus OK"
# Loki
curl -sf http://localhost:3100/ready && echo "Loki OK"
# Grafana
curl -sf http://localhost:3002/api/health | jq .
# Alertmanager
curl -sf http://localhost:9093/-/healthy && echo "Alertmanager OK"
Container Resource Usage
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.NetIO}}"
3. Common Incidents
3.1 Database Connection Pool Exhaustion
Symptoms:
- API returns 503 or hangs on requests
/health/readyreturns unhealthy fordatabase- PgBouncer logs:
no more connections allowedorquery_wait_timeout - Prometheus: spike in
pg_stat_activityactive connections
Diagnosis:
# Check PgBouncer pool status
docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW POOLS;"
docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW CLIENTS;"
docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW STATS;"
# Check PostgreSQL active connections
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
"SELECT state, count(*) FROM pg_stat_activity WHERE datname = '${DB_NAME}' GROUP BY state;"
# Identify long-running queries
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
"SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE datname = '${DB_NAME}' AND state != 'idle'
ORDER BY duration DESC
LIMIT 10;"
Resolution:
# 1. Kill long-running queries (> 5 minutes)
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
"SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = '${DB_NAME}'
AND state != 'idle'
AND now() - query_start > interval '5 minutes'
AND pid <> pg_backend_pid();"
# 2. If pool is fully exhausted, restart PgBouncer
docker compose -f docker-compose.prod.yml restart pgbouncer
# 3. If issue persists, increase pool size temporarily
# Edit PGBOUNCER_POOL_SIZE in .env, then:
docker compose -f docker-compose.prod.yml up -d --no-deps pgbouncer
PgBouncer Configuration Reference:
- Pool mode:
transaction(connections returned to pool after each transaction) - Default pool size: 20 server connections per user/db pair
- Max client connections: 200
- Reserve pool: 5 extra connections (after 3s wait)
- Query wait timeout: 120s (error if client waits this long)
3.2 Redis Connection Failure
Symptoms:
/health/redisreturns unhealthy- Increased API response times (cache misses hitting DB)
- API logs show Redis connection errors
Diagnosis:
# Check Redis container
docker logs --tail=50 goodgo-redis
# Test connectivity
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" ping
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO server
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO memory
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO clients
Resolution:
# 1. Restart Redis (data persisted via AOF)
docker compose -f docker-compose.prod.yml restart redis
# 2. If OOM — check memory usage
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO memory | grep used_memory_human
# Max memory is 512 MB (prod), eviction policy: allkeys-lru
# 3. If AOF is corrupted
docker compose -f docker-compose.prod.yml stop redis
docker exec goodgo-redis redis-check-aof --fix /data/appendonly.aof
docker compose -f docker-compose.prod.yml start redis
Graceful Degradation: The API is designed to continue operating when Redis is unavailable. Cache misses fall through to PostgreSQL. Performance will degrade but functionality is preserved. Redis is non-critical for core operations.
3.3 Typesense Unavailable
Symptoms:
- Search functionality returns errors or falls back to basic DB search
curl http://localhost:8108/healthfails- API logs show Typesense connection timeouts
Diagnosis:
# Check container status
docker logs --tail=50 goodgo-typesense
# Check health
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health
# Check collections
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/collections | jq .
# Check disk space for Typesense data volume
docker system df -v | grep typesense
Resolution:
# 1. Restart Typesense
docker compose -f docker-compose.prod.yml restart typesense
# 2. If data is corrupted — rebuild from PostgreSQL
docker compose -f docker-compose.prod.yml stop typesense
docker volume rm goodgo-platform-ai_typesense_data
docker compose -f docker-compose.prod.yml up -d typesense
# Wait for healthy, then reindex:
docker compose -f docker-compose.prod.yml exec api npx ts-node scripts/typesense-reindex.ts
# Or: pnpm run typesense:reindex
# 3. If volume backup exists — restore
docker compose -f docker-compose.prod.yml stop typesense
docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
alpine sh -c "rm -rf /data/* && tar xzf /backup/typesense_YYYYMMDD_HHMMSS.tar.gz -C /data"
docker compose -f docker-compose.prod.yml start typesense
Fallback Behavior: When Typesense is unavailable, property search falls back to PostgreSQL full-text search with PostGIS geo queries. Search quality degrades but core functionality works.
3.4 High API Latency
Symptoms:
- Prometheus alert
ApiLatencyP99Highfires (p99 > 1s for 5 min) - Critical alert
ApiLatencyP99Criticalfires (p99 > 3s for 3 min — SLO breach) - Users report slow page loads
Diagnosis:
# 1. Check which endpoints are slow
# Grafana: GoodGo API Latency dashboard
# Or via PromQL:
curl -s "http://localhost:9090/api/v1/query" --data-urlencode \
'query=topk(10, histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket[5m])) by (le, route, method)))' \
| jq '.data.result[] | {route: .metric.route, method: .metric.method, p99: .value[1]}'
# 2. Check database slow queries
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
"SELECT pid, now() - query_start AS duration, left(query, 100) AS query_preview
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '1 second'
ORDER BY duration DESC;"
# 3. Check PgBouncer wait times
docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW POOLS;"
# 4. Check container resource usage
docker stats --no-stream goodgo-api goodgo-postgres goodgo-redis goodgo-pgbouncer
# 5. Check Redis latency
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" --latency-history -i 3
# 6. Check application logs for errors
docker logs --tail=200 --since=5m goodgo-api 2>&1 | grep -i "error\|timeout\|slow"
Resolution:
# 1. If DB slow queries — terminate them
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
"SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '30 seconds';"
# 2. If connection pool exhaustion — see Section 3.1
# 3. If Redis is slow — restart
docker compose -f docker-compose.prod.yml restart redis
# 4. If API container OOM — restart with more memory
docker compose -f docker-compose.prod.yml restart api
# 5. If specific endpoint is the bottleneck — check Loki logs:
# Grafana > Explore > Loki > {container_name="goodgo-api"} |= "slow"
3.5 Payment Callback Failures
Symptoms:
- Users report payments stuck in "pending" state
- VNPay/MoMo/ZaloPay IPN callbacks returning errors
- Payment reconciliation mismatches
Diagnosis:
# 1. Check payment callback logs
docker logs goodgo-api 2>&1 | grep -i "payment\|callback\|vnpay\|momo\|zalopay" | tail -50
# 2. Check for pending payments in DB
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
"SELECT id, provider, status, \"amountVND\", \"createdAt\"
FROM \"Payment\"
WHERE status = 'PENDING'
AND \"createdAt\" > now() - interval '24 hours'
ORDER BY \"createdAt\" DESC
LIMIT 20;"
# 3. Verify callback URL is reachable from external networks
curl -sf https://your-domain.com/api/payments/vnpay/callback && echo "Callback URL reachable"
# 4. Check if API is receiving callbacks (via Loki)
# Grafana > Explore > Loki > {container_name="goodgo-api"} |= "callback" |= "payment"
Resolution:
# 1. If callbacks are timing out — check API health and restart if needed
docker compose -f docker-compose.prod.yml restart api
# 2. If VNPay signature verification fails — verify VNPAY_* env vars
docker compose -f docker-compose.prod.yml exec api printenv | grep VNPAY
# 3. For stuck payments — manual reconciliation
# Check VNPay/MoMo merchant portal for actual transaction status
# Update payment status in DB if confirmed paid:
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
"UPDATE \"Payment\" SET status = 'COMPLETED', \"updatedAt\" = now()
WHERE id = '<payment-id>' AND status = 'PENDING';"
# 4. If callbacks are not reaching the server — check:
# - Firewall rules (port 3001 or reverse proxy port must be open)
# - SSL certificate validity
# - DNS resolution
# - Payment provider webhook configuration (correct callback URL)
Important: The payment callback handler uses idempotent processing with atomic state transitions. Replaying a callback is safe and will not duplicate payments.
3.6 Disk Space Alerts
Symptoms:
- Containers failing to start or crashing
- PostgreSQL refusing writes (
PANIC: could not write to file) - Docker daemon running out of space
Diagnosis:
# Host disk usage
df -h
# Docker disk usage
docker system df
docker system df -v
# Check individual volume sizes
for vol in $(docker volume ls -q | grep goodgo); do
echo -n "$vol: "
docker run --rm -v "${vol}:/data" alpine du -sh /data 2>/dev/null
done
# Check backup volume specifically
docker exec goodgo-pg-backup du -sh /backups/
docker exec goodgo-pg-backup ls -lht /backups/
Resolution:
# 1. Clean up Docker artifacts
docker system prune -f # Remove stopped containers, unused networks, dangling images
docker image prune -a -f # Remove ALL unused images (careful in prod)
# 2. Clean old backups (if retention not working)
docker exec goodgo-pg-backup find /backups -name "goodgo_*.sql.gz" -mtime +7 -delete
# 3. Clean Prometheus data (if too large)
# Prometheus retention is 30d (prod) / 15d (dev) — configured via --storage.tsdb.retention.time
# To force compaction:
curl -sf -XPOST http://localhost:9090/-/quit # Graceful shutdown triggers compaction
docker compose -f docker-compose.prod.yml start prometheus
# 4. Clean Loki data (15-day retention)
# Loki handles its own cleanup via compactor. If urgent:
docker compose -f docker-compose.prod.yml restart loki
# 5. Truncate Docker container logs
sudo truncate -s 0 $(docker inspect --format='{{.LogPath}}' goodgo-api)
# Or for all containers:
sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log'
Prevention: All production containers use json-file logging with max-size: 10m and max-file: 3-5. Backup retention is 7 days (configurable via BACKUP_RETENTION_DAYS).
3.7 MinIO / Object Storage Failure
Symptoms:
- Image/file uploads fail
- Property photos not loading
- MinIO console inaccessible at port 9001
Diagnosis:
docker logs --tail=50 goodgo-minio
docker exec goodgo-minio mc ready local
docker exec goodgo-minio mc admin info local
Resolution:
# 1. Restart MinIO
docker compose -f docker-compose.prod.yml restart minio
# 2. If data volume corrupted
docker compose -f docker-compose.prod.yml stop minio
docker volume rm goodgo-platform-ai_minio_data # WARNING: data loss
docker compose -f docker-compose.prod.yml up -d minio
# Recreate buckets via API or admin console
3.8 AI Services Unavailable
Symptoms:
- AI-powered features (AVM, property descriptions) fail
GET /healthon port 8000 fails- API logs show AI service connection timeouts
Diagnosis:
docker logs --tail=50 goodgo-ai-services
curl -sf http://localhost:8000/health
docker stats --no-stream goodgo-ai-services
Resolution:
# 1. Restart AI services
docker compose -f docker-compose.prod.yml restart ai-services
# 2. Check rate limits (default: 60/minute)
docker compose -f docker-compose.prod.yml exec ai-services printenv | grep AI_RATE_LIMIT
# 3. If OOM — the service has 1 GB limit; may need to increase for large models
Graceful Degradation: AI features are optional. The API should handle AI service unavailability gracefully and return non-AI results.
3.9 Log Pipeline Failure (Loki/Promtail)
Symptoms:
- Grafana log explorer returns empty results
- Promtail container unhealthy or crash-looping
- Loki returning 503
Diagnosis:
docker logs --tail=50 goodgo-loki
docker logs --tail=50 goodgo-promtail
curl -sf http://localhost:3100/ready && echo "Loki ready" || echo "Loki NOT ready"
Resolution:
# 1. Restart the pipeline
docker compose -f docker-compose.prod.yml restart loki promtail
# 2. If Loki data corrupted
docker compose -f docker-compose.prod.yml stop loki promtail
docker volume rm goodgo-platform-ai_loki_data
docker compose -f docker-compose.prod.yml up -d loki promtail
# Historical logs are lost but new logs will flow immediately
# 3. If Promtail can't access Docker socket
ls -la /var/run/docker.sock
# Ensure the promtail container has the Docker socket mounted
3.10 5xx Error Rate Spike
Symptoms:
- Prometheus alert
ApiErrorRate5xxHighfires (> 1% 5xx for 5 min) - Users reporting errors
Diagnosis:
# Check which endpoints are returning 5xx
curl -s "http://localhost:9090/api/v1/query" --data-urlencode \
'query=topk(10, sum(rate(http_requests_total{job="goodgo-api", status_code=~"5.."}[5m])) by (route, method))' \
| jq '.data.result'
# Check API error logs
docker logs --tail=200 --since=5m goodgo-api 2>&1 | grep -i "error\|exception\|500"
# Check all dependency health
curl -sf http://localhost:3001/health/ready | jq .
Resolution:
- If DB-related: see Section 3.1
- If Redis-related: see Section 3.2
- If recent deployment: see Section 4.4
- If unknown: restart API and investigate logs
4. Recovery Procedures
4.1 Database Restore from Backup
Automated backups run daily at 02:00 UTC via the pg-backup container. Retention: 7 days. Format: pg_dump --format=custom --compress=6.
Automated verification runs daily at 04:00 UTC — restores to an isolated test database, verifies table existence, row counts, checksums, PostGIS extension, indexes, and enums. Reports are written to /backups/verify-latest.json.
List Available Backups
docker exec goodgo-pg-backup ls -lht /backups/goodgo_*.sql.gz
Create an On-Demand Backup
docker exec goodgo-pg-backup /scripts/pg-backup.sh
Full Restore Procedure
# 1. Stop application services
docker compose -f docker-compose.prod.yml stop api web ai-services
# 2. (Production) Stop PgBouncer to prevent stale connections
docker compose -f docker-compose.prod.yml stop pgbouncer
# 3. Run the restore script
docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz
# The script will:
# - Terminate active DB connections
# - DROP and recreate the database
# - Restore from the backup file
# 4. Verify the restore
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c '\dt'
docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c 'SELECT count(*) FROM "User";'
# 5. Apply any pending migrations
docker compose -f docker-compose.prod.yml exec api npx prisma migrate deploy
# 6. Restart all services
docker compose -f docker-compose.prod.yml up -d
# 7. Verify application health
curl -sf http://localhost:3001/health/ready | jq .
Verify a Backup Without Restoring
# Run verification against latest backup (creates temp DB, drops it after)
docker compose run --rm pg-verify-backup
# Or verify a specific backup file
docker exec goodgo-pg-backup /scripts/pg-verify-backup.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz
# Check latest verification report
docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .
RPO/RTO:
- RPO: ≤ 24 hours (daily backups; consider WAL archiving for lower RPO)
- RTO: ~15 minutes (local volume), ~30 minutes (off-site)
4.2 Redis Cache Flush & Warm-up
# Flush all Redis data
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" FLUSHALL
# Verify flush
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" DBSIZE
# Should return: (integer) 0
Warm-up: Redis uses allkeys-lru eviction. Cache warms naturally as users make requests. No manual warm-up script is needed — cache misses fall through to PostgreSQL.
When to flush:
- After database restore (stale cache references)
- After data corruption at the application level
- After schema changes that alter cached data structures
4.3 Rolling Restart Procedures
Single Service Restart (Zero Downtime)
# API — the --wait flag ensures health check passes before moving on
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api
# Web
docker compose -f docker-compose.prod.yml up -d --no-deps --wait web
# AI Services
docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services
Full Stack Rolling Restart
# Data services first (order matters for dependency chain)
docker compose -f docker-compose.prod.yml restart redis
docker compose -f docker-compose.prod.yml restart typesense
# Wait for data services to be healthy
sleep 10
# Connection pooling
docker compose -f docker-compose.prod.yml restart pgbouncer
sleep 5
# Application services
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api
docker compose -f docker-compose.prod.yml up -d --no-deps --wait web
docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services
# Verify
curl -sf http://localhost:3001/health/ready | jq .
Emergency: Restart Everything
docker compose -f docker-compose.prod.yml down
docker compose -f docker-compose.prod.yml up -d --wait
4.4 Rollback Deployment
The CI/CD pipeline (.github/workflows/deploy.yml) supports automatic rollback if production smoke tests fail. For manual rollback:
Quick Rollback (Revert to Previous Images)
# SSH into production host
ssh deploy@$PRODUCTION_HOST
cd ~/goodgo
# Stop current app containers
docker compose -f docker-compose.prod.yml down api web ai-services
# The previous images are still cached locally
# Restart without pulling — uses last-known-good images
docker compose -f docker-compose.prod.yml up -d --wait api web ai-services
# Verify
curl -sf http://localhost:3001/health && echo "Rollback successful"
Rollback to a Specific Git Commit / Image Tag
# Set the target tag (git SHA)
export IMAGE_TAG=<previous-commit-sha>
export REGISTRY_URL=ghcr.io/goodgo
# Pull specific version
docker compose -f docker-compose.prod.yml pull api web ai-services
# Deploy
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api
docker compose -f docker-compose.prod.yml up -d --no-deps --wait web
docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services
# Verify
curl -sf http://localhost:3001/health/ready | jq .
Rollback Database Migrations
# WARNING: Prisma does not support automatic down-migrations.
# For migration rollback, restore from the pre-migration backup:
# 1. Stop application
docker compose -f docker-compose.prod.yml stop api web ai-services pgbouncer
# 2. Restore from backup taken before the migration
docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/<pre-migration-backup>.sql.gz
# 3. Deploy the previous code version (older IMAGE_TAG)
export IMAGE_TAG=<previous-commit-sha>
docker compose -f docker-compose.prod.yml up -d --wait
4.5 Typesense Reindex from PostgreSQL
If Typesense data is lost or corrupted, rebuild the search index from PostgreSQL:
# 1. Ensure Typesense is running and healthy
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health
# 2. Run reindex
docker compose -f docker-compose.prod.yml exec api npx ts-node scripts/typesense-reindex.ts
# Or from host:
pnpm run typesense:reindex
# 3. Verify collections
curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/collections | jq '.[].name'
4.6 Full Host Recovery
For complete host failure or migration to a new server:
# 1. Provision new host with Docker + Docker Compose
# Requirements: Docker >= 24, Docker Compose v2, 8 GB RAM minimum
# 2. Clone repository and configure
git clone <repo-url> ~/goodgo && cd ~/goodgo
cp .env.example .env
# Edit .env with production secrets (from secrets manager)
# 3. Restore PostgreSQL backup from off-site storage
# Transfer backup file to the new host
scp backups/goodgo_latest.sql.gz deploy@newhost:~/goodgo/backups/
# 4. Start infrastructure services
docker compose -f docker-compose.prod.yml up -d postgres redis typesense minio
# 5. Wait for PostgreSQL to be ready, then restore
docker compose -f docker-compose.prod.yml exec postgres pg_isready
docker exec goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_latest.sql.gz
# 6. Start application services
docker compose -f docker-compose.prod.yml up -d
# 7. Run migrations (if backup predates latest code)
docker compose -f docker-compose.prod.yml exec api npx prisma migrate deploy
# 8. Rebuild Typesense index
pnpm run typesense:reindex
# 9. Flush Redis (stale cache from old host)
docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" FLUSHALL
# 10. Verify everything
curl -sf http://localhost:3001/health/ready | jq .
curl -sf http://localhost:3000 > /dev/null && echo "Web OK"
# Expected RTO: ~60 minutes (depends on backup transfer speed)
5. Escalation Matrix
| Severity | Condition | First Responder | Escalation | SLA |
|---|---|---|---|---|
| P0 — Critical | Full outage, data loss, payment corruption | On-call SRE | CTO + CEO within 15 min | Acknowledge: 5 min, Resolve: 1 hour |
| P1 — High | Partial outage, SLO breach (p99 > 3s), 5xx > 5% | On-call SRE | Engineering lead within 30 min | Acknowledge: 15 min, Resolve: 4 hours |
| P2 — Medium | Degraded performance, single service down (non-critical), p99 > 1s | On-call SRE | Team lead next business day | Acknowledge: 1 hour, Resolve: 24 hours |
| P3 — Low | Cosmetic issues, monitoring gaps, non-urgent improvements | Assigned engineer | Sprint planning | Next sprint |
Contact Channels
| Role | Channel |
|---|---|
| On-call SRE | Slack #sre-oncall + PagerDuty |
| Engineering Lead | Slack #engineering |
| CTO | Slack DM / Phone (see PagerDuty) |
| Payment Issues | Slack #payments + VNPay/MoMo support portals |
| Infrastructure | Slack #infrastructure |
Slack Notifications
The deploy pipeline automatically notifies #deployments (via SLACK_WEBHOOK_URL) on:
- Production deploy success
- Staging smoke test failure
- Production rollback triggered
6. Monitoring Dashboards
All dashboards are provisioned automatically via monitoring/grafana/provisioning/ and are available in the GoodGo folder in Grafana.
| Dashboard | Grafana Path | Purpose |
|---|---|---|
| API Overview | api-overview |
Request rates, status codes, active connections |
| API Latency | api-latency |
p50/p95/p99 latency by endpoint, latency heatmaps |
| Database | database |
PostgreSQL connections, query performance, PgBouncer stats |
| Search | search |
Typesense query rates, latency, index sizes |
| Business Metrics | business-metrics |
Listings, inquiries, payments, user registrations |
| Web Vitals | web-vitals |
Core Web Vitals (LCP, FID, CLS), page load times |
| Logs | logs |
Loki log explorer with filters by service, level, correlation ID |
Access: http://localhost:3002 (default credentials in .env: GRAFANA_ADMIN_USER / GRAFANA_ADMIN_PASSWORD)
Data Sources:
- Prometheus (
http://prometheus:9090) — Metrics (default) - Loki (
http://loki:3100) — Logs, with correlation ID linking to Prometheus - Alertmanager (
http://alertmanager:9093) — Alert state and silences
7. Useful PromQL Queries
API Performance
# Overall p99 latency
histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket{job="goodgo-api"}[5m])) by (le))
# Per-endpoint p99 latency (top 10 slowest)
topk(10, histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket{job="goodgo-api"}[5m])) by (le, route, method)))
# Request rate by status code
sum(rate(http_requests_total{job="goodgo-api"}[5m])) by (status_code)
# 5xx error percentage
(sum(rate(http_requests_total{job="goodgo-api", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="goodgo-api"}[5m]))) * 100
Database
# Active connections
pg_stat_activity_count{datname="goodgo", state="active"}
# Connection pool utilization (if PgBouncer metrics are scraped)
# Manual check via: SHOW POOLS in PgBouncer admin console
Infrastructure
# Container memory usage
container_memory_usage_bytes{name=~"goodgo-.*"}
# Container CPU usage
rate(container_cpu_usage_seconds_total{name=~"goodgo-.*"}[5m])
8. Environment Quick Reference
Key Environment Variables
| Variable | Required | Description |
|---|---|---|
DATABASE_URL |
Yes | PostgreSQL via PgBouncer (postgresql://user:pass@pgbouncer:6432/db) |
DATABASE_URL_DIRECT |
Yes (prod) | Direct PostgreSQL for migrations (postgresql://user:pass@postgres:5432/db) |
JWT_SECRET |
Yes | JWT signing secret |
JWT_REFRESH_SECRET |
Yes | Refresh token signing secret |
REDIS_URL |
Yes | Redis connection (redis://:password@redis:6379) |
REDIS_PASSWORD |
Yes (prod) | Redis auth password |
TYPESENSE_API_KEY |
Yes | Typesense admin API key |
MINIO_ACCESS_KEY |
Yes | MinIO root user |
MINIO_SECRET_KEY |
Yes | MinIO root password |
VNPAY_* |
Yes | VNPay payment gateway configuration |
AI_API_KEY |
Yes | AI services authentication |
GRAFANA_ADMIN_USER |
Yes (prod) | Grafana admin username |
GRAFANA_ADMIN_PASSWORD |
Yes (prod) | Grafana admin password |
PGBOUNCER_POOL_SIZE |
No | PgBouncer pool size (default: 20) |
PGBOUNCER_MAX_CLIENT_CONN |
No | Max PgBouncer client connections (default: 200) |
BACKUP_RETENTION_DAYS |
No | Backup retention period (default: 7) |
IMAGE_TAG |
No (prod) | Container image tag (default: latest) |
Port Map
| Port | Service | Exposed |
|---|---|---|
| 3000 | Web (Next.js) | External |
| 3001 | API (NestJS) | External |
| 3002 | Grafana | External (admin only) |
| 5432 | PostgreSQL | Internal |
| 6432 | PgBouncer | Internal |
| 6379 | Redis | Internal |
| 8000 | AI Services | Internal |
| 8108 | Typesense | Internal |
| 9000 | MinIO API | Internal |
| 9001 | MinIO Console | Internal |
| 9090 | Prometheus | Internal |
| 3100 | Loki | Internal |
Docker Volumes
| Volume | Service | Purpose |
|---|---|---|
pgdata |
PostgreSQL | Database files |
redis_data |
Redis | AOF persistence |
typesense_data |
Typesense | Search index data |
minio_data |
MinIO | Object storage (images, files) |
pg_backups |
pg-backup | Database backup files |
loki_data |
Loki | Log storage (15-day retention) |
prometheus_data |
Prometheus | Metrics (30-day retention prod / 15-day dev) |
grafana_data |
Grafana | Dashboard state, user preferences |
9. Disaster Recovery Validation
Automated Verification
Backup verification runs daily at 04:00 UTC inside the pg-backup container. It restores the latest backup to an isolated test database and checks:
- Table existence (all 22 Prisma models)
- Row count comparison against live database
- Data checksums on critical tables (User, Property, Listing, Payment, Subscription, Transaction, Plan)
- PostGIS extension availability
- Index count match
- Enum type count match
Check latest verification report:
docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .
Check verification logs:
docker exec goodgo-pg-backup cat /var/log/pg-verify.log
Manual DR Validation Procedure
Run this quarterly (or after major schema changes) to validate the full DR process end-to-end.
Step 1: Verify Backups Exist and Are Recent
# List backups with timestamps and sizes
docker exec goodgo-pg-backup ls -lht /backups/goodgo_*.sql.gz
# Verify latest backup is < 25 hours old
LATEST=$(docker exec goodgo-pg-backup ls -t /backups/goodgo_*.sql.gz | head -1)
echo "Latest backup: $LATEST"
Step 2: Run Verification Against Latest Backup
# Automated verification (creates temp DB, validates, drops)
docker exec -e REPORT_FILE=/backups/verify-latest.json goodgo-pg-backup \
/scripts/pg-verify-backup.sh
# Review results
docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .
Expected output: All checks pass, restore completes in < 60 seconds for typical dataset.
Step 3: Test Full Restore (Staging Only)
⚠️ WARNING: Only perform this on a staging or isolated environment. Never on production.
# 1. Create a separate test environment
docker compose -f docker-compose.yml -p goodgo-dr-test up -d postgres
# 2. Wait for PostgreSQL to be ready
docker exec goodgo-dr-test-postgres-1 pg_isready
# 3. Run restore against the test environment
PGHOST=localhost PGPORT=<test-port> PGUSER=goodgo PGPASSWORD=<password> \
/scripts/pg-restore.sh /backups/<latest-backup>.sql.gz
# 4. Verify key tables
docker exec goodgo-dr-test-postgres-1 psql -U goodgo -d goodgo -c \
"SELECT count(*) FROM \"User\"; SELECT count(*) FROM \"Property\"; SELECT count(*) FROM \"Listing\";"
# 5. Clean up test environment
docker compose -f docker-compose.yml -p goodgo-dr-test down -v
Step 4: Validate Service Recovery Chain
Test that all services can start from a clean state with restored data:
# 1. Note current service status
docker compose -f docker-compose.prod.yml ps --format "table {{.Name}}\t{{.Status}}\t{{.Health}}"
# 2. Restart all services in dependency order
docker compose -f docker-compose.prod.yml restart postgres
sleep 10 # Wait for PostgreSQL
docker compose -f docker-compose.prod.yml restart pgbouncer redis typesense
sleep 10 # Wait for data services
docker compose -f docker-compose.prod.yml restart api web ai-services
sleep 15 # Wait for application services
# 3. Verify all health checks
curl -sf http://localhost:3001/health/ready | jq .
curl -sf http://localhost:3000 > /dev/null && echo "Web OK"
curl -sf http://localhost:9090/-/healthy && echo "Prometheus OK"
curl -sf http://localhost:9093/-/healthy && echo "Alertmanager OK"
curl -sf http://localhost:3002/api/health | jq .
Step 5: Validate Alerting Pipeline
# 1. Check Prometheus is loading alert rules
curl -sf http://localhost:9090/api/v1/rules | jq '.data.groups | length'
# Expected: 7 groups
# 2. Check current alerts (should be empty if healthy)
curl -sf http://localhost:9090/api/v1/alerts | jq '.data.alerts | length'
# 3. Check Alertmanager is receiving from Prometheus
curl -sf http://localhost:9093/api/v2/status | jq '.cluster'
# 4. Verify Alertmanager config is loaded
curl -sf http://localhost:9093/api/v2/status | jq '.config'
DR Validation Checklist
Use this checklist during quarterly DR reviews:
- Latest backup is < 25 hours old
- Automated verification report shows all checks passed
- Manual restore to test DB succeeds with correct row counts
- Full service restart completes within RTO target (< 30 min)
- All health endpoints respond after restart
- Prometheus alert rules are loaded (7 groups)
- Alertmanager is reachable and configured
- Slack notification channel is receiving test alerts
- Grafana dashboards show data after restart
- Typesense search returns results after restart
RPO/RTO Summary
| Metric | Target | Actual (Measured) | Notes |
|---|---|---|---|
| RPO | ≤ 24 hours | ~24h (daily at 02:00 UTC) | Reduce with WAL archiving |
| RTO — Local backup | ≤ 15 minutes | Measure during DR test | Restore + service restart |
| RTO — Off-site backup | ≤ 30 minutes | Measure during DR test | Add transfer time |
| RTO — Full host recovery | ≤ 60 minutes | Measure during DR test | New host + restore + deploy |
Appendix: Alert Rules Reference
API & Error Alerts
| Alert | Expression | Severity | Duration |
|---|---|---|---|
ApiLatencyP99High |
p99 > 1s | Warning | 5 min |
ApiEndpointLatencyP99High |
Per-route p99 > 2s | Warning | 5 min |
ApiLatencyP99Critical |
p99 > 3s (SLO breach) | Critical | 3 min |
ApiErrorRate5xxHigh |
5xx rate > 1% | Warning | 5 min |
ApiErrorRate5xxCritical |
5xx rate > 5% | Critical | 3 min |
ApiNoTraffic |
Request rate = 0 | Warning | 10 min |
Database Alerts
| Alert | Expression | Severity | Duration |
|---|---|---|---|
PostgresActiveConnectionsHigh |
Active connections > 15 | Warning | 5 min |
PostgresConnectionPoolCritical |
Total connections > 180 | Critical | 2 min |
PostgresSlowQueries |
Lock-waiting queries > 5 | Warning | 5 min |
PostgresDown |
API scrape target down | Critical | 1 min |
Redis Alerts
| Alert | Expression | Severity | Duration |
|---|---|---|---|
RedisMemoryHigh |
Memory usage > 80% | Warning | 5 min |
RedisMemoryCritical |
Memory usage > 95% | Critical | 2 min |
RedisConnectedClientsHigh |
Clients > 150 | Warning | 5 min |
RedisRejectedConnections |
Rejected connections > 0 | Critical | 1 min |
Container Resource Alerts
| Alert | Expression | Severity | Duration |
|---|---|---|---|
ContainerRestartLoop |
> 3 restarts in 15 min | Critical | 5 min |
ContainerMemoryHigh |
Memory > 85% of limit | Warning | 5 min |
ContainerCPUThrottled |
CPU throttle rate > 0.5s/s | Warning | 10 min |
Disk & Infrastructure Alerts
| Alert | Expression | Severity | Duration |
|---|---|---|---|
HostDiskUsageHigh |
Root disk > 80% | Warning | 10 min |
HostDiskUsageCritical |
Root disk > 90% | Critical | 5 min |
ApiHealthCheckFailing |
Health probe fails | Critical | 2 min |
PrometheusTargetDown |
Scrape target down | Warning | 5 min |
Backup Alerts
| Alert | Expression | Severity | Duration |
|---|---|---|---|
BackupTooOld |
Last backup > 25 hours ago | Warning | 5 min |
BackupVerificationFailed |
Verify result = fail | Warning | 1 min |
Alert Routing
Alerts are routed via Alertmanager (monitoring/alertmanager/alertmanager.yml):
| Channel | Routes | Repeat Interval |
|---|---|---|
#sre-oncall (Slack) |
All warning alerts | 4 hours |
#sre-oncall (Slack) |
All critical alerts (priority) | 1 hour |
#infrastructure (Slack) |
Backup-related alerts | 6 hours |
Inhibition: Warning alerts are suppressed when a critical alert for the same service is already firing.
Alert rules are defined in monitoring/prometheus/alert-rules.yml and evaluated every 15 seconds.