# GoodGo Platform — Production Runbook > **Audience:** On-call SRE, DevOps engineers, and platform operators. > **Last updated:** 2026-04-11 --- ## Table of Contents 1. [Service Inventory](#1-service-inventory) 2. [Health Checks](#2-health-checks) 3. [Common Incidents](#3-common-incidents) - [3.1 Database Connection Pool Exhaustion](#31-database-connection-pool-exhaustion) - [3.2 Redis Connection Failure](#32-redis-connection-failure) - [3.3 Typesense Unavailable](#33-typesense-unavailable) - [3.4 High API Latency](#34-high-api-latency) - [3.5 Payment Callback Failures](#35-payment-callback-failures) - [3.6 Disk Space Alerts](#36-disk-space-alerts) - [3.7 MinIO / Object Storage Failure](#37-minio--object-storage-failure) - [3.8 AI Services Unavailable](#38-ai-services-unavailable) - [3.9 Log Pipeline Failure (Loki/Promtail)](#39-log-pipeline-failure-lokipromtail) - [3.10 5xx Error Rate Spike](#310-5xx-error-rate-spike) 4. [Recovery Procedures](#4-recovery-procedures) - [4.1 Database Restore from Backup](#41-database-restore-from-backup) - [4.2 Redis Cache Flush & Warm-up](#42-redis-cache-flush--warm-up) - [4.3 Rolling Restart Procedures](#43-rolling-restart-procedures) - [4.4 Rollback Deployment](#44-rollback-deployment) - [4.5 Typesense Reindex from PostgreSQL](#45-typesense-reindex-from-postgresql) - [4.6 Full Host Recovery](#46-full-host-recovery) 5. [Escalation Matrix](#5-escalation-matrix) 6. [Monitoring Dashboards](#6-monitoring-dashboards) 7. [Useful PromQL Queries](#7-useful-promql-queries) 8. [Environment Quick Reference](#8-environment-quick-reference) --- ## 1. Service Inventory ### Production Services (`docker-compose.prod.yml`) | Service | Image | Port | Resource Limits | Health Check | |---------|-------|------|-----------------|--------------| | **api** (NestJS) | `ghcr.io/goodgo/goodgo-api` | 3001 | 1 CPU / 1 GB | `GET /health` (node fetch) | | **web** (Next.js) | `ghcr.io/goodgo/goodgo-web` | 3000 | 0.5 CPU / 512 MB | `GET /` (node fetch) | | **ai-services** (FastAPI) | `ghcr.io/goodgo/goodgo-ai-services` | 8000 | 1 CPU / 1 GB | `GET /health` (httpx) | | **postgres** | `postgis/postgis:16-3.4` | 5432 (internal) | 2 CPU / 2 GB, shm=256m | `pg_isready` | | **pgbouncer** | `edoburu/pgbouncer:1.23.1-p2` | 6432 (internal) | 0.5 CPU / 256 MB | `pg_isready -p 6432` | | **redis** | `redis:7-alpine` | 6379 (internal) | 0.5 CPU / 768 MB | `redis-cli ping` | | **typesense** | `typesense/typesense:27.1` | 8108 (internal) | 1 CPU / 1 GB | `curl /health` | | **minio** | `minio/minio:latest` | 9000/9001 (internal) | 0.5 CPU / 1 GB | `mc ready local` | | **pg-backup** | `postgis/postgis:16-3.4` | — | 0.5 CPU / 512 MB | — (cron daemon) | | **loki** | `grafana/loki:3.0.0` | 3100 (internal) | 0.5 CPU / 512 MB | `wget /ready` | | **promtail** | `grafana/promtail:3.0.0` | — | 0.25 CPU / 256 MB | — | | **prometheus** | `prom/prometheus:v2.51.0` | 9090 (internal) | 0.5 CPU / 1 GB | `wget /-/healthy` | | **grafana** | `grafana/grafana:10.4.1` | 3002 (external) | 0.5 CPU / 512 MB | `wget /api/health` | ### Development-Only Services (`docker-compose.yml`) Development uses the same data and monitoring services but runs API/Web on the host. The `pg-backup` service also runs in dev with default credentials. ### Service Dependency Chain ``` web --> api --> pgbouncer --> postgres |-> redis |-> typesense |-> minio |-> ai-services grafana --> prometheus |-> loki --> promtail (Docker socket) pg-backup --> postgres ``` --- ## 2. Health Checks ### Application Health Endpoints | Endpoint | Type | Checks | Expected Response | |----------|------|--------|-------------------| | `GET /health` | Liveness | Process is running | `200 { status: "ok" }` | | `GET /health/ready` | Readiness | PostgreSQL + Redis | `200 { status: "ok", info: { database: ..., redis: ... } }` | | `GET /health/db` | Database only | PostgreSQL connectivity | `200 { status: "ok", info: { database: ... } }` | | `GET /health/redis` | Redis only | Redis connectivity | `200 { status: "ok", info: { redis: ... } }` | ### Verify All Services Are Healthy ```bash # Quick check — all containers docker compose -f docker-compose.prod.yml ps --format "table {{.Name}}\t{{.Status}}\t{{.Health}}" # API liveness curl -sf http://localhost:3001/health && echo "API OK" || echo "API FAIL" # API readiness (DB + Redis) curl -sf http://localhost:3001/health/ready | jq . # Individual dependency checks curl -sf http://localhost:3001/health/db | jq . curl -sf http://localhost:3001/health/redis | jq . # Typesense curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health # MinIO docker exec goodgo-minio mc ready local && echo "MinIO OK" # AI Services curl -sf http://localhost:8000/health && echo "AI OK" || echo "AI FAIL" # PostgreSQL (direct) docker exec goodgo-postgres pg_isready -U ${DB_USER} -d ${DB_NAME} # PgBouncer docker exec goodgo-pgbouncer pg_isready -h 127.0.0.1 -p 6432 -U ${DB_USER} # Redis docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" ping # Prometheus curl -sf http://localhost:9090/-/healthy && echo "Prometheus OK" # Loki curl -sf http://localhost:3100/ready && echo "Loki OK" # Grafana curl -sf http://localhost:3002/api/health | jq . ``` ### Container Resource Usage ```bash docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.NetIO}}" ``` --- ## 3. Common Incidents ### 3.1 Database Connection Pool Exhaustion **Symptoms:** - API returns 503 or hangs on requests - `/health/ready` returns unhealthy for `database` - PgBouncer logs: `no more connections allowed` or `query_wait_timeout` - Prometheus: spike in `pg_stat_activity` active connections **Diagnosis:** ```bash # Check PgBouncer pool status docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW POOLS;" docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW CLIENTS;" docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW STATS;" # Check PostgreSQL active connections docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \ "SELECT state, count(*) FROM pg_stat_activity WHERE datname = '${DB_NAME}' GROUP BY state;" # Identify long-running queries docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \ "SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE datname = '${DB_NAME}' AND state != 'idle' ORDER BY duration DESC LIMIT 10;" ``` **Resolution:** ```bash # 1. Kill long-running queries (> 5 minutes) docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \ "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = '${DB_NAME}' AND state != 'idle' AND now() - query_start > interval '5 minutes' AND pid <> pg_backend_pid();" # 2. If pool is fully exhausted, restart PgBouncer docker compose -f docker-compose.prod.yml restart pgbouncer # 3. If issue persists, increase pool size temporarily # Edit PGBOUNCER_POOL_SIZE in .env, then: docker compose -f docker-compose.prod.yml up -d --no-deps pgbouncer ``` **PgBouncer Configuration Reference:** - Pool mode: `transaction` (connections returned to pool after each transaction) - Default pool size: 20 server connections per user/db pair - Max client connections: 200 - Reserve pool: 5 extra connections (after 3s wait) - Query wait timeout: 120s (error if client waits this long) ### 3.2 Redis Connection Failure **Symptoms:** - `/health/redis` returns unhealthy - Increased API response times (cache misses hitting DB) - API logs show Redis connection errors **Diagnosis:** ```bash # Check Redis container docker logs --tail=50 goodgo-redis # Test connectivity docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" ping docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO server docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO memory docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO clients ``` **Resolution:** ```bash # 1. Restart Redis (data persisted via AOF) docker compose -f docker-compose.prod.yml restart redis # 2. If OOM — check memory usage docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO memory | grep used_memory_human # Max memory is 512 MB (prod), eviction policy: allkeys-lru # 3. If AOF is corrupted docker compose -f docker-compose.prod.yml stop redis docker exec goodgo-redis redis-check-aof --fix /data/appendonly.aof docker compose -f docker-compose.prod.yml start redis ``` **Graceful Degradation:** The API is designed to continue operating when Redis is unavailable. Cache misses fall through to PostgreSQL. Performance will degrade but functionality is preserved. Redis is non-critical for core operations. ### 3.3 Typesense Unavailable **Symptoms:** - Search functionality returns errors or falls back to basic DB search - `curl http://localhost:8108/health` fails - API logs show Typesense connection timeouts **Diagnosis:** ```bash # Check container status docker logs --tail=50 goodgo-typesense # Check health curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health # Check collections curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/collections | jq . # Check disk space for Typesense data volume docker system df -v | grep typesense ``` **Resolution:** ```bash # 1. Restart Typesense docker compose -f docker-compose.prod.yml restart typesense # 2. If data is corrupted — rebuild from PostgreSQL docker compose -f docker-compose.prod.yml stop typesense docker volume rm goodgo-platform-ai_typesense_data docker compose -f docker-compose.prod.yml up -d typesense # Wait for healthy, then reindex: docker compose -f docker-compose.prod.yml exec api npx ts-node scripts/typesense-reindex.ts # Or: pnpm run typesense:reindex # 3. If volume backup exists — restore docker compose -f docker-compose.prod.yml stop typesense docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \ alpine sh -c "rm -rf /data/* && tar xzf /backup/typesense_YYYYMMDD_HHMMSS.tar.gz -C /data" docker compose -f docker-compose.prod.yml start typesense ``` **Fallback Behavior:** When Typesense is unavailable, property search falls back to PostgreSQL full-text search with PostGIS geo queries. Search quality degrades but core functionality works. ### 3.4 High API Latency **Symptoms:** - Prometheus alert `ApiLatencyP99High` fires (p99 > 1s for 5 min) - Critical alert `ApiLatencyP99Critical` fires (p99 > 3s for 3 min — SLO breach) - Users report slow page loads **Diagnosis:** ```bash # 1. Check which endpoints are slow # Grafana: GoodGo API Latency dashboard # Or via PromQL: curl -s "http://localhost:9090/api/v1/query" --data-urlencode \ 'query=topk(10, histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket[5m])) by (le, route, method)))' \ | jq '.data.result[] | {route: .metric.route, method: .metric.method, p99: .value[1]}' # 2. Check database slow queries docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \ "SELECT pid, now() - query_start AS duration, left(query, 100) AS query_preview FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '1 second' ORDER BY duration DESC;" # 3. Check PgBouncer wait times docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW POOLS;" # 4. Check container resource usage docker stats --no-stream goodgo-api goodgo-postgres goodgo-redis goodgo-pgbouncer # 5. Check Redis latency docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" --latency-history -i 3 # 6. Check application logs for errors docker logs --tail=200 --since=5m goodgo-api 2>&1 | grep -i "error\|timeout\|slow" ``` **Resolution:** ```bash # 1. If DB slow queries — terminate them docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \ "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '30 seconds';" # 2. If connection pool exhaustion — see Section 3.1 # 3. If Redis is slow — restart docker compose -f docker-compose.prod.yml restart redis # 4. If API container OOM — restart with more memory docker compose -f docker-compose.prod.yml restart api # 5. If specific endpoint is the bottleneck — check Loki logs: # Grafana > Explore > Loki > {container_name="goodgo-api"} |= "slow" ``` ### 3.5 Payment Callback Failures **Symptoms:** - Users report payments stuck in "pending" state - VNPay/MoMo/ZaloPay IPN callbacks returning errors - Payment reconciliation mismatches **Diagnosis:** ```bash # 1. Check payment callback logs docker logs goodgo-api 2>&1 | grep -i "payment\|callback\|vnpay\|momo\|zalopay" | tail -50 # 2. Check for pending payments in DB docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \ "SELECT id, provider, status, \"amountVND\", \"createdAt\" FROM \"Payment\" WHERE status = 'PENDING' AND \"createdAt\" > now() - interval '24 hours' ORDER BY \"createdAt\" DESC LIMIT 20;" # 3. Verify callback URL is reachable from external networks curl -sf https://your-domain.com/api/payments/vnpay/callback && echo "Callback URL reachable" # 4. Check if API is receiving callbacks (via Loki) # Grafana > Explore > Loki > {container_name="goodgo-api"} |= "callback" |= "payment" ``` **Resolution:** ```bash # 1. If callbacks are timing out — check API health and restart if needed docker compose -f docker-compose.prod.yml restart api # 2. If VNPay signature verification fails — verify VNPAY_* env vars docker compose -f docker-compose.prod.yml exec api printenv | grep VNPAY # 3. For stuck payments — manual reconciliation # Check VNPay/MoMo merchant portal for actual transaction status # Update payment status in DB if confirmed paid: docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \ "UPDATE \"Payment\" SET status = 'COMPLETED', \"updatedAt\" = now() WHERE id = '' AND status = 'PENDING';" # 4. If callbacks are not reaching the server — check: # - Firewall rules (port 3001 or reverse proxy port must be open) # - SSL certificate validity # - DNS resolution # - Payment provider webhook configuration (correct callback URL) ``` **Important:** The payment callback handler uses idempotent processing with atomic state transitions. Replaying a callback is safe and will not duplicate payments. ### 3.6 Disk Space Alerts **Symptoms:** - Containers failing to start or crashing - PostgreSQL refusing writes (`PANIC: could not write to file`) - Docker daemon running out of space **Diagnosis:** ```bash # Host disk usage df -h # Docker disk usage docker system df docker system df -v # Check individual volume sizes for vol in $(docker volume ls -q | grep goodgo); do echo -n "$vol: " docker run --rm -v "${vol}:/data" alpine du -sh /data 2>/dev/null done # Check backup volume specifically docker exec goodgo-pg-backup du -sh /backups/ docker exec goodgo-pg-backup ls -lht /backups/ ``` **Resolution:** ```bash # 1. Clean up Docker artifacts docker system prune -f # Remove stopped containers, unused networks, dangling images docker image prune -a -f # Remove ALL unused images (careful in prod) # 2. Clean old backups (if retention not working) docker exec goodgo-pg-backup find /backups -name "goodgo_*.sql.gz" -mtime +7 -delete # 3. Clean Prometheus data (if too large) # Prometheus retention is 30d (prod) / 15d (dev) — configured via --storage.tsdb.retention.time # To force compaction: curl -sf -XPOST http://localhost:9090/-/quit # Graceful shutdown triggers compaction docker compose -f docker-compose.prod.yml start prometheus # 4. Clean Loki data (15-day retention) # Loki handles its own cleanup via compactor. If urgent: docker compose -f docker-compose.prod.yml restart loki # 5. Truncate Docker container logs sudo truncate -s 0 $(docker inspect --format='{{.LogPath}}' goodgo-api) # Or for all containers: sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log' ``` **Prevention:** All production containers use `json-file` logging with `max-size: 10m` and `max-file: 3-5`. Backup retention is 7 days (configurable via `BACKUP_RETENTION_DAYS`). ### 3.7 MinIO / Object Storage Failure **Symptoms:** - Image/file uploads fail - Property photos not loading - MinIO console inaccessible at port 9001 **Diagnosis:** ```bash docker logs --tail=50 goodgo-minio docker exec goodgo-minio mc ready local docker exec goodgo-minio mc admin info local ``` **Resolution:** ```bash # 1. Restart MinIO docker compose -f docker-compose.prod.yml restart minio # 2. If data volume corrupted docker compose -f docker-compose.prod.yml stop minio docker volume rm goodgo-platform-ai_minio_data # WARNING: data loss docker compose -f docker-compose.prod.yml up -d minio # Recreate buckets via API or admin console ``` ### 3.8 AI Services Unavailable **Symptoms:** - AI-powered features (AVM, property descriptions) fail - `GET /health` on port 8000 fails - API logs show AI service connection timeouts **Diagnosis:** ```bash docker logs --tail=50 goodgo-ai-services curl -sf http://localhost:8000/health docker stats --no-stream goodgo-ai-services ``` **Resolution:** ```bash # 1. Restart AI services docker compose -f docker-compose.prod.yml restart ai-services # 2. Check rate limits (default: 60/minute) docker compose -f docker-compose.prod.yml exec ai-services printenv | grep AI_RATE_LIMIT # 3. If OOM — the service has 1 GB limit; may need to increase for large models ``` **Graceful Degradation:** AI features are optional. The API should handle AI service unavailability gracefully and return non-AI results. ### 3.9 Log Pipeline Failure (Loki/Promtail) **Symptoms:** - Grafana log explorer returns empty results - Promtail container unhealthy or crash-looping - Loki returning 503 **Diagnosis:** ```bash docker logs --tail=50 goodgo-loki docker logs --tail=50 goodgo-promtail curl -sf http://localhost:3100/ready && echo "Loki ready" || echo "Loki NOT ready" ``` **Resolution:** ```bash # 1. Restart the pipeline docker compose -f docker-compose.prod.yml restart loki promtail # 2. If Loki data corrupted docker compose -f docker-compose.prod.yml stop loki promtail docker volume rm goodgo-platform-ai_loki_data docker compose -f docker-compose.prod.yml up -d loki promtail # Historical logs are lost but new logs will flow immediately # 3. If Promtail can't access Docker socket ls -la /var/run/docker.sock # Ensure the promtail container has the Docker socket mounted ``` ### 3.10 5xx Error Rate Spike **Symptoms:** - Prometheus alert `ApiErrorRate5xxHigh` fires (> 1% 5xx for 5 min) - Users reporting errors **Diagnosis:** ```bash # Check which endpoints are returning 5xx curl -s "http://localhost:9090/api/v1/query" --data-urlencode \ 'query=topk(10, sum(rate(http_requests_total{job="goodgo-api", status_code=~"5.."}[5m])) by (route, method))' \ | jq '.data.result' # Check API error logs docker logs --tail=200 --since=5m goodgo-api 2>&1 | grep -i "error\|exception\|500" # Check all dependency health curl -sf http://localhost:3001/health/ready | jq . ``` **Resolution:** 1. If DB-related: see [Section 3.1](#31-database-connection-pool-exhaustion) 2. If Redis-related: see [Section 3.2](#32-redis-connection-failure) 3. If recent deployment: see [Section 4.4](#44-rollback-deployment) 4. If unknown: restart API and investigate logs --- ## 4. Recovery Procedures ### 4.1 Database Restore from Backup **Automated backups run daily at 02:00 UTC** via the `pg-backup` container. Retention: 7 days. Format: `pg_dump --format=custom --compress=6`. **Automated verification runs daily at 04:00 UTC** — restores to an isolated test database, verifies table existence, row counts, checksums, PostGIS extension, indexes, and enums. Reports are written to `/backups/verify-latest.json`. #### List Available Backups ```bash docker exec goodgo-pg-backup ls -lht /backups/goodgo_*.sql.gz ``` #### Create an On-Demand Backup ```bash docker exec goodgo-pg-backup /scripts/pg-backup.sh ``` #### Full Restore Procedure ```bash # 1. Stop application services docker compose -f docker-compose.prod.yml stop api web ai-services # 2. (Production) Stop PgBouncer to prevent stale connections docker compose -f docker-compose.prod.yml stop pgbouncer # 3. Run the restore script docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz # The script will: # - Terminate active DB connections # - DROP and recreate the database # - Restore from the backup file # 4. Verify the restore docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c '\dt' docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c 'SELECT count(*) FROM "User";' # 5. Apply any pending migrations docker compose -f docker-compose.prod.yml exec api npx prisma migrate deploy # 6. Restart all services docker compose -f docker-compose.prod.yml up -d # 7. Verify application health curl -sf http://localhost:3001/health/ready | jq . ``` #### Verify a Backup Without Restoring ```bash # Run verification against latest backup (creates temp DB, drops it after) docker compose run --rm pg-verify-backup # Or verify a specific backup file docker exec goodgo-pg-backup /scripts/pg-verify-backup.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz # Check latest verification report docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq . ``` **RPO/RTO:** - RPO: ≤ 24 hours (daily backups; consider WAL archiving for lower RPO) - RTO: ~15 minutes (local volume), ~30 minutes (off-site) ### 4.2 Redis Cache Flush & Warm-up ```bash # Flush all Redis data docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" FLUSHALL # Verify flush docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" DBSIZE # Should return: (integer) 0 ``` **Warm-up:** Redis uses `allkeys-lru` eviction. Cache warms naturally as users make requests. No manual warm-up script is needed — cache misses fall through to PostgreSQL. **When to flush:** - After database restore (stale cache references) - After data corruption at the application level - After schema changes that alter cached data structures ### 4.3 Rolling Restart Procedures #### Single Service Restart (Zero Downtime) ```bash # API — the --wait flag ensures health check passes before moving on docker compose -f docker-compose.prod.yml up -d --no-deps --wait api # Web docker compose -f docker-compose.prod.yml up -d --no-deps --wait web # AI Services docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services ``` #### Full Stack Rolling Restart ```bash # Data services first (order matters for dependency chain) docker compose -f docker-compose.prod.yml restart redis docker compose -f docker-compose.prod.yml restart typesense # Wait for data services to be healthy sleep 10 # Connection pooling docker compose -f docker-compose.prod.yml restart pgbouncer sleep 5 # Application services docker compose -f docker-compose.prod.yml up -d --no-deps --wait api docker compose -f docker-compose.prod.yml up -d --no-deps --wait web docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services # Verify curl -sf http://localhost:3001/health/ready | jq . ``` #### Emergency: Restart Everything ```bash docker compose -f docker-compose.prod.yml down docker compose -f docker-compose.prod.yml up -d --wait ``` ### 4.4 Rollback Deployment The CI/CD pipeline (`.github/workflows/deploy.yml`) supports automatic rollback if production smoke tests fail. For manual rollback: #### Quick Rollback (Revert to Previous Images) ```bash # SSH into production host ssh deploy@$PRODUCTION_HOST cd ~/goodgo # Stop current app containers docker compose -f docker-compose.prod.yml down api web ai-services # The previous images are still cached locally # Restart without pulling — uses last-known-good images docker compose -f docker-compose.prod.yml up -d --wait api web ai-services # Verify curl -sf http://localhost:3001/health && echo "Rollback successful" ``` #### Rollback to a Specific Git Commit / Image Tag ```bash # Set the target tag (git SHA) export IMAGE_TAG= export REGISTRY_URL=ghcr.io/goodgo # Pull specific version docker compose -f docker-compose.prod.yml pull api web ai-services # Deploy docker compose -f docker-compose.prod.yml up -d --no-deps --wait api docker compose -f docker-compose.prod.yml up -d --no-deps --wait web docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services # Verify curl -sf http://localhost:3001/health/ready | jq . ``` #### Rollback Database Migrations ```bash # WARNING: Prisma does not support automatic down-migrations. # For migration rollback, restore from the pre-migration backup: # 1. Stop application docker compose -f docker-compose.prod.yml stop api web ai-services pgbouncer # 2. Restore from backup taken before the migration docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/.sql.gz # 3. Deploy the previous code version (older IMAGE_TAG) export IMAGE_TAG= docker compose -f docker-compose.prod.yml up -d --wait ``` ### 4.5 Typesense Reindex from PostgreSQL If Typesense data is lost or corrupted, rebuild the search index from PostgreSQL: ```bash # 1. Ensure Typesense is running and healthy curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health # 2. Run reindex docker compose -f docker-compose.prod.yml exec api npx ts-node scripts/typesense-reindex.ts # Or from host: pnpm run typesense:reindex # 3. Verify collections curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/collections | jq '.[].name' ``` ### 4.6 Full Host Recovery For complete host failure or migration to a new server: ```bash # 1. Provision new host with Docker + Docker Compose # Requirements: Docker >= 24, Docker Compose v2, 8 GB RAM minimum # 2. Clone repository and configure git clone ~/goodgo && cd ~/goodgo cp .env.example .env # Edit .env with production secrets (from secrets manager) # 3. Restore PostgreSQL backup from off-site storage # Transfer backup file to the new host scp backups/goodgo_latest.sql.gz deploy@newhost:~/goodgo/backups/ # 4. Start infrastructure services docker compose -f docker-compose.prod.yml up -d postgres redis typesense minio # 5. Wait for PostgreSQL to be ready, then restore docker compose -f docker-compose.prod.yml exec postgres pg_isready docker exec goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_latest.sql.gz # 6. Start application services docker compose -f docker-compose.prod.yml up -d # 7. Run migrations (if backup predates latest code) docker compose -f docker-compose.prod.yml exec api npx prisma migrate deploy # 8. Rebuild Typesense index pnpm run typesense:reindex # 9. Flush Redis (stale cache from old host) docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" FLUSHALL # 10. Verify everything curl -sf http://localhost:3001/health/ready | jq . curl -sf http://localhost:3000 > /dev/null && echo "Web OK" # Expected RTO: ~60 minutes (depends on backup transfer speed) ``` --- ## 5. Escalation Matrix | Severity | Condition | First Responder | Escalation | SLA | |----------|-----------|-----------------|------------|-----| | **P0 — Critical** | Full outage, data loss, payment corruption | On-call SRE | CTO + CEO within 15 min | Acknowledge: 5 min, Resolve: 1 hour | | **P1 — High** | Partial outage, SLO breach (p99 > 3s), 5xx > 5% | On-call SRE | Engineering lead within 30 min | Acknowledge: 15 min, Resolve: 4 hours | | **P2 — Medium** | Degraded performance, single service down (non-critical), p99 > 1s | On-call SRE | Team lead next business day | Acknowledge: 1 hour, Resolve: 24 hours | | **P3 — Low** | Cosmetic issues, monitoring gaps, non-urgent improvements | Assigned engineer | Sprint planning | Next sprint | ### Contact Channels | Role | Channel | |------|---------| | On-call SRE | Slack `#sre-oncall` + PagerDuty | | Engineering Lead | Slack `#engineering` | | CTO | Slack DM / Phone (see PagerDuty) | | Payment Issues | Slack `#payments` + VNPay/MoMo support portals | | Infrastructure | Slack `#infrastructure` | ### Slack Notifications The deploy pipeline automatically notifies `#deployments` (via `SLACK_WEBHOOK_URL`) on: - Production deploy success - Staging smoke test failure - Production rollback triggered --- ## 6. Monitoring Dashboards All dashboards are provisioned automatically via `monitoring/grafana/provisioning/` and are available in the **GoodGo** folder in Grafana. | Dashboard | Grafana Path | Purpose | |-----------|--------------|---------| | **API Overview** | `api-overview` | Request rates, status codes, active connections | | **API Latency** | `api-latency` | p50/p95/p99 latency by endpoint, latency heatmaps | | **Database** | `database` | PostgreSQL connections, query performance, PgBouncer stats | | **Search** | `search` | Typesense query rates, latency, index sizes | | **Business Metrics** | `business-metrics` | Listings, inquiries, payments, user registrations | | **Web Vitals** | `web-vitals` | Core Web Vitals (LCP, FID, CLS), page load times | | **Logs** | `logs` | Loki log explorer with filters by service, level, correlation ID | **Access:** `http://localhost:3002` (default credentials in `.env`: `GRAFANA_ADMIN_USER` / `GRAFANA_ADMIN_PASSWORD`) **Data Sources:** - **Prometheus** (`http://prometheus:9090`) — Metrics (default) - **Loki** (`http://loki:3100`) — Logs, with correlation ID linking to Prometheus --- ## 7. Useful PromQL Queries ### API Performance ```promql # Overall p99 latency histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket{job="goodgo-api"}[5m])) by (le)) # Per-endpoint p99 latency (top 10 slowest) topk(10, histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket{job="goodgo-api"}[5m])) by (le, route, method))) # Request rate by status code sum(rate(http_requests_total{job="goodgo-api"}[5m])) by (status_code) # 5xx error percentage (sum(rate(http_requests_total{job="goodgo-api", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="goodgo-api"}[5m]))) * 100 ``` ### Database ```promql # Active connections pg_stat_activity_count{datname="goodgo", state="active"} # Connection pool utilization (if PgBouncer metrics are scraped) # Manual check via: SHOW POOLS in PgBouncer admin console ``` ### Infrastructure ```promql # Container memory usage container_memory_usage_bytes{name=~"goodgo-.*"} # Container CPU usage rate(container_cpu_usage_seconds_total{name=~"goodgo-.*"}[5m]) ``` --- ## 8. Environment Quick Reference ### Key Environment Variables | Variable | Required | Description | |----------|----------|-------------| | `DATABASE_URL` | Yes | PostgreSQL via PgBouncer (`postgresql://user:pass@pgbouncer:6432/db`) | | `DATABASE_URL_DIRECT` | Yes (prod) | Direct PostgreSQL for migrations (`postgresql://user:pass@postgres:5432/db`) | | `JWT_SECRET` | Yes | JWT signing secret | | `JWT_REFRESH_SECRET` | Yes | Refresh token signing secret | | `REDIS_URL` | Yes | Redis connection (`redis://:password@redis:6379`) | | `REDIS_PASSWORD` | Yes (prod) | Redis auth password | | `TYPESENSE_API_KEY` | Yes | Typesense admin API key | | `MINIO_ACCESS_KEY` | Yes | MinIO root user | | `MINIO_SECRET_KEY` | Yes | MinIO root password | | `VNPAY_*` | Yes | VNPay payment gateway configuration | | `AI_API_KEY` | Yes | AI services authentication | | `GRAFANA_ADMIN_USER` | Yes (prod) | Grafana admin username | | `GRAFANA_ADMIN_PASSWORD` | Yes (prod) | Grafana admin password | | `PGBOUNCER_POOL_SIZE` | No | PgBouncer pool size (default: 20) | | `PGBOUNCER_MAX_CLIENT_CONN` | No | Max PgBouncer client connections (default: 200) | | `BACKUP_RETENTION_DAYS` | No | Backup retention period (default: 7) | | `IMAGE_TAG` | No (prod) | Container image tag (default: `latest`) | ### Port Map | Port | Service | Exposed | |------|---------|---------| | 3000 | Web (Next.js) | External | | 3001 | API (NestJS) | External | | 3002 | Grafana | External (admin only) | | 5432 | PostgreSQL | Internal | | 6432 | PgBouncer | Internal | | 6379 | Redis | Internal | | 8000 | AI Services | Internal | | 8108 | Typesense | Internal | | 9000 | MinIO API | Internal | | 9001 | MinIO Console | Internal | | 9090 | Prometheus | Internal | | 3100 | Loki | Internal | ### Docker Volumes | Volume | Service | Purpose | |--------|---------|---------| | `pgdata` | PostgreSQL | Database files | | `redis_data` | Redis | AOF persistence | | `typesense_data` | Typesense | Search index data | | `minio_data` | MinIO | Object storage (images, files) | | `pg_backups` | pg-backup | Database backup files | | `loki_data` | Loki | Log storage (15-day retention) | | `prometheus_data` | Prometheus | Metrics (30-day retention prod / 15-day dev) | | `grafana_data` | Grafana | Dashboard state, user preferences | --- ## Appendix: Alert Rules Reference | Alert | Expression | Severity | Duration | |-------|-----------|----------|----------| | `ApiLatencyP99High` | p99 > 1s | Warning | 5 min | | `ApiEndpointLatencyP99High` | Per-route p99 > 2s | Warning | 5 min | | `ApiLatencyP99Critical` | p99 > 3s (SLO breach) | Critical | 3 min | | `ApiErrorRate5xxHigh` | 5xx rate > 1% | Warning | 5 min | Alert rules are defined in `monitoring/prometheus/alert-rules.yml` and evaluated every 15 seconds.