diff --git a/docs/RUNBOOK.md b/docs/RUNBOOK.md new file mode 100644 index 0000000..a1e75c9 --- /dev/null +++ b/docs/RUNBOOK.md @@ -0,0 +1,975 @@ +# GoodGo Platform — Production Runbook + +> **Audience:** On-call SRE, DevOps engineers, and platform operators. +> **Last updated:** 2026-04-11 + +--- + +## Table of Contents + +1. [Service Inventory](#1-service-inventory) +2. [Health Checks](#2-health-checks) +3. [Common Incidents](#3-common-incidents) + - [3.1 Database Connection Pool Exhaustion](#31-database-connection-pool-exhaustion) + - [3.2 Redis Connection Failure](#32-redis-connection-failure) + - [3.3 Typesense Unavailable](#33-typesense-unavailable) + - [3.4 High API Latency](#34-high-api-latency) + - [3.5 Payment Callback Failures](#35-payment-callback-failures) + - [3.6 Disk Space Alerts](#36-disk-space-alerts) + - [3.7 MinIO / Object Storage Failure](#37-minio--object-storage-failure) + - [3.8 AI Services Unavailable](#38-ai-services-unavailable) + - [3.9 Log Pipeline Failure (Loki/Promtail)](#39-log-pipeline-failure-lokipromtail) + - [3.10 5xx Error Rate Spike](#310-5xx-error-rate-spike) +4. [Recovery Procedures](#4-recovery-procedures) + - [4.1 Database Restore from Backup](#41-database-restore-from-backup) + - [4.2 Redis Cache Flush & Warm-up](#42-redis-cache-flush--warm-up) + - [4.3 Rolling Restart Procedures](#43-rolling-restart-procedures) + - [4.4 Rollback Deployment](#44-rollback-deployment) + - [4.5 Typesense Reindex from PostgreSQL](#45-typesense-reindex-from-postgresql) + - [4.6 Full Host Recovery](#46-full-host-recovery) +5. [Escalation Matrix](#5-escalation-matrix) +6. [Monitoring Dashboards](#6-monitoring-dashboards) +7. [Useful PromQL Queries](#7-useful-promql-queries) +8. [Environment Quick Reference](#8-environment-quick-reference) + +--- + +## 1. Service Inventory + +### Production Services (`docker-compose.prod.yml`) + +| Service | Image | Port | Resource Limits | Health Check | +|---------|-------|------|-----------------|--------------| +| **api** (NestJS) | `ghcr.io/goodgo/goodgo-api` | 3001 | 1 CPU / 1 GB | `GET /health` (node fetch) | +| **web** (Next.js) | `ghcr.io/goodgo/goodgo-web` | 3000 | 0.5 CPU / 512 MB | `GET /` (node fetch) | +| **ai-services** (FastAPI) | `ghcr.io/goodgo/goodgo-ai-services` | 8000 | 1 CPU / 1 GB | `GET /health` (httpx) | +| **postgres** | `postgis/postgis:16-3.4` | 5432 (internal) | 2 CPU / 2 GB, shm=256m | `pg_isready` | +| **pgbouncer** | `edoburu/pgbouncer:1.23.1-p2` | 6432 (internal) | 0.5 CPU / 256 MB | `pg_isready -p 6432` | +| **redis** | `redis:7-alpine` | 6379 (internal) | 0.5 CPU / 768 MB | `redis-cli ping` | +| **typesense** | `typesense/typesense:27.1` | 8108 (internal) | 1 CPU / 1 GB | `curl /health` | +| **minio** | `minio/minio:latest` | 9000/9001 (internal) | 0.5 CPU / 1 GB | `mc ready local` | +| **pg-backup** | `postgis/postgis:16-3.4` | — | 0.5 CPU / 512 MB | — (cron daemon) | +| **loki** | `grafana/loki:3.0.0` | 3100 (internal) | 0.5 CPU / 512 MB | `wget /ready` | +| **promtail** | `grafana/promtail:3.0.0` | — | 0.25 CPU / 256 MB | — | +| **prometheus** | `prom/prometheus:v2.51.0` | 9090 (internal) | 0.5 CPU / 1 GB | `wget /-/healthy` | +| **grafana** | `grafana/grafana:10.4.1` | 3002 (external) | 0.5 CPU / 512 MB | `wget /api/health` | + +### Development-Only Services (`docker-compose.yml`) + +Development uses the same data and monitoring services but runs API/Web on the host. The `pg-backup` service also runs in dev with default credentials. + +### Service Dependency Chain + +``` +web --> api --> pgbouncer --> postgres + |-> redis + |-> typesense + |-> minio + |-> ai-services + +grafana --> prometheus + |-> loki --> promtail (Docker socket) + +pg-backup --> postgres +``` + +--- + +## 2. Health Checks + +### Application Health Endpoints + +| Endpoint | Type | Checks | Expected Response | +|----------|------|--------|-------------------| +| `GET /health` | Liveness | Process is running | `200 { status: "ok" }` | +| `GET /health/ready` | Readiness | PostgreSQL + Redis | `200 { status: "ok", info: { database: ..., redis: ... } }` | +| `GET /health/db` | Database only | PostgreSQL connectivity | `200 { status: "ok", info: { database: ... } }` | +| `GET /health/redis` | Redis only | Redis connectivity | `200 { status: "ok", info: { redis: ... } }` | + +### Verify All Services Are Healthy + +```bash +# Quick check — all containers +docker compose -f docker-compose.prod.yml ps --format "table {{.Name}}\t{{.Status}}\t{{.Health}}" + +# API liveness +curl -sf http://localhost:3001/health && echo "API OK" || echo "API FAIL" + +# API readiness (DB + Redis) +curl -sf http://localhost:3001/health/ready | jq . + +# Individual dependency checks +curl -sf http://localhost:3001/health/db | jq . +curl -sf http://localhost:3001/health/redis | jq . + +# Typesense +curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health + +# MinIO +docker exec goodgo-minio mc ready local && echo "MinIO OK" + +# AI Services +curl -sf http://localhost:8000/health && echo "AI OK" || echo "AI FAIL" + +# PostgreSQL (direct) +docker exec goodgo-postgres pg_isready -U ${DB_USER} -d ${DB_NAME} + +# PgBouncer +docker exec goodgo-pgbouncer pg_isready -h 127.0.0.1 -p 6432 -U ${DB_USER} + +# Redis +docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" ping + +# Prometheus +curl -sf http://localhost:9090/-/healthy && echo "Prometheus OK" + +# Loki +curl -sf http://localhost:3100/ready && echo "Loki OK" + +# Grafana +curl -sf http://localhost:3002/api/health | jq . +``` + +### Container Resource Usage + +```bash +docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.NetIO}}" +``` + +--- + +## 3. Common Incidents + +### 3.1 Database Connection Pool Exhaustion + +**Symptoms:** +- API returns 503 or hangs on requests +- `/health/ready` returns unhealthy for `database` +- PgBouncer logs: `no more connections allowed` or `query_wait_timeout` +- Prometheus: spike in `pg_stat_activity` active connections + +**Diagnosis:** + +```bash +# Check PgBouncer pool status +docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW POOLS;" +docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW CLIENTS;" +docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW STATS;" + +# Check PostgreSQL active connections +docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \ + "SELECT state, count(*) FROM pg_stat_activity WHERE datname = '${DB_NAME}' GROUP BY state;" + +# Identify long-running queries +docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \ + "SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state + FROM pg_stat_activity + WHERE datname = '${DB_NAME}' AND state != 'idle' + ORDER BY duration DESC + LIMIT 10;" +``` + +**Resolution:** + +```bash +# 1. Kill long-running queries (> 5 minutes) +docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \ + "SELECT pg_terminate_backend(pid) + FROM pg_stat_activity + WHERE datname = '${DB_NAME}' + AND state != 'idle' + AND now() - query_start > interval '5 minutes' + AND pid <> pg_backend_pid();" + +# 2. If pool is fully exhausted, restart PgBouncer +docker compose -f docker-compose.prod.yml restart pgbouncer + +# 3. If issue persists, increase pool size temporarily +# Edit PGBOUNCER_POOL_SIZE in .env, then: +docker compose -f docker-compose.prod.yml up -d --no-deps pgbouncer +``` + +**PgBouncer Configuration Reference:** +- Pool mode: `transaction` (connections returned to pool after each transaction) +- Default pool size: 20 server connections per user/db pair +- Max client connections: 200 +- Reserve pool: 5 extra connections (after 3s wait) +- Query wait timeout: 120s (error if client waits this long) + +### 3.2 Redis Connection Failure + +**Symptoms:** +- `/health/redis` returns unhealthy +- Increased API response times (cache misses hitting DB) +- API logs show Redis connection errors + +**Diagnosis:** + +```bash +# Check Redis container +docker logs --tail=50 goodgo-redis + +# Test connectivity +docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" ping +docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO server +docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO memory +docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO clients +``` + +**Resolution:** + +```bash +# 1. Restart Redis (data persisted via AOF) +docker compose -f docker-compose.prod.yml restart redis + +# 2. If OOM — check memory usage +docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO memory | grep used_memory_human +# Max memory is 512 MB (prod), eviction policy: allkeys-lru + +# 3. If AOF is corrupted +docker compose -f docker-compose.prod.yml stop redis +docker exec goodgo-redis redis-check-aof --fix /data/appendonly.aof +docker compose -f docker-compose.prod.yml start redis +``` + +**Graceful Degradation:** The API is designed to continue operating when Redis is unavailable. Cache misses fall through to PostgreSQL. Performance will degrade but functionality is preserved. Redis is non-critical for core operations. + +### 3.3 Typesense Unavailable + +**Symptoms:** +- Search functionality returns errors or falls back to basic DB search +- `curl http://localhost:8108/health` fails +- API logs show Typesense connection timeouts + +**Diagnosis:** + +```bash +# Check container status +docker logs --tail=50 goodgo-typesense + +# Check health +curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health + +# Check collections +curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/collections | jq . + +# Check disk space for Typesense data volume +docker system df -v | grep typesense +``` + +**Resolution:** + +```bash +# 1. Restart Typesense +docker compose -f docker-compose.prod.yml restart typesense + +# 2. If data is corrupted — rebuild from PostgreSQL +docker compose -f docker-compose.prod.yml stop typesense +docker volume rm goodgo-platform-ai_typesense_data +docker compose -f docker-compose.prod.yml up -d typesense +# Wait for healthy, then reindex: +docker compose -f docker-compose.prod.yml exec api npx ts-node scripts/typesense-reindex.ts +# Or: pnpm run typesense:reindex + +# 3. If volume backup exists — restore +docker compose -f docker-compose.prod.yml stop typesense +docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \ + alpine sh -c "rm -rf /data/* && tar xzf /backup/typesense_YYYYMMDD_HHMMSS.tar.gz -C /data" +docker compose -f docker-compose.prod.yml start typesense +``` + +**Fallback Behavior:** When Typesense is unavailable, property search falls back to PostgreSQL full-text search with PostGIS geo queries. Search quality degrades but core functionality works. + +### 3.4 High API Latency + +**Symptoms:** +- Prometheus alert `ApiLatencyP99High` fires (p99 > 1s for 5 min) +- Critical alert `ApiLatencyP99Critical` fires (p99 > 3s for 3 min — SLO breach) +- Users report slow page loads + +**Diagnosis:** + +```bash +# 1. Check which endpoints are slow +# Grafana: GoodGo API Latency dashboard +# Or via PromQL: +curl -s "http://localhost:9090/api/v1/query" --data-urlencode \ + 'query=topk(10, histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket[5m])) by (le, route, method)))' \ + | jq '.data.result[] | {route: .metric.route, method: .metric.method, p99: .value[1]}' + +# 2. Check database slow queries +docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \ + "SELECT pid, now() - query_start AS duration, left(query, 100) AS query_preview + FROM pg_stat_activity + WHERE state = 'active' AND now() - query_start > interval '1 second' + ORDER BY duration DESC;" + +# 3. Check PgBouncer wait times +docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW POOLS;" + +# 4. Check container resource usage +docker stats --no-stream goodgo-api goodgo-postgres goodgo-redis goodgo-pgbouncer + +# 5. Check Redis latency +docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" --latency-history -i 3 + +# 6. Check application logs for errors +docker logs --tail=200 --since=5m goodgo-api 2>&1 | grep -i "error\|timeout\|slow" +``` + +**Resolution:** + +```bash +# 1. If DB slow queries — terminate them +docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \ + "SELECT pg_terminate_backend(pid) + FROM pg_stat_activity + WHERE state = 'active' AND now() - query_start > interval '30 seconds';" + +# 2. If connection pool exhaustion — see Section 3.1 + +# 3. If Redis is slow — restart +docker compose -f docker-compose.prod.yml restart redis + +# 4. If API container OOM — restart with more memory +docker compose -f docker-compose.prod.yml restart api + +# 5. If specific endpoint is the bottleneck — check Loki logs: +# Grafana > Explore > Loki > {container_name="goodgo-api"} |= "slow" +``` + +### 3.5 Payment Callback Failures + +**Symptoms:** +- Users report payments stuck in "pending" state +- VNPay/MoMo/ZaloPay IPN callbacks returning errors +- Payment reconciliation mismatches + +**Diagnosis:** + +```bash +# 1. Check payment callback logs +docker logs goodgo-api 2>&1 | grep -i "payment\|callback\|vnpay\|momo\|zalopay" | tail -50 + +# 2. Check for pending payments in DB +docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \ + "SELECT id, provider, status, \"amountVND\", \"createdAt\" + FROM \"Payment\" + WHERE status = 'PENDING' + AND \"createdAt\" > now() - interval '24 hours' + ORDER BY \"createdAt\" DESC + LIMIT 20;" + +# 3. Verify callback URL is reachable from external networks +curl -sf https://your-domain.com/api/payments/vnpay/callback && echo "Callback URL reachable" + +# 4. Check if API is receiving callbacks (via Loki) +# Grafana > Explore > Loki > {container_name="goodgo-api"} |= "callback" |= "payment" +``` + +**Resolution:** + +```bash +# 1. If callbacks are timing out — check API health and restart if needed +docker compose -f docker-compose.prod.yml restart api + +# 2. If VNPay signature verification fails — verify VNPAY_* env vars +docker compose -f docker-compose.prod.yml exec api printenv | grep VNPAY + +# 3. For stuck payments — manual reconciliation +# Check VNPay/MoMo merchant portal for actual transaction status +# Update payment status in DB if confirmed paid: +docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \ + "UPDATE \"Payment\" SET status = 'COMPLETED', \"updatedAt\" = now() + WHERE id = '' AND status = 'PENDING';" + +# 4. If callbacks are not reaching the server — check: +# - Firewall rules (port 3001 or reverse proxy port must be open) +# - SSL certificate validity +# - DNS resolution +# - Payment provider webhook configuration (correct callback URL) +``` + +**Important:** The payment callback handler uses idempotent processing with atomic state transitions. Replaying a callback is safe and will not duplicate payments. + +### 3.6 Disk Space Alerts + +**Symptoms:** +- Containers failing to start or crashing +- PostgreSQL refusing writes (`PANIC: could not write to file`) +- Docker daemon running out of space + +**Diagnosis:** + +```bash +# Host disk usage +df -h + +# Docker disk usage +docker system df +docker system df -v + +# Check individual volume sizes +for vol in $(docker volume ls -q | grep goodgo); do + echo -n "$vol: " + docker run --rm -v "${vol}:/data" alpine du -sh /data 2>/dev/null +done + +# Check backup volume specifically +docker exec goodgo-pg-backup du -sh /backups/ +docker exec goodgo-pg-backup ls -lht /backups/ +``` + +**Resolution:** + +```bash +# 1. Clean up Docker artifacts +docker system prune -f # Remove stopped containers, unused networks, dangling images +docker image prune -a -f # Remove ALL unused images (careful in prod) + +# 2. Clean old backups (if retention not working) +docker exec goodgo-pg-backup find /backups -name "goodgo_*.sql.gz" -mtime +7 -delete + +# 3. Clean Prometheus data (if too large) +# Prometheus retention is 30d (prod) / 15d (dev) — configured via --storage.tsdb.retention.time +# To force compaction: +curl -sf -XPOST http://localhost:9090/-/quit # Graceful shutdown triggers compaction +docker compose -f docker-compose.prod.yml start prometheus + +# 4. Clean Loki data (15-day retention) +# Loki handles its own cleanup via compactor. If urgent: +docker compose -f docker-compose.prod.yml restart loki + +# 5. Truncate Docker container logs +sudo truncate -s 0 $(docker inspect --format='{{.LogPath}}' goodgo-api) +# Or for all containers: +sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log' +``` + +**Prevention:** All production containers use `json-file` logging with `max-size: 10m` and `max-file: 3-5`. Backup retention is 7 days (configurable via `BACKUP_RETENTION_DAYS`). + +### 3.7 MinIO / Object Storage Failure + +**Symptoms:** +- Image/file uploads fail +- Property photos not loading +- MinIO console inaccessible at port 9001 + +**Diagnosis:** + +```bash +docker logs --tail=50 goodgo-minio +docker exec goodgo-minio mc ready local +docker exec goodgo-minio mc admin info local +``` + +**Resolution:** + +```bash +# 1. Restart MinIO +docker compose -f docker-compose.prod.yml restart minio + +# 2. If data volume corrupted +docker compose -f docker-compose.prod.yml stop minio +docker volume rm goodgo-platform-ai_minio_data # WARNING: data loss +docker compose -f docker-compose.prod.yml up -d minio +# Recreate buckets via API or admin console +``` + +### 3.8 AI Services Unavailable + +**Symptoms:** +- AI-powered features (AVM, property descriptions) fail +- `GET /health` on port 8000 fails +- API logs show AI service connection timeouts + +**Diagnosis:** + +```bash +docker logs --tail=50 goodgo-ai-services +curl -sf http://localhost:8000/health +docker stats --no-stream goodgo-ai-services +``` + +**Resolution:** + +```bash +# 1. Restart AI services +docker compose -f docker-compose.prod.yml restart ai-services + +# 2. Check rate limits (default: 60/minute) +docker compose -f docker-compose.prod.yml exec ai-services printenv | grep AI_RATE_LIMIT + +# 3. If OOM — the service has 1 GB limit; may need to increase for large models +``` + +**Graceful Degradation:** AI features are optional. The API should handle AI service unavailability gracefully and return non-AI results. + +### 3.9 Log Pipeline Failure (Loki/Promtail) + +**Symptoms:** +- Grafana log explorer returns empty results +- Promtail container unhealthy or crash-looping +- Loki returning 503 + +**Diagnosis:** + +```bash +docker logs --tail=50 goodgo-loki +docker logs --tail=50 goodgo-promtail +curl -sf http://localhost:3100/ready && echo "Loki ready" || echo "Loki NOT ready" +``` + +**Resolution:** + +```bash +# 1. Restart the pipeline +docker compose -f docker-compose.prod.yml restart loki promtail + +# 2. If Loki data corrupted +docker compose -f docker-compose.prod.yml stop loki promtail +docker volume rm goodgo-platform-ai_loki_data +docker compose -f docker-compose.prod.yml up -d loki promtail +# Historical logs are lost but new logs will flow immediately + +# 3. If Promtail can't access Docker socket +ls -la /var/run/docker.sock +# Ensure the promtail container has the Docker socket mounted +``` + +### 3.10 5xx Error Rate Spike + +**Symptoms:** +- Prometheus alert `ApiErrorRate5xxHigh` fires (> 1% 5xx for 5 min) +- Users reporting errors + +**Diagnosis:** + +```bash +# Check which endpoints are returning 5xx +curl -s "http://localhost:9090/api/v1/query" --data-urlencode \ + 'query=topk(10, sum(rate(http_requests_total{job="goodgo-api", status_code=~"5.."}[5m])) by (route, method))' \ + | jq '.data.result' + +# Check API error logs +docker logs --tail=200 --since=5m goodgo-api 2>&1 | grep -i "error\|exception\|500" + +# Check all dependency health +curl -sf http://localhost:3001/health/ready | jq . +``` + +**Resolution:** +1. If DB-related: see [Section 3.1](#31-database-connection-pool-exhaustion) +2. If Redis-related: see [Section 3.2](#32-redis-connection-failure) +3. If recent deployment: see [Section 4.4](#44-rollback-deployment) +4. If unknown: restart API and investigate logs + +--- + +## 4. Recovery Procedures + +### 4.1 Database Restore from Backup + +**Automated backups run daily at 02:00 UTC** via the `pg-backup` container. Retention: 7 days. Format: `pg_dump --format=custom --compress=6`. + +**Automated verification runs daily at 04:00 UTC** — restores to an isolated test database, verifies table existence, row counts, checksums, PostGIS extension, indexes, and enums. Reports are written to `/backups/verify-latest.json`. + +#### List Available Backups + +```bash +docker exec goodgo-pg-backup ls -lht /backups/goodgo_*.sql.gz +``` + +#### Create an On-Demand Backup + +```bash +docker exec goodgo-pg-backup /scripts/pg-backup.sh +``` + +#### Full Restore Procedure + +```bash +# 1. Stop application services +docker compose -f docker-compose.prod.yml stop api web ai-services + +# 2. (Production) Stop PgBouncer to prevent stale connections +docker compose -f docker-compose.prod.yml stop pgbouncer + +# 3. Run the restore script +docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz +# The script will: +# - Terminate active DB connections +# - DROP and recreate the database +# - Restore from the backup file + +# 4. Verify the restore +docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c '\dt' +docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c 'SELECT count(*) FROM "User";' + +# 5. Apply any pending migrations +docker compose -f docker-compose.prod.yml exec api npx prisma migrate deploy + +# 6. Restart all services +docker compose -f docker-compose.prod.yml up -d + +# 7. Verify application health +curl -sf http://localhost:3001/health/ready | jq . +``` + +#### Verify a Backup Without Restoring + +```bash +# Run verification against latest backup (creates temp DB, drops it after) +docker compose run --rm pg-verify-backup + +# Or verify a specific backup file +docker exec goodgo-pg-backup /scripts/pg-verify-backup.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz + +# Check latest verification report +docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq . +``` + +**RPO/RTO:** +- RPO: ≤ 24 hours (daily backups; consider WAL archiving for lower RPO) +- RTO: ~15 minutes (local volume), ~30 minutes (off-site) + +### 4.2 Redis Cache Flush & Warm-up + +```bash +# Flush all Redis data +docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" FLUSHALL + +# Verify flush +docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" DBSIZE +# Should return: (integer) 0 +``` + +**Warm-up:** Redis uses `allkeys-lru` eviction. Cache warms naturally as users make requests. No manual warm-up script is needed — cache misses fall through to PostgreSQL. + +**When to flush:** +- After database restore (stale cache references) +- After data corruption at the application level +- After schema changes that alter cached data structures + +### 4.3 Rolling Restart Procedures + +#### Single Service Restart (Zero Downtime) + +```bash +# API — the --wait flag ensures health check passes before moving on +docker compose -f docker-compose.prod.yml up -d --no-deps --wait api + +# Web +docker compose -f docker-compose.prod.yml up -d --no-deps --wait web + +# AI Services +docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services +``` + +#### Full Stack Rolling Restart + +```bash +# Data services first (order matters for dependency chain) +docker compose -f docker-compose.prod.yml restart redis +docker compose -f docker-compose.prod.yml restart typesense + +# Wait for data services to be healthy +sleep 10 + +# Connection pooling +docker compose -f docker-compose.prod.yml restart pgbouncer +sleep 5 + +# Application services +docker compose -f docker-compose.prod.yml up -d --no-deps --wait api +docker compose -f docker-compose.prod.yml up -d --no-deps --wait web +docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services + +# Verify +curl -sf http://localhost:3001/health/ready | jq . +``` + +#### Emergency: Restart Everything + +```bash +docker compose -f docker-compose.prod.yml down +docker compose -f docker-compose.prod.yml up -d --wait +``` + +### 4.4 Rollback Deployment + +The CI/CD pipeline (`.github/workflows/deploy.yml`) supports automatic rollback if production smoke tests fail. For manual rollback: + +#### Quick Rollback (Revert to Previous Images) + +```bash +# SSH into production host +ssh deploy@$PRODUCTION_HOST + +cd ~/goodgo + +# Stop current app containers +docker compose -f docker-compose.prod.yml down api web ai-services + +# The previous images are still cached locally +# Restart without pulling — uses last-known-good images +docker compose -f docker-compose.prod.yml up -d --wait api web ai-services + +# Verify +curl -sf http://localhost:3001/health && echo "Rollback successful" +``` + +#### Rollback to a Specific Git Commit / Image Tag + +```bash +# Set the target tag (git SHA) +export IMAGE_TAG= +export REGISTRY_URL=ghcr.io/goodgo + +# Pull specific version +docker compose -f docker-compose.prod.yml pull api web ai-services + +# Deploy +docker compose -f docker-compose.prod.yml up -d --no-deps --wait api +docker compose -f docker-compose.prod.yml up -d --no-deps --wait web +docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services + +# Verify +curl -sf http://localhost:3001/health/ready | jq . +``` + +#### Rollback Database Migrations + +```bash +# WARNING: Prisma does not support automatic down-migrations. +# For migration rollback, restore from the pre-migration backup: + +# 1. Stop application +docker compose -f docker-compose.prod.yml stop api web ai-services pgbouncer + +# 2. Restore from backup taken before the migration +docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/.sql.gz + +# 3. Deploy the previous code version (older IMAGE_TAG) +export IMAGE_TAG= +docker compose -f docker-compose.prod.yml up -d --wait +``` + +### 4.5 Typesense Reindex from PostgreSQL + +If Typesense data is lost or corrupted, rebuild the search index from PostgreSQL: + +```bash +# 1. Ensure Typesense is running and healthy +curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health + +# 2. Run reindex +docker compose -f docker-compose.prod.yml exec api npx ts-node scripts/typesense-reindex.ts +# Or from host: +pnpm run typesense:reindex + +# 3. Verify collections +curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/collections | jq '.[].name' +``` + +### 4.6 Full Host Recovery + +For complete host failure or migration to a new server: + +```bash +# 1. Provision new host with Docker + Docker Compose +# Requirements: Docker >= 24, Docker Compose v2, 8 GB RAM minimum + +# 2. Clone repository and configure +git clone ~/goodgo && cd ~/goodgo +cp .env.example .env +# Edit .env with production secrets (from secrets manager) + +# 3. Restore PostgreSQL backup from off-site storage +# Transfer backup file to the new host +scp backups/goodgo_latest.sql.gz deploy@newhost:~/goodgo/backups/ + +# 4. Start infrastructure services +docker compose -f docker-compose.prod.yml up -d postgres redis typesense minio + +# 5. Wait for PostgreSQL to be ready, then restore +docker compose -f docker-compose.prod.yml exec postgres pg_isready +docker exec goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_latest.sql.gz + +# 6. Start application services +docker compose -f docker-compose.prod.yml up -d + +# 7. Run migrations (if backup predates latest code) +docker compose -f docker-compose.prod.yml exec api npx prisma migrate deploy + +# 8. Rebuild Typesense index +pnpm run typesense:reindex + +# 9. Flush Redis (stale cache from old host) +docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" FLUSHALL + +# 10. Verify everything +curl -sf http://localhost:3001/health/ready | jq . +curl -sf http://localhost:3000 > /dev/null && echo "Web OK" + +# Expected RTO: ~60 minutes (depends on backup transfer speed) +``` + +--- + +## 5. Escalation Matrix + +| Severity | Condition | First Responder | Escalation | SLA | +|----------|-----------|-----------------|------------|-----| +| **P0 — Critical** | Full outage, data loss, payment corruption | On-call SRE | CTO + CEO within 15 min | Acknowledge: 5 min, Resolve: 1 hour | +| **P1 — High** | Partial outage, SLO breach (p99 > 3s), 5xx > 5% | On-call SRE | Engineering lead within 30 min | Acknowledge: 15 min, Resolve: 4 hours | +| **P2 — Medium** | Degraded performance, single service down (non-critical), p99 > 1s | On-call SRE | Team lead next business day | Acknowledge: 1 hour, Resolve: 24 hours | +| **P3 — Low** | Cosmetic issues, monitoring gaps, non-urgent improvements | Assigned engineer | Sprint planning | Next sprint | + +### Contact Channels + +| Role | Channel | +|------|---------| +| On-call SRE | Slack `#sre-oncall` + PagerDuty | +| Engineering Lead | Slack `#engineering` | +| CTO | Slack DM / Phone (see PagerDuty) | +| Payment Issues | Slack `#payments` + VNPay/MoMo support portals | +| Infrastructure | Slack `#infrastructure` | + +### Slack Notifications + +The deploy pipeline automatically notifies `#deployments` (via `SLACK_WEBHOOK_URL`) on: +- Production deploy success +- Staging smoke test failure +- Production rollback triggered + +--- + +## 6. Monitoring Dashboards + +All dashboards are provisioned automatically via `monitoring/grafana/provisioning/` and are available in the **GoodGo** folder in Grafana. + +| Dashboard | Grafana Path | Purpose | +|-----------|--------------|---------| +| **API Overview** | `api-overview` | Request rates, status codes, active connections | +| **API Latency** | `api-latency` | p50/p95/p99 latency by endpoint, latency heatmaps | +| **Database** | `database` | PostgreSQL connections, query performance, PgBouncer stats | +| **Search** | `search` | Typesense query rates, latency, index sizes | +| **Business Metrics** | `business-metrics` | Listings, inquiries, payments, user registrations | +| **Web Vitals** | `web-vitals` | Core Web Vitals (LCP, FID, CLS), page load times | +| **Logs** | `logs` | Loki log explorer with filters by service, level, correlation ID | + +**Access:** `http://localhost:3002` (default credentials in `.env`: `GRAFANA_ADMIN_USER` / `GRAFANA_ADMIN_PASSWORD`) + +**Data Sources:** +- **Prometheus** (`http://prometheus:9090`) — Metrics (default) +- **Loki** (`http://loki:3100`) — Logs, with correlation ID linking to Prometheus + +--- + +## 7. Useful PromQL Queries + +### API Performance + +```promql +# Overall p99 latency +histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket{job="goodgo-api"}[5m])) by (le)) + +# Per-endpoint p99 latency (top 10 slowest) +topk(10, histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket{job="goodgo-api"}[5m])) by (le, route, method))) + +# Request rate by status code +sum(rate(http_requests_total{job="goodgo-api"}[5m])) by (status_code) + +# 5xx error percentage +(sum(rate(http_requests_total{job="goodgo-api", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="goodgo-api"}[5m]))) * 100 +``` + +### Database + +```promql +# Active connections +pg_stat_activity_count{datname="goodgo", state="active"} + +# Connection pool utilization (if PgBouncer metrics are scraped) +# Manual check via: SHOW POOLS in PgBouncer admin console +``` + +### Infrastructure + +```promql +# Container memory usage +container_memory_usage_bytes{name=~"goodgo-.*"} + +# Container CPU usage +rate(container_cpu_usage_seconds_total{name=~"goodgo-.*"}[5m]) +``` + +--- + +## 8. Environment Quick Reference + +### Key Environment Variables + +| Variable | Required | Description | +|----------|----------|-------------| +| `DATABASE_URL` | Yes | PostgreSQL via PgBouncer (`postgresql://user:pass@pgbouncer:6432/db`) | +| `DATABASE_URL_DIRECT` | Yes (prod) | Direct PostgreSQL for migrations (`postgresql://user:pass@postgres:5432/db`) | +| `JWT_SECRET` | Yes | JWT signing secret | +| `JWT_REFRESH_SECRET` | Yes | Refresh token signing secret | +| `REDIS_URL` | Yes | Redis connection (`redis://:password@redis:6379`) | +| `REDIS_PASSWORD` | Yes (prod) | Redis auth password | +| `TYPESENSE_API_KEY` | Yes | Typesense admin API key | +| `MINIO_ACCESS_KEY` | Yes | MinIO root user | +| `MINIO_SECRET_KEY` | Yes | MinIO root password | +| `VNPAY_*` | Yes | VNPay payment gateway configuration | +| `AI_API_KEY` | Yes | AI services authentication | +| `GRAFANA_ADMIN_USER` | Yes (prod) | Grafana admin username | +| `GRAFANA_ADMIN_PASSWORD` | Yes (prod) | Grafana admin password | +| `PGBOUNCER_POOL_SIZE` | No | PgBouncer pool size (default: 20) | +| `PGBOUNCER_MAX_CLIENT_CONN` | No | Max PgBouncer client connections (default: 200) | +| `BACKUP_RETENTION_DAYS` | No | Backup retention period (default: 7) | +| `IMAGE_TAG` | No (prod) | Container image tag (default: `latest`) | + +### Port Map + +| Port | Service | Exposed | +|------|---------|---------| +| 3000 | Web (Next.js) | External | +| 3001 | API (NestJS) | External | +| 3002 | Grafana | External (admin only) | +| 5432 | PostgreSQL | Internal | +| 6432 | PgBouncer | Internal | +| 6379 | Redis | Internal | +| 8000 | AI Services | Internal | +| 8108 | Typesense | Internal | +| 9000 | MinIO API | Internal | +| 9001 | MinIO Console | Internal | +| 9090 | Prometheus | Internal | +| 3100 | Loki | Internal | + +### Docker Volumes + +| Volume | Service | Purpose | +|--------|---------|---------| +| `pgdata` | PostgreSQL | Database files | +| `redis_data` | Redis | AOF persistence | +| `typesense_data` | Typesense | Search index data | +| `minio_data` | MinIO | Object storage (images, files) | +| `pg_backups` | pg-backup | Database backup files | +| `loki_data` | Loki | Log storage (15-day retention) | +| `prometheus_data` | Prometheus | Metrics (30-day retention prod / 15-day dev) | +| `grafana_data` | Grafana | Dashboard state, user preferences | + +--- + +## Appendix: Alert Rules Reference + +| Alert | Expression | Severity | Duration | +|-------|-----------|----------|----------| +| `ApiLatencyP99High` | p99 > 1s | Warning | 5 min | +| `ApiEndpointLatencyP99High` | Per-route p99 > 2s | Warning | 5 min | +| `ApiLatencyP99Critical` | p99 > 3s (SLO breach) | Critical | 3 min | +| `ApiErrorRate5xxHigh` | 5xx rate > 1% | Warning | 5 min | + +Alert rules are defined in `monitoring/prometheus/alert-rules.yml` and evaluated every 15 seconds.