docs: add production operational runbook

Create comprehensive docs/RUNBOOK.md covering all 14 production services, health checks, 10 common incident scenarios with diagnosis/resolution, recovery procedures (DB restore, Redis flush, rolling restart, rollback), escalation matrix, monitoring dashboards, and PromQL queries. Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-11 00:28:02 +07:00
parent 45e48c063c
commit f27b13f712
1 changed files with 975 additions and 0 deletions
--- a/docs/RUNBOOK.md
+++ b/docs/RUNBOOK.md
@@ -0,0 +1,975 @@
+# GoodGo Platform — Production Runbook
+
+> **Audience:** On-call SRE, DevOps engineers, and platform operators.
+> **Last updated:** 2026-04-11
+
+---
+
+## Table of Contents
+
+1. [Service Inventory](#1-service-inventory)
+2. [Health Checks](#2-health-checks)
+3. [Common Incidents](#3-common-incidents)
+   - [3.1 Database Connection Pool Exhaustion](#31-database-connection-pool-exhaustion)
+   - [3.2 Redis Connection Failure](#32-redis-connection-failure)
+   - [3.3 Typesense Unavailable](#33-typesense-unavailable)
+   - [3.4 High API Latency](#34-high-api-latency)
+   - [3.5 Payment Callback Failures](#35-payment-callback-failures)
+   - [3.6 Disk Space Alerts](#36-disk-space-alerts)
+   - [3.7 MinIO / Object Storage Failure](#37-minio--object-storage-failure)
+   - [3.8 AI Services Unavailable](#38-ai-services-unavailable)
+   - [3.9 Log Pipeline Failure (Loki/Promtail)](#39-log-pipeline-failure-lokipromtail)
+   - [3.10 5xx Error Rate Spike](#310-5xx-error-rate-spike)
+4. [Recovery Procedures](#4-recovery-procedures)
+   - [4.1 Database Restore from Backup](#41-database-restore-from-backup)
+   - [4.2 Redis Cache Flush & Warm-up](#42-redis-cache-flush--warm-up)
+   - [4.3 Rolling Restart Procedures](#43-rolling-restart-procedures)
+   - [4.4 Rollback Deployment](#44-rollback-deployment)
+   - [4.5 Typesense Reindex from PostgreSQL](#45-typesense-reindex-from-postgresql)
+   - [4.6 Full Host Recovery](#46-full-host-recovery)
+5. [Escalation Matrix](#5-escalation-matrix)
+6. [Monitoring Dashboards](#6-monitoring-dashboards)
+7. [Useful PromQL Queries](#7-useful-promql-queries)
+8. [Environment Quick Reference](#8-environment-quick-reference)
+
+---
+
+## 1. Service Inventory
+
+### Production Services (`docker-compose.prod.yml`)
+
+| Service | Image | Port | Resource Limits | Health Check |
+|---------|-------|------|-----------------|--------------|
+| **api** (NestJS) | `ghcr.io/goodgo/goodgo-api` | 3001 | 1 CPU / 1 GB | `GET /health` (node fetch) |
+| **web** (Next.js) | `ghcr.io/goodgo/goodgo-web` | 3000 | 0.5 CPU / 512 MB | `GET /` (node fetch) |
+| **ai-services** (FastAPI) | `ghcr.io/goodgo/goodgo-ai-services` | 8000 | 1 CPU / 1 GB | `GET /health` (httpx) |
+| **postgres** | `postgis/postgis:16-3.4` | 5432 (internal) | 2 CPU / 2 GB, shm=256m | `pg_isready` |
+| **pgbouncer** | `edoburu/pgbouncer:1.23.1-p2` | 6432 (internal) | 0.5 CPU / 256 MB | `pg_isready -p 6432` |
+| **redis** | `redis:7-alpine` | 6379 (internal) | 0.5 CPU / 768 MB | `redis-cli ping` |
+| **typesense** | `typesense/typesense:27.1` | 8108 (internal) | 1 CPU / 1 GB | `curl /health` |
+| **minio** | `minio/minio:latest` | 9000/9001 (internal) | 0.5 CPU / 1 GB | `mc ready local` |
+| **pg-backup** | `postgis/postgis:16-3.4` | — | 0.5 CPU / 512 MB | — (cron daemon) |
+| **loki** | `grafana/loki:3.0.0` | 3100 (internal) | 0.5 CPU / 512 MB | `wget /ready` |
+| **promtail** | `grafana/promtail:3.0.0` | — | 0.25 CPU / 256 MB | — |
+| **prometheus** | `prom/prometheus:v2.51.0` | 9090 (internal) | 0.5 CPU / 1 GB | `wget /-/healthy` |
+| **grafana** | `grafana/grafana:10.4.1` | 3002 (external) | 0.5 CPU / 512 MB | `wget /api/health` |
+
+### Development-Only Services (`docker-compose.yml`)
+
+Development uses the same data and monitoring services but runs API/Web on the host. The `pg-backup` service also runs in dev with default credentials.
+
+### Service Dependency Chain
+
+```
+web --> api --> pgbouncer --> postgres
+                  |-> redis
+                  |-> typesense
+                  |-> minio
+                  |-> ai-services
+
+grafana --> prometheus
+        |-> loki --> promtail (Docker socket)
+
+pg-backup --> postgres
+```
+
+---
+
+## 2. Health Checks
+
+### Application Health Endpoints
+
+| Endpoint | Type | Checks | Expected Response |
+|----------|------|--------|-------------------|
+| `GET /health` | Liveness | Process is running | `200 { status: "ok" }` |
+| `GET /health/ready` | Readiness | PostgreSQL + Redis | `200 { status: "ok", info: { database: ..., redis: ... } }` |
+| `GET /health/db` | Database only | PostgreSQL connectivity | `200 { status: "ok", info: { database: ... } }` |
+| `GET /health/redis` | Redis only | Redis connectivity | `200 { status: "ok", info: { redis: ... } }` |
+
+### Verify All Services Are Healthy
+
+```bash
+# Quick check — all containers
+docker compose -f docker-compose.prod.yml ps --format "table {{.Name}}\t{{.Status}}\t{{.Health}}"
+
+# API liveness
+curl -sf http://localhost:3001/health && echo "API OK" || echo "API FAIL"
+
+# API readiness (DB + Redis)
+curl -sf http://localhost:3001/health/ready | jq .
+
+# Individual dependency checks
+curl -sf http://localhost:3001/health/db | jq .
+curl -sf http://localhost:3001/health/redis | jq .
+
+# Typesense
+curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health
+
+# MinIO
+docker exec goodgo-minio mc ready local && echo "MinIO OK"
+
+# AI Services
+curl -sf http://localhost:8000/health && echo "AI OK" || echo "AI FAIL"
+
+# PostgreSQL (direct)
+docker exec goodgo-postgres pg_isready -U ${DB_USER} -d ${DB_NAME}
+
+# PgBouncer
+docker exec goodgo-pgbouncer pg_isready -h 127.0.0.1 -p 6432 -U ${DB_USER}
+
+# Redis
+docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" ping
+
+# Prometheus
+curl -sf http://localhost:9090/-/healthy && echo "Prometheus OK"
+
+# Loki
+curl -sf http://localhost:3100/ready && echo "Loki OK"
+
+# Grafana
+curl -sf http://localhost:3002/api/health | jq .
+```
+
+### Container Resource Usage
+
+```bash
+docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.NetIO}}"
+```
+
+---
+
+## 3. Common Incidents
+
+### 3.1 Database Connection Pool Exhaustion
+
+**Symptoms:**
+- API returns 503 or hangs on requests
+- `/health/ready` returns unhealthy for `database`
+- PgBouncer logs: `no more connections allowed` or `query_wait_timeout`
+- Prometheus: spike in `pg_stat_activity` active connections
+
+**Diagnosis:**
+
+```bash
+# Check PgBouncer pool status
+docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW POOLS;"
+docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW CLIENTS;"
+docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW STATS;"
+
+# Check PostgreSQL active connections
+docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
+  "SELECT state, count(*) FROM pg_stat_activity WHERE datname = '${DB_NAME}' GROUP BY state;"
+
+# Identify long-running queries
+docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
+  "SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
+   FROM pg_stat_activity
+   WHERE datname = '${DB_NAME}' AND state != 'idle'
+   ORDER BY duration DESC
+   LIMIT 10;"
+```
+
+**Resolution:**
+
+```bash
+# 1. Kill long-running queries (> 5 minutes)
+docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
+  "SELECT pg_terminate_backend(pid)
+   FROM pg_stat_activity
+   WHERE datname = '${DB_NAME}'
+     AND state != 'idle'
+     AND now() - query_start > interval '5 minutes'
+     AND pid <> pg_backend_pid();"
+
+# 2. If pool is fully exhausted, restart PgBouncer
+docker compose -f docker-compose.prod.yml restart pgbouncer
+
+# 3. If issue persists, increase pool size temporarily
+#    Edit PGBOUNCER_POOL_SIZE in .env, then:
+docker compose -f docker-compose.prod.yml up -d --no-deps pgbouncer
+```
+
+**PgBouncer Configuration Reference:**
+- Pool mode: `transaction` (connections returned to pool after each transaction)
+- Default pool size: 20 server connections per user/db pair
+- Max client connections: 200
+- Reserve pool: 5 extra connections (after 3s wait)
+- Query wait timeout: 120s (error if client waits this long)
+
+### 3.2 Redis Connection Failure
+
+**Symptoms:**
+- `/health/redis` returns unhealthy
+- Increased API response times (cache misses hitting DB)
+- API logs show Redis connection errors
+
+**Diagnosis:**
+
+```bash
+# Check Redis container
+docker logs --tail=50 goodgo-redis
+
+# Test connectivity
+docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" ping
+docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO server
+docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO memory
+docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO clients
+```
+
+**Resolution:**
+
+```bash
+# 1. Restart Redis (data persisted via AOF)
+docker compose -f docker-compose.prod.yml restart redis
+
+# 2. If OOM — check memory usage
+docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" INFO memory | grep used_memory_human
+# Max memory is 512 MB (prod), eviction policy: allkeys-lru
+
+# 3. If AOF is corrupted
+docker compose -f docker-compose.prod.yml stop redis
+docker exec goodgo-redis redis-check-aof --fix /data/appendonly.aof
+docker compose -f docker-compose.prod.yml start redis
+```
+
+**Graceful Degradation:** The API is designed to continue operating when Redis is unavailable. Cache misses fall through to PostgreSQL. Performance will degrade but functionality is preserved. Redis is non-critical for core operations.
+
+### 3.3 Typesense Unavailable
+
+**Symptoms:**
+- Search functionality returns errors or falls back to basic DB search
+- `curl http://localhost:8108/health` fails
+- API logs show Typesense connection timeouts
+
+**Diagnosis:**
+
+```bash
+# Check container status
+docker logs --tail=50 goodgo-typesense
+
+# Check health
+curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health
+
+# Check collections
+curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/collections | jq .
+
+# Check disk space for Typesense data volume
+docker system df -v | grep typesense
+```
+
+**Resolution:**
+
+```bash
+# 1. Restart Typesense
+docker compose -f docker-compose.prod.yml restart typesense
+
+# 2. If data is corrupted — rebuild from PostgreSQL
+docker compose -f docker-compose.prod.yml stop typesense
+docker volume rm goodgo-platform-ai_typesense_data
+docker compose -f docker-compose.prod.yml up -d typesense
+# Wait for healthy, then reindex:
+docker compose -f docker-compose.prod.yml exec api npx ts-node scripts/typesense-reindex.ts
+# Or: pnpm run typesense:reindex
+
+# 3. If volume backup exists — restore
+docker compose -f docker-compose.prod.yml stop typesense
+docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
+  alpine sh -c "rm -rf /data/* && tar xzf /backup/typesense_YYYYMMDD_HHMMSS.tar.gz -C /data"
+docker compose -f docker-compose.prod.yml start typesense
+```
+
+**Fallback Behavior:** When Typesense is unavailable, property search falls back to PostgreSQL full-text search with PostGIS geo queries. Search quality degrades but core functionality works.
+
+### 3.4 High API Latency
+
+**Symptoms:**
+- Prometheus alert `ApiLatencyP99High` fires (p99 > 1s for 5 min)
+- Critical alert `ApiLatencyP99Critical` fires (p99 > 3s for 3 min — SLO breach)
+- Users report slow page loads
+
+**Diagnosis:**
+
+```bash
+# 1. Check which endpoints are slow
+# Grafana: GoodGo API Latency dashboard
+# Or via PromQL:
+curl -s "http://localhost:9090/api/v1/query" --data-urlencode \
+  'query=topk(10, histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket[5m])) by (le, route, method)))' \
+  | jq '.data.result[] | {route: .metric.route, method: .metric.method, p99: .value[1]}'
+
+# 2. Check database slow queries
+docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
+  "SELECT pid, now() - query_start AS duration, left(query, 100) AS query_preview
+   FROM pg_stat_activity
+   WHERE state = 'active' AND now() - query_start > interval '1 second'
+   ORDER BY duration DESC;"
+
+# 3. Check PgBouncer wait times
+docker exec goodgo-pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_admin pgbouncer -c "SHOW POOLS;"
+
+# 4. Check container resource usage
+docker stats --no-stream goodgo-api goodgo-postgres goodgo-redis goodgo-pgbouncer
+
+# 5. Check Redis latency
+docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" --latency-history -i 3
+
+# 6. Check application logs for errors
+docker logs --tail=200 --since=5m goodgo-api 2>&1 | grep -i "error\|timeout\|slow"
+```
+
+**Resolution:**
+
+```bash
+# 1. If DB slow queries — terminate them
+docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
+  "SELECT pg_terminate_backend(pid)
+   FROM pg_stat_activity
+   WHERE state = 'active' AND now() - query_start > interval '30 seconds';"
+
+# 2. If connection pool exhaustion — see Section 3.1
+
+# 3. If Redis is slow — restart
+docker compose -f docker-compose.prod.yml restart redis
+
+# 4. If API container OOM — restart with more memory
+docker compose -f docker-compose.prod.yml restart api
+
+# 5. If specific endpoint is the bottleneck — check Loki logs:
+# Grafana > Explore > Loki > {container_name="goodgo-api"} |= "slow"
+```
+
+### 3.5 Payment Callback Failures
+
+**Symptoms:**
+- Users report payments stuck in "pending" state
+- VNPay/MoMo/ZaloPay IPN callbacks returning errors
+- Payment reconciliation mismatches
+
+**Diagnosis:**
+
+```bash
+# 1. Check payment callback logs
+docker logs goodgo-api 2>&1 | grep -i "payment\|callback\|vnpay\|momo\|zalopay" | tail -50
+
+# 2. Check for pending payments in DB
+docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
+  "SELECT id, provider, status, \"amountVND\", \"createdAt\"
+   FROM \"Payment\"
+   WHERE status = 'PENDING'
+   AND \"createdAt\" > now() - interval '24 hours'
+   ORDER BY \"createdAt\" DESC
+   LIMIT 20;"
+
+# 3. Verify callback URL is reachable from external networks
+curl -sf https://your-domain.com/api/payments/vnpay/callback && echo "Callback URL reachable"
+
+# 4. Check if API is receiving callbacks (via Loki)
+# Grafana > Explore > Loki > {container_name="goodgo-api"} |= "callback" |= "payment"
+```
+
+**Resolution:**
+
+```bash
+# 1. If callbacks are timing out — check API health and restart if needed
+docker compose -f docker-compose.prod.yml restart api
+
+# 2. If VNPay signature verification fails — verify VNPAY_* env vars
+docker compose -f docker-compose.prod.yml exec api printenv | grep VNPAY
+
+# 3. For stuck payments — manual reconciliation
+#    Check VNPay/MoMo merchant portal for actual transaction status
+#    Update payment status in DB if confirmed paid:
+docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c \
+  "UPDATE \"Payment\" SET status = 'COMPLETED', \"updatedAt\" = now()
+   WHERE id = '<payment-id>' AND status = 'PENDING';"
+
+# 4. If callbacks are not reaching the server — check:
+#    - Firewall rules (port 3001 or reverse proxy port must be open)
+#    - SSL certificate validity
+#    - DNS resolution
+#    - Payment provider webhook configuration (correct callback URL)
+```
+
+**Important:** The payment callback handler uses idempotent processing with atomic state transitions. Replaying a callback is safe and will not duplicate payments.
+
+### 3.6 Disk Space Alerts
+
+**Symptoms:**
+- Containers failing to start or crashing
+- PostgreSQL refusing writes (`PANIC: could not write to file`)
+- Docker daemon running out of space
+
+**Diagnosis:**
+
+```bash
+# Host disk usage
+df -h
+
+# Docker disk usage
+docker system df
+docker system df -v
+
+# Check individual volume sizes
+for vol in $(docker volume ls -q | grep goodgo); do
+  echo -n "$vol: "
+  docker run --rm -v "${vol}:/data" alpine du -sh /data 2>/dev/null
+done
+
+# Check backup volume specifically
+docker exec goodgo-pg-backup du -sh /backups/
+docker exec goodgo-pg-backup ls -lht /backups/
+```
+
+**Resolution:**
+
+```bash
+# 1. Clean up Docker artifacts
+docker system prune -f          # Remove stopped containers, unused networks, dangling images
+docker image prune -a -f        # Remove ALL unused images (careful in prod)
+
+# 2. Clean old backups (if retention not working)
+docker exec goodgo-pg-backup find /backups -name "goodgo_*.sql.gz" -mtime +7 -delete
+
+# 3. Clean Prometheus data (if too large)
+# Prometheus retention is 30d (prod) / 15d (dev) — configured via --storage.tsdb.retention.time
+# To force compaction:
+curl -sf -XPOST http://localhost:9090/-/quit  # Graceful shutdown triggers compaction
+docker compose -f docker-compose.prod.yml start prometheus
+
+# 4. Clean Loki data (15-day retention)
+# Loki handles its own cleanup via compactor. If urgent:
+docker compose -f docker-compose.prod.yml restart loki
+
+# 5. Truncate Docker container logs
+sudo truncate -s 0 $(docker inspect --format='{{.LogPath}}' goodgo-api)
+# Or for all containers:
+sudo sh -c 'truncate -s 0 /var/lib/docker/containers/*/*-json.log'
+```
+
+**Prevention:** All production containers use `json-file` logging with `max-size: 10m` and `max-file: 3-5`. Backup retention is 7 days (configurable via `BACKUP_RETENTION_DAYS`).
+
+### 3.7 MinIO / Object Storage Failure
+
+**Symptoms:**
+- Image/file uploads fail
+- Property photos not loading
+- MinIO console inaccessible at port 9001
+
+**Diagnosis:**
+
+```bash
+docker logs --tail=50 goodgo-minio
+docker exec goodgo-minio mc ready local
+docker exec goodgo-minio mc admin info local
+```
+
+**Resolution:**
+
+```bash
+# 1. Restart MinIO
+docker compose -f docker-compose.prod.yml restart minio
+
+# 2. If data volume corrupted
+docker compose -f docker-compose.prod.yml stop minio
+docker volume rm goodgo-platform-ai_minio_data  # WARNING: data loss
+docker compose -f docker-compose.prod.yml up -d minio
+# Recreate buckets via API or admin console
+```
+
+### 3.8 AI Services Unavailable
+
+**Symptoms:**
+- AI-powered features (AVM, property descriptions) fail
+- `GET /health` on port 8000 fails
+- API logs show AI service connection timeouts
+
+**Diagnosis:**
+
+```bash
+docker logs --tail=50 goodgo-ai-services
+curl -sf http://localhost:8000/health
+docker stats --no-stream goodgo-ai-services
+```
+
+**Resolution:**
+
+```bash
+# 1. Restart AI services
+docker compose -f docker-compose.prod.yml restart ai-services
+
+# 2. Check rate limits (default: 60/minute)
+docker compose -f docker-compose.prod.yml exec ai-services printenv | grep AI_RATE_LIMIT
+
+# 3. If OOM — the service has 1 GB limit; may need to increase for large models
+```
+
+**Graceful Degradation:** AI features are optional. The API should handle AI service unavailability gracefully and return non-AI results.
+
+### 3.9 Log Pipeline Failure (Loki/Promtail)
+
+**Symptoms:**
+- Grafana log explorer returns empty results
+- Promtail container unhealthy or crash-looping
+- Loki returning 503
+
+**Diagnosis:**
+
+```bash
+docker logs --tail=50 goodgo-loki
+docker logs --tail=50 goodgo-promtail
+curl -sf http://localhost:3100/ready && echo "Loki ready" || echo "Loki NOT ready"
+```
+
+**Resolution:**
+
+```bash
+# 1. Restart the pipeline
+docker compose -f docker-compose.prod.yml restart loki promtail
+
+# 2. If Loki data corrupted
+docker compose -f docker-compose.prod.yml stop loki promtail
+docker volume rm goodgo-platform-ai_loki_data
+docker compose -f docker-compose.prod.yml up -d loki promtail
+# Historical logs are lost but new logs will flow immediately
+
+# 3. If Promtail can't access Docker socket
+ls -la /var/run/docker.sock
+# Ensure the promtail container has the Docker socket mounted
+```
+
+### 3.10 5xx Error Rate Spike
+
+**Symptoms:**
+- Prometheus alert `ApiErrorRate5xxHigh` fires (> 1% 5xx for 5 min)
+- Users reporting errors
+
+**Diagnosis:**
+
+```bash
+# Check which endpoints are returning 5xx
+curl -s "http://localhost:9090/api/v1/query" --data-urlencode \
+  'query=topk(10, sum(rate(http_requests_total{job="goodgo-api", status_code=~"5.."}[5m])) by (route, method))' \
+  | jq '.data.result'
+
+# Check API error logs
+docker logs --tail=200 --since=5m goodgo-api 2>&1 | grep -i "error\|exception\|500"
+
+# Check all dependency health
+curl -sf http://localhost:3001/health/ready | jq .
+```
+
+**Resolution:**
+1. If DB-related: see [Section 3.1](#31-database-connection-pool-exhaustion)
+2. If Redis-related: see [Section 3.2](#32-redis-connection-failure)
+3. If recent deployment: see [Section 4.4](#44-rollback-deployment)
+4. If unknown: restart API and investigate logs
+
+---
+
+## 4. Recovery Procedures
+
+### 4.1 Database Restore from Backup
+
+**Automated backups run daily at 02:00 UTC** via the `pg-backup` container. Retention: 7 days. Format: `pg_dump --format=custom --compress=6`.
+
+**Automated verification runs daily at 04:00 UTC** — restores to an isolated test database, verifies table existence, row counts, checksums, PostGIS extension, indexes, and enums. Reports are written to `/backups/verify-latest.json`.
+
+#### List Available Backups
+
+```bash
+docker exec goodgo-pg-backup ls -lht /backups/goodgo_*.sql.gz
+```
+
+#### Create an On-Demand Backup
+
+```bash
+docker exec goodgo-pg-backup /scripts/pg-backup.sh
+```
+
+#### Full Restore Procedure
+
+```bash
+# 1. Stop application services
+docker compose -f docker-compose.prod.yml stop api web ai-services
+
+# 2. (Production) Stop PgBouncer to prevent stale connections
+docker compose -f docker-compose.prod.yml stop pgbouncer
+
+# 3. Run the restore script
+docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz
+# The script will:
+#   - Terminate active DB connections
+#   - DROP and recreate the database
+#   - Restore from the backup file
+
+# 4. Verify the restore
+docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c '\dt'
+docker exec goodgo-postgres psql -U ${DB_USER} -d ${DB_NAME} -c 'SELECT count(*) FROM "User";'
+
+# 5. Apply any pending migrations
+docker compose -f docker-compose.prod.yml exec api npx prisma migrate deploy
+
+# 6. Restart all services
+docker compose -f docker-compose.prod.yml up -d
+
+# 7. Verify application health
+curl -sf http://localhost:3001/health/ready | jq .
+```
+
+#### Verify a Backup Without Restoring
+
+```bash
+# Run verification against latest backup (creates temp DB, drops it after)
+docker compose run --rm pg-verify-backup
+
+# Or verify a specific backup file
+docker exec goodgo-pg-backup /scripts/pg-verify-backup.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz
+
+# Check latest verification report
+docker exec goodgo-pg-backup cat /backups/verify-latest.json | jq .
+```
+
+**RPO/RTO:**
+- RPO: ≤ 24 hours (daily backups; consider WAL archiving for lower RPO)
+- RTO: ~15 minutes (local volume), ~30 minutes (off-site)
+
+### 4.2 Redis Cache Flush & Warm-up
+
+```bash
+# Flush all Redis data
+docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" FLUSHALL
+
+# Verify flush
+docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" DBSIZE
+# Should return: (integer) 0
+```
+
+**Warm-up:** Redis uses `allkeys-lru` eviction. Cache warms naturally as users make requests. No manual warm-up script is needed — cache misses fall through to PostgreSQL.
+
+**When to flush:**
+- After database restore (stale cache references)
+- After data corruption at the application level
+- After schema changes that alter cached data structures
+
+### 4.3 Rolling Restart Procedures
+
+#### Single Service Restart (Zero Downtime)
+
+```bash
+# API — the --wait flag ensures health check passes before moving on
+docker compose -f docker-compose.prod.yml up -d --no-deps --wait api
+
+# Web
+docker compose -f docker-compose.prod.yml up -d --no-deps --wait web
+
+# AI Services
+docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services
+```
+
+#### Full Stack Rolling Restart
+
+```bash
+# Data services first (order matters for dependency chain)
+docker compose -f docker-compose.prod.yml restart redis
+docker compose -f docker-compose.prod.yml restart typesense
+
+# Wait for data services to be healthy
+sleep 10
+
+# Connection pooling
+docker compose -f docker-compose.prod.yml restart pgbouncer
+sleep 5
+
+# Application services
+docker compose -f docker-compose.prod.yml up -d --no-deps --wait api
+docker compose -f docker-compose.prod.yml up -d --no-deps --wait web
+docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services
+
+# Verify
+curl -sf http://localhost:3001/health/ready | jq .
+```
+
+#### Emergency: Restart Everything
+
+```bash
+docker compose -f docker-compose.prod.yml down
+docker compose -f docker-compose.prod.yml up -d --wait
+```
+
+### 4.4 Rollback Deployment
+
+The CI/CD pipeline (`.github/workflows/deploy.yml`) supports automatic rollback if production smoke tests fail. For manual rollback:
+
+#### Quick Rollback (Revert to Previous Images)
+
+```bash
+# SSH into production host
+ssh deploy@$PRODUCTION_HOST
+
+cd ~/goodgo
+
+# Stop current app containers
+docker compose -f docker-compose.prod.yml down api web ai-services
+
+# The previous images are still cached locally
+# Restart without pulling — uses last-known-good images
+docker compose -f docker-compose.prod.yml up -d --wait api web ai-services
+
+# Verify
+curl -sf http://localhost:3001/health && echo "Rollback successful"
+```
+
+#### Rollback to a Specific Git Commit / Image Tag
+
+```bash
+# Set the target tag (git SHA)
+export IMAGE_TAG=<previous-commit-sha>
+export REGISTRY_URL=ghcr.io/goodgo
+
+# Pull specific version
+docker compose -f docker-compose.prod.yml pull api web ai-services
+
+# Deploy
+docker compose -f docker-compose.prod.yml up -d --no-deps --wait api
+docker compose -f docker-compose.prod.yml up -d --no-deps --wait web
+docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services
+
+# Verify
+curl -sf http://localhost:3001/health/ready | jq .
+```
+
+#### Rollback Database Migrations
+
+```bash
+# WARNING: Prisma does not support automatic down-migrations.
+# For migration rollback, restore from the pre-migration backup:
+
+# 1. Stop application
+docker compose -f docker-compose.prod.yml stop api web ai-services pgbouncer
+
+# 2. Restore from backup taken before the migration
+docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/<pre-migration-backup>.sql.gz
+
+# 3. Deploy the previous code version (older IMAGE_TAG)
+export IMAGE_TAG=<previous-commit-sha>
+docker compose -f docker-compose.prod.yml up -d --wait
+```
+
+### 4.5 Typesense Reindex from PostgreSQL
+
+If Typesense data is lost or corrupted, rebuild the search index from PostgreSQL:
+
+```bash
+# 1. Ensure Typesense is running and healthy
+curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/health
+
+# 2. Run reindex
+docker compose -f docker-compose.prod.yml exec api npx ts-node scripts/typesense-reindex.ts
+# Or from host:
+pnpm run typesense:reindex
+
+# 3. Verify collections
+curl -sf -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" http://localhost:8108/collections | jq '.[].name'
+```
+
+### 4.6 Full Host Recovery
+
+For complete host failure or migration to a new server:
+
+```bash
+# 1. Provision new host with Docker + Docker Compose
+# Requirements: Docker >= 24, Docker Compose v2, 8 GB RAM minimum
+
+# 2. Clone repository and configure
+git clone <repo-url> ~/goodgo && cd ~/goodgo
+cp .env.example .env
+# Edit .env with production secrets (from secrets manager)
+
+# 3. Restore PostgreSQL backup from off-site storage
+# Transfer backup file to the new host
+scp backups/goodgo_latest.sql.gz deploy@newhost:~/goodgo/backups/
+
+# 4. Start infrastructure services
+docker compose -f docker-compose.prod.yml up -d postgres redis typesense minio
+
+# 5. Wait for PostgreSQL to be ready, then restore
+docker compose -f docker-compose.prod.yml exec postgres pg_isready
+docker exec goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_latest.sql.gz
+
+# 6. Start application services
+docker compose -f docker-compose.prod.yml up -d
+
+# 7. Run migrations (if backup predates latest code)
+docker compose -f docker-compose.prod.yml exec api npx prisma migrate deploy
+
+# 8. Rebuild Typesense index
+pnpm run typesense:reindex
+
+# 9. Flush Redis (stale cache from old host)
+docker exec goodgo-redis redis-cli -a "${REDIS_PASSWORD}" FLUSHALL
+
+# 10. Verify everything
+curl -sf http://localhost:3001/health/ready | jq .
+curl -sf http://localhost:3000 > /dev/null && echo "Web OK"
+
+# Expected RTO: ~60 minutes (depends on backup transfer speed)
+```
+
+---
+
+## 5. Escalation Matrix
+
+| Severity | Condition | First Responder | Escalation | SLA |
+|----------|-----------|-----------------|------------|-----|
+| **P0 — Critical** | Full outage, data loss, payment corruption | On-call SRE | CTO + CEO within 15 min | Acknowledge: 5 min, Resolve: 1 hour |
+| **P1 — High** | Partial outage, SLO breach (p99 > 3s), 5xx > 5% | On-call SRE | Engineering lead within 30 min | Acknowledge: 15 min, Resolve: 4 hours |
+| **P2 — Medium** | Degraded performance, single service down (non-critical), p99 > 1s | On-call SRE | Team lead next business day | Acknowledge: 1 hour, Resolve: 24 hours |
+| **P3 — Low** | Cosmetic issues, monitoring gaps, non-urgent improvements | Assigned engineer | Sprint planning | Next sprint |
+
+### Contact Channels
+
+| Role | Channel |
+|------|---------|
+| On-call SRE | Slack `#sre-oncall` + PagerDuty |
+| Engineering Lead | Slack `#engineering` |
+| CTO | Slack DM / Phone (see PagerDuty) |
+| Payment Issues | Slack `#payments` + VNPay/MoMo support portals |
+| Infrastructure | Slack `#infrastructure` |
+
+### Slack Notifications
+
+The deploy pipeline automatically notifies `#deployments` (via `SLACK_WEBHOOK_URL`) on:
+- Production deploy success
+- Staging smoke test failure
+- Production rollback triggered
+
+---
+
+## 6. Monitoring Dashboards
+
+All dashboards are provisioned automatically via `monitoring/grafana/provisioning/` and are available in the **GoodGo** folder in Grafana.
+
+| Dashboard | Grafana Path | Purpose |
+|-----------|--------------|---------|
+| **API Overview** | `api-overview` | Request rates, status codes, active connections |
+| **API Latency** | `api-latency` | p50/p95/p99 latency by endpoint, latency heatmaps |
+| **Database** | `database` | PostgreSQL connections, query performance, PgBouncer stats |
+| **Search** | `search` | Typesense query rates, latency, index sizes |
+| **Business Metrics** | `business-metrics` | Listings, inquiries, payments, user registrations |
+| **Web Vitals** | `web-vitals` | Core Web Vitals (LCP, FID, CLS), page load times |
+| **Logs** | `logs` | Loki log explorer with filters by service, level, correlation ID |
+
+**Access:** `http://localhost:3002` (default credentials in `.env`: `GRAFANA_ADMIN_USER` / `GRAFANA_ADMIN_PASSWORD`)
+
+**Data Sources:**
+- **Prometheus** (`http://prometheus:9090`) — Metrics (default)
+- **Loki** (`http://loki:3100`) — Logs, with correlation ID linking to Prometheus
+
+---
+
+## 7. Useful PromQL Queries
+
+### API Performance
+
+```promql
+# Overall p99 latency
+histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket{job="goodgo-api"}[5m])) by (le))
+
+# Per-endpoint p99 latency (top 10 slowest)
+topk(10, histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket{job="goodgo-api"}[5m])) by (le, route, method)))
+
+# Request rate by status code
+sum(rate(http_requests_total{job="goodgo-api"}[5m])) by (status_code)
+
+# 5xx error percentage
+(sum(rate(http_requests_total{job="goodgo-api", status_code=~"5.."}[5m])) / sum(rate(http_requests_total{job="goodgo-api"}[5m]))) * 100
+```
+
+### Database
+
+```promql
+# Active connections
+pg_stat_activity_count{datname="goodgo", state="active"}
+
+# Connection pool utilization (if PgBouncer metrics are scraped)
+# Manual check via: SHOW POOLS in PgBouncer admin console
+```
+
+### Infrastructure
+
+```promql
+# Container memory usage
+container_memory_usage_bytes{name=~"goodgo-.*"}
+
+# Container CPU usage
+rate(container_cpu_usage_seconds_total{name=~"goodgo-.*"}[5m])
+```
+
+---
+
+## 8. Environment Quick Reference
+
+### Key Environment Variables
+
+| Variable | Required | Description |
+|----------|----------|-------------|
+| `DATABASE_URL` | Yes | PostgreSQL via PgBouncer (`postgresql://user:pass@pgbouncer:6432/db`) |
+| `DATABASE_URL_DIRECT` | Yes (prod) | Direct PostgreSQL for migrations (`postgresql://user:pass@postgres:5432/db`) |
+| `JWT_SECRET` | Yes | JWT signing secret |
+| `JWT_REFRESH_SECRET` | Yes | Refresh token signing secret |
+| `REDIS_URL` | Yes | Redis connection (`redis://:password@redis:6379`) |
+| `REDIS_PASSWORD` | Yes (prod) | Redis auth password |
+| `TYPESENSE_API_KEY` | Yes | Typesense admin API key |
+| `MINIO_ACCESS_KEY` | Yes | MinIO root user |
+| `MINIO_SECRET_KEY` | Yes | MinIO root password |
+| `VNPAY_*` | Yes | VNPay payment gateway configuration |
+| `AI_API_KEY` | Yes | AI services authentication |
+| `GRAFANA_ADMIN_USER` | Yes (prod) | Grafana admin username |
+| `GRAFANA_ADMIN_PASSWORD` | Yes (prod) | Grafana admin password |
+| `PGBOUNCER_POOL_SIZE` | No | PgBouncer pool size (default: 20) |
+| `PGBOUNCER_MAX_CLIENT_CONN` | No | Max PgBouncer client connections (default: 200) |
+| `BACKUP_RETENTION_DAYS` | No | Backup retention period (default: 7) |
+| `IMAGE_TAG` | No (prod) | Container image tag (default: `latest`) |
+
+### Port Map
+
+| Port | Service | Exposed |
+|------|---------|---------|
+| 3000 | Web (Next.js) | External |
+| 3001 | API (NestJS) | External |
+| 3002 | Grafana | External (admin only) |
+| 5432 | PostgreSQL | Internal |
+| 6432 | PgBouncer | Internal |
+| 6379 | Redis | Internal |
+| 8000 | AI Services | Internal |
+| 8108 | Typesense | Internal |
+| 9000 | MinIO API | Internal |
+| 9001 | MinIO Console | Internal |
+| 9090 | Prometheus | Internal |
+| 3100 | Loki | Internal |
+
+### Docker Volumes
+
+| Volume | Service | Purpose |
+|--------|---------|---------|
+| `pgdata` | PostgreSQL | Database files |
+| `redis_data` | Redis | AOF persistence |
+| `typesense_data` | Typesense | Search index data |
+| `minio_data` | MinIO | Object storage (images, files) |
+| `pg_backups` | pg-backup | Database backup files |
+| `loki_data` | Loki | Log storage (15-day retention) |
+| `prometheus_data` | Prometheus | Metrics (30-day retention prod / 15-day dev) |
+| `grafana_data` | Grafana | Dashboard state, user preferences |
+
+---
+
+## Appendix: Alert Rules Reference
+
+| Alert | Expression | Severity | Duration |
+|-------|-----------|----------|----------|
+| `ApiLatencyP99High` | p99 > 1s | Warning | 5 min |
+| `ApiEndpointLatencyP99High` | Per-route p99 > 2s | Warning | 5 min |
+| `ApiLatencyP99Critical` | p99 > 3s (SLO breach) | Critical | 3 min |
+| `ApiErrorRate5xxHigh` | 5xx rate > 1% | Warning | 5 min |
+
+Alert rules are defined in `monitoring/prometheus/alert-rules.yml` and evaluated every 15 seconds.