# GoodGo Platform — Operational Infrastructure Runbook **Last Updated:** April 11, 2026 **Version:** 1.0 **Purpose:** Complete infrastructure reference for ops teams, SREs, and on-call engineers --- ## Table of Contents 1. [Executive Summary](#executive-summary) 2. [Services Architecture](#services-architecture) 3. [Docker Compose Specifications](#docker-compose-specifications) 4. [Database Layer](#database-layer) 5. [Caching & Search](#caching--search) 6. [Monitoring & Observability](#monitoring--observability) 7. [Payment Integration](#payment-integration) 8. [Health Checks](#health-checks) 9. [Environment Variables](#environment-variables) 10. [Backup & Recovery](#backup--recovery) 11. [Deployment Pipeline](#deployment-pipeline) 12. [Troubleshooting Guide](#troubleshooting-guide) --- ## Executive Summary **GoodGo Platform** is a monorepo real estate marketplace built with: - **Frontend:** Next.js (TypeScript) - **Backend API:** NestJS (TypeScript) - **AI Services:** Python/FastAPI - **Database:** PostgreSQL 16 + PostGIS - **Cache:** Redis 7 - **Search:** Typesense 27.1 - **Object Storage:** MinIO (S3-compatible) - **Monitoring:** Prometheus + Grafana + Loki + Promtail - **Message Queue:** Built-in CQRS/Event Bus (NestJS) **Total Services in Production:** 12+ (detailed below) --- ## Services Architecture ### Service Inventory | Service | Image | Port | Purpose | Health Check | Dependencies | |---------|-------|------|---------|--------------|--------------| | **api** | `goodgo-api:latest` | 3001 | NestJS REST API | `GET /health` (3x30s) | postgres, redis, typesense, pgbouncer | | **web** | `goodgo-web:latest` | 3000 | Next.js frontend | `GET /` (3x30s) | api | | **ai-services** | `goodgo-ai-services:latest` | 8000 | Python FastAPI (price estimation, NLP) | `GET /health` (3x30s) | n/a | | **postgres** | `postgis/postgis:16-3.4` | 5432 | Primary database | `pg_isready` (5x10s) | n/a | | **pgbouncer** | `edoburu/pgbouncer:1.23.1-p2` | 6432 | Connection pooling (transaction mode) | `pg_isready` (5x10s) | postgres | | **redis** | `redis:7-alpine` | 6379 | Cache + session store | `PING` (5x10s) | n/a | | **typesense** | `typesense/typesense:27.1` | 8108 | Full-text search index | `GET /health` (5x10s) | n/a | | **minio** | `minio/minio:latest` | 9000/9001 | Object storage + console | `mc ready local` (5x10s) | n/a | | **loki** | `grafana/loki:3.0.0` | 3100 | Log aggregation | `GET /ready` (5x15s) | n/a | | **promtail** | `grafana/promtail:3.0.0` | 9080 | Log shipper | (depends on loki healthy) | loki | | **prometheus** | `prom/prometheus:v2.51.0` | 9090 | Metrics scraper | `GET /-/healthy` (3x15s) | n/a | | **grafana** | `grafana/grafana:10.4.1` | 3002 | Dashboards + alerting | `GET /api/health` (3x15s) | prometheus, loki | | **pg-backup** | `postgis/postgis:16-3.4` | — | Automated backup cron | depends_on postgres | postgres | ### Network & Volumes - **Network:** Docker bridge network `goodgo-net` - **Volumes:** - `pgdata` — PostgreSQL data files - `redis_data` — Redis snapshot (AOF) - `typesense_data` — Search index - `minio_data` — Object storage - `pg_backups` — Database backups (daily retention: 7 days) - `loki_data` — Log chunks (retention: 15 days) - `prometheus_data` — Metrics TSDB (retention: 30 days in prod, 15 days in dev) - `grafana_data` — Dashboards, datasource configs --- ## Docker Compose Specifications ### Development Environment (`docker-compose.yml`) **12 Services (minimal dependencies, no resource limits)** ```yaml services: postgres: PostGIS 16, port 5432, healthcheck: pg_isready (30s start-period) redis: Alpine 7, port 6379, maxmemory: 256mb LRU, AOF enabled typesense: v27.1, port 8108, CORS enabled, healthcheck /health minio: latest, ports 9000 (API) / 9001 (console) ai-services: Custom Python build, port 8000 pg-backup: Automated daily dumps at 02:00 UTC, cron retention cleanup pg-verify-backup: On-demand backup restore verification (profile: tools) loki: v3.0.0, port 3100, 15-day retention, 2h compaction delay promtail: v3.0.0, Docker socket instrumentation, Pino JSON parsing prometheus: v2.51.0, port 9090, 15-day retention, lifecycle API enabled grafana: v10.4.1, port 3002, datasources pre-provisioned ``` **Key Differences from Prod:** - No resource limits (use all available CPU/memory) - Smaller retention windows (7-15 days) - PostgreSQL on port 5432 (direct, no pgbouncer) - loki/prometheus/grafana on alternate ports ### Production Environment (`docker-compose.prod.yml`) **14 Services (with pgbouncer, resource limits, rolling updates)** ```yaml services: api: NestJS, resource limits: 1g CPU / 1g memory web: Next.js, resource limits: 0.5 CPU / 512m memory ai-services: Python, resource limits: 1.0 CPU / 1g memory postgres: PostGIS, resource limits: 2.0 CPU / 2g memory pgbouncer: Connection pool (NEW), 20 default connections, transaction mode redis: 7-alpine, resource limits: 0.5 CPU / 768m memory, password auth typesense: 27.1, resource limits: 1.0 CPU / 1g memory minio: latest, resource limits: 0.5 CPU / 1g memory loki: v3.0.0, resource limits: 0.5 CPU / 512m memory promtail: v3.0.0, resource limits: 0.25 CPU / 256m memory prometheus: v2.51.0, resource limits: 0.5 CPU / 1g memory, 30-day retention grafana: v10.4.1, resource limits: 0.5 CPU / 512m memory pg-backup: Same as dev ``` **Production-Specific Flags:** - `read_only: true` on app containers (api, web, ai-services) - `tmpfs: [/tmp]` for runtime temp files - `security_opt: [no-new-privileges:true]` - `logging: json-file` with 10m max-size, 3-5 files rotation - **PgBouncer inserted between apps ↔ Postgres** (port 6432) - Secrets management: `GRAFANA_ADMIN_USER`, `GRAFANA_ADMIN_PASSWORD` from Docker secrets - Redis requires password authentication ### CI/E2E Environment (`docker-compose.ci.yml`) **Minimal 4 Services (tmpfs for speed)** ```yaml services: postgres: goodgo_test DB, tmpfs (/var/lib/postgresql/data) redis: --save "" --appendonly no (no persistence) typesense: tmpfs (/data) minio: tmpfs (/data) ``` **Used by:** - GitHub Actions E2E test suite - Local `docker compose -f docker-compose.ci.yml up --wait` --- ## Database Layer ### PostgreSQL + PostGIS **Version:** 16.3.4 with PostGIS extension **Schema:** 22 Prisma models + Prisma migration tracking #### Prisma Schema Models 1. **Auth:** User, RefreshToken, OAuthAccount, Agent 2. **Listings:** Property, PropertyMedia, Listing 3. **Search:** SavedSearch 4. **Transactions:** Transaction, Inquiry, Lead 5. **Payments:** Payment (with PaymentProvider enum: VNPAY, MOMO, ZALOPAY, BANK_TRANSFER) 6. **Subscriptions:** Plan, Subscription, UsageRecord 7. **Analytics:** Valuation, MarketIndex 8. **Notifications:** NotificationLog, NotificationPreference 9. **Audit:** AdminAuditLog 10. **Reviews:** Review #### Key Database Features - **PostGIS Geometry:** Property.location (Point, SRID 4326) with GIST index - **Enums:** UserRole, KYCStatus, PropertyType, TransactionType, ListingStatus, Direction, OAuthProvider, TransactionStatus, LeadStatus, PaymentProvider, PaymentStatus, PaymentType, PlanTier, SubscriptionStatus, NotificationChannel, NotificationStatus, AdminAction, AuditTargetType - **Compound Indexes:** Query optimization on (role, isActive, createdAt), (sellerId, status, publishedAt), (userId, status, createdAt), etc. - **Constraints:** Unique idempotency key on Payment (userId, provider, idempotencyKey) #### Connection Pooling: PgBouncer **Dev Mode (docker-compose.yml):** - Apps connect directly to `postgres:5432` - No pooling overhead **Prod Mode (docker-compose.prod.yml):** - Apps connect to `pgbouncer:6432` - **Pool Mode:** `transaction` (connections returned after each transaction) - **Pool Size:** 20 connections (default, tunable via `PGBOUNCER_POOL_SIZE`) - **Max Client Conn:** 200 (tunable via `PGBOUNCER_MAX_CLIENT_CONN`) - **Reserve Pool:** 5 connections (fallback when pool exhausted) - **Timeouts:** - server_connect_timeout: 15s - server_idle_timeout: 600s - server_lifetime: 3600s (connection recycle) - query_wait_timeout: 120s - query_timeout: 0 (disabled) - **Admin Console:** pgbouncer_admin user (password via PGBOUNCER_ADMIN_PASSWORD env var) - **Stats Console:** pgbouncer_stats user (password via PGBOUNCER_STATS_PASSWORD env var) **Migration Workaround:** - API has two DATABASE_URL env vars: - `DATABASE_URL` → pgbouncer:6432 (normal queries) - `DATABASE_URL_DIRECT` → postgres:5432 (migrations, introspection, DDL) - `RUN_MIGRATIONS=true` switches app to use DATABASE_URL_DIRECT for `prisma migrate deploy` #### Backup Strategy **Automated Backups:** - **Schedule:** Daily at 02:00 UTC (cron inside pg-backup container) - **Format:** Custom format with gzip compression (level 6) - **Retention:** 7 days (configurable via BACKUP_RETENTION_DAYS) - **Location:** `pg_backups` volume (mount to persistent storage in prod) - **File Pattern:** `goodgo_YYYYMMDD_HHMMSS.sql.gz` - **Restore Script:** `/scripts/backup/pg-restore.sh` (manual restore) - **Verification Script:** `/scripts/backup/pg-verify-backup.sh` (automated E2E verification) **Verification Process (runs weekly):** 1. Restores latest backup to isolated test database (`goodgo_verify_`) 2. Verifies all 22 tables exist 3. Compares row counts between source and restored DB 4. Checksums critical tables (User, Property, Listing, Payment, Subscription, Transaction, Plan, _prisma_migrations) 5. Checks PostGIS extension, indexes, enum types 6. Generates JSON report with pass/fail result 7. **Cleanup:** Drops test DB on exit (unless SKIP_CLEANUP=1) 8. **Exit Codes:** 0=pass, 1=checks failed, 2=setup error **CI/CD Backup Verification:** - GitHub Action: `.github/workflows/backup-verify.yml` - Runs weekly Sundays 05:00 UTC - Also manually triggerable with skip_cleanup option - Uploads JSON report as artifact --- ## Caching & Search ### Redis **Image:** `redis:7-alpine` **Port:** 6379 **Production Configuration:** ```bash redis-server \ --appendonly yes \ # AOF persistence (updates only) --requirepass ${REDIS_PASSWORD} \ # Authentication required --maxmemory 512mb \ # Max memory limit (prod) --maxmemory-policy allkeys-lru # LRU eviction when full ``` **Development Configuration:** ```bash redis-server \ --appendonly yes \ --maxmemory 256mb \ --maxmemory-policy allkeys-lru ``` **ioredis Client Configuration:** ```typescript // From RedisService in apps/api/src/modules/shared/infrastructure/redis.service.ts { host: process.env.REDIS_HOST ?? 'localhost', port: Number(process.env.REDIS_PORT ?? 6379), password: process.env.REDIS_PASSWORD ?? undefined, lazyConnect: true, // App starts even if Redis unavailable enableReadyCheck: false, // Prevents "Redis is not ready" errors during transient outages maxRetriesPerRequest: 1, // Fail fast (single retry, no exponential backoff) retryStrategy(times: number): number { return Math.min(times * 1000, 5000); // 1s → 2s → 3s → 4s → 5s → 5s... } } ``` **Graceful Degradation:** - Cache misses don't fail the application - CacheService catches Redis errors and returns cache miss - App serves data directly from PostgreSQL if Redis down - Health check at `GET /health/redis` warns but doesn't fail readiness probe **Use Cases:** - Session storage - Cache layer for expensive queries - Rate limiting (if implemented) - Real-time counters --- ### Typesense **Image:** `typesense/typesense:27.1` **Port:** 8108 (HTTP only, internal Docker network) **API Key:** `${TYPESENSE_API_KEY}` (must be set in .env) **Collection Schema:** ``` Collection Name: "listings" Fields: - listingId (string) - propertyId (string) - title (string, searchable, highlights) - description (string, searchable, highlights) - propertyType (string, faceted) - transactionType (string, faceted: SALE/RENT) - priceVND (int64, sortable) - pricePerM2 (float, optional) - areaM2 (float) - bedrooms (int32, faceted) - bathrooms (int32, faceted) - floors (int32) - direction (string, faceted: NORTH/SOUTH/EAST/WEST/etc) - address (string) - ward (string, faceted) - district (string, faceted) - city (string, faceted) - location (geopoint) — for radius search - agentId (string) - sellerId (string) - status (string, faceted: ACTIVE/SOLD/DRAFT/etc) - publishedAt (int64, sortable) - viewCount (int32) - saveCount (int32) - projectName (string, faceted) - amenities (string[], faceted) ``` **Search Features:** - **Full-text search** on: title, description, address, district, city, projectName - **Query weights:** title=5, description=3, address=2, district=2, city=1, projectName=2 - **Filtering:** propertyType, transactionType, bedrooms, district, city, status, amenities - **Geo-search:** radius-based queries (lat, lng, km) - **Sorting:** price (asc/desc), distance (asc from geopoint), date (desc), relevance - **Highlights:** HTML marks on matched terms in title and description - **Facets:** Return aggregated counts for filtering **TypesenseSearchRepository (`apps/api/src/modules/search/infrastructure/services/typesense-search.repository.ts`):** - `ensureCollection()` — Creates schema if not exists - `dropCollection()` — Cleanup (testing only) - `indexDocument(doc)` — Upsert single document - `indexDocuments(docs)` — Bulk import with error reporting - `removeDocument(id)` — Delete by ID - `search(params)` — Execute search with filters, sort, pagination **Graceful Degradation:** - If Typesense down, search falls back to PostgreSQL full-text search - TypesenseClientService implements retry logic with exponential backoff - Health check at `GET /health` returns JSON status --- ## Monitoring & Observability ### Prometheus **Image:** `prom/prometheus:v2.51.0` **Port:** 9090 **Retention:** 15 days (dev), 30 days (prod) **Lifecycle API:** Enabled (`--web.enable-lifecycle`) **Scrape Targets (`monitoring/prometheus/prometheus.yml`):** ```yaml scrape_configs: - job_name: goodgo-api metrics_path: /metrics static_configs: - targets: ['host.docker.internal:3001'] # Dev (API on host) - targets: ['api:3001'] # Prod (API in container) labels: service: goodgo-api environment: [development|production] - job_name: prometheus static_configs: - targets: ['localhost:9090'] ``` **Expected Metrics from API:** - `goodgo_api_request_duration_seconds_bucket{le, route, method}` — Request latency histogram - `http_requests_total{status_code, job}` — Request count by status code - Custom business metrics (if implemented in NestJS @prometheus decorators) ### Alert Rules (`monitoring/prometheus/alert-rules.yml`) **Latency Alerts:** 1. **ApiLatencyP99High** (warning) - Trigger: p99 latency > 1s for 5 minutes - Dashboard: `/d/goodgo-api-latency/goodgo-api-latency` - Runbook: `https://docs.goodgo.vn/runbooks/api-latency-high` 2. **ApiEndpointLatencyP99High** (warning) - Trigger: Per-endpoint p99 > 2s for 5 minutes - Annotates: method, route labels 3. **ApiLatencyP99Critical** (critical - SLO breach) - Trigger: p99 latency > 3s for 3 minutes - Escalation required - Runbook: `https://docs.goodgo.vn/runbooks/api-latency-critical` **Error Rate Alert:** 1. **ApiErrorRate5xxHigh** (warning) - Trigger: 5xx error rate > 1% for 5 minutes - Uses: `(5xx errors / total requests) * 100` ### Grafana **Image:** `grafana/grafana:10.4.1` **Port:** 3002 **Auth:** Admin user/password from secrets (prod) or env vars (dev) **Pre-Provisioned Datasources:** - Prometheus (default, primary) - Loki (with derived fields for correlationId linkage) **Dashboards:** 1. `api-latency.json` — API p99/p95/p50, route breakdown, slow endpoints 2. `api-overview.json` — Request rate, error rate, uptime status 3. `database.json` — Query latency, connection pool utilization, slow queries 4. `logs.json` — Log volume, error logs, trace links to Prometheus 5. `search.json` — Typesense query latency, indexing rate, collection size 6. `web-vitals.json` — Frontend Core Web Vitals (if client-side instrumentation) 7. `business-metrics.json` — Listings created, payments processed, user signups **Admin Console Access:** - URL: `http://localhost:3002` (dev) or `${GRAFANA_PORT}` (prod) - Default user: `admin` (change password on first login) - Non-signup mode (`GF_USERS_ALLOW_SIGN_UP: false`) ### Loki & Promtail (Log Aggregation) **Loki:** `grafana/loki:3.0.0`, port 3100 **Configuration:** ```yaml schema: - from: 2024-01-01 store: tsdb schema: v13 limits: max_entries_limit_per_query: 5000 ingestion_rate_mb: 4 ingestion_burst_size_mb: 6 retention: 360h (15 days) ``` **Promtail:** `grafana/promtail:3.0.0` **Configuration:** - Scrapes Docker logs from `goodgo-net` bridge network - Parses **Pino JSON** structured logs - Extracts labels: level, context, component, service - Structured metadata: method, url, statusCode, correlationId, duration - Derives timestamp from Pino output (RFC3339Nano) **Expected Log Format (Pino):** ```json { "level": 30, // info "time": "2026-04-11T10:30:00Z", "msg": "POST /api/listings", "correlationId": "abc-123-def", "context": "ListingController", "component": "api", "method": "POST", "url": "/api/listings", "statusCode": 201, "duration": 150 } ``` --- ## Payment Integration ### Supported Payment Providers **Enum:** `PaymentProvider` (Prisma) - `VNPAY` — VNPay (Vietnam payment gateway) - `MOMO` — MoMo (Vietnam mobile wallet) - `ZALOPAY` — ZaloPay (Vietnam digital wallet) - `BANK_TRANSFER` — Manual bank transfer (offline) ### Payment Flow & Callback Handling **Database Schema (Payment Model):** ```typescript model Payment { id String @id @default(cuid()) userId String transactionId String? provider PaymentProvider type PaymentType // SUBSCRIPTION, LISTING_FEE, DEPOSIT, FEATURED_LISTING amountVND BigInt status PaymentStatus // PENDING, PROCESSING, COMPLETED, FAILED, REFUNDED providerTxId String? // External transaction ID from VNPay/MoMo/ZaloPay callbackData Json? // Raw callback payload (for audit) idempotencyKey String? // Prevent duplicate payments (userId, provider, idempotencyKey unique) createdAt DateTime @default(now()) updatedAt DateTime @updatedAt } enum PaymentStatus { PENDING, PROCESSING, COMPLETED, FAILED, REFUNDED } enum PaymentType { SUBSCRIPTION, LISTING_FEE, DEPOSIT, FEATURED_LISTING } ``` **Command Handler: `HandleCallbackHandler`** (`apps/api/src/modules/payments/application/commands/handle-callback/handle-callback.handler.ts`) 1. **Callback Signature Verification:** - Uses `PAYMENT_GATEWAY_FACTORY` to route to correct provider (VNPay/MoMo/ZaloPay) - Gateway.verifyCallback() validates HMAC signature - Throws `ValidationException` if signature invalid 2. **Idempotent Status Transition:** - Only updates payments in state: `PENDING` or `PROCESSING` - Atomically transitions to `COMPLETED` or `FAILED` - If already in terminal state (COMPLETED/FAILED/REFUNDED), returns existing status (idempotent) - Logs warning if payment not found 3. **Domain Event Publishing:** - Reconstructs domain entity from repository - Emits `PaymentCompletedEvent` or `PaymentFailedEvent` - Event bus publishes events to subscribers (e.g., subscription creation, listing activation) 4. **Response:** ```typescript { paymentId: string, status: PaymentStatus, isSuccess: boolean } ``` **Payment Gateway Interface (`payment-gateway.interface.ts`):** ```typescript interface IPaymentGateway { readonly provider: PaymentProvider createPaymentUrl(params: CreatePaymentUrlParams): Promise verifyCallback(data: Record): CallbackVerifyResult refund(params: RefundParams): Promise } interface CreatePaymentUrlParams { orderId: string amountVND: bigint description: string returnUrl: string ipAddress: string } interface CallbackVerifyResult { isValid: boolean orderId: string providerTxId: string isSuccess: boolean rawData: Record } interface RefundParams { providerTxId: string amountVND: bigint reason: string } interface RefundResult { success: boolean refundTxId: string | null } ``` ### Environment Variables **VNPay:** ```env VNPAY_TMN_CODE= VNPAY_HASH_SECRET= VNPAY_BASE_URL=https://sandbox.vnpayment.vn/paymentv2/vpcpay.html VNPAY_API_URL=https://sandbox.vnpayment.vn/merchant_webapi/api/transaction ``` **MoMo:** ```env MOMO_PARTNER_CODE= MOMO_ACCESS_KEY= MOMO_SECRET_KEY= MOMO_ENDPOINT=https://test-payment.momo.vn/v2/gateway/api ``` **ZaloPay:** ```env ZALOPAY_APP_ID= ZALOPAY_KEY1= ZALOPAY_KEY2= ZALOPAY_ENDPOINT=https://sb-openapi.zalopay.vn/v2 ``` ### Race Condition & Idempotency Protection **Problem:** Multiple callbacks may arrive for same payment (network retries, duplicate notifications) **Solution:** 1. **Unique Idempotency Key:** `Payment_idempotency_unique(userId, provider, idempotencyKey)` - Prevents duplicate payment records - Generated by client/API before creating payment 2. **Atomic Status Update:** `paymentRepo.updateIfStatus(orderId, ['PENDING', 'PROCESSING'], newStatus)` - Only updates if current status in allowed list - Returns updated entity or null if already terminal 3. **Terminal State Check:** If already COMPLETED/FAILED/REFUNDED, handler returns existing state - No re-triggering of domain events - No double billing or duplicate transactions --- ## Health Checks ### API Health Endpoints **Health Controller** (`apps/api/src/modules/health/health.controller.ts`) 1. **GET /health** — Liveness Probe (always 200 if process running) - Uses: `@HealthCheck()` on empty probe list - Response: `{ "status": "ok", "timestamp": "..." }` - **Use Case:** Kubernetes/Docker readiness (initial startup) 2. **GET /health/ready** — Readiness Probe (checks dependencies) - Checks: PostgreSQL + Redis connectivity - Response: ```json { "status": "ok", "checks": { "database": { "status": "up" }, "redis": { "status": "up" } } } ``` - **Use Case:** Load balancer, before accepting traffic - **Failure:** Returns 503 if any dependency down 3. **GET /health/db** — Database Readiness Only - Checks: PostgreSQL connectivity via `SELECT 1` query - **Use Case:** Manual database troubleshooting 4. **GET /health/redis** — Redis Readiness Only - Checks: Redis PING command - **Use Case:** Manual Redis troubleshooting ### Health Check Implementations **PrismaHealthIndicator** (`apps/api/src/modules/health/infrastructure/prisma.health.ts`): ```typescript async isHealthy(key: string): Promise { try { await this.prisma.$queryRawUnsafe('SELECT 1'); return this.getStatus(key, true); } catch { throw new HealthCheckError('Database check failed', this.getStatus(key, false)); } } ``` **RedisHealthIndicator** (`apps/api/src/modules/health/infrastructure/redis.health.ts`): ```typescript async isHealthy(key: string): Promise { try { const client = this.redis.getClient(); const pong = await client.ping(); const isHealthy = pong === 'PONG'; const result = this.getStatus(key, isHealthy); if (isHealthy) return result; throw new HealthCheckError('Redis ping failed', result); } catch (error) { if (error instanceof HealthCheckError) throw error; throw new HealthCheckError('Redis check failed', this.getStatus(key, false)); } } ``` ### Docker Container Health Checks **API Container:** ```yaml healthcheck: test: ['CMD', 'node', '-e', "fetch('http://localhost:3001/health').then(r => { if (!r.ok) throw 1 }).catch(() => process.exit(1))"] interval: 30s timeout: 5s retries: 5 start_period: 30s ``` **Web Container:** ```yaml healthcheck: test: ['CMD', 'node', '-e', "fetch('http://localhost:3000').then(r => { if (!r.ok) throw 1 }).catch(() => process.exit(1))"] interval: 30s timeout: 5s retries: 3 start_period: 15s ``` **PostgreSQL:** ```yaml healthcheck: test: ['CMD-SHELL', 'pg_isready -U ${DB_USER} -d ${DB_NAME}'] interval: 10s timeout: 5s retries: 5 start_period: 30s ``` **Redis:** ```yaml healthcheck: test: ['CMD', 'redis-cli', '-a', '${REDIS_PASSWORD}', 'ping'] interval: 10s timeout: 5s retries: 5 start_period: 10s ``` **Typesense:** ```yaml healthcheck: test: ['CMD', 'curl', '-sf', 'http://localhost:8108/health'] interval: 10s timeout: 5s retries: 5 start_period: 15s ``` --- ## Environment Variables ### Complete `.env.example` Reference **PostgreSQL:** ```env DB_HOST=localhost DB_PORT=5432 DB_NAME=goodgo DB_USER=goodgo DB_PASSWORD=CHANGE_ME DATABASE_URL=postgresql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:${DB_PORT}/${DB_NAME}?schema=public DATABASE_URL_DIRECT=postgresql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:${DB_PORT}/${DB_NAME}?schema=public ``` **PgBouncer (Prod Only):** ```env PGBOUNCER_POOL_SIZE=20 PGBOUNCER_MAX_CLIENT_CONN=200 PGBOUNCER_ADMIN_PASSWORD=CHANGE_ME PGBOUNCER_STATS_PASSWORD=CHANGE_ME ``` **Redis:** ```env REDIS_HOST=localhost REDIS_PORT=6379 REDIS_PASSWORD= REDIS_URL=redis://${REDIS_HOST}:${REDIS_PORT} ``` **Typesense:** ```env TYPESENSE_HOST=localhost TYPESENSE_PORT=8108 TYPESENSE_PROTOCOL=http TYPESENSE_API_KEY=CHANGE_ME ``` **MinIO:** ```env MINIO_ENDPOINT=localhost MINIO_PORT=9000 MINIO_CONSOLE_PORT=9001 MINIO_ACCESS_KEY=CHANGE_ME MINIO_SECRET_KEY=CHANGE_ME MINIO_BUCKET=goodgo-media MINIO_USE_SSL=false ``` **NestJS API:** ```env API_PORT=3000 PORT=3001 NODE_ENV=development CORS_ORIGINS=http://localhost:3000,http://localhost:3001 ``` **JWT / Authentication (REQUIRED):** ```env JWT_SECRET= JWT_EXPIRES_IN=15m JWT_REFRESH_SECRET= JWT_REFRESH_EXPIRES_IN=7d ``` **OAuth Providers:** ```env GOOGLE_CLIENT_ID= GOOGLE_CLIENT_SECRET= GOOGLE_CALLBACK_URL=http://localhost:3001/auth/google/callback ZALO_APP_ID= ZALO_APP_SECRET= ZALO_CALLBACK_URL=http://localhost:3001/auth/zalo/callback FRONTEND_URL=http://localhost:3000 ``` **Next.js Web:** ```env NEXT_PUBLIC_API_URL=http://localhost:3000 WEB_PORT=3001 ``` **AI Service (Python/FastAPI):** ```env AI_SERVICE_PORT=8000 AI_SERVICE_URL=http://localhost:8000 CLAUDE_API_KEY= AI_DEBUG=false AI_LOG_LEVEL=info ``` **Map Integration:** ```env NEXT_PUBLIC_MAPBOX_TOKEN= ``` **Payment Gateways:** ```env VNPAY_TMN_CODE= VNPAY_HASH_SECRET= VNPAY_BASE_URL=https://sandbox.vnpayment.vn/paymentv2/vpcpay.html VNPAY_API_URL=https://sandbox.vnpayment.vn/merchant_webapi/api/transaction MOMO_PARTNER_CODE= MOMO_ACCESS_KEY= MOMO_SECRET_KEY= MOMO_ENDPOINT=https://test-payment.momo.vn/v2/gateway/api ZALOPAY_APP_ID= ZALOPAY_KEY1= ZALOPAY_KEY2= ZALOPAY_ENDPOINT=https://sb-openapi.zalopay.vn/v2 ``` **Email / SMTP:** ```env SMTP_HOST=localhost SMTP_PORT=1025 SMTP_USER= SMTP_PASS= SMTP_FROM=noreply@goodgo.vn ``` **Firebase Cloud Messaging (Optional):** ```env FIREBASE_SERVICE_ACCOUNT= ``` **Sentry Error Tracking:** ```env SENTRY_DSN= NEXT_PUBLIC_SENTRY_DSN= SENTRY_AUTH_TOKEN= SENTRY_ORG= SENTRY_PROJECT= ``` **KYC Field Encryption (REQUIRED Prod):** ```env KYC_ENCRYPTION_KEY= # 64 hex chars (32 bytes) KYC_ENCRYPTION_KEY_VERSION=1 ``` **Logging:** ```env LOG_LEVEL=info ``` --- ## Backup & Recovery ### Automated Daily Backups **Service:** `pg-backup` container (runs inside docker compose) **Backup Script:** `scripts/backup/pg-backup.sh` ```bash # Daily cron job: 02:00 UTC PGHOST=postgres \ PGPORT=5432 \ PGUSER=goodgo \ PGDATABASE=goodgo \ PGPASSWORD= \ BACKUP_DIR=/backups \ RETENTION_DAYS=7 \ /scripts/pg-backup.sh ``` **Behavior:** 1. Creates dump with `pg_dump --format=custom --compress=6` 2. Saves as `goodgo_YYYYMMDD_HHMMSS.sql.gz` 3. Prunes backups older than 7 days (configurable) 4. Logs to `/var/log/pg-backup.log` **Restore from Backup:** ```bash # Interactive restore prompt docker compose -f docker-compose.prod.yml exec pg-backup bash -c \ 'pg_restore -h postgres -p 5432 -U goodgo -d goodgo \ --clean --if-exists /backups/goodgo_20260410_020000.sql.gz' # Or using restore script docker compose -f docker-compose.prod.yml run --rm pg-verify-backup bash -c \ 'source /scripts/pg-restore.sh /backups/goodgo_20260410_020000.sql.gz' ``` ### Backup Verification **Service:** `pg-verify-backup` container (on-demand, profile: tools) **Verification Script:** `scripts/backup/pg-verify-backup.sh` ```bash # Usage: docker compose -f docker-compose.prod.yml run --rm pg-verify-backup # Or with options: SKIP_CLEANUP=1 REPORT_FILE=/backups/verify-report.json \ docker compose -f docker-compose.prod.yml run --rm pg-verify-backup ``` **Verification Steps:** 1. Creates isolated test database: `goodgo_verify_` 2. Enables PostGIS extension 3. Restores backup into test DB 4. Verifies all 22 tables exist 5. Compares row counts between source and restored 6. Checksums critical tables using MD5 hashes 7. Checks indexes, enum types 8. Generates JSON report with results 9. **Cleanup:** Drops test DB (unless SKIP_CLEANUP=1) **JSON Report Structure:** ```json { "timestamp": "2026-04-11T10:30:00Z", "backupFile": "/backups/goodgo_20260410_020000.sql.gz", "backupSize": "150M", "testDatabase": "goodgo_verify_20260411_103000", "restoreDurationSeconds": 45, "passed": 28, "failed": 0, "warnings": 2, "result": "pass", "checks": [ { "check": "Database creation", "status": "pass", "detail": "Test database created" }, { "check": "Restore", "status": "pass", "detail": "pg_restore completed cleanly in 45s" }, { "check": "Table existence", "status": "pass", "detail": "All 22 expected tables present" }, { "check": "Row counts", "status": "pass", "detail": "All tables match source database" }, { "check": "Checksum: User identities", "status": "pass", "detail": "Hashes match (abc123def456...)" }, ... ] } ``` **GitHub Action Backup Verification:** - File: `.github/workflows/backup-verify.yml` - Schedule: Weekly Sundays 05:00 UTC - Also: Manual trigger with skip_cleanup option - Artifacts: Uploads JSON report for 30 days --- ## Deployment Pipeline ### GitHub Actions CI/CD **Workflows:** 1. `.github/workflows/ci.yml` — Lint, typecheck, test, build (on push/PR to master) 2. `.github/workflows/deploy.yml` — Build Docker images, deploy to staging/prod 3. `.github/workflows/e2e.yml` — E2E tests (spins up full docker-compose.ci.yml) 4. `.github/workflows/backup-verify.yml` — Weekly backup verification 5. `.github/workflows/security.yml` — Dependency scanning, SAST 6. `.github/workflows/codeql.yml` — GitHub CodeQL analysis 7. `.github/workflows/load-test.yml` — K6 load testing ### CI Pipeline (`ci.yml`) **On:** `push master`, `pull_request master` **Node:** 22 **Concurrency:** Cancel previous runs on same ref **Jobs:** 1. **Lint → Typecheck → Test → Build** - Installs pnpm, Node 22 - Runs linter (eslint) - Type checks (tsc) - Unit tests (jest) - Builds all apps (turbo) - PostgreSQL 16 service available (goodgo_test DB) 2. **E2E Tests** (depends on ci job) - Full docker-compose.ci.yml services (postgres, redis, typesense, minio) - Runs end-to-end test suite - Timeout: 20 minutes - Env vars: DATABASE_URL, JWT secrets, payment test codes ### Deploy Pipeline (`deploy.yml`) **On:** - `push master` (auto-deploys to staging) - Manual `workflow_dispatch` (choose staging or production) **Jobs:** 1. **Build API Image** - Builds: `goodgo-api:${IMAGE_TAG}` - Dockerfile: `apps/api/Dockerfile` - Registry: `ghcr.io/goodgo/goodgo-api` - Tags: git SHA, branch name, `latest` (on master) 2. **Build Web Image** - Builds: `goodgo-web:${IMAGE_TAG}` - Dockerfile: `apps/web/Dockerfile` - Registry: `ghcr.io/goodgo/goodgo-web` 3. **Build AI Services Image** - Builds: `goodgo-ai-services:${IMAGE_TAG}` - Context: `libs/ai-services/` - Registry: `ghcr.io/goodgo/goodgo-ai-services` 4. **Deploy to Staging** - Condition: `github.event_name == 'push' || inputs.environment == 'staging'` - SSH into staging host - Pulls new images from GHCR - **Rolling update** (zero downtime): ```bash docker compose -f docker-compose.prod.yml up -d --no-deps --wait api docker compose -f docker-compose.prod.yml up -d --no-deps --wait web docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services ``` - Runs migrations: `docker compose exec api npx prisma migrate deploy` - Prunes old images 5. **Deploy to Production** - Only on manual `workflow_dispatch` with `environment: production` - Same steps as staging - Requires `environment: production` approval (GitHub security) ### Dockerfile Multi-Stage Builds **API (apps/api/Dockerfile):** - **Base:** node:22-slim + pnpm 10.27.0 - **Deps:** Install locked dependencies (layer caching) - **Build:** Compile TypeScript, generate Prisma client - **Prune:** `pnpm deploy --prod` (removes dev deps, hoists prod deps) - **Production:** Minimal image, dumb-init for signals, non-root user **Web (apps/web/Dockerfile):** - **Base:** node:22-slim + pnpm - **Deps:** Install dependencies - **Build:** `next build` → standalone output + static files - **Production:** Copy .next/standalone, public, static assets **AI Services (libs/ai-services/Dockerfile):** - **Base:** python:3.12-slim - **Install:** System deps (gcc, g++), dumb-init, FastAPI/XGBoost/underthesea - **Models:** Pre-download underthesea ML models at build time - **User:** Run as non-root appuser - **CMD:** `uvicorn app.main:app --host 0.0.0.0 --port 8000` --- ## Troubleshooting Guide ### Check Service Status ```bash # All services docker compose -f docker-compose.prod.yml ps # Single service docker compose -f docker-compose.prod.yml ps api # Get logs docker compose -f docker-compose.prod.yml logs -f api --tail=100 # Health check status docker compose -f docker-compose.prod.yml exec api curl http://localhost:3001/health ``` ### Common Issues #### 1. API Service Not Healthy (stuck in "health-check-failed" state) **Symptoms:** - `docker compose ps` shows `(health: starting)` for >2 minutes - `docker compose logs api` shows connection errors **Diagnosis:** ```bash # Check API liveness docker compose exec api curl http://localhost:3001/health # Check readiness (includes DB + Redis checks) docker compose exec api curl http://localhost:3001/health/ready # Check specific dependencies docker compose exec api curl http://localhost:3001/health/db docker compose exec api curl http://localhost:3001/health/redis ``` **Solutions:** - **PostgreSQL not ready:** ```bash docker compose ps postgres # Should show (healthy) docker compose exec postgres pg_isready -U goodgo -d goodgo docker compose logs postgres --tail=50 ``` - **Redis not ready:** ```bash docker compose exec redis redis-cli ping # Should return PONG docker compose logs redis --tail=50 ``` - **PgBouncer not ready (prod):** ```bash docker compose exec pgbouncer pg_isready -h 127.0.0.1 -p 6432 -U goodgo docker compose logs pgbouncer --tail=50 ``` - **Database schema not initialized:** ```bash # Run migrations manually docker compose exec api npx prisma migrate deploy # Or check if schema exists docker compose exec postgres psql -U goodgo -d goodgo -c "\dt" ``` #### 2. High Database Connection Pool Exhaustion **Symptoms:** - Errors: `Error: unable to get a connection from the pool after X s` - Slow queries pile up - API latency spikes **Diagnosis:** ```bash # Check pool stats (prod, PgBouncer) docker compose exec pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_stats -c "SHOW stats" # Or query PostgreSQL directly docker compose exec postgres psql -U goodgo -d goodgo -c "SELECT count(*) FROM pg_stat_activity" ``` **Solutions:** - Increase `PGBOUNCER_POOL_SIZE` (default: 20) - Increase `PGBOUNCER_MAX_CLIENT_CONN` (default: 200) - Reduce long-running queries (add query timeout) - Check for idle connections: `server_idle_timeout` #### 3. Redis Connection Failures (Non-Fatal) **Symptoms:** - Logs: `Redis check failed` or `ECONNREFUSED` - But API still responds with slower database reads - Health check `/health/ready` returns 503 **Expected Behavior:** Cache misses → app serves from database **Diagnosis:** ```bash # Check Redis availability docker compose exec redis redis-cli ping # Check RedisService logs docker compose logs api | grep -i redis ``` **Solutions:** - Restart Redis: `docker compose restart redis` - Check memory: `docker compose exec redis redis-cli info memory` - If at `maxmemory`, increase in docker-compose.yml and restart #### 4. Typesense Search Not Indexing **Symptoms:** - Search returns 0 results - Listings created but not searchable - `/health` for typesense shows green, but collection empty **Diagnosis:** ```bash # Check collection exists curl http://localhost:8108/collections -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" # Check collection stats curl "http://localhost:8108/collections/listings" \ -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" | jq . # Check recent docs curl "http://localhost:8108/collections/listings/documents/search?q=*" \ -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" | jq '.found' ``` **Solutions:** - Verify `TYPESENSE_API_KEY` matches container env var - Reindex all listings: ```bash docker compose exec api npx ts-node scripts/reindex-listings.ts ``` - If collection corrupted, drop and recreate: ```bash curl -X DELETE "http://localhost:8108/collections/listings" \ -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" # Then restart API service to recreate schema docker compose restart api ``` #### 5. Payment Callback Failures **Symptoms:** - Payment status stuck in `PENDING` - Logs: `Invalid callback signature for provider=VNPAY` **Diagnosis:** ```bash # Check payment record in DB docker compose exec postgres psql -U goodgo -d goodgo -c \ "SELECT id, status, provider, \"providerTxId\", \"callbackData\" FROM \"Payment\" \ WHERE \"providerTxId\" = 'your-txid' ORDER BY \"createdAt\" DESC LIMIT 1;" # Check logs for callback handler docker compose logs api | grep -i "HandleCallbackHandler\|callback" ``` **Solutions:** - Verify payment gateway credentials (VNPAY_HASH_SECRET, MOMO_SECRET_KEY, etc.) - Manually verify callback signature (contact payment provider support) - Replay callback manually (if idempotent key available): ```bash curl -X POST http://localhost:3001/api/payments/callback \ -H "Content-Type: application/json" \ -d '{"provider":"VNPAY",...callback data...}' ``` #### 6. Backup Verification Fails **Symptoms:** - GitHub Action `.github/workflows/backup-verify.yml` fails - Restore test database shows mismatched row counts **Diagnosis:** ```bash # Run verification manually docker compose -f docker-compose.ci.yml up postgres docker compose -f docker-compose.ci.yml exec postgres \ /scripts/pg-verify-backup.sh /backups/goodgo_latest.sql.gz # Check JSON report cat /tmp/backups/verify-report.json | jq . ``` **Solutions:** - Check if backup file corrupt: `file goodgo_*.sql.gz` - Verify restore process: `pg_restore --verbose` - Check PostGIS extension availability: `psql -c "CREATE EXTENSION postgis;"` #### 7. Memory/CPU Pressure **Symptoms:** - OOM kills, container exits 137 - CPU throttling, latency spikes - Prometheus `container_memory_usage_bytes` near limit **Diagnosis:** ```bash # Check Docker stats docker stats --no-stream # Check limits in compose file docker compose config | grep -A3 "resources:" # Check actual memory usage docker inspect goodgo-api | jq '.HostConfig.Memory' ``` **Solutions:** - Increase resource limits in `docker-compose.prod.yml` - Reduce log verbosity (set LOG_LEVEL=warn) - Implement pagination for large queries - Scale horizontally (add more API replicas) ### Prometheus Queries for Debugging ```promql # API request latency p99 histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket[5m])) by (le)) # API error rate (5xx) (sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100 # Container memory usage container_memory_usage_bytes{name="goodgo-api"} # Container CPU usage rate(container_cpu_usage_seconds_total{name="goodgo-api"}[5m]) # PostgreSQL active queries pg_stat_activity_count{state="active"} # Redis memory usage redis_memory_used_bytes / 1024 / 1024 # in MB # Typesense collection size typesense_documents_count{collection="listings"} ``` ### Emergency Procedures **Full System Reset (dev only):** ```bash docker compose down -v # Remove all volumes! docker system prune -a docker compose up -d --wait docker compose exec api npx prisma db push docker compose exec api npx ts-node scripts/seed.ts ``` **Database Emergency Restore:** ```bash # Find latest backup ls -t /var/lib/docker/volumes/pg_backups/_data/goodgo_*.sql.gz | head -1 # Restore to new database pg_restore -h localhost -p 5432 -U goodgo -d goodgo_restored \ --clean --if-exists --verbose /path/to/backup.sql.gz # Verify restore psql -U goodgo -d goodgo_restored -c "SELECT count(*) FROM \"User\";" ``` **Force Kill Stuck Service:** ```bash # If health check broken docker compose kill api docker compose rm -f api docker compose up -d api ``` --- ## Appendix: Key File Locations ``` /Users/velikho/Desktop/WORKING/goodgo-platform-ai/ ├── docker-compose.yml # Dev environment ├── docker-compose.prod.yml # Prod environment (with pgbouncer, resource limits) ├── docker-compose.ci.yml # CI/E2E test environment ├── .env.example # Template for all required env vars │ ├── apps/ │ ├── api/ │ │ ├── Dockerfile # Multi-stage NestJS build │ │ ├── docker-entrypoint.sh # Startup script (migrations, app start) │ │ ├── src/ │ │ │ ├── modules/health/health.controller.ts │ │ │ ├── modules/payments/application/commands/handle-callback/ │ │ │ ├── modules/shared/infrastructure/redis.service.ts │ │ │ └── modules/search/infrastructure/services/typesense-search.repository.ts │ │ └── package.json │ │ │ └── web/ │ ├── Dockerfile # Multi-stage Next.js build │ └── package.json │ ├── libs/ │ └── ai-services/ │ ├── Dockerfile # Python FastAPI build │ ├── app/main.py # FastAPI app entry │ └── pyproject.toml │ ├── prisma/ │ └── schema.prisma # Complete Prisma schema (22 models) │ ├── infra/ │ └── pgbouncer/ │ ├── pgbouncer.ini # Connection pooling config │ ├── userlist.txt.template # User list (templated) │ └── entrypoint.sh # Env substitution script │ ├── scripts/ │ └── backup/ │ ├── pg-backup.sh # Daily backup automation │ ├── pg-verify-backup.sh # Restore verification │ └── pg-restore.sh # Manual restore script │ ├── monitoring/ │ ├── prometheus/ │ │ ├── prometheus.yml # Scrape config (goodgo-api metrics) │ │ └── alert-rules.yml # Latency + error rate alerts │ ├── loki/ │ │ └── loki-config.yml # Log aggregation config (15-day retention) │ ├── promtail/ │ │ └── promtail-config.yml # Log shipping (Pino JSON parsing) │ └── grafana/ │ ├── provisioning/ │ │ ├── datasources/datasource.yml │ │ └── dashboards/dashboard.yml │ └── dashboards/ │ ├── api-latency.json │ ├── api-overview.json │ ├── database.json │ ├── logs.json │ ├── search.json │ ├── web-vitals.json │ └── business-metrics.json │ └── .github/workflows/ ├── ci.yml # Lint, test, build ├── deploy.yml # Build images, deploy to staging/prod ├── e2e.yml # End-to-end tests ├── backup-verify.yml # Weekly backup verification ├── security.yml # Dependency/SAST scanning ├── codeql.yml # GitHub CodeQL └── load-test.yml # K6 load testing ``` --- ## Document Version History | Version | Date | Author | Changes | |---------|------|--------|---------| | 1.0 | 2026-04-11 | DevOps Team | Initial comprehensive runbook | --- **Last Updated:** April 11, 2026 **Maintained By:** GoodGo Platform SRE Team **Contact:** devops@goodgo.vn