Files
goodgo-platform/docs/audits/INFRASTRUCTURE_RUNBOOK.md
Ho Ngoc Hai b8512ebff4 docs: consolidate audit and analysis reports into docs/audits/
Move 36 root-level audit/analysis documents and 7 web app audit documents
into docs/audits/ directory to declutter the project root. Remove stale
EXPLORATION_SUMMARY.txt.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-11 01:37:50 +07:00

45 KiB

GoodGo Platform — Operational Infrastructure Runbook

Last Updated: April 11, 2026
Version: 1.0
Purpose: Complete infrastructure reference for ops teams, SREs, and on-call engineers


Table of Contents

  1. Executive Summary
  2. Services Architecture
  3. Docker Compose Specifications
  4. Database Layer
  5. Caching & Search
  6. Monitoring & Observability
  7. Payment Integration
  8. Health Checks
  9. Environment Variables
  10. Backup & Recovery
  11. Deployment Pipeline
  12. Troubleshooting Guide

Executive Summary

GoodGo Platform is a monorepo real estate marketplace built with:

  • Frontend: Next.js (TypeScript)
  • Backend API: NestJS (TypeScript)
  • AI Services: Python/FastAPI
  • Database: PostgreSQL 16 + PostGIS
  • Cache: Redis 7
  • Search: Typesense 27.1
  • Object Storage: MinIO (S3-compatible)
  • Monitoring: Prometheus + Grafana + Loki + Promtail
  • Message Queue: Built-in CQRS/Event Bus (NestJS)

Total Services in Production: 12+ (detailed below)


Services Architecture

Service Inventory

Service Image Port Purpose Health Check Dependencies
api goodgo-api:latest 3001 NestJS REST API GET /health (3x30s) postgres, redis, typesense, pgbouncer
web goodgo-web:latest 3000 Next.js frontend GET / (3x30s) api
ai-services goodgo-ai-services:latest 8000 Python FastAPI (price estimation, NLP) GET /health (3x30s) n/a
postgres postgis/postgis:16-3.4 5432 Primary database pg_isready (5x10s) n/a
pgbouncer edoburu/pgbouncer:1.23.1-p2 6432 Connection pooling (transaction mode) pg_isready (5x10s) postgres
redis redis:7-alpine 6379 Cache + session store PING (5x10s) n/a
typesense typesense/typesense:27.1 8108 Full-text search index GET /health (5x10s) n/a
minio minio/minio:latest 9000/9001 Object storage + console mc ready local (5x10s) n/a
loki grafana/loki:3.0.0 3100 Log aggregation GET /ready (5x15s) n/a
promtail grafana/promtail:3.0.0 9080 Log shipper (depends on loki healthy) loki
prometheus prom/prometheus:v2.51.0 9090 Metrics scraper GET /-/healthy (3x15s) n/a
grafana grafana/grafana:10.4.1 3002 Dashboards + alerting GET /api/health (3x15s) prometheus, loki
pg-backup postgis/postgis:16-3.4 Automated backup cron depends_on postgres postgres

Network & Volumes

  • Network: Docker bridge network goodgo-net
  • Volumes:
    • pgdata — PostgreSQL data files
    • redis_data — Redis snapshot (AOF)
    • typesense_data — Search index
    • minio_data — Object storage
    • pg_backups — Database backups (daily retention: 7 days)
    • loki_data — Log chunks (retention: 15 days)
    • prometheus_data — Metrics TSDB (retention: 30 days in prod, 15 days in dev)
    • grafana_data — Dashboards, datasource configs

Docker Compose Specifications

Development Environment (docker-compose.yml)

12 Services (minimal dependencies, no resource limits)

services:
  postgres:        PostGIS 16, port 5432, healthcheck: pg_isready (30s start-period)
  redis:           Alpine 7, port 6379, maxmemory: 256mb LRU, AOF enabled
  typesense:       v27.1, port 8108, CORS enabled, healthcheck /health
  minio:           latest, ports 9000 (API) / 9001 (console)
  ai-services:     Custom Python build, port 8000
  pg-backup:       Automated daily dumps at 02:00 UTC, cron retention cleanup
  pg-verify-backup: On-demand backup restore verification (profile: tools)
  loki:            v3.0.0, port 3100, 15-day retention, 2h compaction delay
  promtail:        v3.0.0, Docker socket instrumentation, Pino JSON parsing
  prometheus:      v2.51.0, port 9090, 15-day retention, lifecycle API enabled
  grafana:         v10.4.1, port 3002, datasources pre-provisioned

Key Differences from Prod:

  • No resource limits (use all available CPU/memory)
  • Smaller retention windows (7-15 days)
  • PostgreSQL on port 5432 (direct, no pgbouncer)
  • loki/prometheus/grafana on alternate ports

Production Environment (docker-compose.prod.yml)

14 Services (with pgbouncer, resource limits, rolling updates)

services:
  api:             NestJS, resource limits: 1g CPU / 1g memory
  web:             Next.js, resource limits: 0.5 CPU / 512m memory
  ai-services:     Python, resource limits: 1.0 CPU / 1g memory
  postgres:        PostGIS, resource limits: 2.0 CPU / 2g memory
  pgbouncer:       Connection pool (NEW), 20 default connections, transaction mode
  redis:           7-alpine, resource limits: 0.5 CPU / 768m memory, password auth
  typesense:       27.1, resource limits: 1.0 CPU / 1g memory
  minio:           latest, resource limits: 0.5 CPU / 1g memory
  loki:            v3.0.0, resource limits: 0.5 CPU / 512m memory
  promtail:        v3.0.0, resource limits: 0.25 CPU / 256m memory
  prometheus:      v2.51.0, resource limits: 0.5 CPU / 1g memory, 30-day retention
  grafana:         v10.4.1, resource limits: 0.5 CPU / 512m memory
  pg-backup:       Same as dev

Production-Specific Flags:

  • read_only: true on app containers (api, web, ai-services)
  • tmpfs: [/tmp] for runtime temp files
  • security_opt: [no-new-privileges:true]
  • logging: json-file with 10m max-size, 3-5 files rotation
  • PgBouncer inserted between apps ↔ Postgres (port 6432)
  • Secrets management: GRAFANA_ADMIN_USER, GRAFANA_ADMIN_PASSWORD from Docker secrets
  • Redis requires password authentication

CI/E2E Environment (docker-compose.ci.yml)

Minimal 4 Services (tmpfs for speed)

services:
  postgres:        goodgo_test DB, tmpfs (/var/lib/postgresql/data)
  redis:          --save "" --appendonly no (no persistence)
  typesense:      tmpfs (/data)
  minio:          tmpfs (/data)

Used by:

  • GitHub Actions E2E test suite
  • Local docker compose -f docker-compose.ci.yml up --wait

Database Layer

PostgreSQL + PostGIS

Version: 16.3.4 with PostGIS extension
Schema: 22 Prisma models + Prisma migration tracking

Prisma Schema Models

  1. Auth: User, RefreshToken, OAuthAccount, Agent
  2. Listings: Property, PropertyMedia, Listing
  3. Search: SavedSearch
  4. Transactions: Transaction, Inquiry, Lead
  5. Payments: Payment (with PaymentProvider enum: VNPAY, MOMO, ZALOPAY, BANK_TRANSFER)
  6. Subscriptions: Plan, Subscription, UsageRecord
  7. Analytics: Valuation, MarketIndex
  8. Notifications: NotificationLog, NotificationPreference
  9. Audit: AdminAuditLog
  10. Reviews: Review

Key Database Features

  • PostGIS Geometry: Property.location (Point, SRID 4326) with GIST index
  • Enums: UserRole, KYCStatus, PropertyType, TransactionType, ListingStatus, Direction, OAuthProvider, TransactionStatus, LeadStatus, PaymentProvider, PaymentStatus, PaymentType, PlanTier, SubscriptionStatus, NotificationChannel, NotificationStatus, AdminAction, AuditTargetType
  • Compound Indexes: Query optimization on (role, isActive, createdAt), (sellerId, status, publishedAt), (userId, status, createdAt), etc.
  • Constraints: Unique idempotency key on Payment (userId, provider, idempotencyKey)

Connection Pooling: PgBouncer

Dev Mode (docker-compose.yml):

  • Apps connect directly to postgres:5432
  • No pooling overhead

Prod Mode (docker-compose.prod.yml):

  • Apps connect to pgbouncer:6432
  • Pool Mode: transaction (connections returned after each transaction)
  • Pool Size: 20 connections (default, tunable via PGBOUNCER_POOL_SIZE)
  • Max Client Conn: 200 (tunable via PGBOUNCER_MAX_CLIENT_CONN)
  • Reserve Pool: 5 connections (fallback when pool exhausted)
  • Timeouts:
    • server_connect_timeout: 15s
    • server_idle_timeout: 600s
    • server_lifetime: 3600s (connection recycle)
    • query_wait_timeout: 120s
    • query_timeout: 0 (disabled)
  • Admin Console: pgbouncer_admin user (password via PGBOUNCER_ADMIN_PASSWORD env var)
  • Stats Console: pgbouncer_stats user (password via PGBOUNCER_STATS_PASSWORD env var)

Migration Workaround:

  • API has two DATABASE_URL env vars:
    • DATABASE_URL → pgbouncer:6432 (normal queries)
    • DATABASE_URL_DIRECT → postgres:5432 (migrations, introspection, DDL)
  • RUN_MIGRATIONS=true switches app to use DATABASE_URL_DIRECT for prisma migrate deploy

Backup Strategy

Automated Backups:

  • Schedule: Daily at 02:00 UTC (cron inside pg-backup container)
  • Format: Custom format with gzip compression (level 6)
  • Retention: 7 days (configurable via BACKUP_RETENTION_DAYS)
  • Location: pg_backups volume (mount to persistent storage in prod)
  • File Pattern: goodgo_YYYYMMDD_HHMMSS.sql.gz
  • Restore Script: /scripts/backup/pg-restore.sh (manual restore)
  • Verification Script: /scripts/backup/pg-verify-backup.sh (automated E2E verification)

Verification Process (runs weekly):

  1. Restores latest backup to isolated test database (goodgo_verify_<timestamp>)
  2. Verifies all 22 tables exist
  3. Compares row counts between source and restored DB
  4. Checksums critical tables (User, Property, Listing, Payment, Subscription, Transaction, Plan, _prisma_migrations)
  5. Checks PostGIS extension, indexes, enum types
  6. Generates JSON report with pass/fail result
  7. Cleanup: Drops test DB on exit (unless SKIP_CLEANUP=1)
  8. Exit Codes: 0=pass, 1=checks failed, 2=setup error

CI/CD Backup Verification:

  • GitHub Action: .github/workflows/backup-verify.yml
  • Runs weekly Sundays 05:00 UTC
  • Also manually triggerable with skip_cleanup option
  • Uploads JSON report as artifact

Redis

Image: redis:7-alpine
Port: 6379

Production Configuration:

redis-server \
  --appendonly yes \                # AOF persistence (updates only)
  --requirepass ${REDIS_PASSWORD} \ # Authentication required
  --maxmemory 512mb \               # Max memory limit (prod)
  --maxmemory-policy allkeys-lru    # LRU eviction when full

Development Configuration:

redis-server \
  --appendonly yes \
  --maxmemory 256mb \
  --maxmemory-policy allkeys-lru

ioredis Client Configuration:

// From RedisService in apps/api/src/modules/shared/infrastructure/redis.service.ts
{
  host: process.env.REDIS_HOST ?? 'localhost',
  port: Number(process.env.REDIS_PORT ?? 6379),
  password: process.env.REDIS_PASSWORD ?? undefined,
  lazyConnect: true,          // App starts even if Redis unavailable
  enableReadyCheck: false,    // Prevents "Redis is not ready" errors during transient outages
  maxRetriesPerRequest: 1,    // Fail fast (single retry, no exponential backoff)
  retryStrategy(times: number): number {
    return Math.min(times * 1000, 5000);  // 1s → 2s → 3s → 4s → 5s → 5s...
  }
}

Graceful Degradation:

  • Cache misses don't fail the application
  • CacheService catches Redis errors and returns cache miss
  • App serves data directly from PostgreSQL if Redis down
  • Health check at GET /health/redis warns but doesn't fail readiness probe

Use Cases:

  • Session storage
  • Cache layer for expensive queries
  • Rate limiting (if implemented)
  • Real-time counters

Typesense

Image: typesense/typesense:27.1
Port: 8108 (HTTP only, internal Docker network) API Key: ${TYPESENSE_API_KEY} (must be set in .env)

Collection Schema:

Collection Name: "listings"
Fields:
  - listingId (string)
  - propertyId (string)
  - title (string, searchable, highlights)
  - description (string, searchable, highlights)
  - propertyType (string, faceted)
  - transactionType (string, faceted: SALE/RENT)
  - priceVND (int64, sortable)
  - pricePerM2 (float, optional)
  - areaM2 (float)
  - bedrooms (int32, faceted)
  - bathrooms (int32, faceted)
  - floors (int32)
  - direction (string, faceted: NORTH/SOUTH/EAST/WEST/etc)
  - address (string)
  - ward (string, faceted)
  - district (string, faceted)
  - city (string, faceted)
  - location (geopoint) — for radius search
  - agentId (string)
  - sellerId (string)
  - status (string, faceted: ACTIVE/SOLD/DRAFT/etc)
  - publishedAt (int64, sortable)
  - viewCount (int32)
  - saveCount (int32)
  - projectName (string, faceted)
  - amenities (string[], faceted)

Search Features:

  • Full-text search on: title, description, address, district, city, projectName
  • Query weights: title=5, description=3, address=2, district=2, city=1, projectName=2
  • Filtering: propertyType, transactionType, bedrooms, district, city, status, amenities
  • Geo-search: radius-based queries (lat, lng, km)
  • Sorting: price (asc/desc), distance (asc from geopoint), date (desc), relevance
  • Highlights: HTML marks on matched terms in title and description
  • Facets: Return aggregated counts for filtering

TypesenseSearchRepository (apps/api/src/modules/search/infrastructure/services/typesense-search.repository.ts):

  • ensureCollection() — Creates schema if not exists
  • dropCollection() — Cleanup (testing only)
  • indexDocument(doc) — Upsert single document
  • indexDocuments(docs) — Bulk import with error reporting
  • removeDocument(id) — Delete by ID
  • search(params) — Execute search with filters, sort, pagination

Graceful Degradation:

  • If Typesense down, search falls back to PostgreSQL full-text search
  • TypesenseClientService implements retry logic with exponential backoff
  • Health check at GET /health returns JSON status

Monitoring & Observability

Prometheus

Image: prom/prometheus:v2.51.0
Port: 9090
Retention: 15 days (dev), 30 days (prod)
Lifecycle API: Enabled (--web.enable-lifecycle)

Scrape Targets (monitoring/prometheus/prometheus.yml):

scrape_configs:
  - job_name: goodgo-api
    metrics_path: /metrics
    static_configs:
      - targets: ['host.docker.internal:3001']  # Dev (API on host)
      - targets: ['api:3001']                   # Prod (API in container)
    labels:
      service: goodgo-api
      environment: [development|production]

  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

Expected Metrics from API:

  • goodgo_api_request_duration_seconds_bucket{le, route, method} — Request latency histogram
  • http_requests_total{status_code, job} — Request count by status code
  • Custom business metrics (if implemented in NestJS @prometheus decorators)

Alert Rules (monitoring/prometheus/alert-rules.yml)

Latency Alerts:

  1. ApiLatencyP99High (warning)

    • Trigger: p99 latency > 1s for 5 minutes
    • Dashboard: /d/goodgo-api-latency/goodgo-api-latency
    • Runbook: https://docs.goodgo.vn/runbooks/api-latency-high
  2. ApiEndpointLatencyP99High (warning)

    • Trigger: Per-endpoint p99 > 2s for 5 minutes
    • Annotates: method, route labels
  3. ApiLatencyP99Critical (critical - SLO breach)

    • Trigger: p99 latency > 3s for 3 minutes
    • Escalation required
    • Runbook: https://docs.goodgo.vn/runbooks/api-latency-critical

Error Rate Alert:

  1. ApiErrorRate5xxHigh (warning)
    • Trigger: 5xx error rate > 1% for 5 minutes
    • Uses: (5xx errors / total requests) * 100

Grafana

Image: grafana/grafana:10.4.1
Port: 3002
Auth: Admin user/password from secrets (prod) or env vars (dev)

Pre-Provisioned Datasources:

  • Prometheus (default, primary)
  • Loki (with derived fields for correlationId linkage)

Dashboards:

  1. api-latency.json — API p99/p95/p50, route breakdown, slow endpoints
  2. api-overview.json — Request rate, error rate, uptime status
  3. database.json — Query latency, connection pool utilization, slow queries
  4. logs.json — Log volume, error logs, trace links to Prometheus
  5. search.json — Typesense query latency, indexing rate, collection size
  6. web-vitals.json — Frontend Core Web Vitals (if client-side instrumentation)
  7. business-metrics.json — Listings created, payments processed, user signups

Admin Console Access:

  • URL: http://localhost:3002 (dev) or ${GRAFANA_PORT} (prod)
  • Default user: admin (change password on first login)
  • Non-signup mode (GF_USERS_ALLOW_SIGN_UP: false)

Loki & Promtail (Log Aggregation)

Loki: grafana/loki:3.0.0, port 3100

Configuration:

schema:
  - from: 2024-01-01
    store: tsdb
    schema: v13
limits:
  max_entries_limit_per_query: 5000
  ingestion_rate_mb: 4
  ingestion_burst_size_mb: 6
retention: 360h (15 days)

Promtail: grafana/promtail:3.0.0

Configuration:

  • Scrapes Docker logs from goodgo-net bridge network
  • Parses Pino JSON structured logs
  • Extracts labels: level, context, component, service
  • Structured metadata: method, url, statusCode, correlationId, duration
  • Derives timestamp from Pino output (RFC3339Nano)

Expected Log Format (Pino):

{
  "level": 30,                    // info
  "time": "2026-04-11T10:30:00Z",
  "msg": "POST /api/listings",
  "correlationId": "abc-123-def",
  "context": "ListingController",
  "component": "api",
  "method": "POST",
  "url": "/api/listings",
  "statusCode": 201,
  "duration": 150
}

Payment Integration

Supported Payment Providers

Enum: PaymentProvider (Prisma)

  • VNPAY — VNPay (Vietnam payment gateway)
  • MOMO — MoMo (Vietnam mobile wallet)
  • ZALOPAY — ZaloPay (Vietnam digital wallet)
  • BANK_TRANSFER — Manual bank transfer (offline)

Payment Flow & Callback Handling

Database Schema (Payment Model):

model Payment {
  id            String @id @default(cuid())
  userId        String
  transactionId String?
  provider      PaymentProvider
  type          PaymentType  // SUBSCRIPTION, LISTING_FEE, DEPOSIT, FEATURED_LISTING
  amountVND     BigInt
  status        PaymentStatus  // PENDING, PROCESSING, COMPLETED, FAILED, REFUNDED
  providerTxId  String?  // External transaction ID from VNPay/MoMo/ZaloPay
  callbackData  Json?    // Raw callback payload (for audit)
  idempotencyKey String? // Prevent duplicate payments (userId, provider, idempotencyKey unique)
  createdAt     DateTime @default(now())
  updatedAt     DateTime @updatedAt
}

enum PaymentStatus {
  PENDING, PROCESSING, COMPLETED, FAILED, REFUNDED
}

enum PaymentType {
  SUBSCRIPTION, LISTING_FEE, DEPOSIT, FEATURED_LISTING
}

Command Handler: HandleCallbackHandler (apps/api/src/modules/payments/application/commands/handle-callback/handle-callback.handler.ts)

  1. Callback Signature Verification:

    • Uses PAYMENT_GATEWAY_FACTORY to route to correct provider (VNPay/MoMo/ZaloPay)
    • Gateway.verifyCallback() validates HMAC signature
    • Throws ValidationException if signature invalid
  2. Idempotent Status Transition:

    • Only updates payments in state: PENDING or PROCESSING
    • Atomically transitions to COMPLETED or FAILED
    • If already in terminal state (COMPLETED/FAILED/REFUNDED), returns existing status (idempotent)
    • Logs warning if payment not found
  3. Domain Event Publishing:

    • Reconstructs domain entity from repository
    • Emits PaymentCompletedEvent or PaymentFailedEvent
    • Event bus publishes events to subscribers (e.g., subscription creation, listing activation)
  4. Response:

    {
      paymentId: string,
      status: PaymentStatus,
      isSuccess: boolean
    }
    

Payment Gateway Interface (payment-gateway.interface.ts):

interface IPaymentGateway {
  readonly provider: PaymentProvider
  createPaymentUrl(params: CreatePaymentUrlParams): Promise<CreatePaymentUrlResult>
  verifyCallback(data: Record<string, string>): CallbackVerifyResult
  refund(params: RefundParams): Promise<RefundResult>
}

interface CreatePaymentUrlParams {
  orderId: string
  amountVND: bigint
  description: string
  returnUrl: string
  ipAddress: string
}

interface CallbackVerifyResult {
  isValid: boolean
  orderId: string
  providerTxId: string
  isSuccess: boolean
  rawData: Record<string, unknown>
}

interface RefundParams {
  providerTxId: string
  amountVND: bigint
  reason: string
}

interface RefundResult {
  success: boolean
  refundTxId: string | null
}

Environment Variables

VNPay:

VNPAY_TMN_CODE=<merchant terminal code>
VNPAY_HASH_SECRET=<HMAC secret key>
VNPAY_BASE_URL=https://sandbox.vnpayment.vn/paymentv2/vpcpay.html
VNPAY_API_URL=https://sandbox.vnpayment.vn/merchant_webapi/api/transaction

MoMo:

MOMO_PARTNER_CODE=<partner code>
MOMO_ACCESS_KEY=<access key>
MOMO_SECRET_KEY=<secret key>
MOMO_ENDPOINT=https://test-payment.momo.vn/v2/gateway/api

ZaloPay:

ZALOPAY_APP_ID=<app ID>
ZALOPAY_KEY1=<key 1 (for creating payments)>
ZALOPAY_KEY2=<key 2 (for callback verification)>
ZALOPAY_ENDPOINT=https://sb-openapi.zalopay.vn/v2

Race Condition & Idempotency Protection

Problem: Multiple callbacks may arrive for same payment (network retries, duplicate notifications)

Solution:

  1. Unique Idempotency Key: Payment_idempotency_unique(userId, provider, idempotencyKey)

    • Prevents duplicate payment records
    • Generated by client/API before creating payment
  2. Atomic Status Update: paymentRepo.updateIfStatus(orderId, ['PENDING', 'PROCESSING'], newStatus)

    • Only updates if current status in allowed list
    • Returns updated entity or null if already terminal
  3. Terminal State Check: If already COMPLETED/FAILED/REFUNDED, handler returns existing state

    • No re-triggering of domain events
    • No double billing or duplicate transactions

Health Checks

API Health Endpoints

Health Controller (apps/api/src/modules/health/health.controller.ts)

  1. GET /health — Liveness Probe (always 200 if process running)

    • Uses: @HealthCheck() on empty probe list
    • Response: { "status": "ok", "timestamp": "..." }
    • Use Case: Kubernetes/Docker readiness (initial startup)
  2. GET /health/ready — Readiness Probe (checks dependencies)

    • Checks: PostgreSQL + Redis connectivity
    • Response:
      {
        "status": "ok",
        "checks": {
          "database": { "status": "up" },
          "redis": { "status": "up" }
        }
      }
      
    • Use Case: Load balancer, before accepting traffic
    • Failure: Returns 503 if any dependency down
  3. GET /health/db — Database Readiness Only

    • Checks: PostgreSQL connectivity via SELECT 1 query
    • Use Case: Manual database troubleshooting
  4. GET /health/redis — Redis Readiness Only

    • Checks: Redis PING command
    • Use Case: Manual Redis troubleshooting

Health Check Implementations

PrismaHealthIndicator (apps/api/src/modules/health/infrastructure/prisma.health.ts):

async isHealthy(key: string): Promise<HealthIndicatorResult> {
  try {
    await this.prisma.$queryRawUnsafe('SELECT 1');
    return this.getStatus(key, true);
  } catch {
    throw new HealthCheckError('Database check failed', this.getStatus(key, false));
  }
}

RedisHealthIndicator (apps/api/src/modules/health/infrastructure/redis.health.ts):

async isHealthy(key: string): Promise<HealthIndicatorResult> {
  try {
    const client = this.redis.getClient();
    const pong = await client.ping();
    const isHealthy = pong === 'PONG';
    const result = this.getStatus(key, isHealthy);
    if (isHealthy) return result;
    throw new HealthCheckError('Redis ping failed', result);
  } catch (error) {
    if (error instanceof HealthCheckError) throw error;
    throw new HealthCheckError('Redis check failed', this.getStatus(key, false));
  }
}

Docker Container Health Checks

API Container:

healthcheck:
  test: ['CMD', 'node', '-e', "fetch('http://localhost:3001/health').then(r => { if (!r.ok) throw 1 }).catch(() => process.exit(1))"]
  interval: 30s
  timeout: 5s
  retries: 5
  start_period: 30s

Web Container:

healthcheck:
  test: ['CMD', 'node', '-e', "fetch('http://localhost:3000').then(r => { if (!r.ok) throw 1 }).catch(() => process.exit(1))"]
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 15s

PostgreSQL:

healthcheck:
  test: ['CMD-SHELL', 'pg_isready -U ${DB_USER} -d ${DB_NAME}']
  interval: 10s
  timeout: 5s
  retries: 5
  start_period: 30s

Redis:

healthcheck:
  test: ['CMD', 'redis-cli', '-a', '${REDIS_PASSWORD}', 'ping']
  interval: 10s
  timeout: 5s
  retries: 5
  start_period: 10s

Typesense:

healthcheck:
  test: ['CMD', 'curl', '-sf', 'http://localhost:8108/health']
  interval: 10s
  timeout: 5s
  retries: 5
  start_period: 15s

Environment Variables

Complete .env.example Reference

PostgreSQL:

DB_HOST=localhost
DB_PORT=5432
DB_NAME=goodgo
DB_USER=goodgo
DB_PASSWORD=CHANGE_ME
DATABASE_URL=postgresql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:${DB_PORT}/${DB_NAME}?schema=public
DATABASE_URL_DIRECT=postgresql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:${DB_PORT}/${DB_NAME}?schema=public

PgBouncer (Prod Only):

PGBOUNCER_POOL_SIZE=20
PGBOUNCER_MAX_CLIENT_CONN=200
PGBOUNCER_ADMIN_PASSWORD=CHANGE_ME
PGBOUNCER_STATS_PASSWORD=CHANGE_ME

Redis:

REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=
REDIS_URL=redis://${REDIS_HOST}:${REDIS_PORT}

Typesense:

TYPESENSE_HOST=localhost
TYPESENSE_PORT=8108
TYPESENSE_PROTOCOL=http
TYPESENSE_API_KEY=CHANGE_ME

MinIO:

MINIO_ENDPOINT=localhost
MINIO_PORT=9000
MINIO_CONSOLE_PORT=9001
MINIO_ACCESS_KEY=CHANGE_ME
MINIO_SECRET_KEY=CHANGE_ME
MINIO_BUCKET=goodgo-media
MINIO_USE_SSL=false

NestJS API:

API_PORT=3000
PORT=3001
NODE_ENV=development
CORS_ORIGINS=http://localhost:3000,http://localhost:3001

JWT / Authentication (REQUIRED):

JWT_SECRET=<generate with: openssl rand -base64 48>
JWT_EXPIRES_IN=15m
JWT_REFRESH_SECRET=<generate with: openssl rand -base64 48>
JWT_REFRESH_EXPIRES_IN=7d

OAuth Providers:

GOOGLE_CLIENT_ID=
GOOGLE_CLIENT_SECRET=
GOOGLE_CALLBACK_URL=http://localhost:3001/auth/google/callback

ZALO_APP_ID=
ZALO_APP_SECRET=
ZALO_CALLBACK_URL=http://localhost:3001/auth/zalo/callback

FRONTEND_URL=http://localhost:3000

Next.js Web:

NEXT_PUBLIC_API_URL=http://localhost:3000
WEB_PORT=3001

AI Service (Python/FastAPI):

AI_SERVICE_PORT=8000
AI_SERVICE_URL=http://localhost:8000
CLAUDE_API_KEY=
AI_DEBUG=false
AI_LOG_LEVEL=info

Map Integration:

NEXT_PUBLIC_MAPBOX_TOKEN=

Payment Gateways:

VNPAY_TMN_CODE=
VNPAY_HASH_SECRET=
VNPAY_BASE_URL=https://sandbox.vnpayment.vn/paymentv2/vpcpay.html
VNPAY_API_URL=https://sandbox.vnpayment.vn/merchant_webapi/api/transaction

MOMO_PARTNER_CODE=
MOMO_ACCESS_KEY=
MOMO_SECRET_KEY=
MOMO_ENDPOINT=https://test-payment.momo.vn/v2/gateway/api

ZALOPAY_APP_ID=
ZALOPAY_KEY1=
ZALOPAY_KEY2=
ZALOPAY_ENDPOINT=https://sb-openapi.zalopay.vn/v2

Email / SMTP:

SMTP_HOST=localhost
SMTP_PORT=1025
SMTP_USER=
SMTP_PASS=
SMTP_FROM=noreply@goodgo.vn

Firebase Cloud Messaging (Optional):

FIREBASE_SERVICE_ACCOUNT=

Sentry Error Tracking:

SENTRY_DSN=
NEXT_PUBLIC_SENTRY_DSN=
SENTRY_AUTH_TOKEN=
SENTRY_ORG=
SENTRY_PROJECT=

KYC Field Encryption (REQUIRED Prod):

KYC_ENCRYPTION_KEY=<generate with: openssl rand -hex 32> # 64 hex chars (32 bytes)
KYC_ENCRYPTION_KEY_VERSION=1

Logging:

LOG_LEVEL=info

Backup & Recovery

Automated Daily Backups

Service: pg-backup container (runs inside docker compose)

Backup Script: scripts/backup/pg-backup.sh

# Daily cron job: 02:00 UTC
PGHOST=postgres \
PGPORT=5432 \
PGUSER=goodgo \
PGDATABASE=goodgo \
PGPASSWORD=<secret> \
BACKUP_DIR=/backups \
RETENTION_DAYS=7 \
  /scripts/pg-backup.sh

Behavior:

  1. Creates dump with pg_dump --format=custom --compress=6
  2. Saves as goodgo_YYYYMMDD_HHMMSS.sql.gz
  3. Prunes backups older than 7 days (configurable)
  4. Logs to /var/log/pg-backup.log

Restore from Backup:

# Interactive restore prompt
docker compose -f docker-compose.prod.yml exec pg-backup bash -c \
  'pg_restore -h postgres -p 5432 -U goodgo -d goodgo \
   --clean --if-exists /backups/goodgo_20260410_020000.sql.gz'

# Or using restore script
docker compose -f docker-compose.prod.yml run --rm pg-verify-backup bash -c \
  'source /scripts/pg-restore.sh /backups/goodgo_20260410_020000.sql.gz'

Backup Verification

Service: pg-verify-backup container (on-demand, profile: tools)

Verification Script: scripts/backup/pg-verify-backup.sh

# Usage:
docker compose -f docker-compose.prod.yml run --rm pg-verify-backup

# Or with options:
SKIP_CLEANUP=1 REPORT_FILE=/backups/verify-report.json \
  docker compose -f docker-compose.prod.yml run --rm pg-verify-backup

Verification Steps:

  1. Creates isolated test database: goodgo_verify_<timestamp>
  2. Enables PostGIS extension
  3. Restores backup into test DB
  4. Verifies all 22 tables exist
  5. Compares row counts between source and restored
  6. Checksums critical tables using MD5 hashes
  7. Checks indexes, enum types
  8. Generates JSON report with results
  9. Cleanup: Drops test DB (unless SKIP_CLEANUP=1)

JSON Report Structure:

{
  "timestamp": "2026-04-11T10:30:00Z",
  "backupFile": "/backups/goodgo_20260410_020000.sql.gz",
  "backupSize": "150M",
  "testDatabase": "goodgo_verify_20260411_103000",
  "restoreDurationSeconds": 45,
  "passed": 28,
  "failed": 0,
  "warnings": 2,
  "result": "pass",
  "checks": [
    { "check": "Database creation", "status": "pass", "detail": "Test database created" },
    { "check": "Restore", "status": "pass", "detail": "pg_restore completed cleanly in 45s" },
    { "check": "Table existence", "status": "pass", "detail": "All 22 expected tables present" },
    { "check": "Row counts", "status": "pass", "detail": "All tables match source database" },
    { "check": "Checksum: User identities", "status": "pass", "detail": "Hashes match (abc123def456...)" },
    ...
  ]
}

GitHub Action Backup Verification:

  • File: .github/workflows/backup-verify.yml
  • Schedule: Weekly Sundays 05:00 UTC
  • Also: Manual trigger with skip_cleanup option
  • Artifacts: Uploads JSON report for 30 days

Deployment Pipeline

GitHub Actions CI/CD

Workflows:

  1. .github/workflows/ci.yml — Lint, typecheck, test, build (on push/PR to master)
  2. .github/workflows/deploy.yml — Build Docker images, deploy to staging/prod
  3. .github/workflows/e2e.yml — E2E tests (spins up full docker-compose.ci.yml)
  4. .github/workflows/backup-verify.yml — Weekly backup verification
  5. .github/workflows/security.yml — Dependency scanning, SAST
  6. .github/workflows/codeql.yml — GitHub CodeQL analysis
  7. .github/workflows/load-test.yml — K6 load testing

CI Pipeline (ci.yml)

On: push master, pull_request master
Node: 22
Concurrency: Cancel previous runs on same ref

Jobs:

  1. Lint → Typecheck → Test → Build

    • Installs pnpm, Node 22
    • Runs linter (eslint)
    • Type checks (tsc)
    • Unit tests (jest)
    • Builds all apps (turbo)
    • PostgreSQL 16 service available (goodgo_test DB)
  2. E2E Tests (depends on ci job)

    • Full docker-compose.ci.yml services (postgres, redis, typesense, minio)
    • Runs end-to-end test suite
    • Timeout: 20 minutes
    • Env vars: DATABASE_URL, JWT secrets, payment test codes

Deploy Pipeline (deploy.yml)

On:

  • push master (auto-deploys to staging)
  • Manual workflow_dispatch (choose staging or production)

Jobs:

  1. Build API Image

    • Builds: goodgo-api:${IMAGE_TAG}
    • Dockerfile: apps/api/Dockerfile
    • Registry: ghcr.io/goodgo/goodgo-api
    • Tags: git SHA, branch name, latest (on master)
  2. Build Web Image

    • Builds: goodgo-web:${IMAGE_TAG}
    • Dockerfile: apps/web/Dockerfile
    • Registry: ghcr.io/goodgo/goodgo-web
  3. Build AI Services Image

    • Builds: goodgo-ai-services:${IMAGE_TAG}
    • Context: libs/ai-services/
    • Registry: ghcr.io/goodgo/goodgo-ai-services
  4. Deploy to Staging

    • Condition: github.event_name == 'push' || inputs.environment == 'staging'
    • SSH into staging host
    • Pulls new images from GHCR
    • Rolling update (zero downtime):
      docker compose -f docker-compose.prod.yml up -d --no-deps --wait api
      docker compose -f docker-compose.prod.yml up -d --no-deps --wait web
      docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services
      
    • Runs migrations: docker compose exec api npx prisma migrate deploy
    • Prunes old images
  5. Deploy to Production

    • Only on manual workflow_dispatch with environment: production
    • Same steps as staging
    • Requires environment: production approval (GitHub security)

Dockerfile Multi-Stage Builds

API (apps/api/Dockerfile):

  • Base: node:22-slim + pnpm 10.27.0
  • Deps: Install locked dependencies (layer caching)
  • Build: Compile TypeScript, generate Prisma client
  • Prune: pnpm deploy --prod (removes dev deps, hoists prod deps)
  • Production: Minimal image, dumb-init for signals, non-root user

Web (apps/web/Dockerfile):

  • Base: node:22-slim + pnpm
  • Deps: Install dependencies
  • Build: next build → standalone output + static files
  • Production: Copy .next/standalone, public, static assets

AI Services (libs/ai-services/Dockerfile):

  • Base: python:3.12-slim
  • Install: System deps (gcc, g++), dumb-init, FastAPI/XGBoost/underthesea
  • Models: Pre-download underthesea ML models at build time
  • User: Run as non-root appuser
  • CMD: uvicorn app.main:app --host 0.0.0.0 --port 8000

Troubleshooting Guide

Check Service Status

# All services
docker compose -f docker-compose.prod.yml ps

# Single service
docker compose -f docker-compose.prod.yml ps api

# Get logs
docker compose -f docker-compose.prod.yml logs -f api --tail=100

# Health check status
docker compose -f docker-compose.prod.yml exec api curl http://localhost:3001/health

Common Issues

1. API Service Not Healthy (stuck in "health-check-failed" state)

Symptoms:

  • docker compose ps shows (health: starting) for >2 minutes
  • docker compose logs api shows connection errors

Diagnosis:

# Check API liveness
docker compose exec api curl http://localhost:3001/health

# Check readiness (includes DB + Redis checks)
docker compose exec api curl http://localhost:3001/health/ready

# Check specific dependencies
docker compose exec api curl http://localhost:3001/health/db
docker compose exec api curl http://localhost:3001/health/redis

Solutions:

  • PostgreSQL not ready:

    docker compose ps postgres  # Should show (healthy)
    docker compose exec postgres pg_isready -U goodgo -d goodgo
    docker compose logs postgres --tail=50
    
  • Redis not ready:

    docker compose exec redis redis-cli ping  # Should return PONG
    docker compose logs redis --tail=50
    
  • PgBouncer not ready (prod):

    docker compose exec pgbouncer pg_isready -h 127.0.0.1 -p 6432 -U goodgo
    docker compose logs pgbouncer --tail=50
    
  • Database schema not initialized:

    # Run migrations manually
    docker compose exec api npx prisma migrate deploy
    # Or check if schema exists
    docker compose exec postgres psql -U goodgo -d goodgo -c "\dt"
    

2. High Database Connection Pool Exhaustion

Symptoms:

  • Errors: Error: unable to get a connection from the pool after X s
  • Slow queries pile up
  • API latency spikes

Diagnosis:

# Check pool stats (prod, PgBouncer)
docker compose exec pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_stats -c "SHOW stats"

# Or query PostgreSQL directly
docker compose exec postgres psql -U goodgo -d goodgo -c "SELECT count(*) FROM pg_stat_activity"

Solutions:

  • Increase PGBOUNCER_POOL_SIZE (default: 20)
  • Increase PGBOUNCER_MAX_CLIENT_CONN (default: 200)
  • Reduce long-running queries (add query timeout)
  • Check for idle connections: server_idle_timeout

3. Redis Connection Failures (Non-Fatal)

Symptoms:

  • Logs: Redis check failed or ECONNREFUSED
  • But API still responds with slower database reads
  • Health check /health/ready returns 503

Expected Behavior: Cache misses → app serves from database

Diagnosis:

# Check Redis availability
docker compose exec redis redis-cli ping

# Check RedisService logs
docker compose logs api | grep -i redis

Solutions:

  • Restart Redis: docker compose restart redis
  • Check memory: docker compose exec redis redis-cli info memory
  • If at maxmemory, increase in docker-compose.yml and restart

4. Typesense Search Not Indexing

Symptoms:

  • Search returns 0 results
  • Listings created but not searchable
  • /health for typesense shows green, but collection empty

Diagnosis:

# Check collection exists
curl http://localhost:8108/collections -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"

# Check collection stats
curl "http://localhost:8108/collections/listings" \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" | jq .

# Check recent docs
curl "http://localhost:8108/collections/listings/documents/search?q=*" \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" | jq '.found'

Solutions:

  • Verify TYPESENSE_API_KEY matches container env var
  • Reindex all listings:
    docker compose exec api npx ts-node scripts/reindex-listings.ts
    
  • If collection corrupted, drop and recreate:
    curl -X DELETE "http://localhost:8108/collections/listings" \
      -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"
    # Then restart API service to recreate schema
    docker compose restart api
    

5. Payment Callback Failures

Symptoms:

  • Payment status stuck in PENDING
  • Logs: Invalid callback signature for provider=VNPAY

Diagnosis:

# Check payment record in DB
docker compose exec postgres psql -U goodgo -d goodgo -c \
  "SELECT id, status, provider, \"providerTxId\", \"callbackData\" FROM \"Payment\" \
   WHERE \"providerTxId\" = 'your-txid' ORDER BY \"createdAt\" DESC LIMIT 1;"

# Check logs for callback handler
docker compose logs api | grep -i "HandleCallbackHandler\|callback"

Solutions:

  • Verify payment gateway credentials (VNPAY_HASH_SECRET, MOMO_SECRET_KEY, etc.)
  • Manually verify callback signature (contact payment provider support)
  • Replay callback manually (if idempotent key available):
    curl -X POST http://localhost:3001/api/payments/callback \
      -H "Content-Type: application/json" \
      -d '{"provider":"VNPAY",...callback data...}'
    

6. Backup Verification Fails

Symptoms:

  • GitHub Action .github/workflows/backup-verify.yml fails
  • Restore test database shows mismatched row counts

Diagnosis:

# Run verification manually
docker compose -f docker-compose.ci.yml up postgres
docker compose -f docker-compose.ci.yml exec postgres \
  /scripts/pg-verify-backup.sh /backups/goodgo_latest.sql.gz

# Check JSON report
cat /tmp/backups/verify-report.json | jq .

Solutions:

  • Check if backup file corrupt: file goodgo_*.sql.gz
  • Verify restore process: pg_restore --verbose
  • Check PostGIS extension availability: psql -c "CREATE EXTENSION postgis;"

7. Memory/CPU Pressure

Symptoms:

  • OOM kills, container exits 137
  • CPU throttling, latency spikes
  • Prometheus container_memory_usage_bytes near limit

Diagnosis:

# Check Docker stats
docker stats --no-stream

# Check limits in compose file
docker compose config | grep -A3 "resources:"

# Check actual memory usage
docker inspect goodgo-api | jq '.HostConfig.Memory'

Solutions:

  • Increase resource limits in docker-compose.prod.yml
  • Reduce log verbosity (set LOG_LEVEL=warn)
  • Implement pagination for large queries
  • Scale horizontally (add more API replicas)

Prometheus Queries for Debugging

# API request latency p99
histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket[5m])) by (le))

# API error rate (5xx)
(sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100

# Container memory usage
container_memory_usage_bytes{name="goodgo-api"}

# Container CPU usage
rate(container_cpu_usage_seconds_total{name="goodgo-api"}[5m])

# PostgreSQL active queries
pg_stat_activity_count{state="active"}

# Redis memory usage
redis_memory_used_bytes / 1024 / 1024  # in MB

# Typesense collection size
typesense_documents_count{collection="listings"}

Emergency Procedures

Full System Reset (dev only):

docker compose down -v  # Remove all volumes!
docker system prune -a
docker compose up -d --wait
docker compose exec api npx prisma db push
docker compose exec api npx ts-node scripts/seed.ts

Database Emergency Restore:

# Find latest backup
ls -t /var/lib/docker/volumes/pg_backups/_data/goodgo_*.sql.gz | head -1

# Restore to new database
pg_restore -h localhost -p 5432 -U goodgo -d goodgo_restored \
  --clean --if-exists --verbose /path/to/backup.sql.gz

# Verify restore
psql -U goodgo -d goodgo_restored -c "SELECT count(*) FROM \"User\";"

Force Kill Stuck Service:

# If health check broken
docker compose kill api
docker compose rm -f api
docker compose up -d api

Appendix: Key File Locations

/Users/velikho/Desktop/WORKING/goodgo-platform-ai/
├── docker-compose.yml              # Dev environment
├── docker-compose.prod.yml         # Prod environment (with pgbouncer, resource limits)
├── docker-compose.ci.yml           # CI/E2E test environment
├── .env.example                    # Template for all required env vars
│
├── apps/
│   ├── api/
│   │   ├── Dockerfile              # Multi-stage NestJS build
│   │   ├── docker-entrypoint.sh    # Startup script (migrations, app start)
│   │   ├── src/
│   │   │   ├── modules/health/health.controller.ts
│   │   │   ├── modules/payments/application/commands/handle-callback/
│   │   │   ├── modules/shared/infrastructure/redis.service.ts
│   │   │   └── modules/search/infrastructure/services/typesense-search.repository.ts
│   │   └── package.json
│   │
│   └── web/
│       ├── Dockerfile              # Multi-stage Next.js build
│       └── package.json
│
├── libs/
│   └── ai-services/
│       ├── Dockerfile              # Python FastAPI build
│       ├── app/main.py             # FastAPI app entry
│       └── pyproject.toml
│
├── prisma/
│   └── schema.prisma               # Complete Prisma schema (22 models)
│
├── infra/
│   └── pgbouncer/
│       ├── pgbouncer.ini           # Connection pooling config
│       ├── userlist.txt.template   # User list (templated)
│       └── entrypoint.sh           # Env substitution script
│
├── scripts/
│   └── backup/
│       ├── pg-backup.sh            # Daily backup automation
│       ├── pg-verify-backup.sh     # Restore verification
│       └── pg-restore.sh           # Manual restore script
│
├── monitoring/
│   ├── prometheus/
│   │   ├── prometheus.yml          # Scrape config (goodgo-api metrics)
│   │   └── alert-rules.yml         # Latency + error rate alerts
│   ├── loki/
│   │   └── loki-config.yml         # Log aggregation config (15-day retention)
│   ├── promtail/
│   │   └── promtail-config.yml     # Log shipping (Pino JSON parsing)
│   └── grafana/
│       ├── provisioning/
│       │   ├── datasources/datasource.yml
│       │   └── dashboards/dashboard.yml
│       └── dashboards/
│           ├── api-latency.json
│           ├── api-overview.json
│           ├── database.json
│           ├── logs.json
│           ├── search.json
│           ├── web-vitals.json
│           └── business-metrics.json
│
└── .github/workflows/
    ├── ci.yml                      # Lint, test, build
    ├── deploy.yml                  # Build images, deploy to staging/prod
    ├── e2e.yml                     # End-to-end tests
    ├── backup-verify.yml           # Weekly backup verification
    ├── security.yml                # Dependency/SAST scanning
    ├── codeql.yml                  # GitHub CodeQL
    └── load-test.yml               # K6 load testing

Document Version History

Version Date Author Changes
1.0 2026-04-11 DevOps Team Initial comprehensive runbook

Last Updated: April 11, 2026
Maintained By: GoodGo Platform SRE Team
Contact: devops@goodgo.vn