Files

Ho Ngoc Hai b8512ebff4 docs: consolidate audit and analysis reports into docs/audits/

Move 36 root-level audit/analysis documents and 7 web app audit documents
into docs/audits/ directory to declutter the project root. Remove stale
EXPLORATION_SUMMARY.txt.

Co-Authored-By: Paperclip <noreply@paperclip.ing>

2026-04-11 01:37:50 +07:00

45 KiB

Raw Blame History

GoodGo Platform — Operational Infrastructure Runbook

Last Updated: April 11, 2026
Version: 1.0
Purpose: Complete infrastructure reference for ops teams, SREs, and on-call engineers

Executive Summary
Services Architecture
Docker Compose Specifications
Database Layer
Caching & Search
Monitoring & Observability
Payment Integration
Health Checks
Environment Variables
Backup & Recovery
Deployment Pipeline
Troubleshooting Guide

Executive Summary

GoodGo Platform is a monorepo real estate marketplace built with:

Frontend: Next.js (TypeScript)
Backend API: NestJS (TypeScript)
AI Services: Python/FastAPI
Database: PostgreSQL 16 + PostGIS
Cache: Redis 7
Search: Typesense 27.1
Object Storage: MinIO (S3-compatible)
Monitoring: Prometheus + Grafana + Loki + Promtail
Message Queue: Built-in CQRS/Event Bus (NestJS)

Total Services in Production: 12+ (detailed below)

Services Architecture

Service Inventory

Service	Image	Port	Purpose	Health Check	Dependencies
api	`goodgo-api:latest`	3001	NestJS REST API	`GET /health` (3x30s)	postgres, redis, typesense, pgbouncer
web	`goodgo-web:latest`	3000	Next.js frontend	`GET /` (3x30s)	api
ai-services	`goodgo-ai-services:latest`	8000	Python FastAPI (price estimation, NLP)	`GET /health` (3x30s)	n/a
postgres	`postgis/postgis:16-3.4`	5432	Primary database	`pg_isready` (5x10s)	n/a
pgbouncer	`edoburu/pgbouncer:1.23.1-p2`	6432	Connection pooling (transaction mode)	`pg_isready` (5x10s)	postgres
redis	`redis:7-alpine`	6379	Cache + session store	`PING` (5x10s)	n/a
typesense	`typesense/typesense:27.1`	8108	Full-text search index	`GET /health` (5x10s)	n/a
minio	`minio/minio:latest`	9000/9001	Object storage + console	`mc ready local` (5x10s)	n/a
loki	`grafana/loki:3.0.0`	3100	Log aggregation	`GET /ready` (5x15s)	n/a
promtail	`grafana/promtail:3.0.0`	9080	Log shipper	(depends on loki healthy)	loki
prometheus	`prom/prometheus:v2.51.0`	9090	Metrics scraper	`GET /-/healthy` (3x15s)	n/a
grafana	`grafana/grafana:10.4.1`	3002	Dashboards + alerting	`GET /api/health` (3x15s)	prometheus, loki
pg-backup	`postgis/postgis:16-3.4`	—	Automated backup cron	depends_on postgres	postgres

Network & Volumes

Network: Docker bridge network goodgo-net
Volumes:
- pgdata — PostgreSQL data files
- redis_data — Redis snapshot (AOF)
- typesense_data — Search index
- minio_data — Object storage
- pg_backups — Database backups (daily retention: 7 days)
- loki_data — Log chunks (retention: 15 days)
- prometheus_data — Metrics TSDB (retention: 30 days in prod, 15 days in dev)
- grafana_data — Dashboards, datasource configs

Docker Compose Specifications

Development Environment (`docker-compose.yml`)

12 Services (minimal dependencies, no resource limits)

services:
  postgres:        PostGIS 16, port 5432, healthcheck: pg_isready (30s start-period)
  redis:           Alpine 7, port 6379, maxmemory: 256mb LRU, AOF enabled
  typesense:       v27.1, port 8108, CORS enabled, healthcheck /health
  minio:           latest, ports 9000 (API) / 9001 (console)
  ai-services:     Custom Python build, port 8000
  pg-backup:       Automated daily dumps at 02:00 UTC, cron retention cleanup
  pg-verify-backup: On-demand backup restore verification (profile: tools)
  loki:            v3.0.0, port 3100, 15-day retention, 2h compaction delay
  promtail:        v3.0.0, Docker socket instrumentation, Pino JSON parsing
  prometheus:      v2.51.0, port 9090, 15-day retention, lifecycle API enabled
  grafana:         v10.4.1, port 3002, datasources pre-provisioned

Key Differences from Prod:

No resource limits (use all available CPU/memory)
Smaller retention windows (7-15 days)
PostgreSQL on port 5432 (direct, no pgbouncer)
loki/prometheus/grafana on alternate ports

Production Environment (`docker-compose.prod.yml`)

14 Services (with pgbouncer, resource limits, rolling updates)

services:
  api:             NestJS, resource limits: 1g CPU / 1g memory
  web:             Next.js, resource limits: 0.5 CPU / 512m memory
  ai-services:     Python, resource limits: 1.0 CPU / 1g memory
  postgres:        PostGIS, resource limits: 2.0 CPU / 2g memory
  pgbouncer:       Connection pool (NEW), 20 default connections, transaction mode
  redis:           7-alpine, resource limits: 0.5 CPU / 768m memory, password auth
  typesense:       27.1, resource limits: 1.0 CPU / 1g memory
  minio:           latest, resource limits: 0.5 CPU / 1g memory
  loki:            v3.0.0, resource limits: 0.5 CPU / 512m memory
  promtail:        v3.0.0, resource limits: 0.25 CPU / 256m memory
  prometheus:      v2.51.0, resource limits: 0.5 CPU / 1g memory, 30-day retention
  grafana:         v10.4.1, resource limits: 0.5 CPU / 512m memory
  pg-backup:       Same as dev

Production-Specific Flags:

read_only: true on app containers (api, web, ai-services)
tmpfs: [/tmp] for runtime temp files
security_opt: [no-new-privileges:true]
logging: json-file with 10m max-size, 3-5 files rotation
PgBouncer inserted between apps ↔ Postgres (port 6432)
Secrets management: GRAFANA_ADMIN_USER, GRAFANA_ADMIN_PASSWORD from Docker secrets
Redis requires password authentication

CI/E2E Environment (`docker-compose.ci.yml`)

Minimal 4 Services (tmpfs for speed)

services:
  postgres:        goodgo_test DB, tmpfs (/var/lib/postgresql/data)
  redis:          --save "" --appendonly no (no persistence)
  typesense:      tmpfs (/data)
  minio:          tmpfs (/data)

Used by:

GitHub Actions E2E test suite
Local docker compose -f docker-compose.ci.yml up --wait

Database Layer

PostgreSQL + PostGIS

Version: 16.3.4 with PostGIS extension
Schema: 22 Prisma models + Prisma migration tracking

Prisma Schema Models

Auth: User, RefreshToken, OAuthAccount, Agent
Listings: Property, PropertyMedia, Listing
Search: SavedSearch
Transactions: Transaction, Inquiry, Lead
Payments: Payment (with PaymentProvider enum: VNPAY, MOMO, ZALOPAY, BANK_TRANSFER)
Subscriptions: Plan, Subscription, UsageRecord
Analytics: Valuation, MarketIndex
Notifications: NotificationLog, NotificationPreference
Audit: AdminAuditLog
Reviews: Review

Key Database Features

PostGIS Geometry: Property.location (Point, SRID 4326) with GIST index
Enums: UserRole, KYCStatus, PropertyType, TransactionType, ListingStatus, Direction, OAuthProvider, TransactionStatus, LeadStatus, PaymentProvider, PaymentStatus, PaymentType, PlanTier, SubscriptionStatus, NotificationChannel, NotificationStatus, AdminAction, AuditTargetType
Compound Indexes: Query optimization on (role, isActive, createdAt), (sellerId, status, publishedAt), (userId, status, createdAt), etc.
Constraints: Unique idempotency key on Payment (userId, provider, idempotencyKey)

Connection Pooling: PgBouncer

Dev Mode (docker-compose.yml):

Apps connect directly to postgres:5432
No pooling overhead

Prod Mode (docker-compose.prod.yml):

Apps connect to pgbouncer:6432
Pool Mode: transaction (connections returned after each transaction)
Pool Size: 20 connections (default, tunable via PGBOUNCER_POOL_SIZE)
Max Client Conn: 200 (tunable via PGBOUNCER_MAX_CLIENT_CONN)
Reserve Pool: 5 connections (fallback when pool exhausted)
Timeouts:
- server_connect_timeout: 15s
- server_idle_timeout: 600s
- server_lifetime: 3600s (connection recycle)
- query_wait_timeout: 120s
- query_timeout: 0 (disabled)
Admin Console: pgbouncer_admin user (password via PGBOUNCER_ADMIN_PASSWORD env var)
Stats Console: pgbouncer_stats user (password via PGBOUNCER_STATS_PASSWORD env var)

Migration Workaround:

API has two DATABASE_URL env vars:
- DATABASE_URL → pgbouncer:6432 (normal queries)
- DATABASE_URL_DIRECT → postgres:5432 (migrations, introspection, DDL)
RUN_MIGRATIONS=true switches app to use DATABASE_URL_DIRECT for prisma migrate deploy

Backup Strategy

Automated Backups:

Schedule: Daily at 02:00 UTC (cron inside pg-backup container)
Format: Custom format with gzip compression (level 6)
Retention: 7 days (configurable via BACKUP_RETENTION_DAYS)
Location: pg_backups volume (mount to persistent storage in prod)
File Pattern: goodgo_YYYYMMDD_HHMMSS.sql.gz
Restore Script: /scripts/backup/pg-restore.sh (manual restore)
Verification Script: /scripts/backup/pg-verify-backup.sh (automated E2E verification)

Verification Process (runs weekly):

Restores latest backup to isolated test database (goodgo_verify_<timestamp>)
Verifies all 22 tables exist
Compares row counts between source and restored DB
Checksums critical tables (User, Property, Listing, Payment, Subscription, Transaction, Plan, _prisma_migrations)
Checks PostGIS extension, indexes, enum types
Generates JSON report with pass/fail result
Cleanup: Drops test DB on exit (unless SKIP_CLEANUP=1)
Exit Codes: 0=pass, 1=checks failed, 2=setup error

CI/CD Backup Verification:

GitHub Action: .github/workflows/backup-verify.yml
Runs weekly Sundays 05:00 UTC
Also manually triggerable with skip_cleanup option
Uploads JSON report as artifact

Caching & Search

Redis

Image: redis:7-alpine
Port: 6379

Production Configuration:

redis-server \
  --appendonly yes \                # AOF persistence (updates only)
  --requirepass ${REDIS_PASSWORD} \ # Authentication required
  --maxmemory 512mb \               # Max memory limit (prod)
  --maxmemory-policy allkeys-lru    # LRU eviction when full

Development Configuration:

redis-server \
  --appendonly yes \
  --maxmemory 256mb \
  --maxmemory-policy allkeys-lru

ioredis Client Configuration:

// From RedisService in apps/api/src/modules/shared/infrastructure/redis.service.ts
{
  host: process.env.REDIS_HOST ?? 'localhost',
  port: Number(process.env.REDIS_PORT ?? 6379),
  password: process.env.REDIS_PASSWORD ?? undefined,
  lazyConnect: true,          // App starts even if Redis unavailable
  enableReadyCheck: false,    // Prevents "Redis is not ready" errors during transient outages
  maxRetriesPerRequest: 1,    // Fail fast (single retry, no exponential backoff)
  retryStrategy(times: number): number {
    return Math.min(times * 1000, 5000);  // 1s → 2s → 3s → 4s → 5s → 5s...
  }
}

Graceful Degradation:

Cache misses don't fail the application
CacheService catches Redis errors and returns cache miss
App serves data directly from PostgreSQL if Redis down
Health check at GET /health/redis warns but doesn't fail readiness probe

Use Cases:

Session storage
Cache layer for expensive queries
Rate limiting (if implemented)
Real-time counters

Typesense

Image: typesense/typesense:27.1
Port: 8108 (HTTP only, internal Docker network) API Key: ${TYPESENSE_API_KEY} (must be set in .env)

Collection Schema:

Collection Name: "listings"
Fields:
  - listingId (string)
  - propertyId (string)
  - title (string, searchable, highlights)
  - description (string, searchable, highlights)
  - propertyType (string, faceted)
  - transactionType (string, faceted: SALE/RENT)
  - priceVND (int64, sortable)
  - pricePerM2 (float, optional)
  - areaM2 (float)
  - bedrooms (int32, faceted)
  - bathrooms (int32, faceted)
  - floors (int32)
  - direction (string, faceted: NORTH/SOUTH/EAST/WEST/etc)
  - address (string)
  - ward (string, faceted)
  - district (string, faceted)
  - city (string, faceted)
  - location (geopoint) — for radius search
  - agentId (string)
  - sellerId (string)
  - status (string, faceted: ACTIVE/SOLD/DRAFT/etc)
  - publishedAt (int64, sortable)
  - viewCount (int32)
  - saveCount (int32)
  - projectName (string, faceted)
  - amenities (string[], faceted)

Search Features:

Full-text search on: title, description, address, district, city, projectName
Query weights: title=5, description=3, address=2, district=2, city=1, projectName=2
Filtering: propertyType, transactionType, bedrooms, district, city, status, amenities
Geo-search: radius-based queries (lat, lng, km)
Sorting: price (asc/desc), distance (asc from geopoint), date (desc), relevance
Highlights: HTML marks on matched terms in title and description
Facets: Return aggregated counts for filtering

TypesenseSearchRepository (apps/api/src/modules/search/infrastructure/services/typesense-search.repository.ts):

ensureCollection() — Creates schema if not exists
dropCollection() — Cleanup (testing only)
indexDocument(doc) — Upsert single document
indexDocuments(docs) — Bulk import with error reporting
removeDocument(id) — Delete by ID
search(params) — Execute search with filters, sort, pagination

Graceful Degradation:

If Typesense down, search falls back to PostgreSQL full-text search
TypesenseClientService implements retry logic with exponential backoff
Health check at GET /health returns JSON status

Monitoring & Observability

Prometheus

Image: prom/prometheus:v2.51.0
Port: 9090
Retention: 15 days (dev), 30 days (prod)
Lifecycle API: Enabled (--web.enable-lifecycle)

Scrape Targets (monitoring/prometheus/prometheus.yml):

scrape_configs:
  - job_name: goodgo-api
    metrics_path: /metrics
    static_configs:
      - targets: ['host.docker.internal:3001']  # Dev (API on host)
      - targets: ['api:3001']                   # Prod (API in container)
    labels:
      service: goodgo-api
      environment: [development|production]

  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

Expected Metrics from API:

goodgo_api_request_duration_seconds_bucket{le, route, method} — Request latency histogram
http_requests_total{status_code, job} — Request count by status code
Custom business metrics (if implemented in NestJS @prometheus decorators)

Alert Rules (`monitoring/prometheus/alert-rules.yml`)

Latency Alerts:

ApiLatencyP99High (warning)
- Trigger: p99 latency > 1s for 5 minutes
- Dashboard: /d/goodgo-api-latency/goodgo-api-latency
- Runbook: https://docs.goodgo.vn/runbooks/api-latency-high
ApiEndpointLatencyP99High (warning)
- Trigger: Per-endpoint p99 > 2s for 5 minutes
- Annotates: method, route labels
ApiLatencyP99Critical (critical - SLO breach)
- Trigger: p99 latency > 3s for 3 minutes
- Escalation required
- Runbook: https://docs.goodgo.vn/runbooks/api-latency-critical

Error Rate Alert:

ApiErrorRate5xxHigh (warning)
- Trigger: 5xx error rate > 1% for 5 minutes
- Uses: (5xx errors / total requests) * 100

Grafana

Image: grafana/grafana:10.4.1
Port: 3002
Auth: Admin user/password from secrets (prod) or env vars (dev)

Pre-Provisioned Datasources:

Prometheus (default, primary)
Loki (with derived fields for correlationId linkage)

Dashboards:

api-latency.json — API p99/p95/p50, route breakdown, slow endpoints
api-overview.json — Request rate, error rate, uptime status
database.json — Query latency, connection pool utilization, slow queries
logs.json — Log volume, error logs, trace links to Prometheus
search.json — Typesense query latency, indexing rate, collection size
web-vitals.json — Frontend Core Web Vitals (if client-side instrumentation)
business-metrics.json — Listings created, payments processed, user signups

Admin Console Access:

URL: http://localhost:3002 (dev) or ${GRAFANA_PORT} (prod)
Default user: admin (change password on first login)
Non-signup mode (GF_USERS_ALLOW_SIGN_UP: false)

Loki & Promtail (Log Aggregation)

Loki: grafana/loki:3.0.0, port 3100

Configuration:

schema:
  - from: 2024-01-01
    store: tsdb
    schema: v13
limits:
  max_entries_limit_per_query: 5000
  ingestion_rate_mb: 4
  ingestion_burst_size_mb: 6
retention: 360h (15 days)

Promtail: grafana/promtail:3.0.0

Configuration:

Scrapes Docker logs from goodgo-net bridge network
Parses Pino JSON structured logs
Extracts labels: level, context, component, service
Structured metadata: method, url, statusCode, correlationId, duration
Derives timestamp from Pino output (RFC3339Nano)

Expected Log Format (Pino):

{
  "level": 30,                    // info
  "time": "2026-04-11T10:30:00Z",
  "msg": "POST /api/listings",
  "correlationId": "abc-123-def",
  "context": "ListingController",
  "component": "api",
  "method": "POST",
  "url": "/api/listings",
  "statusCode": 201,
  "duration": 150
}

Payment Integration

Supported Payment Providers

Enum: PaymentProvider (Prisma)

VNPAY — VNPay (Vietnam payment gateway)
MOMO — MoMo (Vietnam mobile wallet)
ZALOPAY — ZaloPay (Vietnam digital wallet)
BANK_TRANSFER — Manual bank transfer (offline)

Payment Flow & Callback Handling

Database Schema (Payment Model):

model Payment {
  id            String @id @default(cuid())
  userId        String
  transactionId String?
  provider      PaymentProvider
  type          PaymentType  // SUBSCRIPTION, LISTING_FEE, DEPOSIT, FEATURED_LISTING
  amountVND     BigInt
  status        PaymentStatus  // PENDING, PROCESSING, COMPLETED, FAILED, REFUNDED
  providerTxId  String?  // External transaction ID from VNPay/MoMo/ZaloPay
  callbackData  Json?    // Raw callback payload (for audit)
  idempotencyKey String? // Prevent duplicate payments (userId, provider, idempotencyKey unique)
  createdAt     DateTime @default(now())
  updatedAt     DateTime @updatedAt
}

enum PaymentStatus {
  PENDING, PROCESSING, COMPLETED, FAILED, REFUNDED
}

enum PaymentType {
  SUBSCRIPTION, LISTING_FEE, DEPOSIT, FEATURED_LISTING
}

Command Handler: HandleCallbackHandler (apps/api/src/modules/payments/application/commands/handle-callback/handle-callback.handler.ts)

Callback Signature Verification:
- Uses PAYMENT_GATEWAY_FACTORY to route to correct provider (VNPay/MoMo/ZaloPay)
- Gateway.verifyCallback() validates HMAC signature
- Throws ValidationException if signature invalid
Idempotent Status Transition:
- Only updates payments in state: PENDING or PROCESSING
- Atomically transitions to COMPLETED or FAILED
- If already in terminal state (COMPLETED/FAILED/REFUNDED), returns existing status (idempotent)
- Logs warning if payment not found
Domain Event Publishing:
- Reconstructs domain entity from repository
- Emits PaymentCompletedEvent or PaymentFailedEvent
- Event bus publishes events to subscribers (e.g., subscription creation, listing activation)

Response:

{
  paymentId: string,
  status: PaymentStatus,
  isSuccess: boolean
}

Payment Gateway Interface (payment-gateway.interface.ts):

interface IPaymentGateway {
  readonly provider: PaymentProvider
  createPaymentUrl(params: CreatePaymentUrlParams): Promise<CreatePaymentUrlResult>
  verifyCallback(data: Record<string, string>): CallbackVerifyResult
  refund(params: RefundParams): Promise<RefundResult>
}

interface CreatePaymentUrlParams {
  orderId: string
  amountVND: bigint
  description: string
  returnUrl: string
  ipAddress: string
}

interface CallbackVerifyResult {
  isValid: boolean
  orderId: string
  providerTxId: string
  isSuccess: boolean
  rawData: Record<string, unknown>
}

interface RefundParams {
  providerTxId: string
  amountVND: bigint
  reason: string
}

interface RefundResult {
  success: boolean
  refundTxId: string | null
}

Environment Variables

VNPay:

VNPAY_TMN_CODE=<merchant terminal code>
VNPAY_HASH_SECRET=<HMAC secret key>
VNPAY_BASE_URL=https://sandbox.vnpayment.vn/paymentv2/vpcpay.html
VNPAY_API_URL=https://sandbox.vnpayment.vn/merchant_webapi/api/transaction

MoMo:

MOMO_PARTNER_CODE=<partner code>
MOMO_ACCESS_KEY=<access key>
MOMO_SECRET_KEY=<secret key>
MOMO_ENDPOINT=https://test-payment.momo.vn/v2/gateway/api

ZaloPay:

ZALOPAY_APP_ID=<app ID>
ZALOPAY_KEY1=<key 1 (for creating payments)>
ZALOPAY_KEY2=<key 2 (for callback verification)>
ZALOPAY_ENDPOINT=https://sb-openapi.zalopay.vn/v2

Race Condition & Idempotency Protection

Problem: Multiple callbacks may arrive for same payment (network retries, duplicate notifications)

Solution:

Unique Idempotency Key: Payment_idempotency_unique(userId, provider, idempotencyKey)
- Prevents duplicate payment records
- Generated by client/API before creating payment
Atomic Status Update: paymentRepo.updateIfStatus(orderId, ['PENDING', 'PROCESSING'], newStatus)
- Only updates if current status in allowed list
- Returns updated entity or null if already terminal
Terminal State Check: If already COMPLETED/FAILED/REFUNDED, handler returns existing state
- No re-triggering of domain events
- No double billing or duplicate transactions

Health Checks

API Health Endpoints

Health Controller (apps/api/src/modules/health/health.controller.ts)

GET /health — Liveness Probe (always 200 if process running)
- Uses: @HealthCheck() on empty probe list
- Response: { "status": "ok", "timestamp": "..." }
- Use Case: Kubernetes/Docker readiness (initial startup)
GET /health/ready — Readiness Probe (checks dependencies)
- Checks: PostgreSQL + Redis connectivity
- Response:
```
{
  "status": "ok",
  "checks": {
    "database": { "status": "up" },
    "redis": { "status": "up" }
  }
}
```
- Use Case: Load balancer, before accepting traffic
- Failure: Returns 503 if any dependency down
GET /health/db — Database Readiness Only
- Checks: PostgreSQL connectivity via SELECT 1 query
- Use Case: Manual database troubleshooting
GET /health/redis — Redis Readiness Only
- Checks: Redis PING command
- Use Case: Manual Redis troubleshooting

Health Check Implementations

PrismaHealthIndicator (apps/api/src/modules/health/infrastructure/prisma.health.ts):

async isHealthy(key: string): Promise<HealthIndicatorResult> {
  try {
    await this.prisma.$queryRawUnsafe('SELECT 1');
    return this.getStatus(key, true);
  } catch {
    throw new HealthCheckError('Database check failed', this.getStatus(key, false));
  }
}

RedisHealthIndicator (apps/api/src/modules/health/infrastructure/redis.health.ts):

async isHealthy(key: string): Promise<HealthIndicatorResult> {
  try {
    const client = this.redis.getClient();
    const pong = await client.ping();
    const isHealthy = pong === 'PONG';
    const result = this.getStatus(key, isHealthy);
    if (isHealthy) return result;
    throw new HealthCheckError('Redis ping failed', result);
  } catch (error) {
    if (error instanceof HealthCheckError) throw error;
    throw new HealthCheckError('Redis check failed', this.getStatus(key, false));
  }
}

Docker Container Health Checks

API Container:

healthcheck:
  test: ['CMD', 'node', '-e', "fetch('http://localhost:3001/health').then(r => { if (!r.ok) throw 1 }).catch(() => process.exit(1))"]
  interval: 30s
  timeout: 5s
  retries: 5
  start_period: 30s

Web Container:

healthcheck:
  test: ['CMD', 'node', '-e', "fetch('http://localhost:3000').then(r => { if (!r.ok) throw 1 }).catch(() => process.exit(1))"]
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 15s

PostgreSQL:

healthcheck:
  test: ['CMD-SHELL', 'pg_isready -U ${DB_USER} -d ${DB_NAME}']
  interval: 10s
  timeout: 5s
  retries: 5
  start_period: 30s

Redis:

healthcheck:
  test: ['CMD', 'redis-cli', '-a', '${REDIS_PASSWORD}', 'ping']
  interval: 10s
  timeout: 5s
  retries: 5
  start_period: 10s

Typesense:

healthcheck:
  test: ['CMD', 'curl', '-sf', 'http://localhost:8108/health']
  interval: 10s
  timeout: 5s
  retries: 5
  start_period: 15s

Environment Variables

Complete `.env.example` Reference

PostgreSQL:

DB_HOST=localhost
DB_PORT=5432
DB_NAME=goodgo
DB_USER=goodgo
DB_PASSWORD=CHANGE_ME
DATABASE_URL=postgresql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:${DB_PORT}/${DB_NAME}?schema=public
DATABASE_URL_DIRECT=postgresql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:${DB_PORT}/${DB_NAME}?schema=public

PgBouncer (Prod Only):

PGBOUNCER_POOL_SIZE=20
PGBOUNCER_MAX_CLIENT_CONN=200
PGBOUNCER_ADMIN_PASSWORD=CHANGE_ME
PGBOUNCER_STATS_PASSWORD=CHANGE_ME

Redis:

REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=
REDIS_URL=redis://${REDIS_HOST}:${REDIS_PORT}

Typesense:

TYPESENSE_HOST=localhost
TYPESENSE_PORT=8108
TYPESENSE_PROTOCOL=http
TYPESENSE_API_KEY=CHANGE_ME

MinIO:

MINIO_ENDPOINT=localhost
MINIO_PORT=9000
MINIO_CONSOLE_PORT=9001
MINIO_ACCESS_KEY=CHANGE_ME
MINIO_SECRET_KEY=CHANGE_ME
MINIO_BUCKET=goodgo-media
MINIO_USE_SSL=false

NestJS API:

API_PORT=3000
PORT=3001
NODE_ENV=development
CORS_ORIGINS=http://localhost:3000,http://localhost:3001

JWT / Authentication (REQUIRED):

JWT_SECRET=<generate with: openssl rand -base64 48>
JWT_EXPIRES_IN=15m
JWT_REFRESH_SECRET=<generate with: openssl rand -base64 48>
JWT_REFRESH_EXPIRES_IN=7d

OAuth Providers:

GOOGLE_CLIENT_ID=
GOOGLE_CLIENT_SECRET=
GOOGLE_CALLBACK_URL=http://localhost:3001/auth/google/callback

ZALO_APP_ID=
ZALO_APP_SECRET=
ZALO_CALLBACK_URL=http://localhost:3001/auth/zalo/callback

FRONTEND_URL=http://localhost:3000

Next.js Web:

NEXT_PUBLIC_API_URL=http://localhost:3000
WEB_PORT=3001

AI Service (Python/FastAPI):

AI_SERVICE_PORT=8000
AI_SERVICE_URL=http://localhost:8000
CLAUDE_API_KEY=
AI_DEBUG=false
AI_LOG_LEVEL=info

Map Integration:

NEXT_PUBLIC_MAPBOX_TOKEN=

Payment Gateways:

VNPAY_TMN_CODE=
VNPAY_HASH_SECRET=
VNPAY_BASE_URL=https://sandbox.vnpayment.vn/paymentv2/vpcpay.html
VNPAY_API_URL=https://sandbox.vnpayment.vn/merchant_webapi/api/transaction

MOMO_PARTNER_CODE=
MOMO_ACCESS_KEY=
MOMO_SECRET_KEY=
MOMO_ENDPOINT=https://test-payment.momo.vn/v2/gateway/api

ZALOPAY_APP_ID=
ZALOPAY_KEY1=
ZALOPAY_KEY2=
ZALOPAY_ENDPOINT=https://sb-openapi.zalopay.vn/v2

Email / SMTP:

SMTP_HOST=localhost
SMTP_PORT=1025
SMTP_USER=
SMTP_PASS=
SMTP_FROM=noreply@goodgo.vn

Firebase Cloud Messaging (Optional):

FIREBASE_SERVICE_ACCOUNT=

Sentry Error Tracking:

SENTRY_DSN=
NEXT_PUBLIC_SENTRY_DSN=
SENTRY_AUTH_TOKEN=
SENTRY_ORG=
SENTRY_PROJECT=

KYC Field Encryption (REQUIRED Prod):

KYC_ENCRYPTION_KEY=<generate with: openssl rand -hex 32> # 64 hex chars (32 bytes)
KYC_ENCRYPTION_KEY_VERSION=1

Logging:

LOG_LEVEL=info

Backup & Recovery

Automated Daily Backups

Service: pg-backup container (runs inside docker compose)

Backup Script: scripts/backup/pg-backup.sh

# Daily cron job: 02:00 UTC
PGHOST=postgres \
PGPORT=5432 \
PGUSER=goodgo \
PGDATABASE=goodgo \
PGPASSWORD=<secret> \
BACKUP_DIR=/backups \
RETENTION_DAYS=7 \
  /scripts/pg-backup.sh

Behavior:

Creates dump with pg_dump --format=custom --compress=6
Saves as goodgo_YYYYMMDD_HHMMSS.sql.gz
Prunes backups older than 7 days (configurable)
Logs to /var/log/pg-backup.log

Restore from Backup:

# Interactive restore prompt
docker compose -f docker-compose.prod.yml exec pg-backup bash -c \
  'pg_restore -h postgres -p 5432 -U goodgo -d goodgo \
   --clean --if-exists /backups/goodgo_20260410_020000.sql.gz'

# Or using restore script
docker compose -f docker-compose.prod.yml run --rm pg-verify-backup bash -c \
  'source /scripts/pg-restore.sh /backups/goodgo_20260410_020000.sql.gz'

Backup Verification

Service: pg-verify-backup container (on-demand, profile: tools)

Verification Script: scripts/backup/pg-verify-backup.sh

# Usage:
docker compose -f docker-compose.prod.yml run --rm pg-verify-backup

# Or with options:
SKIP_CLEANUP=1 REPORT_FILE=/backups/verify-report.json \
  docker compose -f docker-compose.prod.yml run --rm pg-verify-backup

Verification Steps:

Creates isolated test database: goodgo_verify_<timestamp>
Enables PostGIS extension
Restores backup into test DB
Verifies all 22 tables exist
Compares row counts between source and restored
Checksums critical tables using MD5 hashes
Checks indexes, enum types
Generates JSON report with results
Cleanup: Drops test DB (unless SKIP_CLEANUP=1)

JSON Report Structure:

{
  "timestamp": "2026-04-11T10:30:00Z",
  "backupFile": "/backups/goodgo_20260410_020000.sql.gz",
  "backupSize": "150M",
  "testDatabase": "goodgo_verify_20260411_103000",
  "restoreDurationSeconds": 45,
  "passed": 28,
  "failed": 0,
  "warnings": 2,
  "result": "pass",
  "checks": [
    { "check": "Database creation", "status": "pass", "detail": "Test database created" },
    { "check": "Restore", "status": "pass", "detail": "pg_restore completed cleanly in 45s" },
    { "check": "Table existence", "status": "pass", "detail": "All 22 expected tables present" },
    { "check": "Row counts", "status": "pass", "detail": "All tables match source database" },
    { "check": "Checksum: User identities", "status": "pass", "detail": "Hashes match (abc123def456...)" },
    ...
  ]
}

GitHub Action Backup Verification:

File: .github/workflows/backup-verify.yml
Schedule: Weekly Sundays 05:00 UTC
Also: Manual trigger with skip_cleanup option
Artifacts: Uploads JSON report for 30 days

Deployment Pipeline

GitHub Actions CI/CD

Workflows:

.github/workflows/ci.yml — Lint, typecheck, test, build (on push/PR to master)
.github/workflows/deploy.yml — Build Docker images, deploy to staging/prod
.github/workflows/e2e.yml — E2E tests (spins up full docker-compose.ci.yml)
.github/workflows/backup-verify.yml — Weekly backup verification
.github/workflows/security.yml — Dependency scanning, SAST
.github/workflows/codeql.yml — GitHub CodeQL analysis
.github/workflows/load-test.yml — K6 load testing

CI Pipeline (`ci.yml`)

On: push master, pull_request master
Node: 22
Concurrency: Cancel previous runs on same ref

Jobs:

Lint → Typecheck → Test → Build
- Installs pnpm, Node 22
- Runs linter (eslint)
- Type checks (tsc)
- Unit tests (jest)
- Builds all apps (turbo)
- PostgreSQL 16 service available (goodgo_test DB)
E2E Tests (depends on ci job)
- Full docker-compose.ci.yml services (postgres, redis, typesense, minio)
- Runs end-to-end test suite
- Timeout: 20 minutes
- Env vars: DATABASE_URL, JWT secrets, payment test codes

Deploy Pipeline (`deploy.yml`)

On:

push master (auto-deploys to staging)
Manual workflow_dispatch (choose staging or production)

Jobs:

Build API Image
- Builds: goodgo-api:${IMAGE_TAG}
- Dockerfile: apps/api/Dockerfile
- Registry: ghcr.io/goodgo/goodgo-api
- Tags: git SHA, branch name, latest (on master)
Build Web Image
- Builds: goodgo-web:${IMAGE_TAG}
- Dockerfile: apps/web/Dockerfile
- Registry: ghcr.io/goodgo/goodgo-web
Build AI Services Image
- Builds: goodgo-ai-services:${IMAGE_TAG}
- Context: libs/ai-services/
- Registry: ghcr.io/goodgo/goodgo-ai-services
Deploy to Staging
- Condition: github.event_name == 'push' || inputs.environment == 'staging'
- SSH into staging host
- Pulls new images from GHCR
- Rolling update (zero downtime):
```
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api
docker compose -f docker-compose.prod.yml up -d --no-deps --wait web
docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services
```
- Runs migrations: docker compose exec api npx prisma migrate deploy
- Prunes old images
Deploy to Production
- Only on manual workflow_dispatch with environment: production
- Same steps as staging
- Requires environment: production approval (GitHub security)

Dockerfile Multi-Stage Builds

API (apps/api/Dockerfile):

Base: node:22-slim + pnpm 10.27.0
Deps: Install locked dependencies (layer caching)
Build: Compile TypeScript, generate Prisma client
Prune: pnpm deploy --prod (removes dev deps, hoists prod deps)
Production: Minimal image, dumb-init for signals, non-root user

Web (apps/web/Dockerfile):

Base: node:22-slim + pnpm
Deps: Install dependencies
Build: next build → standalone output + static files
Production: Copy .next/standalone, public, static assets

AI Services (libs/ai-services/Dockerfile):

Base: python:3.12-slim
Install: System deps (gcc, g++), dumb-init, FastAPI/XGBoost/underthesea
Models: Pre-download underthesea ML models at build time
User: Run as non-root appuser
CMD: uvicorn app.main:app --host 0.0.0.0 --port 8000

Troubleshooting Guide

Check Service Status

# All services
docker compose -f docker-compose.prod.yml ps

# Single service
docker compose -f docker-compose.prod.yml ps api

# Get logs
docker compose -f docker-compose.prod.yml logs -f api --tail=100

# Health check status
docker compose -f docker-compose.prod.yml exec api curl http://localhost:3001/health

Common Issues

1. API Service Not Healthy (stuck in "health-check-failed" state)

Symptoms:

docker compose ps shows (health: starting) for >2 minutes
docker compose logs api shows connection errors

Diagnosis:

# Check API liveness
docker compose exec api curl http://localhost:3001/health

# Check readiness (includes DB + Redis checks)
docker compose exec api curl http://localhost:3001/health/ready

# Check specific dependencies
docker compose exec api curl http://localhost:3001/health/db
docker compose exec api curl http://localhost:3001/health/redis

Solutions:

PostgreSQL not ready:

docker compose ps postgres  # Should show (healthy)
docker compose exec postgres pg_isready -U goodgo -d goodgo
docker compose logs postgres --tail=50

Redis not ready:

docker compose exec redis redis-cli ping  # Should return PONG
docker compose logs redis --tail=50

PgBouncer not ready (prod):

docker compose exec pgbouncer pg_isready -h 127.0.0.1 -p 6432 -U goodgo
docker compose logs pgbouncer --tail=50

Database schema not initialized:

# Run migrations manually
docker compose exec api npx prisma migrate deploy
# Or check if schema exists
docker compose exec postgres psql -U goodgo -d goodgo -c "\dt"

2. High Database Connection Pool Exhaustion

Symptoms:

Errors: Error: unable to get a connection from the pool after X s
Slow queries pile up
API latency spikes

Diagnosis:

# Check pool stats (prod, PgBouncer)
docker compose exec pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_stats -c "SHOW stats"

# Or query PostgreSQL directly
docker compose exec postgres psql -U goodgo -d goodgo -c "SELECT count(*) FROM pg_stat_activity"

Solutions:

Increase PGBOUNCER_POOL_SIZE (default: 20)
Increase PGBOUNCER_MAX_CLIENT_CONN (default: 200)
Reduce long-running queries (add query timeout)
Check for idle connections: server_idle_timeout

3. Redis Connection Failures (Non-Fatal)

Symptoms:

Logs: Redis check failed or ECONNREFUSED
But API still responds with slower database reads
Health check /health/ready returns 503

Expected Behavior: Cache misses → app serves from database

Diagnosis:

# Check Redis availability
docker compose exec redis redis-cli ping

# Check RedisService logs
docker compose logs api | grep -i redis

Solutions:

Restart Redis: docker compose restart redis
Check memory: docker compose exec redis redis-cli info memory
If at maxmemory, increase in docker-compose.yml and restart

4. Typesense Search Not Indexing

Symptoms:

Search returns 0 results
Listings created but not searchable
/health for typesense shows green, but collection empty

Diagnosis:

# Check collection exists
curl http://localhost:8108/collections -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"

# Check collection stats
curl "http://localhost:8108/collections/listings" \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" | jq .

# Check recent docs
curl "http://localhost:8108/collections/listings/documents/search?q=*" \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" | jq '.found'

Solutions:

Verify TYPESENSE_API_KEY matches container env var

Reindex all listings:

docker compose exec api npx ts-node scripts/reindex-listings.ts

If collection corrupted, drop and recreate:

curl -X DELETE "http://localhost:8108/collections/listings" \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"
# Then restart API service to recreate schema
docker compose restart api

5. Payment Callback Failures

Symptoms:

Payment status stuck in PENDING
Logs: Invalid callback signature for provider=VNPAY

Diagnosis:

# Check payment record in DB
docker compose exec postgres psql -U goodgo -d goodgo -c \
  "SELECT id, status, provider, \"providerTxId\", \"callbackData\" FROM \"Payment\" \
   WHERE \"providerTxId\" = 'your-txid' ORDER BY \"createdAt\" DESC LIMIT 1;"

# Check logs for callback handler
docker compose logs api | grep -i "HandleCallbackHandler\|callback"

Solutions:

Verify payment gateway credentials (VNPAY_HASH_SECRET, MOMO_SECRET_KEY, etc.)
Manually verify callback signature (contact payment provider support)

Replay callback manually (if idempotent key available):

curl -X POST http://localhost:3001/api/payments/callback \
  -H "Content-Type: application/json" \
  -d '{"provider":"VNPAY",...callback data...}'

6. Backup Verification Fails

Symptoms:

GitHub Action .github/workflows/backup-verify.yml fails
Restore test database shows mismatched row counts

Diagnosis:

# Run verification manually
docker compose -f docker-compose.ci.yml up postgres
docker compose -f docker-compose.ci.yml exec postgres \
  /scripts/pg-verify-backup.sh /backups/goodgo_latest.sql.gz

# Check JSON report
cat /tmp/backups/verify-report.json | jq .

Solutions:

Check if backup file corrupt: file goodgo_*.sql.gz
Verify restore process: pg_restore --verbose
Check PostGIS extension availability: psql -c "CREATE EXTENSION postgis;"

7. Memory/CPU Pressure

Symptoms:

OOM kills, container exits 137
CPU throttling, latency spikes
Prometheus container_memory_usage_bytes near limit

Diagnosis:

# Check Docker stats
docker stats --no-stream

# Check limits in compose file
docker compose config | grep -A3 "resources:"

# Check actual memory usage
docker inspect goodgo-api | jq '.HostConfig.Memory'

Solutions:

Increase resource limits in docker-compose.prod.yml
Reduce log verbosity (set LOG_LEVEL=warn)
Implement pagination for large queries
Scale horizontally (add more API replicas)

Prometheus Queries for Debugging

# API request latency p99
histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket[5m])) by (le))

# API error rate (5xx)
(sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100

# Container memory usage
container_memory_usage_bytes{name="goodgo-api"}

# Container CPU usage
rate(container_cpu_usage_seconds_total{name="goodgo-api"}[5m])

# PostgreSQL active queries
pg_stat_activity_count{state="active"}

# Redis memory usage
redis_memory_used_bytes / 1024 / 1024  # in MB

# Typesense collection size
typesense_documents_count{collection="listings"}

Emergency Procedures

Full System Reset (dev only):

docker compose down -v  # Remove all volumes!
docker system prune -a
docker compose up -d --wait
docker compose exec api npx prisma db push
docker compose exec api npx ts-node scripts/seed.ts

Database Emergency Restore:

# Find latest backup
ls -t /var/lib/docker/volumes/pg_backups/_data/goodgo_*.sql.gz | head -1

# Restore to new database
pg_restore -h localhost -p 5432 -U goodgo -d goodgo_restored \
  --clean --if-exists --verbose /path/to/backup.sql.gz

# Verify restore
psql -U goodgo -d goodgo_restored -c "SELECT count(*) FROM \"User\";"

Force Kill Stuck Service:

# If health check broken
docker compose kill api
docker compose rm -f api
docker compose up -d api

Appendix: Key File Locations

/Users/velikho/Desktop/WORKING/goodgo-platform-ai/
├── docker-compose.yml              # Dev environment
├── docker-compose.prod.yml         # Prod environment (with pgbouncer, resource limits)
├── docker-compose.ci.yml           # CI/E2E test environment
├── .env.example                    # Template for all required env vars
│
├── apps/
│   ├── api/
│   │   ├── Dockerfile              # Multi-stage NestJS build
│   │   ├── docker-entrypoint.sh    # Startup script (migrations, app start)
│   │   ├── src/
│   │   │   ├── modules/health/health.controller.ts
│   │   │   ├── modules/payments/application/commands/handle-callback/
│   │   │   ├── modules/shared/infrastructure/redis.service.ts
│   │   │   └── modules/search/infrastructure/services/typesense-search.repository.ts
│   │   └── package.json
│   │
│   └── web/
│       ├── Dockerfile              # Multi-stage Next.js build
│       └── package.json
│
├── libs/
│   └── ai-services/
│       ├── Dockerfile              # Python FastAPI build
│       ├── app/main.py             # FastAPI app entry
│       └── pyproject.toml
│
├── prisma/
│   └── schema.prisma               # Complete Prisma schema (22 models)
│
├── infra/
│   └── pgbouncer/
│       ├── pgbouncer.ini           # Connection pooling config
│       ├── userlist.txt.template   # User list (templated)
│       └── entrypoint.sh           # Env substitution script
│
├── scripts/
│   └── backup/
│       ├── pg-backup.sh            # Daily backup automation
│       ├── pg-verify-backup.sh     # Restore verification
│       └── pg-restore.sh           # Manual restore script
│
├── monitoring/
│   ├── prometheus/
│   │   ├── prometheus.yml          # Scrape config (goodgo-api metrics)
│   │   └── alert-rules.yml         # Latency + error rate alerts
│   ├── loki/
│   │   └── loki-config.yml         # Log aggregation config (15-day retention)
│   ├── promtail/
│   │   └── promtail-config.yml     # Log shipping (Pino JSON parsing)
│   └── grafana/
│       ├── provisioning/
│       │   ├── datasources/datasource.yml
│       │   └── dashboards/dashboard.yml
│       └── dashboards/
│           ├── api-latency.json
│           ├── api-overview.json
│           ├── database.json
│           ├── logs.json
│           ├── search.json
│           ├── web-vitals.json
│           └── business-metrics.json
│
└── .github/workflows/
    ├── ci.yml                      # Lint, test, build
    ├── deploy.yml                  # Build images, deploy to staging/prod
    ├── e2e.yml                     # End-to-end tests
    ├── backup-verify.yml           # Weekly backup verification
    ├── security.yml                # Dependency/SAST scanning
    ├── codeql.yml                  # GitHub CodeQL
    └── load-test.yml               # K6 load testing

Document Version History

Version	Date	Author	Changes
1.0	2026-04-11	DevOps Team	Initial comprehensive runbook

Last Updated: April 11, 2026
Maintained By: GoodGo Platform SRE Team
Contact: devops@goodgo.vn

45 KiB Raw Blame History

GoodGo Platform — Operational Infrastructure Runbook

Table of Contents

Executive Summary

Services Architecture

Service Inventory

Network & Volumes

Docker Compose Specifications

Development Environment (docker-compose.yml)

Production Environment (docker-compose.prod.yml)

CI/E2E Environment (docker-compose.ci.yml)

Database Layer

PostgreSQL + PostGIS

Prisma Schema Models

Key Database Features

Connection Pooling: PgBouncer

Backup Strategy

Caching & Search

Redis

Typesense

Monitoring & Observability

Prometheus

Alert Rules (monitoring/prometheus/alert-rules.yml)

Grafana

Loki & Promtail (Log Aggregation)

Payment Integration

Supported Payment Providers

Payment Flow & Callback Handling

Environment Variables

Race Condition & Idempotency Protection

Health Checks

API Health Endpoints

Health Check Implementations

Docker Container Health Checks

Environment Variables

Complete .env.example Reference

Backup & Recovery

Automated Daily Backups

Backup Verification

Deployment Pipeline

GitHub Actions CI/CD

CI Pipeline (ci.yml)

Deploy Pipeline (deploy.yml)

Dockerfile Multi-Stage Builds

Troubleshooting Guide

Check Service Status

Common Issues

1. API Service Not Healthy (stuck in "health-check-failed" state)

2. High Database Connection Pool Exhaustion

3. Redis Connection Failures (Non-Fatal)

4. Typesense Search Not Indexing

5. Payment Callback Failures

6. Backup Verification Fails

7. Memory/CPU Pressure

Prometheus Queries for Debugging

Emergency Procedures

Appendix: Key File Locations

Document Version History

45 KiB

Raw Blame History

Development Environment (`docker-compose.yml`)

Production Environment (`docker-compose.prod.yml`)

CI/E2E Environment (`docker-compose.ci.yml`)

Alert Rules (`monitoring/prometheus/alert-rules.yml`)

Complete `.env.example` Reference

CI Pipeline (`ci.yml`)

Deploy Pipeline (`deploy.yml`)