Move 36 root-level audit/analysis documents and 7 web app audit documents into docs/audits/ directory to declutter the project root. Remove stale EXPLORATION_SUMMARY.txt. Co-Authored-By: Paperclip <noreply@paperclip.ing>
45 KiB
GoodGo Platform — Operational Infrastructure Runbook
Last Updated: April 11, 2026
Version: 1.0
Purpose: Complete infrastructure reference for ops teams, SREs, and on-call engineers
Table of Contents
- Executive Summary
- Services Architecture
- Docker Compose Specifications
- Database Layer
- Caching & Search
- Monitoring & Observability
- Payment Integration
- Health Checks
- Environment Variables
- Backup & Recovery
- Deployment Pipeline
- Troubleshooting Guide
Executive Summary
GoodGo Platform is a monorepo real estate marketplace built with:
- Frontend: Next.js (TypeScript)
- Backend API: NestJS (TypeScript)
- AI Services: Python/FastAPI
- Database: PostgreSQL 16 + PostGIS
- Cache: Redis 7
- Search: Typesense 27.1
- Object Storage: MinIO (S3-compatible)
- Monitoring: Prometheus + Grafana + Loki + Promtail
- Message Queue: Built-in CQRS/Event Bus (NestJS)
Total Services in Production: 12+ (detailed below)
Services Architecture
Service Inventory
| Service | Image | Port | Purpose | Health Check | Dependencies |
|---|---|---|---|---|---|
| api | goodgo-api:latest |
3001 | NestJS REST API | GET /health (3x30s) |
postgres, redis, typesense, pgbouncer |
| web | goodgo-web:latest |
3000 | Next.js frontend | GET / (3x30s) |
api |
| ai-services | goodgo-ai-services:latest |
8000 | Python FastAPI (price estimation, NLP) | GET /health (3x30s) |
n/a |
| postgres | postgis/postgis:16-3.4 |
5432 | Primary database | pg_isready (5x10s) |
n/a |
| pgbouncer | edoburu/pgbouncer:1.23.1-p2 |
6432 | Connection pooling (transaction mode) | pg_isready (5x10s) |
postgres |
| redis | redis:7-alpine |
6379 | Cache + session store | PING (5x10s) |
n/a |
| typesense | typesense/typesense:27.1 |
8108 | Full-text search index | GET /health (5x10s) |
n/a |
| minio | minio/minio:latest |
9000/9001 | Object storage + console | mc ready local (5x10s) |
n/a |
| loki | grafana/loki:3.0.0 |
3100 | Log aggregation | GET /ready (5x15s) |
n/a |
| promtail | grafana/promtail:3.0.0 |
9080 | Log shipper | (depends on loki healthy) | loki |
| prometheus | prom/prometheus:v2.51.0 |
9090 | Metrics scraper | GET /-/healthy (3x15s) |
n/a |
| grafana | grafana/grafana:10.4.1 |
3002 | Dashboards + alerting | GET /api/health (3x15s) |
prometheus, loki |
| pg-backup | postgis/postgis:16-3.4 |
— | Automated backup cron | depends_on postgres | postgres |
Network & Volumes
- Network: Docker bridge network
goodgo-net - Volumes:
pgdata— PostgreSQL data filesredis_data— Redis snapshot (AOF)typesense_data— Search indexminio_data— Object storagepg_backups— Database backups (daily retention: 7 days)loki_data— Log chunks (retention: 15 days)prometheus_data— Metrics TSDB (retention: 30 days in prod, 15 days in dev)grafana_data— Dashboards, datasource configs
Docker Compose Specifications
Development Environment (docker-compose.yml)
12 Services (minimal dependencies, no resource limits)
services:
postgres: PostGIS 16, port 5432, healthcheck: pg_isready (30s start-period)
redis: Alpine 7, port 6379, maxmemory: 256mb LRU, AOF enabled
typesense: v27.1, port 8108, CORS enabled, healthcheck /health
minio: latest, ports 9000 (API) / 9001 (console)
ai-services: Custom Python build, port 8000
pg-backup: Automated daily dumps at 02:00 UTC, cron retention cleanup
pg-verify-backup: On-demand backup restore verification (profile: tools)
loki: v3.0.0, port 3100, 15-day retention, 2h compaction delay
promtail: v3.0.0, Docker socket instrumentation, Pino JSON parsing
prometheus: v2.51.0, port 9090, 15-day retention, lifecycle API enabled
grafana: v10.4.1, port 3002, datasources pre-provisioned
Key Differences from Prod:
- No resource limits (use all available CPU/memory)
- Smaller retention windows (7-15 days)
- PostgreSQL on port 5432 (direct, no pgbouncer)
- loki/prometheus/grafana on alternate ports
Production Environment (docker-compose.prod.yml)
14 Services (with pgbouncer, resource limits, rolling updates)
services:
api: NestJS, resource limits: 1g CPU / 1g memory
web: Next.js, resource limits: 0.5 CPU / 512m memory
ai-services: Python, resource limits: 1.0 CPU / 1g memory
postgres: PostGIS, resource limits: 2.0 CPU / 2g memory
pgbouncer: Connection pool (NEW), 20 default connections, transaction mode
redis: 7-alpine, resource limits: 0.5 CPU / 768m memory, password auth
typesense: 27.1, resource limits: 1.0 CPU / 1g memory
minio: latest, resource limits: 0.5 CPU / 1g memory
loki: v3.0.0, resource limits: 0.5 CPU / 512m memory
promtail: v3.0.0, resource limits: 0.25 CPU / 256m memory
prometheus: v2.51.0, resource limits: 0.5 CPU / 1g memory, 30-day retention
grafana: v10.4.1, resource limits: 0.5 CPU / 512m memory
pg-backup: Same as dev
Production-Specific Flags:
read_only: trueon app containers (api, web, ai-services)tmpfs: [/tmp]for runtime temp filessecurity_opt: [no-new-privileges:true]logging: json-filewith 10m max-size, 3-5 files rotation- PgBouncer inserted between apps ↔ Postgres (port 6432)
- Secrets management:
GRAFANA_ADMIN_USER,GRAFANA_ADMIN_PASSWORDfrom Docker secrets - Redis requires password authentication
CI/E2E Environment (docker-compose.ci.yml)
Minimal 4 Services (tmpfs for speed)
services:
postgres: goodgo_test DB, tmpfs (/var/lib/postgresql/data)
redis: --save "" --appendonly no (no persistence)
typesense: tmpfs (/data)
minio: tmpfs (/data)
Used by:
- GitHub Actions E2E test suite
- Local
docker compose -f docker-compose.ci.yml up --wait
Database Layer
PostgreSQL + PostGIS
Version: 16.3.4 with PostGIS extension
Schema: 22 Prisma models + Prisma migration tracking
Prisma Schema Models
- Auth: User, RefreshToken, OAuthAccount, Agent
- Listings: Property, PropertyMedia, Listing
- Search: SavedSearch
- Transactions: Transaction, Inquiry, Lead
- Payments: Payment (with PaymentProvider enum: VNPAY, MOMO, ZALOPAY, BANK_TRANSFER)
- Subscriptions: Plan, Subscription, UsageRecord
- Analytics: Valuation, MarketIndex
- Notifications: NotificationLog, NotificationPreference
- Audit: AdminAuditLog
- Reviews: Review
Key Database Features
- PostGIS Geometry: Property.location (Point, SRID 4326) with GIST index
- Enums: UserRole, KYCStatus, PropertyType, TransactionType, ListingStatus, Direction, OAuthProvider, TransactionStatus, LeadStatus, PaymentProvider, PaymentStatus, PaymentType, PlanTier, SubscriptionStatus, NotificationChannel, NotificationStatus, AdminAction, AuditTargetType
- Compound Indexes: Query optimization on (role, isActive, createdAt), (sellerId, status, publishedAt), (userId, status, createdAt), etc.
- Constraints: Unique idempotency key on Payment (userId, provider, idempotencyKey)
Connection Pooling: PgBouncer
Dev Mode (docker-compose.yml):
- Apps connect directly to
postgres:5432 - No pooling overhead
Prod Mode (docker-compose.prod.yml):
- Apps connect to
pgbouncer:6432 - Pool Mode:
transaction(connections returned after each transaction) - Pool Size: 20 connections (default, tunable via
PGBOUNCER_POOL_SIZE) - Max Client Conn: 200 (tunable via
PGBOUNCER_MAX_CLIENT_CONN) - Reserve Pool: 5 connections (fallback when pool exhausted)
- Timeouts:
- server_connect_timeout: 15s
- server_idle_timeout: 600s
- server_lifetime: 3600s (connection recycle)
- query_wait_timeout: 120s
- query_timeout: 0 (disabled)
- Admin Console: pgbouncer_admin user (password via PGBOUNCER_ADMIN_PASSWORD env var)
- Stats Console: pgbouncer_stats user (password via PGBOUNCER_STATS_PASSWORD env var)
Migration Workaround:
- API has two DATABASE_URL env vars:
DATABASE_URL→ pgbouncer:6432 (normal queries)DATABASE_URL_DIRECT→ postgres:5432 (migrations, introspection, DDL)
RUN_MIGRATIONS=trueswitches app to use DATABASE_URL_DIRECT forprisma migrate deploy
Backup Strategy
Automated Backups:
- Schedule: Daily at 02:00 UTC (cron inside pg-backup container)
- Format: Custom format with gzip compression (level 6)
- Retention: 7 days (configurable via BACKUP_RETENTION_DAYS)
- Location:
pg_backupsvolume (mount to persistent storage in prod) - File Pattern:
goodgo_YYYYMMDD_HHMMSS.sql.gz - Restore Script:
/scripts/backup/pg-restore.sh(manual restore) - Verification Script:
/scripts/backup/pg-verify-backup.sh(automated E2E verification)
Verification Process (runs weekly):
- Restores latest backup to isolated test database (
goodgo_verify_<timestamp>) - Verifies all 22 tables exist
- Compares row counts between source and restored DB
- Checksums critical tables (User, Property, Listing, Payment, Subscription, Transaction, Plan, _prisma_migrations)
- Checks PostGIS extension, indexes, enum types
- Generates JSON report with pass/fail result
- Cleanup: Drops test DB on exit (unless SKIP_CLEANUP=1)
- Exit Codes: 0=pass, 1=checks failed, 2=setup error
CI/CD Backup Verification:
- GitHub Action:
.github/workflows/backup-verify.yml - Runs weekly Sundays 05:00 UTC
- Also manually triggerable with skip_cleanup option
- Uploads JSON report as artifact
Caching & Search
Redis
Image: redis:7-alpine
Port: 6379
Production Configuration:
redis-server \
--appendonly yes \ # AOF persistence (updates only)
--requirepass ${REDIS_PASSWORD} \ # Authentication required
--maxmemory 512mb \ # Max memory limit (prod)
--maxmemory-policy allkeys-lru # LRU eviction when full
Development Configuration:
redis-server \
--appendonly yes \
--maxmemory 256mb \
--maxmemory-policy allkeys-lru
ioredis Client Configuration:
// From RedisService in apps/api/src/modules/shared/infrastructure/redis.service.ts
{
host: process.env.REDIS_HOST ?? 'localhost',
port: Number(process.env.REDIS_PORT ?? 6379),
password: process.env.REDIS_PASSWORD ?? undefined,
lazyConnect: true, // App starts even if Redis unavailable
enableReadyCheck: false, // Prevents "Redis is not ready" errors during transient outages
maxRetriesPerRequest: 1, // Fail fast (single retry, no exponential backoff)
retryStrategy(times: number): number {
return Math.min(times * 1000, 5000); // 1s → 2s → 3s → 4s → 5s → 5s...
}
}
Graceful Degradation:
- Cache misses don't fail the application
- CacheService catches Redis errors and returns cache miss
- App serves data directly from PostgreSQL if Redis down
- Health check at
GET /health/rediswarns but doesn't fail readiness probe
Use Cases:
- Session storage
- Cache layer for expensive queries
- Rate limiting (if implemented)
- Real-time counters
Typesense
Image: typesense/typesense:27.1
Port: 8108 (HTTP only, internal Docker network)
API Key: ${TYPESENSE_API_KEY} (must be set in .env)
Collection Schema:
Collection Name: "listings"
Fields:
- listingId (string)
- propertyId (string)
- title (string, searchable, highlights)
- description (string, searchable, highlights)
- propertyType (string, faceted)
- transactionType (string, faceted: SALE/RENT)
- priceVND (int64, sortable)
- pricePerM2 (float, optional)
- areaM2 (float)
- bedrooms (int32, faceted)
- bathrooms (int32, faceted)
- floors (int32)
- direction (string, faceted: NORTH/SOUTH/EAST/WEST/etc)
- address (string)
- ward (string, faceted)
- district (string, faceted)
- city (string, faceted)
- location (geopoint) — for radius search
- agentId (string)
- sellerId (string)
- status (string, faceted: ACTIVE/SOLD/DRAFT/etc)
- publishedAt (int64, sortable)
- viewCount (int32)
- saveCount (int32)
- projectName (string, faceted)
- amenities (string[], faceted)
Search Features:
- Full-text search on: title, description, address, district, city, projectName
- Query weights: title=5, description=3, address=2, district=2, city=1, projectName=2
- Filtering: propertyType, transactionType, bedrooms, district, city, status, amenities
- Geo-search: radius-based queries (lat, lng, km)
- Sorting: price (asc/desc), distance (asc from geopoint), date (desc), relevance
- Highlights: HTML marks on matched terms in title and description
- Facets: Return aggregated counts for filtering
TypesenseSearchRepository (apps/api/src/modules/search/infrastructure/services/typesense-search.repository.ts):
ensureCollection()— Creates schema if not existsdropCollection()— Cleanup (testing only)indexDocument(doc)— Upsert single documentindexDocuments(docs)— Bulk import with error reportingremoveDocument(id)— Delete by IDsearch(params)— Execute search with filters, sort, pagination
Graceful Degradation:
- If Typesense down, search falls back to PostgreSQL full-text search
- TypesenseClientService implements retry logic with exponential backoff
- Health check at
GET /healthreturns JSON status
Monitoring & Observability
Prometheus
Image: prom/prometheus:v2.51.0
Port: 9090
Retention: 15 days (dev), 30 days (prod)
Lifecycle API: Enabled (--web.enable-lifecycle)
Scrape Targets (monitoring/prometheus/prometheus.yml):
scrape_configs:
- job_name: goodgo-api
metrics_path: /metrics
static_configs:
- targets: ['host.docker.internal:3001'] # Dev (API on host)
- targets: ['api:3001'] # Prod (API in container)
labels:
service: goodgo-api
environment: [development|production]
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
Expected Metrics from API:
goodgo_api_request_duration_seconds_bucket{le, route, method}— Request latency histogramhttp_requests_total{status_code, job}— Request count by status code- Custom business metrics (if implemented in NestJS @prometheus decorators)
Alert Rules (monitoring/prometheus/alert-rules.yml)
Latency Alerts:
-
ApiLatencyP99High (warning)
- Trigger: p99 latency > 1s for 5 minutes
- Dashboard:
/d/goodgo-api-latency/goodgo-api-latency - Runbook:
https://docs.goodgo.vn/runbooks/api-latency-high
-
ApiEndpointLatencyP99High (warning)
- Trigger: Per-endpoint p99 > 2s for 5 minutes
- Annotates: method, route labels
-
ApiLatencyP99Critical (critical - SLO breach)
- Trigger: p99 latency > 3s for 3 minutes
- Escalation required
- Runbook:
https://docs.goodgo.vn/runbooks/api-latency-critical
Error Rate Alert:
- ApiErrorRate5xxHigh (warning)
- Trigger: 5xx error rate > 1% for 5 minutes
- Uses:
(5xx errors / total requests) * 100
Grafana
Image: grafana/grafana:10.4.1
Port: 3002
Auth: Admin user/password from secrets (prod) or env vars (dev)
Pre-Provisioned Datasources:
- Prometheus (default, primary)
- Loki (with derived fields for correlationId linkage)
Dashboards:
api-latency.json— API p99/p95/p50, route breakdown, slow endpointsapi-overview.json— Request rate, error rate, uptime statusdatabase.json— Query latency, connection pool utilization, slow querieslogs.json— Log volume, error logs, trace links to Prometheussearch.json— Typesense query latency, indexing rate, collection sizeweb-vitals.json— Frontend Core Web Vitals (if client-side instrumentation)business-metrics.json— Listings created, payments processed, user signups
Admin Console Access:
- URL:
http://localhost:3002(dev) or${GRAFANA_PORT}(prod) - Default user:
admin(change password on first login) - Non-signup mode (
GF_USERS_ALLOW_SIGN_UP: false)
Loki & Promtail (Log Aggregation)
Loki: grafana/loki:3.0.0, port 3100
Configuration:
schema:
- from: 2024-01-01
store: tsdb
schema: v13
limits:
max_entries_limit_per_query: 5000
ingestion_rate_mb: 4
ingestion_burst_size_mb: 6
retention: 360h (15 days)
Promtail: grafana/promtail:3.0.0
Configuration:
- Scrapes Docker logs from
goodgo-netbridge network - Parses Pino JSON structured logs
- Extracts labels: level, context, component, service
- Structured metadata: method, url, statusCode, correlationId, duration
- Derives timestamp from Pino output (RFC3339Nano)
Expected Log Format (Pino):
{
"level": 30, // info
"time": "2026-04-11T10:30:00Z",
"msg": "POST /api/listings",
"correlationId": "abc-123-def",
"context": "ListingController",
"component": "api",
"method": "POST",
"url": "/api/listings",
"statusCode": 201,
"duration": 150
}
Payment Integration
Supported Payment Providers
Enum: PaymentProvider (Prisma)
VNPAY— VNPay (Vietnam payment gateway)MOMO— MoMo (Vietnam mobile wallet)ZALOPAY— ZaloPay (Vietnam digital wallet)BANK_TRANSFER— Manual bank transfer (offline)
Payment Flow & Callback Handling
Database Schema (Payment Model):
model Payment {
id String @id @default(cuid())
userId String
transactionId String?
provider PaymentProvider
type PaymentType // SUBSCRIPTION, LISTING_FEE, DEPOSIT, FEATURED_LISTING
amountVND BigInt
status PaymentStatus // PENDING, PROCESSING, COMPLETED, FAILED, REFUNDED
providerTxId String? // External transaction ID from VNPay/MoMo/ZaloPay
callbackData Json? // Raw callback payload (for audit)
idempotencyKey String? // Prevent duplicate payments (userId, provider, idempotencyKey unique)
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
}
enum PaymentStatus {
PENDING, PROCESSING, COMPLETED, FAILED, REFUNDED
}
enum PaymentType {
SUBSCRIPTION, LISTING_FEE, DEPOSIT, FEATURED_LISTING
}
Command Handler: HandleCallbackHandler
(apps/api/src/modules/payments/application/commands/handle-callback/handle-callback.handler.ts)
-
Callback Signature Verification:
- Uses
PAYMENT_GATEWAY_FACTORYto route to correct provider (VNPay/MoMo/ZaloPay) - Gateway.verifyCallback() validates HMAC signature
- Throws
ValidationExceptionif signature invalid
- Uses
-
Idempotent Status Transition:
- Only updates payments in state:
PENDINGorPROCESSING - Atomically transitions to
COMPLETEDorFAILED - If already in terminal state (COMPLETED/FAILED/REFUNDED), returns existing status (idempotent)
- Logs warning if payment not found
- Only updates payments in state:
-
Domain Event Publishing:
- Reconstructs domain entity from repository
- Emits
PaymentCompletedEventorPaymentFailedEvent - Event bus publishes events to subscribers (e.g., subscription creation, listing activation)
-
Response:
{ paymentId: string, status: PaymentStatus, isSuccess: boolean }
Payment Gateway Interface (payment-gateway.interface.ts):
interface IPaymentGateway {
readonly provider: PaymentProvider
createPaymentUrl(params: CreatePaymentUrlParams): Promise<CreatePaymentUrlResult>
verifyCallback(data: Record<string, string>): CallbackVerifyResult
refund(params: RefundParams): Promise<RefundResult>
}
interface CreatePaymentUrlParams {
orderId: string
amountVND: bigint
description: string
returnUrl: string
ipAddress: string
}
interface CallbackVerifyResult {
isValid: boolean
orderId: string
providerTxId: string
isSuccess: boolean
rawData: Record<string, unknown>
}
interface RefundParams {
providerTxId: string
amountVND: bigint
reason: string
}
interface RefundResult {
success: boolean
refundTxId: string | null
}
Environment Variables
VNPay:
VNPAY_TMN_CODE=<merchant terminal code>
VNPAY_HASH_SECRET=<HMAC secret key>
VNPAY_BASE_URL=https://sandbox.vnpayment.vn/paymentv2/vpcpay.html
VNPAY_API_URL=https://sandbox.vnpayment.vn/merchant_webapi/api/transaction
MoMo:
MOMO_PARTNER_CODE=<partner code>
MOMO_ACCESS_KEY=<access key>
MOMO_SECRET_KEY=<secret key>
MOMO_ENDPOINT=https://test-payment.momo.vn/v2/gateway/api
ZaloPay:
ZALOPAY_APP_ID=<app ID>
ZALOPAY_KEY1=<key 1 (for creating payments)>
ZALOPAY_KEY2=<key 2 (for callback verification)>
ZALOPAY_ENDPOINT=https://sb-openapi.zalopay.vn/v2
Race Condition & Idempotency Protection
Problem: Multiple callbacks may arrive for same payment (network retries, duplicate notifications)
Solution:
-
Unique Idempotency Key:
Payment_idempotency_unique(userId, provider, idempotencyKey)- Prevents duplicate payment records
- Generated by client/API before creating payment
-
Atomic Status Update:
paymentRepo.updateIfStatus(orderId, ['PENDING', 'PROCESSING'], newStatus)- Only updates if current status in allowed list
- Returns updated entity or null if already terminal
-
Terminal State Check: If already COMPLETED/FAILED/REFUNDED, handler returns existing state
- No re-triggering of domain events
- No double billing or duplicate transactions
Health Checks
API Health Endpoints
Health Controller (apps/api/src/modules/health/health.controller.ts)
-
GET /health — Liveness Probe (always 200 if process running)
- Uses:
@HealthCheck()on empty probe list - Response:
{ "status": "ok", "timestamp": "..." } - Use Case: Kubernetes/Docker readiness (initial startup)
- Uses:
-
GET /health/ready — Readiness Probe (checks dependencies)
- Checks: PostgreSQL + Redis connectivity
- Response:
{ "status": "ok", "checks": { "database": { "status": "up" }, "redis": { "status": "up" } } } - Use Case: Load balancer, before accepting traffic
- Failure: Returns 503 if any dependency down
-
GET /health/db — Database Readiness Only
- Checks: PostgreSQL connectivity via
SELECT 1query - Use Case: Manual database troubleshooting
- Checks: PostgreSQL connectivity via
-
GET /health/redis — Redis Readiness Only
- Checks: Redis PING command
- Use Case: Manual Redis troubleshooting
Health Check Implementations
PrismaHealthIndicator (apps/api/src/modules/health/infrastructure/prisma.health.ts):
async isHealthy(key: string): Promise<HealthIndicatorResult> {
try {
await this.prisma.$queryRawUnsafe('SELECT 1');
return this.getStatus(key, true);
} catch {
throw new HealthCheckError('Database check failed', this.getStatus(key, false));
}
}
RedisHealthIndicator (apps/api/src/modules/health/infrastructure/redis.health.ts):
async isHealthy(key: string): Promise<HealthIndicatorResult> {
try {
const client = this.redis.getClient();
const pong = await client.ping();
const isHealthy = pong === 'PONG';
const result = this.getStatus(key, isHealthy);
if (isHealthy) return result;
throw new HealthCheckError('Redis ping failed', result);
} catch (error) {
if (error instanceof HealthCheckError) throw error;
throw new HealthCheckError('Redis check failed', this.getStatus(key, false));
}
}
Docker Container Health Checks
API Container:
healthcheck:
test: ['CMD', 'node', '-e', "fetch('http://localhost:3001/health').then(r => { if (!r.ok) throw 1 }).catch(() => process.exit(1))"]
interval: 30s
timeout: 5s
retries: 5
start_period: 30s
Web Container:
healthcheck:
test: ['CMD', 'node', '-e', "fetch('http://localhost:3000').then(r => { if (!r.ok) throw 1 }).catch(() => process.exit(1))"]
interval: 30s
timeout: 5s
retries: 3
start_period: 15s
PostgreSQL:
healthcheck:
test: ['CMD-SHELL', 'pg_isready -U ${DB_USER} -d ${DB_NAME}']
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
Redis:
healthcheck:
test: ['CMD', 'redis-cli', '-a', '${REDIS_PASSWORD}', 'ping']
interval: 10s
timeout: 5s
retries: 5
start_period: 10s
Typesense:
healthcheck:
test: ['CMD', 'curl', '-sf', 'http://localhost:8108/health']
interval: 10s
timeout: 5s
retries: 5
start_period: 15s
Environment Variables
Complete .env.example Reference
PostgreSQL:
DB_HOST=localhost
DB_PORT=5432
DB_NAME=goodgo
DB_USER=goodgo
DB_PASSWORD=CHANGE_ME
DATABASE_URL=postgresql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:${DB_PORT}/${DB_NAME}?schema=public
DATABASE_URL_DIRECT=postgresql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:${DB_PORT}/${DB_NAME}?schema=public
PgBouncer (Prod Only):
PGBOUNCER_POOL_SIZE=20
PGBOUNCER_MAX_CLIENT_CONN=200
PGBOUNCER_ADMIN_PASSWORD=CHANGE_ME
PGBOUNCER_STATS_PASSWORD=CHANGE_ME
Redis:
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=
REDIS_URL=redis://${REDIS_HOST}:${REDIS_PORT}
Typesense:
TYPESENSE_HOST=localhost
TYPESENSE_PORT=8108
TYPESENSE_PROTOCOL=http
TYPESENSE_API_KEY=CHANGE_ME
MinIO:
MINIO_ENDPOINT=localhost
MINIO_PORT=9000
MINIO_CONSOLE_PORT=9001
MINIO_ACCESS_KEY=CHANGE_ME
MINIO_SECRET_KEY=CHANGE_ME
MINIO_BUCKET=goodgo-media
MINIO_USE_SSL=false
NestJS API:
API_PORT=3000
PORT=3001
NODE_ENV=development
CORS_ORIGINS=http://localhost:3000,http://localhost:3001
JWT / Authentication (REQUIRED):
JWT_SECRET=<generate with: openssl rand -base64 48>
JWT_EXPIRES_IN=15m
JWT_REFRESH_SECRET=<generate with: openssl rand -base64 48>
JWT_REFRESH_EXPIRES_IN=7d
OAuth Providers:
GOOGLE_CLIENT_ID=
GOOGLE_CLIENT_SECRET=
GOOGLE_CALLBACK_URL=http://localhost:3001/auth/google/callback
ZALO_APP_ID=
ZALO_APP_SECRET=
ZALO_CALLBACK_URL=http://localhost:3001/auth/zalo/callback
FRONTEND_URL=http://localhost:3000
Next.js Web:
NEXT_PUBLIC_API_URL=http://localhost:3000
WEB_PORT=3001
AI Service (Python/FastAPI):
AI_SERVICE_PORT=8000
AI_SERVICE_URL=http://localhost:8000
CLAUDE_API_KEY=
AI_DEBUG=false
AI_LOG_LEVEL=info
Map Integration:
NEXT_PUBLIC_MAPBOX_TOKEN=
Payment Gateways:
VNPAY_TMN_CODE=
VNPAY_HASH_SECRET=
VNPAY_BASE_URL=https://sandbox.vnpayment.vn/paymentv2/vpcpay.html
VNPAY_API_URL=https://sandbox.vnpayment.vn/merchant_webapi/api/transaction
MOMO_PARTNER_CODE=
MOMO_ACCESS_KEY=
MOMO_SECRET_KEY=
MOMO_ENDPOINT=https://test-payment.momo.vn/v2/gateway/api
ZALOPAY_APP_ID=
ZALOPAY_KEY1=
ZALOPAY_KEY2=
ZALOPAY_ENDPOINT=https://sb-openapi.zalopay.vn/v2
Email / SMTP:
SMTP_HOST=localhost
SMTP_PORT=1025
SMTP_USER=
SMTP_PASS=
SMTP_FROM=noreply@goodgo.vn
Firebase Cloud Messaging (Optional):
FIREBASE_SERVICE_ACCOUNT=
Sentry Error Tracking:
SENTRY_DSN=
NEXT_PUBLIC_SENTRY_DSN=
SENTRY_AUTH_TOKEN=
SENTRY_ORG=
SENTRY_PROJECT=
KYC Field Encryption (REQUIRED Prod):
KYC_ENCRYPTION_KEY=<generate with: openssl rand -hex 32> # 64 hex chars (32 bytes)
KYC_ENCRYPTION_KEY_VERSION=1
Logging:
LOG_LEVEL=info
Backup & Recovery
Automated Daily Backups
Service: pg-backup container (runs inside docker compose)
Backup Script: scripts/backup/pg-backup.sh
# Daily cron job: 02:00 UTC
PGHOST=postgres \
PGPORT=5432 \
PGUSER=goodgo \
PGDATABASE=goodgo \
PGPASSWORD=<secret> \
BACKUP_DIR=/backups \
RETENTION_DAYS=7 \
/scripts/pg-backup.sh
Behavior:
- Creates dump with
pg_dump --format=custom --compress=6 - Saves as
goodgo_YYYYMMDD_HHMMSS.sql.gz - Prunes backups older than 7 days (configurable)
- Logs to
/var/log/pg-backup.log
Restore from Backup:
# Interactive restore prompt
docker compose -f docker-compose.prod.yml exec pg-backup bash -c \
'pg_restore -h postgres -p 5432 -U goodgo -d goodgo \
--clean --if-exists /backups/goodgo_20260410_020000.sql.gz'
# Or using restore script
docker compose -f docker-compose.prod.yml run --rm pg-verify-backup bash -c \
'source /scripts/pg-restore.sh /backups/goodgo_20260410_020000.sql.gz'
Backup Verification
Service: pg-verify-backup container (on-demand, profile: tools)
Verification Script: scripts/backup/pg-verify-backup.sh
# Usage:
docker compose -f docker-compose.prod.yml run --rm pg-verify-backup
# Or with options:
SKIP_CLEANUP=1 REPORT_FILE=/backups/verify-report.json \
docker compose -f docker-compose.prod.yml run --rm pg-verify-backup
Verification Steps:
- Creates isolated test database:
goodgo_verify_<timestamp> - Enables PostGIS extension
- Restores backup into test DB
- Verifies all 22 tables exist
- Compares row counts between source and restored
- Checksums critical tables using MD5 hashes
- Checks indexes, enum types
- Generates JSON report with results
- Cleanup: Drops test DB (unless SKIP_CLEANUP=1)
JSON Report Structure:
{
"timestamp": "2026-04-11T10:30:00Z",
"backupFile": "/backups/goodgo_20260410_020000.sql.gz",
"backupSize": "150M",
"testDatabase": "goodgo_verify_20260411_103000",
"restoreDurationSeconds": 45,
"passed": 28,
"failed": 0,
"warnings": 2,
"result": "pass",
"checks": [
{ "check": "Database creation", "status": "pass", "detail": "Test database created" },
{ "check": "Restore", "status": "pass", "detail": "pg_restore completed cleanly in 45s" },
{ "check": "Table existence", "status": "pass", "detail": "All 22 expected tables present" },
{ "check": "Row counts", "status": "pass", "detail": "All tables match source database" },
{ "check": "Checksum: User identities", "status": "pass", "detail": "Hashes match (abc123def456...)" },
...
]
}
GitHub Action Backup Verification:
- File:
.github/workflows/backup-verify.yml - Schedule: Weekly Sundays 05:00 UTC
- Also: Manual trigger with skip_cleanup option
- Artifacts: Uploads JSON report for 30 days
Deployment Pipeline
GitHub Actions CI/CD
Workflows:
.github/workflows/ci.yml— Lint, typecheck, test, build (on push/PR to master).github/workflows/deploy.yml— Build Docker images, deploy to staging/prod.github/workflows/e2e.yml— E2E tests (spins up full docker-compose.ci.yml).github/workflows/backup-verify.yml— Weekly backup verification.github/workflows/security.yml— Dependency scanning, SAST.github/workflows/codeql.yml— GitHub CodeQL analysis.github/workflows/load-test.yml— K6 load testing
CI Pipeline (ci.yml)
On: push master, pull_request master
Node: 22
Concurrency: Cancel previous runs on same ref
Jobs:
-
Lint → Typecheck → Test → Build
- Installs pnpm, Node 22
- Runs linter (eslint)
- Type checks (tsc)
- Unit tests (jest)
- Builds all apps (turbo)
- PostgreSQL 16 service available (goodgo_test DB)
-
E2E Tests (depends on ci job)
- Full docker-compose.ci.yml services (postgres, redis, typesense, minio)
- Runs end-to-end test suite
- Timeout: 20 minutes
- Env vars: DATABASE_URL, JWT secrets, payment test codes
Deploy Pipeline (deploy.yml)
On:
push master(auto-deploys to staging)- Manual
workflow_dispatch(choose staging or production)
Jobs:
-
Build API Image
- Builds:
goodgo-api:${IMAGE_TAG} - Dockerfile:
apps/api/Dockerfile - Registry:
ghcr.io/goodgo/goodgo-api - Tags: git SHA, branch name,
latest(on master)
- Builds:
-
Build Web Image
- Builds:
goodgo-web:${IMAGE_TAG} - Dockerfile:
apps/web/Dockerfile - Registry:
ghcr.io/goodgo/goodgo-web
- Builds:
-
Build AI Services Image
- Builds:
goodgo-ai-services:${IMAGE_TAG} - Context:
libs/ai-services/ - Registry:
ghcr.io/goodgo/goodgo-ai-services
- Builds:
-
Deploy to Staging
- Condition:
github.event_name == 'push' || inputs.environment == 'staging' - SSH into staging host
- Pulls new images from GHCR
- Rolling update (zero downtime):
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api docker compose -f docker-compose.prod.yml up -d --no-deps --wait web docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services - Runs migrations:
docker compose exec api npx prisma migrate deploy - Prunes old images
- Condition:
-
Deploy to Production
- Only on manual
workflow_dispatchwithenvironment: production - Same steps as staging
- Requires
environment: productionapproval (GitHub security)
- Only on manual
Dockerfile Multi-Stage Builds
API (apps/api/Dockerfile):
- Base: node:22-slim + pnpm 10.27.0
- Deps: Install locked dependencies (layer caching)
- Build: Compile TypeScript, generate Prisma client
- Prune:
pnpm deploy --prod(removes dev deps, hoists prod deps) - Production: Minimal image, dumb-init for signals, non-root user
Web (apps/web/Dockerfile):
- Base: node:22-slim + pnpm
- Deps: Install dependencies
- Build:
next build→ standalone output + static files - Production: Copy .next/standalone, public, static assets
AI Services (libs/ai-services/Dockerfile):
- Base: python:3.12-slim
- Install: System deps (gcc, g++), dumb-init, FastAPI/XGBoost/underthesea
- Models: Pre-download underthesea ML models at build time
- User: Run as non-root appuser
- CMD:
uvicorn app.main:app --host 0.0.0.0 --port 8000
Troubleshooting Guide
Check Service Status
# All services
docker compose -f docker-compose.prod.yml ps
# Single service
docker compose -f docker-compose.prod.yml ps api
# Get logs
docker compose -f docker-compose.prod.yml logs -f api --tail=100
# Health check status
docker compose -f docker-compose.prod.yml exec api curl http://localhost:3001/health
Common Issues
1. API Service Not Healthy (stuck in "health-check-failed" state)
Symptoms:
docker compose psshows(health: starting)for >2 minutesdocker compose logs apishows connection errors
Diagnosis:
# Check API liveness
docker compose exec api curl http://localhost:3001/health
# Check readiness (includes DB + Redis checks)
docker compose exec api curl http://localhost:3001/health/ready
# Check specific dependencies
docker compose exec api curl http://localhost:3001/health/db
docker compose exec api curl http://localhost:3001/health/redis
Solutions:
-
PostgreSQL not ready:
docker compose ps postgres # Should show (healthy) docker compose exec postgres pg_isready -U goodgo -d goodgo docker compose logs postgres --tail=50 -
Redis not ready:
docker compose exec redis redis-cli ping # Should return PONG docker compose logs redis --tail=50 -
PgBouncer not ready (prod):
docker compose exec pgbouncer pg_isready -h 127.0.0.1 -p 6432 -U goodgo docker compose logs pgbouncer --tail=50 -
Database schema not initialized:
# Run migrations manually docker compose exec api npx prisma migrate deploy # Or check if schema exists docker compose exec postgres psql -U goodgo -d goodgo -c "\dt"
2. High Database Connection Pool Exhaustion
Symptoms:
- Errors:
Error: unable to get a connection from the pool after X s - Slow queries pile up
- API latency spikes
Diagnosis:
# Check pool stats (prod, PgBouncer)
docker compose exec pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_stats -c "SHOW stats"
# Or query PostgreSQL directly
docker compose exec postgres psql -U goodgo -d goodgo -c "SELECT count(*) FROM pg_stat_activity"
Solutions:
- Increase
PGBOUNCER_POOL_SIZE(default: 20) - Increase
PGBOUNCER_MAX_CLIENT_CONN(default: 200) - Reduce long-running queries (add query timeout)
- Check for idle connections:
server_idle_timeout
3. Redis Connection Failures (Non-Fatal)
Symptoms:
- Logs:
Redis check failedorECONNREFUSED - But API still responds with slower database reads
- Health check
/health/readyreturns 503
Expected Behavior: Cache misses → app serves from database
Diagnosis:
# Check Redis availability
docker compose exec redis redis-cli ping
# Check RedisService logs
docker compose logs api | grep -i redis
Solutions:
- Restart Redis:
docker compose restart redis - Check memory:
docker compose exec redis redis-cli info memory - If at
maxmemory, increase in docker-compose.yml and restart
4. Typesense Search Not Indexing
Symptoms:
- Search returns 0 results
- Listings created but not searchable
/healthfor typesense shows green, but collection empty
Diagnosis:
# Check collection exists
curl http://localhost:8108/collections -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"
# Check collection stats
curl "http://localhost:8108/collections/listings" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" | jq .
# Check recent docs
curl "http://localhost:8108/collections/listings/documents/search?q=*" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" | jq '.found'
Solutions:
- Verify
TYPESENSE_API_KEYmatches container env var - Reindex all listings:
docker compose exec api npx ts-node scripts/reindex-listings.ts - If collection corrupted, drop and recreate:
curl -X DELETE "http://localhost:8108/collections/listings" \ -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" # Then restart API service to recreate schema docker compose restart api
5. Payment Callback Failures
Symptoms:
- Payment status stuck in
PENDING - Logs:
Invalid callback signature for provider=VNPAY
Diagnosis:
# Check payment record in DB
docker compose exec postgres psql -U goodgo -d goodgo -c \
"SELECT id, status, provider, \"providerTxId\", \"callbackData\" FROM \"Payment\" \
WHERE \"providerTxId\" = 'your-txid' ORDER BY \"createdAt\" DESC LIMIT 1;"
# Check logs for callback handler
docker compose logs api | grep -i "HandleCallbackHandler\|callback"
Solutions:
- Verify payment gateway credentials (VNPAY_HASH_SECRET, MOMO_SECRET_KEY, etc.)
- Manually verify callback signature (contact payment provider support)
- Replay callback manually (if idempotent key available):
curl -X POST http://localhost:3001/api/payments/callback \ -H "Content-Type: application/json" \ -d '{"provider":"VNPAY",...callback data...}'
6. Backup Verification Fails
Symptoms:
- GitHub Action
.github/workflows/backup-verify.ymlfails - Restore test database shows mismatched row counts
Diagnosis:
# Run verification manually
docker compose -f docker-compose.ci.yml up postgres
docker compose -f docker-compose.ci.yml exec postgres \
/scripts/pg-verify-backup.sh /backups/goodgo_latest.sql.gz
# Check JSON report
cat /tmp/backups/verify-report.json | jq .
Solutions:
- Check if backup file corrupt:
file goodgo_*.sql.gz - Verify restore process:
pg_restore --verbose - Check PostGIS extension availability:
psql -c "CREATE EXTENSION postgis;"
7. Memory/CPU Pressure
Symptoms:
- OOM kills, container exits 137
- CPU throttling, latency spikes
- Prometheus
container_memory_usage_bytesnear limit
Diagnosis:
# Check Docker stats
docker stats --no-stream
# Check limits in compose file
docker compose config | grep -A3 "resources:"
# Check actual memory usage
docker inspect goodgo-api | jq '.HostConfig.Memory'
Solutions:
- Increase resource limits in
docker-compose.prod.yml - Reduce log verbosity (set LOG_LEVEL=warn)
- Implement pagination for large queries
- Scale horizontally (add more API replicas)
Prometheus Queries for Debugging
# API request latency p99
histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket[5m])) by (le))
# API error rate (5xx)
(sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
# Container memory usage
container_memory_usage_bytes{name="goodgo-api"}
# Container CPU usage
rate(container_cpu_usage_seconds_total{name="goodgo-api"}[5m])
# PostgreSQL active queries
pg_stat_activity_count{state="active"}
# Redis memory usage
redis_memory_used_bytes / 1024 / 1024 # in MB
# Typesense collection size
typesense_documents_count{collection="listings"}
Emergency Procedures
Full System Reset (dev only):
docker compose down -v # Remove all volumes!
docker system prune -a
docker compose up -d --wait
docker compose exec api npx prisma db push
docker compose exec api npx ts-node scripts/seed.ts
Database Emergency Restore:
# Find latest backup
ls -t /var/lib/docker/volumes/pg_backups/_data/goodgo_*.sql.gz | head -1
# Restore to new database
pg_restore -h localhost -p 5432 -U goodgo -d goodgo_restored \
--clean --if-exists --verbose /path/to/backup.sql.gz
# Verify restore
psql -U goodgo -d goodgo_restored -c "SELECT count(*) FROM \"User\";"
Force Kill Stuck Service:
# If health check broken
docker compose kill api
docker compose rm -f api
docker compose up -d api
Appendix: Key File Locations
/Users/velikho/Desktop/WORKING/goodgo-platform-ai/
├── docker-compose.yml # Dev environment
├── docker-compose.prod.yml # Prod environment (with pgbouncer, resource limits)
├── docker-compose.ci.yml # CI/E2E test environment
├── .env.example # Template for all required env vars
│
├── apps/
│ ├── api/
│ │ ├── Dockerfile # Multi-stage NestJS build
│ │ ├── docker-entrypoint.sh # Startup script (migrations, app start)
│ │ ├── src/
│ │ │ ├── modules/health/health.controller.ts
│ │ │ ├── modules/payments/application/commands/handle-callback/
│ │ │ ├── modules/shared/infrastructure/redis.service.ts
│ │ │ └── modules/search/infrastructure/services/typesense-search.repository.ts
│ │ └── package.json
│ │
│ └── web/
│ ├── Dockerfile # Multi-stage Next.js build
│ └── package.json
│
├── libs/
│ └── ai-services/
│ ├── Dockerfile # Python FastAPI build
│ ├── app/main.py # FastAPI app entry
│ └── pyproject.toml
│
├── prisma/
│ └── schema.prisma # Complete Prisma schema (22 models)
│
├── infra/
│ └── pgbouncer/
│ ├── pgbouncer.ini # Connection pooling config
│ ├── userlist.txt.template # User list (templated)
│ └── entrypoint.sh # Env substitution script
│
├── scripts/
│ └── backup/
│ ├── pg-backup.sh # Daily backup automation
│ ├── pg-verify-backup.sh # Restore verification
│ └── pg-restore.sh # Manual restore script
│
├── monitoring/
│ ├── prometheus/
│ │ ├── prometheus.yml # Scrape config (goodgo-api metrics)
│ │ └── alert-rules.yml # Latency + error rate alerts
│ ├── loki/
│ │ └── loki-config.yml # Log aggregation config (15-day retention)
│ ├── promtail/
│ │ └── promtail-config.yml # Log shipping (Pino JSON parsing)
│ └── grafana/
│ ├── provisioning/
│ │ ├── datasources/datasource.yml
│ │ └── dashboards/dashboard.yml
│ └── dashboards/
│ ├── api-latency.json
│ ├── api-overview.json
│ ├── database.json
│ ├── logs.json
│ ├── search.json
│ ├── web-vitals.json
│ └── business-metrics.json
│
└── .github/workflows/
├── ci.yml # Lint, test, build
├── deploy.yml # Build images, deploy to staging/prod
├── e2e.yml # End-to-end tests
├── backup-verify.yml # Weekly backup verification
├── security.yml # Dependency/SAST scanning
├── codeql.yml # GitHub CodeQL
└── load-test.yml # K6 load testing
Document Version History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-04-11 | DevOps Team | Initial comprehensive runbook |
Last Updated: April 11, 2026
Maintained By: GoodGo Platform SRE Team
Contact: devops@goodgo.vn