goodgo-platform/docs/audits/INFRASTRUCTURE_RUNBOOK.md

# GoodGo Platform — Operational Infrastructure Runbook

**Last Updated:** April 11, 2026
**Version:** 1.0
**Purpose:** Complete infrastructure reference for ops teams, SREs, and on-call engineers

---

## Table of Contents

1. [Executive Summary](#executive-summary)
2. [Services Architecture](#services-architecture)
3. [Docker Compose Specifications](#docker-compose-specifications)
4. [Database Layer](#database-layer)
5. [Caching & Search](#caching--search)
6. [Monitoring & Observability](#monitoring--observability)
7. [Payment Integration](#payment-integration)
8. [Health Checks](#health-checks)
9. [Environment Variables](#environment-variables)
10. [Backup & Recovery](#backup--recovery)
11. [Deployment Pipeline](#deployment-pipeline)
12. [Troubleshooting Guide](#troubleshooting-guide)

---

## Executive Summary

**GoodGo Platform** is a monorepo real estate marketplace built with:
- **Frontend:** Next.js (TypeScript)
- **Backend API:** NestJS (TypeScript)
- **AI Services:** Python/FastAPI
- **Database:** PostgreSQL 16 + PostGIS
- **Cache:** Redis 7
- **Search:** Typesense 27.1
- **Object Storage:** MinIO (S3-compatible)
- **Monitoring:** Prometheus + Grafana + Loki + Promtail
- **Message Queue:** Built-in CQRS/Event Bus (NestJS)

**Total Services in Production:** 12+ (detailed below)

---

## Services Architecture

### Service Inventory

| Service | Image | Port | Purpose | Health Check | Dependencies |
|---------|-------|------|---------|--------------|--------------|
| **api** | `goodgo-api:latest` | 3001 | NestJS REST API | `GET /health` (3x30s) | postgres, redis, typesense, pgbouncer |
| **web** | `goodgo-web:latest` | 3000 | Next.js frontend | `GET /` (3x30s) | api |
| **ai-services** | `goodgo-ai-services:latest` | 8000 | Python FastAPI (price estimation, NLP) | `GET /health` (3x30s) | n/a |
| **postgres** | `postgis/postgis:16-3.4` | 5432 | Primary database | `pg_isready` (5x10s) | n/a |
| **pgbouncer** | `edoburu/pgbouncer:1.23.1-p2` | 6432 | Connection pooling (transaction mode) | `pg_isready` (5x10s) | postgres |
| **redis** | `redis:7-alpine` | 6379 | Cache + session store | `PING` (5x10s) | n/a |
| **typesense** | `typesense/typesense:27.1` | 8108 | Full-text search index | `GET /health` (5x10s) | n/a |
| **minio** | `minio/minio:latest` | 9000/9001 | Object storage + console | `mc ready local` (5x10s) | n/a |
| **loki** | `grafana/loki:3.0.0` | 3100 | Log aggregation | `GET /ready` (5x15s) | n/a |
| **promtail** | `grafana/promtail:3.0.0` | 9080 | Log shipper | (depends on loki healthy) | loki |
| **prometheus** | `prom/prometheus:v2.51.0` | 9090 | Metrics scraper | `GET /-/healthy` (3x15s) | n/a |
| **grafana** | `grafana/grafana:10.4.1` | 3002 | Dashboards + alerting | `GET /api/health` (3x15s) | prometheus, loki |
| **pg-backup** | `postgis/postgis:16-3.4` | — | Automated backup cron | depends_on postgres | postgres |

### Network & Volumes

- **Network:** Docker bridge network `goodgo-net`
- **Volumes:**
  - `pgdata` — PostgreSQL data files
  - `redis_data` — Redis snapshot (AOF)
  - `typesense_data` — Search index
  - `minio_data` — Object storage
  - `pg_backups` — Database backups (daily retention: 7 days)
  - `loki_data` — Log chunks (retention: 15 days)
  - `prometheus_data` — Metrics TSDB (retention: 30 days in prod, 15 days in dev)
  - `grafana_data` — Dashboards, datasource configs

---

## Docker Compose Specifications

### Development Environment (`docker-compose.yml`)

**12 Services (minimal dependencies, no resource limits)**

```yaml
services:
  postgres:        PostGIS 16, port 5432, healthcheck: pg_isready (30s start-period)
  redis:           Alpine 7, port 6379, maxmemory: 256mb LRU, AOF enabled
  typesense:       v27.1, port 8108, CORS enabled, healthcheck /health
  minio:           latest, ports 9000 (API) / 9001 (console)
  ai-services:     Custom Python build, port 8000
  pg-backup:       Automated daily dumps at 02:00 UTC, cron retention cleanup
  pg-verify-backup: On-demand backup restore verification (profile: tools)
  loki:            v3.0.0, port 3100, 15-day retention, 2h compaction delay
  promtail:        v3.0.0, Docker socket instrumentation, Pino JSON parsing
  prometheus:      v2.51.0, port 9090, 15-day retention, lifecycle API enabled
  grafana:         v10.4.1, port 3002, datasources pre-provisioned
```

**Key Differences from Prod:**
- No resource limits (use all available CPU/memory)
- Smaller retention windows (7-15 days)
- PostgreSQL on port 5432 (direct, no pgbouncer)
- loki/prometheus/grafana on alternate ports

### Production Environment (`docker-compose.prod.yml`)

**14 Services (with pgbouncer, resource limits, rolling updates)**

```yaml
services:
  api:             NestJS, resource limits: 1g CPU / 1g memory
  web:             Next.js, resource limits: 0.5 CPU / 512m memory
  ai-services:     Python, resource limits: 1.0 CPU / 1g memory
  postgres:        PostGIS, resource limits: 2.0 CPU / 2g memory
  pgbouncer:       Connection pool (NEW), 20 default connections, transaction mode
  redis:           7-alpine, resource limits: 0.5 CPU / 768m memory, password auth
  typesense:       27.1, resource limits: 1.0 CPU / 1g memory
  minio:           latest, resource limits: 0.5 CPU / 1g memory
  loki:            v3.0.0, resource limits: 0.5 CPU / 512m memory
  promtail:        v3.0.0, resource limits: 0.25 CPU / 256m memory
  prometheus:      v2.51.0, resource limits: 0.5 CPU / 1g memory, 30-day retention
  grafana:         v10.4.1, resource limits: 0.5 CPU / 512m memory
  pg-backup:       Same as dev
```

**Production-Specific Flags:**
- `read_only: true` on app containers (api, web, ai-services)
- `tmpfs: [/tmp]` for runtime temp files
- `security_opt: [no-new-privileges:true]`
- `logging: json-file` with 10m max-size, 3-5 files rotation
- **PgBouncer inserted between apps ↔ Postgres** (port 6432)
- Secrets management: `GRAFANA_ADMIN_USER`, `GRAFANA_ADMIN_PASSWORD` from Docker secrets
- Redis requires password authentication

### CI/E2E Environment (`docker-compose.ci.yml`)

**Minimal 4 Services (tmpfs for speed)**

```yaml
services:
  postgres:        goodgo_test DB, tmpfs (/var/lib/postgresql/data)
  redis:          --save "" --appendonly no (no persistence)
  typesense:      tmpfs (/data)
  minio:          tmpfs (/data)
```

**Used by:**
- GitHub Actions E2E test suite
- Local `docker compose -f docker-compose.ci.yml up --wait`

---

## Database Layer

### PostgreSQL + PostGIS

**Version:** 16.3.4 with PostGIS extension
**Schema:** 22 Prisma models + Prisma migration tracking

#### Prisma Schema Models

1. **Auth:** User, RefreshToken, OAuthAccount, Agent
2. **Listings:** Property, PropertyMedia, Listing
3. **Search:** SavedSearch
4. **Transactions:** Transaction, Inquiry, Lead
5. **Payments:** Payment (with PaymentProvider enum: VNPAY, MOMO, ZALOPAY, BANK_TRANSFER)
6. **Subscriptions:** Plan, Subscription, UsageRecord
7. **Analytics:** Valuation, MarketIndex
8. **Notifications:** NotificationLog, NotificationPreference
9. **Audit:** AdminAuditLog
10. **Reviews:** Review

#### Key Database Features

- **PostGIS Geometry:** Property.location (Point, SRID 4326) with GIST index
- **Enums:** UserRole, KYCStatus, PropertyType, TransactionType, ListingStatus, Direction, OAuthProvider, TransactionStatus, LeadStatus, PaymentProvider, PaymentStatus, PaymentType, PlanTier, SubscriptionStatus, NotificationChannel, NotificationStatus, AdminAction, AuditTargetType
- **Compound Indexes:** Query optimization on (role, isActive, createdAt), (sellerId, status, publishedAt), (userId, status, createdAt), etc.
- **Constraints:** Unique idempotency key on Payment (userId, provider, idempotencyKey)

#### Connection Pooling: PgBouncer

**Dev Mode (docker-compose.yml):**
- Apps connect directly to `postgres:5432`
- No pooling overhead

**Prod Mode (docker-compose.prod.yml):**
- Apps connect to `pgbouncer:6432`
- **Pool Mode:** `transaction` (connections returned after each transaction)
- **Pool Size:** 20 connections (default, tunable via `PGBOUNCER_POOL_SIZE`)
- **Max Client Conn:** 200 (tunable via `PGBOUNCER_MAX_CLIENT_CONN`)
- **Reserve Pool:** 5 connections (fallback when pool exhausted)
- **Timeouts:**
  - server_connect_timeout: 15s
  - server_idle_timeout: 600s
  - server_lifetime: 3600s (connection recycle)
  - query_wait_timeout: 120s
  - query_timeout: 0 (disabled)
- **Admin Console:** pgbouncer_admin user (password via PGBOUNCER_ADMIN_PASSWORD env var)
- **Stats Console:** pgbouncer_stats user (password via PGBOUNCER_STATS_PASSWORD env var)

**Migration Workaround:**
- API has two DATABASE_URL env vars:
  - `DATABASE_URL` → pgbouncer:6432 (normal queries)
  - `DATABASE_URL_DIRECT` → postgres:5432 (migrations, introspection, DDL)
- `RUN_MIGRATIONS=true` switches app to use DATABASE_URL_DIRECT for `prisma migrate deploy`

#### Backup Strategy

**Automated Backups:**
- **Schedule:** Daily at 02:00 UTC (cron inside pg-backup container)
- **Format:** Custom format with gzip compression (level 6)
- **Retention:** 7 days (configurable via BACKUP_RETENTION_DAYS)
- **Location:** `pg_backups` volume (mount to persistent storage in prod)
- **File Pattern:** `goodgo_YYYYMMDD_HHMMSS.sql.gz`
- **Restore Script:** `/scripts/backup/pg-restore.sh` (manual restore)
- **Verification Script:** `/scripts/backup/pg-verify-backup.sh` (automated E2E verification)

**Verification Process (runs weekly):**
1. Restores latest backup to isolated test database (`goodgo_verify_<timestamp>`)
2. Verifies all 22 tables exist
3. Compares row counts between source and restored DB
4. Checksums critical tables (User, Property, Listing, Payment, Subscription, Transaction, Plan, _prisma_migrations)
5. Checks PostGIS extension, indexes, enum types
6. Generates JSON report with pass/fail result
7. **Cleanup:** Drops test DB on exit (unless SKIP_CLEANUP=1)
8. **Exit Codes:** 0=pass, 1=checks failed, 2=setup error

**CI/CD Backup Verification:**
- GitHub Action: `.github/workflows/backup-verify.yml`
- Runs weekly Sundays 05:00 UTC
- Also manually triggerable with skip_cleanup option
- Uploads JSON report as artifact

---

## Caching & Search

### Redis

**Image:** `redis:7-alpine`
**Port:** 6379

**Production Configuration:**
```bash
redis-server \
  --appendonly yes \                # AOF persistence (updates only)
  --requirepass ${REDIS_PASSWORD} \ # Authentication required
  --maxmemory 512mb \               # Max memory limit (prod)
  --maxmemory-policy allkeys-lru    # LRU eviction when full
```

**Development Configuration:**
```bash
redis-server \
  --appendonly yes \
  --maxmemory 256mb \
  --maxmemory-policy allkeys-lru
```

**ioredis Client Configuration:**
```typescript
// From RedisService in apps/api/src/modules/shared/infrastructure/redis.service.ts
{
  host: process.env.REDIS_HOST ?? 'localhost',
  port: Number(process.env.REDIS_PORT ?? 6379),
  password: process.env.REDIS_PASSWORD ?? undefined,
  lazyConnect: true,          // App starts even if Redis unavailable
  enableReadyCheck: false,    // Prevents "Redis is not ready" errors during transient outages
  maxRetriesPerRequest: 1,    // Fail fast (single retry, no exponential backoff)
  retryStrategy(times: number): number {
    return Math.min(times * 1000, 5000);  // 1s → 2s → 3s → 4s → 5s → 5s...
  }
}
```

**Graceful Degradation:**
- Cache misses don't fail the application
- CacheService catches Redis errors and returns cache miss
- App serves data directly from PostgreSQL if Redis down
- Health check at `GET /health/redis` warns but doesn't fail readiness probe

**Use Cases:**
- Session storage
- Cache layer for expensive queries
- Rate limiting (if implemented)
- Real-time counters

---

### Typesense

**Image:** `typesense/typesense:27.1`
**Port:** 8108 (HTTP only, internal Docker network)
**API Key:** `${TYPESENSE_API_KEY}` (must be set in .env)

**Collection Schema:**
```
Collection Name: "listings"
Fields:
  - listingId (string)
  - propertyId (string)
  - title (string, searchable, highlights)
  - description (string, searchable, highlights)
  - propertyType (string, faceted)
  - transactionType (string, faceted: SALE/RENT)
  - priceVND (int64, sortable)
  - pricePerM2 (float, optional)
  - areaM2 (float)
  - bedrooms (int32, faceted)
  - bathrooms (int32, faceted)
  - floors (int32)
  - direction (string, faceted: NORTH/SOUTH/EAST/WEST/etc)
  - address (string)
  - ward (string, faceted)
  - district (string, faceted)
  - city (string, faceted)
  - location (geopoint) — for radius search
  - agentId (string)
  - sellerId (string)
  - status (string, faceted: ACTIVE/SOLD/DRAFT/etc)
  - publishedAt (int64, sortable)
  - viewCount (int32)
  - saveCount (int32)
  - projectName (string, faceted)
  - amenities (string[], faceted)
```

**Search Features:**
- **Full-text search** on: title, description, address, district, city, projectName
- **Query weights:** title=5, description=3, address=2, district=2, city=1, projectName=2
- **Filtering:** propertyType, transactionType, bedrooms, district, city, status, amenities
- **Geo-search:** radius-based queries (lat, lng, km)
- **Sorting:** price (asc/desc), distance (asc from geopoint), date (desc), relevance
- **Highlights:** HTML marks on matched terms in title and description
- **Facets:** Return aggregated counts for filtering

**TypesenseSearchRepository (`apps/api/src/modules/search/infrastructure/services/typesense-search.repository.ts`):**
- `ensureCollection()` — Creates schema if not exists
- `dropCollection()` — Cleanup (testing only)
- `indexDocument(doc)` — Upsert single document
- `indexDocuments(docs)` — Bulk import with error reporting
- `removeDocument(id)` — Delete by ID
- `search(params)` — Execute search with filters, sort, pagination

**Graceful Degradation:**
- If Typesense down, search falls back to PostgreSQL full-text search
- TypesenseClientService implements retry logic with exponential backoff
- Health check at `GET /health` returns JSON status

---

## Monitoring & Observability

### Prometheus

**Image:** `prom/prometheus:v2.51.0`
**Port:** 9090
**Retention:** 15 days (dev), 30 days (prod)
**Lifecycle API:** Enabled (`--web.enable-lifecycle`)

**Scrape Targets (`monitoring/prometheus/prometheus.yml`):**
```yaml
scrape_configs:
  - job_name: goodgo-api
    metrics_path: /metrics
    static_configs:
      - targets: ['host.docker.internal:3001']  # Dev (API on host)
      - targets: ['api:3001']                   # Prod (API in container)
    labels:
      service: goodgo-api
      environment: [development|production]

  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']
```

**Expected Metrics from API:**
- `goodgo_api_request_duration_seconds_bucket{le, route, method}` — Request latency histogram
- `http_requests_total{status_code, job}` — Request count by status code
- Custom business metrics (if implemented in NestJS @prometheus decorators)

### Alert Rules (`monitoring/prometheus/alert-rules.yml`)

**Latency Alerts:**
1. **ApiLatencyP99High** (warning)
   - Trigger: p99 latency > 1s for 5 minutes
   - Dashboard: `/d/goodgo-api-latency/goodgo-api-latency`
   - Runbook: `https://docs.goodgo.vn/runbooks/api-latency-high`

2. **ApiEndpointLatencyP99High** (warning)
   - Trigger: Per-endpoint p99 > 2s for 5 minutes
   - Annotates: method, route labels

3. **ApiLatencyP99Critical** (critical - SLO breach)
   - Trigger: p99 latency > 3s for 3 minutes
   - Escalation required
   - Runbook: `https://docs.goodgo.vn/runbooks/api-latency-critical`

**Error Rate Alert:**
1. **ApiErrorRate5xxHigh** (warning)
   - Trigger: 5xx error rate > 1% for 5 minutes
   - Uses: `(5xx errors / total requests) * 100`

### Grafana

**Image:** `grafana/grafana:10.4.1`
**Port:** 3002
**Auth:** Admin user/password from secrets (prod) or env vars (dev)

**Pre-Provisioned Datasources:**
- Prometheus (default, primary)
- Loki (with derived fields for correlationId linkage)

**Dashboards:**
1. `api-latency.json` — API p99/p95/p50, route breakdown, slow endpoints
2. `api-overview.json` — Request rate, error rate, uptime status
3. `database.json` — Query latency, connection pool utilization, slow queries
4. `logs.json` — Log volume, error logs, trace links to Prometheus
5. `search.json` — Typesense query latency, indexing rate, collection size
6. `web-vitals.json` — Frontend Core Web Vitals (if client-side instrumentation)
7. `business-metrics.json` — Listings created, payments processed, user signups

**Admin Console Access:**
- URL: `http://localhost:3002` (dev) or `${GRAFANA_PORT}` (prod)
- Default user: `admin` (change password on first login)
- Non-signup mode (`GF_USERS_ALLOW_SIGN_UP: false`)

### Loki & Promtail (Log Aggregation)

**Loki:** `grafana/loki:3.0.0`, port 3100

**Configuration:**
```yaml
schema:
  - from: 2024-01-01
    store: tsdb
    schema: v13
limits:
  max_entries_limit_per_query: 5000
  ingestion_rate_mb: 4
  ingestion_burst_size_mb: 6
retention: 360h (15 days)
```

**Promtail:** `grafana/promtail:3.0.0`

**Configuration:**
- Scrapes Docker logs from `goodgo-net` bridge network
- Parses **Pino JSON** structured logs
- Extracts labels: level, context, component, service
- Structured metadata: method, url, statusCode, correlationId, duration
- Derives timestamp from Pino output (RFC3339Nano)

**Expected Log Format (Pino):**
```json
{
  "level": 30,                    // info
  "time": "2026-04-11T10:30:00Z",
  "msg": "POST /api/listings",
  "correlationId": "abc-123-def",
  "context": "ListingController",
  "component": "api",
  "method": "POST",
  "url": "/api/listings",
  "statusCode": 201,
  "duration": 150
}
```

---

## Payment Integration

### Supported Payment Providers

**Enum:** `PaymentProvider` (Prisma)
- `VNPAY` — VNPay (Vietnam payment gateway)
- `MOMO` — MoMo (Vietnam mobile wallet)
- `ZALOPAY` — ZaloPay (Vietnam digital wallet)
- `BANK_TRANSFER` — Manual bank transfer (offline)

### Payment Flow & Callback Handling

**Database Schema (Payment Model):**
```typescript
model Payment {
  id            String @id @default(cuid())
  userId        String
  transactionId String?
  provider      PaymentProvider
  type          PaymentType  // SUBSCRIPTION, LISTING_FEE, DEPOSIT, FEATURED_LISTING
  amountVND     BigInt
  status        PaymentStatus  // PENDING, PROCESSING, COMPLETED, FAILED, REFUNDED
  providerTxId  String?  // External transaction ID from VNPay/MoMo/ZaloPay
  callbackData  Json?    // Raw callback payload (for audit)
  idempotencyKey String? // Prevent duplicate payments (userId, provider, idempotencyKey unique)
  createdAt     DateTime @default(now())
  updatedAt     DateTime @updatedAt
}

enum PaymentStatus {
  PENDING, PROCESSING, COMPLETED, FAILED, REFUNDED
}

enum PaymentType {
  SUBSCRIPTION, LISTING_FEE, DEPOSIT, FEATURED_LISTING
}
```

**Command Handler: `HandleCallbackHandler`**
(`apps/api/src/modules/payments/application/commands/handle-callback/handle-callback.handler.ts`)

1. **Callback Signature Verification:**
   - Uses `PAYMENT_GATEWAY_FACTORY` to route to correct provider (VNPay/MoMo/ZaloPay)
   - Gateway.verifyCallback() validates HMAC signature
   - Throws `ValidationException` if signature invalid

2. **Idempotent Status Transition:**
   - Only updates payments in state: `PENDING` or `PROCESSING`
   - Atomically transitions to `COMPLETED` or `FAILED`
   - If already in terminal state (COMPLETED/FAILED/REFUNDED), returns existing status (idempotent)
   - Logs warning if payment not found

3. **Domain Event Publishing:**
   - Reconstructs domain entity from repository
   - Emits `PaymentCompletedEvent` or `PaymentFailedEvent`
   - Event bus publishes events to subscribers (e.g., subscription creation, listing activation)

4. **Response:**
   ```typescript
   {
     paymentId: string,
     status: PaymentStatus,
     isSuccess: boolean
   }
   ```

**Payment Gateway Interface (`payment-gateway.interface.ts`):**
```typescript
interface IPaymentGateway {
  readonly provider: PaymentProvider
  createPaymentUrl(params: CreatePaymentUrlParams): Promise<CreatePaymentUrlResult>
  verifyCallback(data: Record<string, string>): CallbackVerifyResult
  refund(params: RefundParams): Promise<RefundResult>
}

interface CreatePaymentUrlParams {
  orderId: string
  amountVND: bigint
  description: string
  returnUrl: string
  ipAddress: string
}

interface CallbackVerifyResult {
  isValid: boolean
  orderId: string
  providerTxId: string
  isSuccess: boolean
  rawData: Record<string, unknown>
}

interface RefundParams {
  providerTxId: string
  amountVND: bigint
  reason: string
}

interface RefundResult {
  success: boolean
  refundTxId: string | null
}
```

### Environment Variables

**VNPay:**
```env
VNPAY_TMN_CODE=<merchant terminal code>
VNPAY_HASH_SECRET=<HMAC secret key>
VNPAY_BASE_URL=https://sandbox.vnpayment.vn/paymentv2/vpcpay.html
VNPAY_API_URL=https://sandbox.vnpayment.vn/merchant_webapi/api/transaction
```

**MoMo:**
```env
MOMO_PARTNER_CODE=<partner code>
MOMO_ACCESS_KEY=<access key>
MOMO_SECRET_KEY=<secret key>
MOMO_ENDPOINT=https://test-payment.momo.vn/v2/gateway/api
```

**ZaloPay:**
```env
ZALOPAY_APP_ID=<app ID>
ZALOPAY_KEY1=<key 1 (for creating payments)>
ZALOPAY_KEY2=<key 2 (for callback verification)>
ZALOPAY_ENDPOINT=https://sb-openapi.zalopay.vn/v2
```

### Race Condition & Idempotency Protection

**Problem:** Multiple callbacks may arrive for same payment (network retries, duplicate notifications)

**Solution:**
1. **Unique Idempotency Key:** `Payment_idempotency_unique(userId, provider, idempotencyKey)`
   - Prevents duplicate payment records
   - Generated by client/API before creating payment

2. **Atomic Status Update:** `paymentRepo.updateIfStatus(orderId, ['PENDING', 'PROCESSING'], newStatus)`
   - Only updates if current status in allowed list
   - Returns updated entity or null if already terminal

3. **Terminal State Check:** If already COMPLETED/FAILED/REFUNDED, handler returns existing state
   - No re-triggering of domain events
   - No double billing or duplicate transactions

---

## Health Checks

### API Health Endpoints

**Health Controller** (`apps/api/src/modules/health/health.controller.ts`)

1. **GET /health** — Liveness Probe (always 200 if process running)
   - Uses: `@HealthCheck()` on empty probe list
   - Response: `{ "status": "ok", "timestamp": "..." }`
   - **Use Case:** Kubernetes/Docker readiness (initial startup)

2. **GET /health/ready** — Readiness Probe (checks dependencies)
   - Checks: PostgreSQL + Redis connectivity
   - Response:
     ```json
     {
       "status": "ok",
       "checks": {
         "database": { "status": "up" },
         "redis": { "status": "up" }
       }
     }
     ```
   - **Use Case:** Load balancer, before accepting traffic
   - **Failure:** Returns 503 if any dependency down

3. **GET /health/db** — Database Readiness Only
   - Checks: PostgreSQL connectivity via `SELECT 1` query
   - **Use Case:** Manual database troubleshooting

4. **GET /health/redis** — Redis Readiness Only
   - Checks: Redis PING command
   - **Use Case:** Manual Redis troubleshooting

### Health Check Implementations

**PrismaHealthIndicator** (`apps/api/src/modules/health/infrastructure/prisma.health.ts`):
```typescript
async isHealthy(key: string): Promise<HealthIndicatorResult> {
  try {
    await this.prisma.$queryRawUnsafe('SELECT 1');
    return this.getStatus(key, true);
  } catch {
    throw new HealthCheckError('Database check failed', this.getStatus(key, false));
  }
}
```

**RedisHealthIndicator** (`apps/api/src/modules/health/infrastructure/redis.health.ts`):
```typescript
async isHealthy(key: string): Promise<HealthIndicatorResult> {
  try {
    const client = this.redis.getClient();
    const pong = await client.ping();
    const isHealthy = pong === 'PONG';
    const result = this.getStatus(key, isHealthy);
    if (isHealthy) return result;
    throw new HealthCheckError('Redis ping failed', result);
  } catch (error) {
    if (error instanceof HealthCheckError) throw error;
    throw new HealthCheckError('Redis check failed', this.getStatus(key, false));
  }
}
```

### Docker Container Health Checks

**API Container:**
```yaml
healthcheck:
  test: ['CMD', 'node', '-e', "fetch('http://localhost:3001/health').then(r => { if (!r.ok) throw 1 }).catch(() => process.exit(1))"]
  interval: 30s
  timeout: 5s
  retries: 5
  start_period: 30s
```

**Web Container:**
```yaml
healthcheck:
  test: ['CMD', 'node', '-e', "fetch('http://localhost:3000').then(r => { if (!r.ok) throw 1 }).catch(() => process.exit(1))"]
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 15s
```

**PostgreSQL:**
```yaml
healthcheck:
  test: ['CMD-SHELL', 'pg_isready -U ${DB_USER} -d ${DB_NAME}']
  interval: 10s
  timeout: 5s
  retries: 5
  start_period: 30s
```

**Redis:**
```yaml
healthcheck:
  test: ['CMD', 'redis-cli', '-a', '${REDIS_PASSWORD}', 'ping']
  interval: 10s
  timeout: 5s
  retries: 5
  start_period: 10s
```

**Typesense:**
```yaml
healthcheck:
  test: ['CMD', 'curl', '-sf', 'http://localhost:8108/health']
  interval: 10s
  timeout: 5s
  retries: 5
  start_period: 15s
```

---

## Environment Variables

### Complete `.env.example` Reference

**PostgreSQL:**
```env
DB_HOST=localhost
DB_PORT=5432
DB_NAME=goodgo
DB_USER=goodgo
DB_PASSWORD=CHANGE_ME
DATABASE_URL=postgresql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:${DB_PORT}/${DB_NAME}?schema=public
DATABASE_URL_DIRECT=postgresql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:${DB_PORT}/${DB_NAME}?schema=public
```

**PgBouncer (Prod Only):**
```env
PGBOUNCER_POOL_SIZE=20
PGBOUNCER_MAX_CLIENT_CONN=200
PGBOUNCER_ADMIN_PASSWORD=CHANGE_ME
PGBOUNCER_STATS_PASSWORD=CHANGE_ME
```

**Redis:**
```env
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=
REDIS_URL=redis://${REDIS_HOST}:${REDIS_PORT}
```

**Typesense:**
```env
TYPESENSE_HOST=localhost
TYPESENSE_PORT=8108
TYPESENSE_PROTOCOL=http
TYPESENSE_API_KEY=CHANGE_ME
```

**MinIO:**
```env
MINIO_ENDPOINT=localhost
MINIO_PORT=9000
MINIO_CONSOLE_PORT=9001
MINIO_ACCESS_KEY=CHANGE_ME
MINIO_SECRET_KEY=CHANGE_ME
MINIO_BUCKET=goodgo-media
MINIO_USE_SSL=false
```

**NestJS API:**
```env
API_PORT=3000
PORT=3001
NODE_ENV=development
CORS_ORIGINS=http://localhost:3000,http://localhost:3001
```

**JWT / Authentication (REQUIRED):**
```env
JWT_SECRET=<generate with: openssl rand -base64 48>
JWT_EXPIRES_IN=15m
JWT_REFRESH_SECRET=<generate with: openssl rand -base64 48>
JWT_REFRESH_EXPIRES_IN=7d
```

**OAuth Providers:**
```env
GOOGLE_CLIENT_ID=
GOOGLE_CLIENT_SECRET=
GOOGLE_CALLBACK_URL=http://localhost:3001/auth/google/callback

ZALO_APP_ID=
ZALO_APP_SECRET=
ZALO_CALLBACK_URL=http://localhost:3001/auth/zalo/callback

FRONTEND_URL=http://localhost:3000
```

**Next.js Web:**
```env
NEXT_PUBLIC_API_URL=http://localhost:3000
WEB_PORT=3001
```

**AI Service (Python/FastAPI):**
```env
AI_SERVICE_PORT=8000
AI_SERVICE_URL=http://localhost:8000
CLAUDE_API_KEY=
AI_DEBUG=false
AI_LOG_LEVEL=info
```

**Map Integration:**
```env
NEXT_PUBLIC_MAPBOX_TOKEN=
```

**Payment Gateways:**
```env
VNPAY_TMN_CODE=
VNPAY_HASH_SECRET=
VNPAY_BASE_URL=https://sandbox.vnpayment.vn/paymentv2/vpcpay.html
VNPAY_API_URL=https://sandbox.vnpayment.vn/merchant_webapi/api/transaction

MOMO_PARTNER_CODE=
MOMO_ACCESS_KEY=
MOMO_SECRET_KEY=
MOMO_ENDPOINT=https://test-payment.momo.vn/v2/gateway/api

ZALOPAY_APP_ID=
ZALOPAY_KEY1=
ZALOPAY_KEY2=
ZALOPAY_ENDPOINT=https://sb-openapi.zalopay.vn/v2
```

**Email / SMTP:**
```env
SMTP_HOST=localhost
SMTP_PORT=1025
SMTP_USER=
SMTP_PASS=
SMTP_FROM=noreply@goodgo.vn
```

**Firebase Cloud Messaging (Optional):**
```env
FIREBASE_SERVICE_ACCOUNT=
```

**Sentry Error Tracking:**
```env
SENTRY_DSN=
NEXT_PUBLIC_SENTRY_DSN=
SENTRY_AUTH_TOKEN=
SENTRY_ORG=
SENTRY_PROJECT=
```

**KYC Field Encryption (REQUIRED Prod):**
```env
KYC_ENCRYPTION_KEY=<generate with: openssl rand -hex 32> # 64 hex chars (32 bytes)
KYC_ENCRYPTION_KEY_VERSION=1
```

**Logging:**
```env
LOG_LEVEL=info
```

---

## Backup & Recovery

### Automated Daily Backups

**Service:** `pg-backup` container (runs inside docker compose)

**Backup Script:** `scripts/backup/pg-backup.sh`

```bash
# Daily cron job: 02:00 UTC
PGHOST=postgres \
PGPORT=5432 \
PGUSER=goodgo \
PGDATABASE=goodgo \
PGPASSWORD=<secret> \
BACKUP_DIR=/backups \
RETENTION_DAYS=7 \
  /scripts/pg-backup.sh
```

**Behavior:**
1. Creates dump with `pg_dump --format=custom --compress=6`
2. Saves as `goodgo_YYYYMMDD_HHMMSS.sql.gz`
3. Prunes backups older than 7 days (configurable)
4. Logs to `/var/log/pg-backup.log`

**Restore from Backup:**

```bash
# Interactive restore prompt
docker compose -f docker-compose.prod.yml exec pg-backup bash -c \
  'pg_restore -h postgres -p 5432 -U goodgo -d goodgo \
   --clean --if-exists /backups/goodgo_20260410_020000.sql.gz'

# Or using restore script
docker compose -f docker-compose.prod.yml run --rm pg-verify-backup bash -c \
  'source /scripts/pg-restore.sh /backups/goodgo_20260410_020000.sql.gz'
```

### Backup Verification

**Service:** `pg-verify-backup` container (on-demand, profile: tools)

**Verification Script:** `scripts/backup/pg-verify-backup.sh`

```bash
# Usage:
docker compose -f docker-compose.prod.yml run --rm pg-verify-backup

# Or with options:
SKIP_CLEANUP=1 REPORT_FILE=/backups/verify-report.json \
  docker compose -f docker-compose.prod.yml run --rm pg-verify-backup
```

**Verification Steps:**
1. Creates isolated test database: `goodgo_verify_<timestamp>`
2. Enables PostGIS extension
3. Restores backup into test DB
4. Verifies all 22 tables exist
5. Compares row counts between source and restored
6. Checksums critical tables using MD5 hashes
7. Checks indexes, enum types
8. Generates JSON report with results
9. **Cleanup:** Drops test DB (unless SKIP_CLEANUP=1)

**JSON Report Structure:**
```json
{
  "timestamp": "2026-04-11T10:30:00Z",
  "backupFile": "/backups/goodgo_20260410_020000.sql.gz",
  "backupSize": "150M",
  "testDatabase": "goodgo_verify_20260411_103000",
  "restoreDurationSeconds": 45,
  "passed": 28,
  "failed": 0,
  "warnings": 2,
  "result": "pass",
  "checks": [
    { "check": "Database creation", "status": "pass", "detail": "Test database created" },
    { "check": "Restore", "status": "pass", "detail": "pg_restore completed cleanly in 45s" },
    { "check": "Table existence", "status": "pass", "detail": "All 22 expected tables present" },
    { "check": "Row counts", "status": "pass", "detail": "All tables match source database" },
    { "check": "Checksum: User identities", "status": "pass", "detail": "Hashes match (abc123def456...)" },
    ...
  ]
}
```

**GitHub Action Backup Verification:**
- File: `.github/workflows/backup-verify.yml`
- Schedule: Weekly Sundays 05:00 UTC
- Also: Manual trigger with skip_cleanup option
- Artifacts: Uploads JSON report for 30 days

---

## Deployment Pipeline

### GitHub Actions CI/CD

**Workflows:**
1. `.github/workflows/ci.yml` — Lint, typecheck, test, build (on push/PR to master)
2. `.github/workflows/deploy.yml` — Build Docker images, deploy to staging/prod
3. `.github/workflows/e2e.yml` — E2E tests (spins up full docker-compose.ci.yml)
4. `.github/workflows/backup-verify.yml` — Weekly backup verification
5. `.github/workflows/security.yml` — Dependency scanning, SAST
6. `.github/workflows/codeql.yml` — GitHub CodeQL analysis
7. `.github/workflows/load-test.yml` — K6 load testing

### CI Pipeline (`ci.yml`)

**On:** `push master`, `pull_request master`
**Node:** 22
**Concurrency:** Cancel previous runs on same ref

**Jobs:**
1. **Lint → Typecheck → Test → Build**
   - Installs pnpm, Node 22
   - Runs linter (eslint)
   - Type checks (tsc)
   - Unit tests (jest)
   - Builds all apps (turbo)
   - PostgreSQL 16 service available (goodgo_test DB)

2. **E2E Tests** (depends on ci job)
   - Full docker-compose.ci.yml services (postgres, redis, typesense, minio)
   - Runs end-to-end test suite
   - Timeout: 20 minutes
   - Env vars: DATABASE_URL, JWT secrets, payment test codes

### Deploy Pipeline (`deploy.yml`)

**On:**
- `push master` (auto-deploys to staging)
- Manual `workflow_dispatch` (choose staging or production)

**Jobs:**
1. **Build API Image**
   - Builds: `goodgo-api:${IMAGE_TAG}`
   - Dockerfile: `apps/api/Dockerfile`
   - Registry: `ghcr.io/goodgo/goodgo-api`
   - Tags: git SHA, branch name, `latest` (on master)

2. **Build Web Image**
   - Builds: `goodgo-web:${IMAGE_TAG}`
   - Dockerfile: `apps/web/Dockerfile`
   - Registry: `ghcr.io/goodgo/goodgo-web`

3. **Build AI Services Image**
   - Builds: `goodgo-ai-services:${IMAGE_TAG}`
   - Context: `libs/ai-services/`
   - Registry: `ghcr.io/goodgo/goodgo-ai-services`

4. **Deploy to Staging**
   - Condition: `github.event_name == 'push' || inputs.environment == 'staging'`
   - SSH into staging host
   - Pulls new images from GHCR
   - **Rolling update** (zero downtime):
     ```bash
     docker compose -f docker-compose.prod.yml up -d --no-deps --wait api
     docker compose -f docker-compose.prod.yml up -d --no-deps --wait web
     docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services
     ```
   - Runs migrations: `docker compose exec api npx prisma migrate deploy`
   - Prunes old images

5. **Deploy to Production**
   - Only on manual `workflow_dispatch` with `environment: production`
   - Same steps as staging
   - Requires `environment: production` approval (GitHub security)

### Dockerfile Multi-Stage Builds

**API (apps/api/Dockerfile):**
- **Base:** node:22-slim + pnpm 10.27.0
- **Deps:** Install locked dependencies (layer caching)
- **Build:** Compile TypeScript, generate Prisma client
- **Prune:** `pnpm deploy --prod` (removes dev deps, hoists prod deps)
- **Production:** Minimal image, dumb-init for signals, non-root user

**Web (apps/web/Dockerfile):**
- **Base:** node:22-slim + pnpm
- **Deps:** Install dependencies
- **Build:** `next build` → standalone output + static files
- **Production:** Copy .next/standalone, public, static assets

**AI Services (libs/ai-services/Dockerfile):**
- **Base:** python:3.12-slim
- **Install:** System deps (gcc, g++), dumb-init, FastAPI/XGBoost/underthesea
- **Models:** Pre-download underthesea ML models at build time
- **User:** Run as non-root appuser
- **CMD:** `uvicorn app.main:app --host 0.0.0.0 --port 8000`

---

## Troubleshooting Guide

### Check Service Status

```bash
# All services
docker compose -f docker-compose.prod.yml ps

# Single service
docker compose -f docker-compose.prod.yml ps api

# Get logs
docker compose -f docker-compose.prod.yml logs -f api --tail=100

# Health check status
docker compose -f docker-compose.prod.yml exec api curl http://localhost:3001/health
```

### Common Issues

#### 1. API Service Not Healthy (stuck in "health-check-failed" state)

**Symptoms:**
- `docker compose ps` shows `(health: starting)` for >2 minutes
- `docker compose logs api` shows connection errors

**Diagnosis:**
```bash
# Check API liveness
docker compose exec api curl http://localhost:3001/health

# Check readiness (includes DB + Redis checks)
docker compose exec api curl http://localhost:3001/health/ready

# Check specific dependencies
docker compose exec api curl http://localhost:3001/health/db
docker compose exec api curl http://localhost:3001/health/redis
```

**Solutions:**

- **PostgreSQL not ready:**
  ```bash
  docker compose ps postgres  # Should show (healthy)
  docker compose exec postgres pg_isready -U goodgo -d goodgo
  docker compose logs postgres --tail=50
  ```

- **Redis not ready:**
  ```bash
  docker compose exec redis redis-cli ping  # Should return PONG
  docker compose logs redis --tail=50
  ```

- **PgBouncer not ready (prod):**
  ```bash
  docker compose exec pgbouncer pg_isready -h 127.0.0.1 -p 6432 -U goodgo
  docker compose logs pgbouncer --tail=50
  ```

- **Database schema not initialized:**
  ```bash
  # Run migrations manually
  docker compose exec api npx prisma migrate deploy
  # Or check if schema exists
  docker compose exec postgres psql -U goodgo -d goodgo -c "\dt"
  ```

#### 2. High Database Connection Pool Exhaustion

**Symptoms:**
- Errors: `Error: unable to get a connection from the pool after X s`
- Slow queries pile up
- API latency spikes

**Diagnosis:**
```bash
# Check pool stats (prod, PgBouncer)
docker compose exec pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_stats -c "SHOW stats"

# Or query PostgreSQL directly
docker compose exec postgres psql -U goodgo -d goodgo -c "SELECT count(*) FROM pg_stat_activity"
```

**Solutions:**
- Increase `PGBOUNCER_POOL_SIZE` (default: 20)
- Increase `PGBOUNCER_MAX_CLIENT_CONN` (default: 200)
- Reduce long-running queries (add query timeout)
- Check for idle connections: `server_idle_timeout`

#### 3. Redis Connection Failures (Non-Fatal)

**Symptoms:**
- Logs: `Redis check failed` or `ECONNREFUSED`
- But API still responds with slower database reads
- Health check `/health/ready` returns 503

**Expected Behavior:** Cache misses → app serves from database

**Diagnosis:**
```bash
# Check Redis availability
docker compose exec redis redis-cli ping

# Check RedisService logs
docker compose logs api | grep -i redis
```

**Solutions:**
- Restart Redis: `docker compose restart redis`
- Check memory: `docker compose exec redis redis-cli info memory`
- If at `maxmemory`, increase in docker-compose.yml and restart

#### 4. Typesense Search Not Indexing

**Symptoms:**
- Search returns 0 results
- Listings created but not searchable
- `/health` for typesense shows green, but collection empty

**Diagnosis:**
```bash
# Check collection exists
curl http://localhost:8108/collections -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"

# Check collection stats
curl "http://localhost:8108/collections/listings" \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" | jq .

# Check recent docs
curl "http://localhost:8108/collections/listings/documents/search?q=*" \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" | jq '.found'
```

**Solutions:**
- Verify `TYPESENSE_API_KEY` matches container env var
- Reindex all listings:
  ```bash
  docker compose exec api npx ts-node scripts/reindex-listings.ts
  ```
- If collection corrupted, drop and recreate:
  ```bash
  curl -X DELETE "http://localhost:8108/collections/listings" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"
  # Then restart API service to recreate schema
  docker compose restart api
  ```

#### 5. Payment Callback Failures

**Symptoms:**
- Payment status stuck in `PENDING`
- Logs: `Invalid callback signature for provider=VNPAY`

**Diagnosis:**
```bash
# Check payment record in DB
docker compose exec postgres psql -U goodgo -d goodgo -c \
  "SELECT id, status, provider, \"providerTxId\", \"callbackData\" FROM \"Payment\" \
   WHERE \"providerTxId\" = 'your-txid' ORDER BY \"createdAt\" DESC LIMIT 1;"

# Check logs for callback handler
docker compose logs api | grep -i "HandleCallbackHandler\|callback"
```

**Solutions:**
- Verify payment gateway credentials (VNPAY_HASH_SECRET, MOMO_SECRET_KEY, etc.)
- Manually verify callback signature (contact payment provider support)
- Replay callback manually (if idempotent key available):
  ```bash
  curl -X POST http://localhost:3001/api/payments/callback \
    -H "Content-Type: application/json" \
    -d '{"provider":"VNPAY",...callback data...}'
  ```

#### 6. Backup Verification Fails

**Symptoms:**
- GitHub Action `.github/workflows/backup-verify.yml` fails
- Restore test database shows mismatched row counts

**Diagnosis:**
```bash
# Run verification manually
docker compose -f docker-compose.ci.yml up postgres
docker compose -f docker-compose.ci.yml exec postgres \
  /scripts/pg-verify-backup.sh /backups/goodgo_latest.sql.gz

# Check JSON report
cat /tmp/backups/verify-report.json | jq .
```

**Solutions:**
- Check if backup file corrupt: `file goodgo_*.sql.gz`
- Verify restore process: `pg_restore --verbose`
- Check PostGIS extension availability: `psql -c "CREATE EXTENSION postgis;"`

#### 7. Memory/CPU Pressure

**Symptoms:**
- OOM kills, container exits 137
- CPU throttling, latency spikes
- Prometheus `container_memory_usage_bytes` near limit

**Diagnosis:**
```bash
# Check Docker stats
docker stats --no-stream

# Check limits in compose file
docker compose config | grep -A3 "resources:"

# Check actual memory usage
docker inspect goodgo-api | jq '.HostConfig.Memory'
```

**Solutions:**
- Increase resource limits in `docker-compose.prod.yml`
- Reduce log verbosity (set LOG_LEVEL=warn)
- Implement pagination for large queries
- Scale horizontally (add more API replicas)

### Prometheus Queries for Debugging

```promql
# API request latency p99
histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket[5m])) by (le))

# API error rate (5xx)
(sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100

# Container memory usage
container_memory_usage_bytes{name="goodgo-api"}

# Container CPU usage
rate(container_cpu_usage_seconds_total{name="goodgo-api"}[5m])

# PostgreSQL active queries
pg_stat_activity_count{state="active"}

# Redis memory usage
redis_memory_used_bytes / 1024 / 1024  # in MB

# Typesense collection size
typesense_documents_count{collection="listings"}
```

### Emergency Procedures

**Full System Reset (dev only):**
```bash
docker compose down -v  # Remove all volumes!
docker system prune -a
docker compose up -d --wait
docker compose exec api npx prisma db push
docker compose exec api npx ts-node scripts/seed.ts
```

**Database Emergency Restore:**
```bash
# Find latest backup
ls -t /var/lib/docker/volumes/pg_backups/_data/goodgo_*.sql.gz | head -1

# Restore to new database
pg_restore -h localhost -p 5432 -U goodgo -d goodgo_restored \
  --clean --if-exists --verbose /path/to/backup.sql.gz

# Verify restore
psql -U goodgo -d goodgo_restored -c "SELECT count(*) FROM \"User\";"
```

**Force Kill Stuck Service:**
```bash
# If health check broken
docker compose kill api
docker compose rm -f api
docker compose up -d api
```

---

## Appendix: Key File Locations

```
/Users/velikho/Desktop/WORKING/goodgo-platform-ai/
├── docker-compose.yml              # Dev environment
├── docker-compose.prod.yml         # Prod environment (with pgbouncer, resource limits)
├── docker-compose.ci.yml           # CI/E2E test environment
├── .env.example                    # Template for all required env vars
│
├── apps/
│   ├── api/
│   │   ├── Dockerfile              # Multi-stage NestJS build
│   │   ├── docker-entrypoint.sh    # Startup script (migrations, app start)
│   │   ├── src/
│   │   │   ├── modules/health/health.controller.ts
│   │   │   ├── modules/payments/application/commands/handle-callback/
│   │   │   ├── modules/shared/infrastructure/redis.service.ts
│   │   │   └── modules/search/infrastructure/services/typesense-search.repository.ts
│   │   └── package.json
│   │
│   └── web/
│       ├── Dockerfile              # Multi-stage Next.js build
│       └── package.json
│
├── libs/
│   └── ai-services/
│       ├── Dockerfile              # Python FastAPI build
│       ├── app/main.py             # FastAPI app entry
│       └── pyproject.toml
│
├── prisma/
│   └── schema.prisma               # Complete Prisma schema (22 models)
│
├── infra/
│   └── pgbouncer/
│       ├── pgbouncer.ini           # Connection pooling config
│       ├── userlist.txt.template   # User list (templated)
│       └── entrypoint.sh           # Env substitution script
│
├── scripts/
│   └── backup/
│       ├── pg-backup.sh            # Daily backup automation
│       ├── pg-verify-backup.sh     # Restore verification
│       └── pg-restore.sh           # Manual restore script
│
├── monitoring/
│   ├── prometheus/
│   │   ├── prometheus.yml          # Scrape config (goodgo-api metrics)
│   │   └── alert-rules.yml         # Latency + error rate alerts
│   ├── loki/
│   │   └── loki-config.yml         # Log aggregation config (15-day retention)
│   ├── promtail/
│   │   └── promtail-config.yml     # Log shipping (Pino JSON parsing)
│   └── grafana/
│       ├── provisioning/
│       │   ├── datasources/datasource.yml
│       │   └── dashboards/dashboard.yml
│       └── dashboards/
│           ├── api-latency.json
│           ├── api-overview.json
│           ├── database.json
│           ├── logs.json
│           ├── search.json
│           ├── web-vitals.json
│           └── business-metrics.json
│
└── .github/workflows/
    ├── ci.yml                      # Lint, test, build
    ├── deploy.yml                  # Build images, deploy to staging/prod
    ├── e2e.yml                     # End-to-end tests
    ├── backup-verify.yml           # Weekly backup verification
    ├── security.yml                # Dependency/SAST scanning
    ├── codeql.yml                  # GitHub CodeQL
    └── load-test.yml               # K6 load testing
```

---

## Document Version History

| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | 2026-04-11 | DevOps Team | Initial comprehensive runbook |

---

**Last Updated:** April 11, 2026
**Maintained By:** GoodGo Platform SRE Team
**Contact:** devops@goodgo.vn