Move 36 root-level audit/analysis documents and 7 web app audit documents into docs/audits/ directory to declutter the project root. Remove stale EXPLORATION_SUMMARY.txt. Co-Authored-By: Paperclip <noreply@paperclip.ing>
1459 lines
45 KiB
Markdown
1459 lines
45 KiB
Markdown
# GoodGo Platform — Operational Infrastructure Runbook
|
|
|
|
**Last Updated:** April 11, 2026
|
|
**Version:** 1.0
|
|
**Purpose:** Complete infrastructure reference for ops teams, SREs, and on-call engineers
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Executive Summary](#executive-summary)
|
|
2. [Services Architecture](#services-architecture)
|
|
3. [Docker Compose Specifications](#docker-compose-specifications)
|
|
4. [Database Layer](#database-layer)
|
|
5. [Caching & Search](#caching--search)
|
|
6. [Monitoring & Observability](#monitoring--observability)
|
|
7. [Payment Integration](#payment-integration)
|
|
8. [Health Checks](#health-checks)
|
|
9. [Environment Variables](#environment-variables)
|
|
10. [Backup & Recovery](#backup--recovery)
|
|
11. [Deployment Pipeline](#deployment-pipeline)
|
|
12. [Troubleshooting Guide](#troubleshooting-guide)
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
**GoodGo Platform** is a monorepo real estate marketplace built with:
|
|
- **Frontend:** Next.js (TypeScript)
|
|
- **Backend API:** NestJS (TypeScript)
|
|
- **AI Services:** Python/FastAPI
|
|
- **Database:** PostgreSQL 16 + PostGIS
|
|
- **Cache:** Redis 7
|
|
- **Search:** Typesense 27.1
|
|
- **Object Storage:** MinIO (S3-compatible)
|
|
- **Monitoring:** Prometheus + Grafana + Loki + Promtail
|
|
- **Message Queue:** Built-in CQRS/Event Bus (NestJS)
|
|
|
|
**Total Services in Production:** 12+ (detailed below)
|
|
|
|
---
|
|
|
|
## Services Architecture
|
|
|
|
### Service Inventory
|
|
|
|
| Service | Image | Port | Purpose | Health Check | Dependencies |
|
|
|---------|-------|------|---------|--------------|--------------|
|
|
| **api** | `goodgo-api:latest` | 3001 | NestJS REST API | `GET /health` (3x30s) | postgres, redis, typesense, pgbouncer |
|
|
| **web** | `goodgo-web:latest` | 3000 | Next.js frontend | `GET /` (3x30s) | api |
|
|
| **ai-services** | `goodgo-ai-services:latest` | 8000 | Python FastAPI (price estimation, NLP) | `GET /health` (3x30s) | n/a |
|
|
| **postgres** | `postgis/postgis:16-3.4` | 5432 | Primary database | `pg_isready` (5x10s) | n/a |
|
|
| **pgbouncer** | `edoburu/pgbouncer:1.23.1-p2` | 6432 | Connection pooling (transaction mode) | `pg_isready` (5x10s) | postgres |
|
|
| **redis** | `redis:7-alpine` | 6379 | Cache + session store | `PING` (5x10s) | n/a |
|
|
| **typesense** | `typesense/typesense:27.1` | 8108 | Full-text search index | `GET /health` (5x10s) | n/a |
|
|
| **minio** | `minio/minio:latest` | 9000/9001 | Object storage + console | `mc ready local` (5x10s) | n/a |
|
|
| **loki** | `grafana/loki:3.0.0` | 3100 | Log aggregation | `GET /ready` (5x15s) | n/a |
|
|
| **promtail** | `grafana/promtail:3.0.0` | 9080 | Log shipper | (depends on loki healthy) | loki |
|
|
| **prometheus** | `prom/prometheus:v2.51.0` | 9090 | Metrics scraper | `GET /-/healthy` (3x15s) | n/a |
|
|
| **grafana** | `grafana/grafana:10.4.1` | 3002 | Dashboards + alerting | `GET /api/health` (3x15s) | prometheus, loki |
|
|
| **pg-backup** | `postgis/postgis:16-3.4` | — | Automated backup cron | depends_on postgres | postgres |
|
|
|
|
### Network & Volumes
|
|
|
|
- **Network:** Docker bridge network `goodgo-net`
|
|
- **Volumes:**
|
|
- `pgdata` — PostgreSQL data files
|
|
- `redis_data` — Redis snapshot (AOF)
|
|
- `typesense_data` — Search index
|
|
- `minio_data` — Object storage
|
|
- `pg_backups` — Database backups (daily retention: 7 days)
|
|
- `loki_data` — Log chunks (retention: 15 days)
|
|
- `prometheus_data` — Metrics TSDB (retention: 30 days in prod, 15 days in dev)
|
|
- `grafana_data` — Dashboards, datasource configs
|
|
|
|
---
|
|
|
|
## Docker Compose Specifications
|
|
|
|
### Development Environment (`docker-compose.yml`)
|
|
|
|
**12 Services (minimal dependencies, no resource limits)**
|
|
|
|
```yaml
|
|
services:
|
|
postgres: PostGIS 16, port 5432, healthcheck: pg_isready (30s start-period)
|
|
redis: Alpine 7, port 6379, maxmemory: 256mb LRU, AOF enabled
|
|
typesense: v27.1, port 8108, CORS enabled, healthcheck /health
|
|
minio: latest, ports 9000 (API) / 9001 (console)
|
|
ai-services: Custom Python build, port 8000
|
|
pg-backup: Automated daily dumps at 02:00 UTC, cron retention cleanup
|
|
pg-verify-backup: On-demand backup restore verification (profile: tools)
|
|
loki: v3.0.0, port 3100, 15-day retention, 2h compaction delay
|
|
promtail: v3.0.0, Docker socket instrumentation, Pino JSON parsing
|
|
prometheus: v2.51.0, port 9090, 15-day retention, lifecycle API enabled
|
|
grafana: v10.4.1, port 3002, datasources pre-provisioned
|
|
```
|
|
|
|
**Key Differences from Prod:**
|
|
- No resource limits (use all available CPU/memory)
|
|
- Smaller retention windows (7-15 days)
|
|
- PostgreSQL on port 5432 (direct, no pgbouncer)
|
|
- loki/prometheus/grafana on alternate ports
|
|
|
|
### Production Environment (`docker-compose.prod.yml`)
|
|
|
|
**14 Services (with pgbouncer, resource limits, rolling updates)**
|
|
|
|
```yaml
|
|
services:
|
|
api: NestJS, resource limits: 1g CPU / 1g memory
|
|
web: Next.js, resource limits: 0.5 CPU / 512m memory
|
|
ai-services: Python, resource limits: 1.0 CPU / 1g memory
|
|
postgres: PostGIS, resource limits: 2.0 CPU / 2g memory
|
|
pgbouncer: Connection pool (NEW), 20 default connections, transaction mode
|
|
redis: 7-alpine, resource limits: 0.5 CPU / 768m memory, password auth
|
|
typesense: 27.1, resource limits: 1.0 CPU / 1g memory
|
|
minio: latest, resource limits: 0.5 CPU / 1g memory
|
|
loki: v3.0.0, resource limits: 0.5 CPU / 512m memory
|
|
promtail: v3.0.0, resource limits: 0.25 CPU / 256m memory
|
|
prometheus: v2.51.0, resource limits: 0.5 CPU / 1g memory, 30-day retention
|
|
grafana: v10.4.1, resource limits: 0.5 CPU / 512m memory
|
|
pg-backup: Same as dev
|
|
```
|
|
|
|
**Production-Specific Flags:**
|
|
- `read_only: true` on app containers (api, web, ai-services)
|
|
- `tmpfs: [/tmp]` for runtime temp files
|
|
- `security_opt: [no-new-privileges:true]`
|
|
- `logging: json-file` with 10m max-size, 3-5 files rotation
|
|
- **PgBouncer inserted between apps ↔ Postgres** (port 6432)
|
|
- Secrets management: `GRAFANA_ADMIN_USER`, `GRAFANA_ADMIN_PASSWORD` from Docker secrets
|
|
- Redis requires password authentication
|
|
|
|
### CI/E2E Environment (`docker-compose.ci.yml`)
|
|
|
|
**Minimal 4 Services (tmpfs for speed)**
|
|
|
|
```yaml
|
|
services:
|
|
postgres: goodgo_test DB, tmpfs (/var/lib/postgresql/data)
|
|
redis: --save "" --appendonly no (no persistence)
|
|
typesense: tmpfs (/data)
|
|
minio: tmpfs (/data)
|
|
```
|
|
|
|
**Used by:**
|
|
- GitHub Actions E2E test suite
|
|
- Local `docker compose -f docker-compose.ci.yml up --wait`
|
|
|
|
---
|
|
|
|
## Database Layer
|
|
|
|
### PostgreSQL + PostGIS
|
|
|
|
**Version:** 16.3.4 with PostGIS extension
|
|
**Schema:** 22 Prisma models + Prisma migration tracking
|
|
|
|
#### Prisma Schema Models
|
|
|
|
1. **Auth:** User, RefreshToken, OAuthAccount, Agent
|
|
2. **Listings:** Property, PropertyMedia, Listing
|
|
3. **Search:** SavedSearch
|
|
4. **Transactions:** Transaction, Inquiry, Lead
|
|
5. **Payments:** Payment (with PaymentProvider enum: VNPAY, MOMO, ZALOPAY, BANK_TRANSFER)
|
|
6. **Subscriptions:** Plan, Subscription, UsageRecord
|
|
7. **Analytics:** Valuation, MarketIndex
|
|
8. **Notifications:** NotificationLog, NotificationPreference
|
|
9. **Audit:** AdminAuditLog
|
|
10. **Reviews:** Review
|
|
|
|
#### Key Database Features
|
|
|
|
- **PostGIS Geometry:** Property.location (Point, SRID 4326) with GIST index
|
|
- **Enums:** UserRole, KYCStatus, PropertyType, TransactionType, ListingStatus, Direction, OAuthProvider, TransactionStatus, LeadStatus, PaymentProvider, PaymentStatus, PaymentType, PlanTier, SubscriptionStatus, NotificationChannel, NotificationStatus, AdminAction, AuditTargetType
|
|
- **Compound Indexes:** Query optimization on (role, isActive, createdAt), (sellerId, status, publishedAt), (userId, status, createdAt), etc.
|
|
- **Constraints:** Unique idempotency key on Payment (userId, provider, idempotencyKey)
|
|
|
|
#### Connection Pooling: PgBouncer
|
|
|
|
**Dev Mode (docker-compose.yml):**
|
|
- Apps connect directly to `postgres:5432`
|
|
- No pooling overhead
|
|
|
|
**Prod Mode (docker-compose.prod.yml):**
|
|
- Apps connect to `pgbouncer:6432`
|
|
- **Pool Mode:** `transaction` (connections returned after each transaction)
|
|
- **Pool Size:** 20 connections (default, tunable via `PGBOUNCER_POOL_SIZE`)
|
|
- **Max Client Conn:** 200 (tunable via `PGBOUNCER_MAX_CLIENT_CONN`)
|
|
- **Reserve Pool:** 5 connections (fallback when pool exhausted)
|
|
- **Timeouts:**
|
|
- server_connect_timeout: 15s
|
|
- server_idle_timeout: 600s
|
|
- server_lifetime: 3600s (connection recycle)
|
|
- query_wait_timeout: 120s
|
|
- query_timeout: 0 (disabled)
|
|
- **Admin Console:** pgbouncer_admin user (password via PGBOUNCER_ADMIN_PASSWORD env var)
|
|
- **Stats Console:** pgbouncer_stats user (password via PGBOUNCER_STATS_PASSWORD env var)
|
|
|
|
**Migration Workaround:**
|
|
- API has two DATABASE_URL env vars:
|
|
- `DATABASE_URL` → pgbouncer:6432 (normal queries)
|
|
- `DATABASE_URL_DIRECT` → postgres:5432 (migrations, introspection, DDL)
|
|
- `RUN_MIGRATIONS=true` switches app to use DATABASE_URL_DIRECT for `prisma migrate deploy`
|
|
|
|
#### Backup Strategy
|
|
|
|
**Automated Backups:**
|
|
- **Schedule:** Daily at 02:00 UTC (cron inside pg-backup container)
|
|
- **Format:** Custom format with gzip compression (level 6)
|
|
- **Retention:** 7 days (configurable via BACKUP_RETENTION_DAYS)
|
|
- **Location:** `pg_backups` volume (mount to persistent storage in prod)
|
|
- **File Pattern:** `goodgo_YYYYMMDD_HHMMSS.sql.gz`
|
|
- **Restore Script:** `/scripts/backup/pg-restore.sh` (manual restore)
|
|
- **Verification Script:** `/scripts/backup/pg-verify-backup.sh` (automated E2E verification)
|
|
|
|
**Verification Process (runs weekly):**
|
|
1. Restores latest backup to isolated test database (`goodgo_verify_<timestamp>`)
|
|
2. Verifies all 22 tables exist
|
|
3. Compares row counts between source and restored DB
|
|
4. Checksums critical tables (User, Property, Listing, Payment, Subscription, Transaction, Plan, _prisma_migrations)
|
|
5. Checks PostGIS extension, indexes, enum types
|
|
6. Generates JSON report with pass/fail result
|
|
7. **Cleanup:** Drops test DB on exit (unless SKIP_CLEANUP=1)
|
|
8. **Exit Codes:** 0=pass, 1=checks failed, 2=setup error
|
|
|
|
**CI/CD Backup Verification:**
|
|
- GitHub Action: `.github/workflows/backup-verify.yml`
|
|
- Runs weekly Sundays 05:00 UTC
|
|
- Also manually triggerable with skip_cleanup option
|
|
- Uploads JSON report as artifact
|
|
|
|
---
|
|
|
|
## Caching & Search
|
|
|
|
### Redis
|
|
|
|
**Image:** `redis:7-alpine`
|
|
**Port:** 6379
|
|
|
|
**Production Configuration:**
|
|
```bash
|
|
redis-server \
|
|
--appendonly yes \ # AOF persistence (updates only)
|
|
--requirepass ${REDIS_PASSWORD} \ # Authentication required
|
|
--maxmemory 512mb \ # Max memory limit (prod)
|
|
--maxmemory-policy allkeys-lru # LRU eviction when full
|
|
```
|
|
|
|
**Development Configuration:**
|
|
```bash
|
|
redis-server \
|
|
--appendonly yes \
|
|
--maxmemory 256mb \
|
|
--maxmemory-policy allkeys-lru
|
|
```
|
|
|
|
**ioredis Client Configuration:**
|
|
```typescript
|
|
// From RedisService in apps/api/src/modules/shared/infrastructure/redis.service.ts
|
|
{
|
|
host: process.env.REDIS_HOST ?? 'localhost',
|
|
port: Number(process.env.REDIS_PORT ?? 6379),
|
|
password: process.env.REDIS_PASSWORD ?? undefined,
|
|
lazyConnect: true, // App starts even if Redis unavailable
|
|
enableReadyCheck: false, // Prevents "Redis is not ready" errors during transient outages
|
|
maxRetriesPerRequest: 1, // Fail fast (single retry, no exponential backoff)
|
|
retryStrategy(times: number): number {
|
|
return Math.min(times * 1000, 5000); // 1s → 2s → 3s → 4s → 5s → 5s...
|
|
}
|
|
}
|
|
```
|
|
|
|
**Graceful Degradation:**
|
|
- Cache misses don't fail the application
|
|
- CacheService catches Redis errors and returns cache miss
|
|
- App serves data directly from PostgreSQL if Redis down
|
|
- Health check at `GET /health/redis` warns but doesn't fail readiness probe
|
|
|
|
**Use Cases:**
|
|
- Session storage
|
|
- Cache layer for expensive queries
|
|
- Rate limiting (if implemented)
|
|
- Real-time counters
|
|
|
|
---
|
|
|
|
### Typesense
|
|
|
|
**Image:** `typesense/typesense:27.1`
|
|
**Port:** 8108 (HTTP only, internal Docker network)
|
|
**API Key:** `${TYPESENSE_API_KEY}` (must be set in .env)
|
|
|
|
**Collection Schema:**
|
|
```
|
|
Collection Name: "listings"
|
|
Fields:
|
|
- listingId (string)
|
|
- propertyId (string)
|
|
- title (string, searchable, highlights)
|
|
- description (string, searchable, highlights)
|
|
- propertyType (string, faceted)
|
|
- transactionType (string, faceted: SALE/RENT)
|
|
- priceVND (int64, sortable)
|
|
- pricePerM2 (float, optional)
|
|
- areaM2 (float)
|
|
- bedrooms (int32, faceted)
|
|
- bathrooms (int32, faceted)
|
|
- floors (int32)
|
|
- direction (string, faceted: NORTH/SOUTH/EAST/WEST/etc)
|
|
- address (string)
|
|
- ward (string, faceted)
|
|
- district (string, faceted)
|
|
- city (string, faceted)
|
|
- location (geopoint) — for radius search
|
|
- agentId (string)
|
|
- sellerId (string)
|
|
- status (string, faceted: ACTIVE/SOLD/DRAFT/etc)
|
|
- publishedAt (int64, sortable)
|
|
- viewCount (int32)
|
|
- saveCount (int32)
|
|
- projectName (string, faceted)
|
|
- amenities (string[], faceted)
|
|
```
|
|
|
|
**Search Features:**
|
|
- **Full-text search** on: title, description, address, district, city, projectName
|
|
- **Query weights:** title=5, description=3, address=2, district=2, city=1, projectName=2
|
|
- **Filtering:** propertyType, transactionType, bedrooms, district, city, status, amenities
|
|
- **Geo-search:** radius-based queries (lat, lng, km)
|
|
- **Sorting:** price (asc/desc), distance (asc from geopoint), date (desc), relevance
|
|
- **Highlights:** HTML marks on matched terms in title and description
|
|
- **Facets:** Return aggregated counts for filtering
|
|
|
|
**TypesenseSearchRepository (`apps/api/src/modules/search/infrastructure/services/typesense-search.repository.ts`):**
|
|
- `ensureCollection()` — Creates schema if not exists
|
|
- `dropCollection()` — Cleanup (testing only)
|
|
- `indexDocument(doc)` — Upsert single document
|
|
- `indexDocuments(docs)` — Bulk import with error reporting
|
|
- `removeDocument(id)` — Delete by ID
|
|
- `search(params)` — Execute search with filters, sort, pagination
|
|
|
|
**Graceful Degradation:**
|
|
- If Typesense down, search falls back to PostgreSQL full-text search
|
|
- TypesenseClientService implements retry logic with exponential backoff
|
|
- Health check at `GET /health` returns JSON status
|
|
|
|
---
|
|
|
|
## Monitoring & Observability
|
|
|
|
### Prometheus
|
|
|
|
**Image:** `prom/prometheus:v2.51.0`
|
|
**Port:** 9090
|
|
**Retention:** 15 days (dev), 30 days (prod)
|
|
**Lifecycle API:** Enabled (`--web.enable-lifecycle`)
|
|
|
|
**Scrape Targets (`monitoring/prometheus/prometheus.yml`):**
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: goodgo-api
|
|
metrics_path: /metrics
|
|
static_configs:
|
|
- targets: ['host.docker.internal:3001'] # Dev (API on host)
|
|
- targets: ['api:3001'] # Prod (API in container)
|
|
labels:
|
|
service: goodgo-api
|
|
environment: [development|production]
|
|
|
|
- job_name: prometheus
|
|
static_configs:
|
|
- targets: ['localhost:9090']
|
|
```
|
|
|
|
**Expected Metrics from API:**
|
|
- `goodgo_api_request_duration_seconds_bucket{le, route, method}` — Request latency histogram
|
|
- `http_requests_total{status_code, job}` — Request count by status code
|
|
- Custom business metrics (if implemented in NestJS @prometheus decorators)
|
|
|
|
### Alert Rules (`monitoring/prometheus/alert-rules.yml`)
|
|
|
|
**Latency Alerts:**
|
|
1. **ApiLatencyP99High** (warning)
|
|
- Trigger: p99 latency > 1s for 5 minutes
|
|
- Dashboard: `/d/goodgo-api-latency/goodgo-api-latency`
|
|
- Runbook: `https://docs.goodgo.vn/runbooks/api-latency-high`
|
|
|
|
2. **ApiEndpointLatencyP99High** (warning)
|
|
- Trigger: Per-endpoint p99 > 2s for 5 minutes
|
|
- Annotates: method, route labels
|
|
|
|
3. **ApiLatencyP99Critical** (critical - SLO breach)
|
|
- Trigger: p99 latency > 3s for 3 minutes
|
|
- Escalation required
|
|
- Runbook: `https://docs.goodgo.vn/runbooks/api-latency-critical`
|
|
|
|
**Error Rate Alert:**
|
|
1. **ApiErrorRate5xxHigh** (warning)
|
|
- Trigger: 5xx error rate > 1% for 5 minutes
|
|
- Uses: `(5xx errors / total requests) * 100`
|
|
|
|
### Grafana
|
|
|
|
**Image:** `grafana/grafana:10.4.1`
|
|
**Port:** 3002
|
|
**Auth:** Admin user/password from secrets (prod) or env vars (dev)
|
|
|
|
**Pre-Provisioned Datasources:**
|
|
- Prometheus (default, primary)
|
|
- Loki (with derived fields for correlationId linkage)
|
|
|
|
**Dashboards:**
|
|
1. `api-latency.json` — API p99/p95/p50, route breakdown, slow endpoints
|
|
2. `api-overview.json` — Request rate, error rate, uptime status
|
|
3. `database.json` — Query latency, connection pool utilization, slow queries
|
|
4. `logs.json` — Log volume, error logs, trace links to Prometheus
|
|
5. `search.json` — Typesense query latency, indexing rate, collection size
|
|
6. `web-vitals.json` — Frontend Core Web Vitals (if client-side instrumentation)
|
|
7. `business-metrics.json` — Listings created, payments processed, user signups
|
|
|
|
**Admin Console Access:**
|
|
- URL: `http://localhost:3002` (dev) or `${GRAFANA_PORT}` (prod)
|
|
- Default user: `admin` (change password on first login)
|
|
- Non-signup mode (`GF_USERS_ALLOW_SIGN_UP: false`)
|
|
|
|
### Loki & Promtail (Log Aggregation)
|
|
|
|
**Loki:** `grafana/loki:3.0.0`, port 3100
|
|
|
|
**Configuration:**
|
|
```yaml
|
|
schema:
|
|
- from: 2024-01-01
|
|
store: tsdb
|
|
schema: v13
|
|
limits:
|
|
max_entries_limit_per_query: 5000
|
|
ingestion_rate_mb: 4
|
|
ingestion_burst_size_mb: 6
|
|
retention: 360h (15 days)
|
|
```
|
|
|
|
**Promtail:** `grafana/promtail:3.0.0`
|
|
|
|
**Configuration:**
|
|
- Scrapes Docker logs from `goodgo-net` bridge network
|
|
- Parses **Pino JSON** structured logs
|
|
- Extracts labels: level, context, component, service
|
|
- Structured metadata: method, url, statusCode, correlationId, duration
|
|
- Derives timestamp from Pino output (RFC3339Nano)
|
|
|
|
**Expected Log Format (Pino):**
|
|
```json
|
|
{
|
|
"level": 30, // info
|
|
"time": "2026-04-11T10:30:00Z",
|
|
"msg": "POST /api/listings",
|
|
"correlationId": "abc-123-def",
|
|
"context": "ListingController",
|
|
"component": "api",
|
|
"method": "POST",
|
|
"url": "/api/listings",
|
|
"statusCode": 201,
|
|
"duration": 150
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Payment Integration
|
|
|
|
### Supported Payment Providers
|
|
|
|
**Enum:** `PaymentProvider` (Prisma)
|
|
- `VNPAY` — VNPay (Vietnam payment gateway)
|
|
- `MOMO` — MoMo (Vietnam mobile wallet)
|
|
- `ZALOPAY` — ZaloPay (Vietnam digital wallet)
|
|
- `BANK_TRANSFER` — Manual bank transfer (offline)
|
|
|
|
### Payment Flow & Callback Handling
|
|
|
|
**Database Schema (Payment Model):**
|
|
```typescript
|
|
model Payment {
|
|
id String @id @default(cuid())
|
|
userId String
|
|
transactionId String?
|
|
provider PaymentProvider
|
|
type PaymentType // SUBSCRIPTION, LISTING_FEE, DEPOSIT, FEATURED_LISTING
|
|
amountVND BigInt
|
|
status PaymentStatus // PENDING, PROCESSING, COMPLETED, FAILED, REFUNDED
|
|
providerTxId String? // External transaction ID from VNPay/MoMo/ZaloPay
|
|
callbackData Json? // Raw callback payload (for audit)
|
|
idempotencyKey String? // Prevent duplicate payments (userId, provider, idempotencyKey unique)
|
|
createdAt DateTime @default(now())
|
|
updatedAt DateTime @updatedAt
|
|
}
|
|
|
|
enum PaymentStatus {
|
|
PENDING, PROCESSING, COMPLETED, FAILED, REFUNDED
|
|
}
|
|
|
|
enum PaymentType {
|
|
SUBSCRIPTION, LISTING_FEE, DEPOSIT, FEATURED_LISTING
|
|
}
|
|
```
|
|
|
|
**Command Handler: `HandleCallbackHandler`**
|
|
(`apps/api/src/modules/payments/application/commands/handle-callback/handle-callback.handler.ts`)
|
|
|
|
1. **Callback Signature Verification:**
|
|
- Uses `PAYMENT_GATEWAY_FACTORY` to route to correct provider (VNPay/MoMo/ZaloPay)
|
|
- Gateway.verifyCallback() validates HMAC signature
|
|
- Throws `ValidationException` if signature invalid
|
|
|
|
2. **Idempotent Status Transition:**
|
|
- Only updates payments in state: `PENDING` or `PROCESSING`
|
|
- Atomically transitions to `COMPLETED` or `FAILED`
|
|
- If already in terminal state (COMPLETED/FAILED/REFUNDED), returns existing status (idempotent)
|
|
- Logs warning if payment not found
|
|
|
|
3. **Domain Event Publishing:**
|
|
- Reconstructs domain entity from repository
|
|
- Emits `PaymentCompletedEvent` or `PaymentFailedEvent`
|
|
- Event bus publishes events to subscribers (e.g., subscription creation, listing activation)
|
|
|
|
4. **Response:**
|
|
```typescript
|
|
{
|
|
paymentId: string,
|
|
status: PaymentStatus,
|
|
isSuccess: boolean
|
|
}
|
|
```
|
|
|
|
**Payment Gateway Interface (`payment-gateway.interface.ts`):**
|
|
```typescript
|
|
interface IPaymentGateway {
|
|
readonly provider: PaymentProvider
|
|
createPaymentUrl(params: CreatePaymentUrlParams): Promise<CreatePaymentUrlResult>
|
|
verifyCallback(data: Record<string, string>): CallbackVerifyResult
|
|
refund(params: RefundParams): Promise<RefundResult>
|
|
}
|
|
|
|
interface CreatePaymentUrlParams {
|
|
orderId: string
|
|
amountVND: bigint
|
|
description: string
|
|
returnUrl: string
|
|
ipAddress: string
|
|
}
|
|
|
|
interface CallbackVerifyResult {
|
|
isValid: boolean
|
|
orderId: string
|
|
providerTxId: string
|
|
isSuccess: boolean
|
|
rawData: Record<string, unknown>
|
|
}
|
|
|
|
interface RefundParams {
|
|
providerTxId: string
|
|
amountVND: bigint
|
|
reason: string
|
|
}
|
|
|
|
interface RefundResult {
|
|
success: boolean
|
|
refundTxId: string | null
|
|
}
|
|
```
|
|
|
|
### Environment Variables
|
|
|
|
**VNPay:**
|
|
```env
|
|
VNPAY_TMN_CODE=<merchant terminal code>
|
|
VNPAY_HASH_SECRET=<HMAC secret key>
|
|
VNPAY_BASE_URL=https://sandbox.vnpayment.vn/paymentv2/vpcpay.html
|
|
VNPAY_API_URL=https://sandbox.vnpayment.vn/merchant_webapi/api/transaction
|
|
```
|
|
|
|
**MoMo:**
|
|
```env
|
|
MOMO_PARTNER_CODE=<partner code>
|
|
MOMO_ACCESS_KEY=<access key>
|
|
MOMO_SECRET_KEY=<secret key>
|
|
MOMO_ENDPOINT=https://test-payment.momo.vn/v2/gateway/api
|
|
```
|
|
|
|
**ZaloPay:**
|
|
```env
|
|
ZALOPAY_APP_ID=<app ID>
|
|
ZALOPAY_KEY1=<key 1 (for creating payments)>
|
|
ZALOPAY_KEY2=<key 2 (for callback verification)>
|
|
ZALOPAY_ENDPOINT=https://sb-openapi.zalopay.vn/v2
|
|
```
|
|
|
|
### Race Condition & Idempotency Protection
|
|
|
|
**Problem:** Multiple callbacks may arrive for same payment (network retries, duplicate notifications)
|
|
|
|
**Solution:**
|
|
1. **Unique Idempotency Key:** `Payment_idempotency_unique(userId, provider, idempotencyKey)`
|
|
- Prevents duplicate payment records
|
|
- Generated by client/API before creating payment
|
|
|
|
2. **Atomic Status Update:** `paymentRepo.updateIfStatus(orderId, ['PENDING', 'PROCESSING'], newStatus)`
|
|
- Only updates if current status in allowed list
|
|
- Returns updated entity or null if already terminal
|
|
|
|
3. **Terminal State Check:** If already COMPLETED/FAILED/REFUNDED, handler returns existing state
|
|
- No re-triggering of domain events
|
|
- No double billing or duplicate transactions
|
|
|
|
---
|
|
|
|
## Health Checks
|
|
|
|
### API Health Endpoints
|
|
|
|
**Health Controller** (`apps/api/src/modules/health/health.controller.ts`)
|
|
|
|
1. **GET /health** — Liveness Probe (always 200 if process running)
|
|
- Uses: `@HealthCheck()` on empty probe list
|
|
- Response: `{ "status": "ok", "timestamp": "..." }`
|
|
- **Use Case:** Kubernetes/Docker readiness (initial startup)
|
|
|
|
2. **GET /health/ready** — Readiness Probe (checks dependencies)
|
|
- Checks: PostgreSQL + Redis connectivity
|
|
- Response:
|
|
```json
|
|
{
|
|
"status": "ok",
|
|
"checks": {
|
|
"database": { "status": "up" },
|
|
"redis": { "status": "up" }
|
|
}
|
|
}
|
|
```
|
|
- **Use Case:** Load balancer, before accepting traffic
|
|
- **Failure:** Returns 503 if any dependency down
|
|
|
|
3. **GET /health/db** — Database Readiness Only
|
|
- Checks: PostgreSQL connectivity via `SELECT 1` query
|
|
- **Use Case:** Manual database troubleshooting
|
|
|
|
4. **GET /health/redis** — Redis Readiness Only
|
|
- Checks: Redis PING command
|
|
- **Use Case:** Manual Redis troubleshooting
|
|
|
|
### Health Check Implementations
|
|
|
|
**PrismaHealthIndicator** (`apps/api/src/modules/health/infrastructure/prisma.health.ts`):
|
|
```typescript
|
|
async isHealthy(key: string): Promise<HealthIndicatorResult> {
|
|
try {
|
|
await this.prisma.$queryRawUnsafe('SELECT 1');
|
|
return this.getStatus(key, true);
|
|
} catch {
|
|
throw new HealthCheckError('Database check failed', this.getStatus(key, false));
|
|
}
|
|
}
|
|
```
|
|
|
|
**RedisHealthIndicator** (`apps/api/src/modules/health/infrastructure/redis.health.ts`):
|
|
```typescript
|
|
async isHealthy(key: string): Promise<HealthIndicatorResult> {
|
|
try {
|
|
const client = this.redis.getClient();
|
|
const pong = await client.ping();
|
|
const isHealthy = pong === 'PONG';
|
|
const result = this.getStatus(key, isHealthy);
|
|
if (isHealthy) return result;
|
|
throw new HealthCheckError('Redis ping failed', result);
|
|
} catch (error) {
|
|
if (error instanceof HealthCheckError) throw error;
|
|
throw new HealthCheckError('Redis check failed', this.getStatus(key, false));
|
|
}
|
|
}
|
|
```
|
|
|
|
### Docker Container Health Checks
|
|
|
|
**API Container:**
|
|
```yaml
|
|
healthcheck:
|
|
test: ['CMD', 'node', '-e', "fetch('http://localhost:3001/health').then(r => { if (!r.ok) throw 1 }).catch(() => process.exit(1))"]
|
|
interval: 30s
|
|
timeout: 5s
|
|
retries: 5
|
|
start_period: 30s
|
|
```
|
|
|
|
**Web Container:**
|
|
```yaml
|
|
healthcheck:
|
|
test: ['CMD', 'node', '-e', "fetch('http://localhost:3000').then(r => { if (!r.ok) throw 1 }).catch(() => process.exit(1))"]
|
|
interval: 30s
|
|
timeout: 5s
|
|
retries: 3
|
|
start_period: 15s
|
|
```
|
|
|
|
**PostgreSQL:**
|
|
```yaml
|
|
healthcheck:
|
|
test: ['CMD-SHELL', 'pg_isready -U ${DB_USER} -d ${DB_NAME}']
|
|
interval: 10s
|
|
timeout: 5s
|
|
retries: 5
|
|
start_period: 30s
|
|
```
|
|
|
|
**Redis:**
|
|
```yaml
|
|
healthcheck:
|
|
test: ['CMD', 'redis-cli', '-a', '${REDIS_PASSWORD}', 'ping']
|
|
interval: 10s
|
|
timeout: 5s
|
|
retries: 5
|
|
start_period: 10s
|
|
```
|
|
|
|
**Typesense:**
|
|
```yaml
|
|
healthcheck:
|
|
test: ['CMD', 'curl', '-sf', 'http://localhost:8108/health']
|
|
interval: 10s
|
|
timeout: 5s
|
|
retries: 5
|
|
start_period: 15s
|
|
```
|
|
|
|
---
|
|
|
|
## Environment Variables
|
|
|
|
### Complete `.env.example` Reference
|
|
|
|
**PostgreSQL:**
|
|
```env
|
|
DB_HOST=localhost
|
|
DB_PORT=5432
|
|
DB_NAME=goodgo
|
|
DB_USER=goodgo
|
|
DB_PASSWORD=CHANGE_ME
|
|
DATABASE_URL=postgresql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:${DB_PORT}/${DB_NAME}?schema=public
|
|
DATABASE_URL_DIRECT=postgresql://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:${DB_PORT}/${DB_NAME}?schema=public
|
|
```
|
|
|
|
**PgBouncer (Prod Only):**
|
|
```env
|
|
PGBOUNCER_POOL_SIZE=20
|
|
PGBOUNCER_MAX_CLIENT_CONN=200
|
|
PGBOUNCER_ADMIN_PASSWORD=CHANGE_ME
|
|
PGBOUNCER_STATS_PASSWORD=CHANGE_ME
|
|
```
|
|
|
|
**Redis:**
|
|
```env
|
|
REDIS_HOST=localhost
|
|
REDIS_PORT=6379
|
|
REDIS_PASSWORD=
|
|
REDIS_URL=redis://${REDIS_HOST}:${REDIS_PORT}
|
|
```
|
|
|
|
**Typesense:**
|
|
```env
|
|
TYPESENSE_HOST=localhost
|
|
TYPESENSE_PORT=8108
|
|
TYPESENSE_PROTOCOL=http
|
|
TYPESENSE_API_KEY=CHANGE_ME
|
|
```
|
|
|
|
**MinIO:**
|
|
```env
|
|
MINIO_ENDPOINT=localhost
|
|
MINIO_PORT=9000
|
|
MINIO_CONSOLE_PORT=9001
|
|
MINIO_ACCESS_KEY=CHANGE_ME
|
|
MINIO_SECRET_KEY=CHANGE_ME
|
|
MINIO_BUCKET=goodgo-media
|
|
MINIO_USE_SSL=false
|
|
```
|
|
|
|
**NestJS API:**
|
|
```env
|
|
API_PORT=3000
|
|
PORT=3001
|
|
NODE_ENV=development
|
|
CORS_ORIGINS=http://localhost:3000,http://localhost:3001
|
|
```
|
|
|
|
**JWT / Authentication (REQUIRED):**
|
|
```env
|
|
JWT_SECRET=<generate with: openssl rand -base64 48>
|
|
JWT_EXPIRES_IN=15m
|
|
JWT_REFRESH_SECRET=<generate with: openssl rand -base64 48>
|
|
JWT_REFRESH_EXPIRES_IN=7d
|
|
```
|
|
|
|
**OAuth Providers:**
|
|
```env
|
|
GOOGLE_CLIENT_ID=
|
|
GOOGLE_CLIENT_SECRET=
|
|
GOOGLE_CALLBACK_URL=http://localhost:3001/auth/google/callback
|
|
|
|
ZALO_APP_ID=
|
|
ZALO_APP_SECRET=
|
|
ZALO_CALLBACK_URL=http://localhost:3001/auth/zalo/callback
|
|
|
|
FRONTEND_URL=http://localhost:3000
|
|
```
|
|
|
|
**Next.js Web:**
|
|
```env
|
|
NEXT_PUBLIC_API_URL=http://localhost:3000
|
|
WEB_PORT=3001
|
|
```
|
|
|
|
**AI Service (Python/FastAPI):**
|
|
```env
|
|
AI_SERVICE_PORT=8000
|
|
AI_SERVICE_URL=http://localhost:8000
|
|
CLAUDE_API_KEY=
|
|
AI_DEBUG=false
|
|
AI_LOG_LEVEL=info
|
|
```
|
|
|
|
**Map Integration:**
|
|
```env
|
|
NEXT_PUBLIC_MAPBOX_TOKEN=
|
|
```
|
|
|
|
**Payment Gateways:**
|
|
```env
|
|
VNPAY_TMN_CODE=
|
|
VNPAY_HASH_SECRET=
|
|
VNPAY_BASE_URL=https://sandbox.vnpayment.vn/paymentv2/vpcpay.html
|
|
VNPAY_API_URL=https://sandbox.vnpayment.vn/merchant_webapi/api/transaction
|
|
|
|
MOMO_PARTNER_CODE=
|
|
MOMO_ACCESS_KEY=
|
|
MOMO_SECRET_KEY=
|
|
MOMO_ENDPOINT=https://test-payment.momo.vn/v2/gateway/api
|
|
|
|
ZALOPAY_APP_ID=
|
|
ZALOPAY_KEY1=
|
|
ZALOPAY_KEY2=
|
|
ZALOPAY_ENDPOINT=https://sb-openapi.zalopay.vn/v2
|
|
```
|
|
|
|
**Email / SMTP:**
|
|
```env
|
|
SMTP_HOST=localhost
|
|
SMTP_PORT=1025
|
|
SMTP_USER=
|
|
SMTP_PASS=
|
|
SMTP_FROM=noreply@goodgo.vn
|
|
```
|
|
|
|
**Firebase Cloud Messaging (Optional):**
|
|
```env
|
|
FIREBASE_SERVICE_ACCOUNT=
|
|
```
|
|
|
|
**Sentry Error Tracking:**
|
|
```env
|
|
SENTRY_DSN=
|
|
NEXT_PUBLIC_SENTRY_DSN=
|
|
SENTRY_AUTH_TOKEN=
|
|
SENTRY_ORG=
|
|
SENTRY_PROJECT=
|
|
```
|
|
|
|
**KYC Field Encryption (REQUIRED Prod):**
|
|
```env
|
|
KYC_ENCRYPTION_KEY=<generate with: openssl rand -hex 32> # 64 hex chars (32 bytes)
|
|
KYC_ENCRYPTION_KEY_VERSION=1
|
|
```
|
|
|
|
**Logging:**
|
|
```env
|
|
LOG_LEVEL=info
|
|
```
|
|
|
|
---
|
|
|
|
## Backup & Recovery
|
|
|
|
### Automated Daily Backups
|
|
|
|
**Service:** `pg-backup` container (runs inside docker compose)
|
|
|
|
**Backup Script:** `scripts/backup/pg-backup.sh`
|
|
|
|
```bash
|
|
# Daily cron job: 02:00 UTC
|
|
PGHOST=postgres \
|
|
PGPORT=5432 \
|
|
PGUSER=goodgo \
|
|
PGDATABASE=goodgo \
|
|
PGPASSWORD=<secret> \
|
|
BACKUP_DIR=/backups \
|
|
RETENTION_DAYS=7 \
|
|
/scripts/pg-backup.sh
|
|
```
|
|
|
|
**Behavior:**
|
|
1. Creates dump with `pg_dump --format=custom --compress=6`
|
|
2. Saves as `goodgo_YYYYMMDD_HHMMSS.sql.gz`
|
|
3. Prunes backups older than 7 days (configurable)
|
|
4. Logs to `/var/log/pg-backup.log`
|
|
|
|
**Restore from Backup:**
|
|
|
|
```bash
|
|
# Interactive restore prompt
|
|
docker compose -f docker-compose.prod.yml exec pg-backup bash -c \
|
|
'pg_restore -h postgres -p 5432 -U goodgo -d goodgo \
|
|
--clean --if-exists /backups/goodgo_20260410_020000.sql.gz'
|
|
|
|
# Or using restore script
|
|
docker compose -f docker-compose.prod.yml run --rm pg-verify-backup bash -c \
|
|
'source /scripts/pg-restore.sh /backups/goodgo_20260410_020000.sql.gz'
|
|
```
|
|
|
|
### Backup Verification
|
|
|
|
**Service:** `pg-verify-backup` container (on-demand, profile: tools)
|
|
|
|
**Verification Script:** `scripts/backup/pg-verify-backup.sh`
|
|
|
|
```bash
|
|
# Usage:
|
|
docker compose -f docker-compose.prod.yml run --rm pg-verify-backup
|
|
|
|
# Or with options:
|
|
SKIP_CLEANUP=1 REPORT_FILE=/backups/verify-report.json \
|
|
docker compose -f docker-compose.prod.yml run --rm pg-verify-backup
|
|
```
|
|
|
|
**Verification Steps:**
|
|
1. Creates isolated test database: `goodgo_verify_<timestamp>`
|
|
2. Enables PostGIS extension
|
|
3. Restores backup into test DB
|
|
4. Verifies all 22 tables exist
|
|
5. Compares row counts between source and restored
|
|
6. Checksums critical tables using MD5 hashes
|
|
7. Checks indexes, enum types
|
|
8. Generates JSON report with results
|
|
9. **Cleanup:** Drops test DB (unless SKIP_CLEANUP=1)
|
|
|
|
**JSON Report Structure:**
|
|
```json
|
|
{
|
|
"timestamp": "2026-04-11T10:30:00Z",
|
|
"backupFile": "/backups/goodgo_20260410_020000.sql.gz",
|
|
"backupSize": "150M",
|
|
"testDatabase": "goodgo_verify_20260411_103000",
|
|
"restoreDurationSeconds": 45,
|
|
"passed": 28,
|
|
"failed": 0,
|
|
"warnings": 2,
|
|
"result": "pass",
|
|
"checks": [
|
|
{ "check": "Database creation", "status": "pass", "detail": "Test database created" },
|
|
{ "check": "Restore", "status": "pass", "detail": "pg_restore completed cleanly in 45s" },
|
|
{ "check": "Table existence", "status": "pass", "detail": "All 22 expected tables present" },
|
|
{ "check": "Row counts", "status": "pass", "detail": "All tables match source database" },
|
|
{ "check": "Checksum: User identities", "status": "pass", "detail": "Hashes match (abc123def456...)" },
|
|
...
|
|
]
|
|
}
|
|
```
|
|
|
|
**GitHub Action Backup Verification:**
|
|
- File: `.github/workflows/backup-verify.yml`
|
|
- Schedule: Weekly Sundays 05:00 UTC
|
|
- Also: Manual trigger with skip_cleanup option
|
|
- Artifacts: Uploads JSON report for 30 days
|
|
|
|
---
|
|
|
|
## Deployment Pipeline
|
|
|
|
### GitHub Actions CI/CD
|
|
|
|
**Workflows:**
|
|
1. `.github/workflows/ci.yml` — Lint, typecheck, test, build (on push/PR to master)
|
|
2. `.github/workflows/deploy.yml` — Build Docker images, deploy to staging/prod
|
|
3. `.github/workflows/e2e.yml` — E2E tests (spins up full docker-compose.ci.yml)
|
|
4. `.github/workflows/backup-verify.yml` — Weekly backup verification
|
|
5. `.github/workflows/security.yml` — Dependency scanning, SAST
|
|
6. `.github/workflows/codeql.yml` — GitHub CodeQL analysis
|
|
7. `.github/workflows/load-test.yml` — K6 load testing
|
|
|
|
### CI Pipeline (`ci.yml`)
|
|
|
|
**On:** `push master`, `pull_request master`
|
|
**Node:** 22
|
|
**Concurrency:** Cancel previous runs on same ref
|
|
|
|
**Jobs:**
|
|
1. **Lint → Typecheck → Test → Build**
|
|
- Installs pnpm, Node 22
|
|
- Runs linter (eslint)
|
|
- Type checks (tsc)
|
|
- Unit tests (jest)
|
|
- Builds all apps (turbo)
|
|
- PostgreSQL 16 service available (goodgo_test DB)
|
|
|
|
2. **E2E Tests** (depends on ci job)
|
|
- Full docker-compose.ci.yml services (postgres, redis, typesense, minio)
|
|
- Runs end-to-end test suite
|
|
- Timeout: 20 minutes
|
|
- Env vars: DATABASE_URL, JWT secrets, payment test codes
|
|
|
|
### Deploy Pipeline (`deploy.yml`)
|
|
|
|
**On:**
|
|
- `push master` (auto-deploys to staging)
|
|
- Manual `workflow_dispatch` (choose staging or production)
|
|
|
|
**Jobs:**
|
|
1. **Build API Image**
|
|
- Builds: `goodgo-api:${IMAGE_TAG}`
|
|
- Dockerfile: `apps/api/Dockerfile`
|
|
- Registry: `ghcr.io/goodgo/goodgo-api`
|
|
- Tags: git SHA, branch name, `latest` (on master)
|
|
|
|
2. **Build Web Image**
|
|
- Builds: `goodgo-web:${IMAGE_TAG}`
|
|
- Dockerfile: `apps/web/Dockerfile`
|
|
- Registry: `ghcr.io/goodgo/goodgo-web`
|
|
|
|
3. **Build AI Services Image**
|
|
- Builds: `goodgo-ai-services:${IMAGE_TAG}`
|
|
- Context: `libs/ai-services/`
|
|
- Registry: `ghcr.io/goodgo/goodgo-ai-services`
|
|
|
|
4. **Deploy to Staging**
|
|
- Condition: `github.event_name == 'push' || inputs.environment == 'staging'`
|
|
- SSH into staging host
|
|
- Pulls new images from GHCR
|
|
- **Rolling update** (zero downtime):
|
|
```bash
|
|
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api
|
|
docker compose -f docker-compose.prod.yml up -d --no-deps --wait web
|
|
docker compose -f docker-compose.prod.yml up -d --no-deps --wait ai-services
|
|
```
|
|
- Runs migrations: `docker compose exec api npx prisma migrate deploy`
|
|
- Prunes old images
|
|
|
|
5. **Deploy to Production**
|
|
- Only on manual `workflow_dispatch` with `environment: production`
|
|
- Same steps as staging
|
|
- Requires `environment: production` approval (GitHub security)
|
|
|
|
### Dockerfile Multi-Stage Builds
|
|
|
|
**API (apps/api/Dockerfile):**
|
|
- **Base:** node:22-slim + pnpm 10.27.0
|
|
- **Deps:** Install locked dependencies (layer caching)
|
|
- **Build:** Compile TypeScript, generate Prisma client
|
|
- **Prune:** `pnpm deploy --prod` (removes dev deps, hoists prod deps)
|
|
- **Production:** Minimal image, dumb-init for signals, non-root user
|
|
|
|
**Web (apps/web/Dockerfile):**
|
|
- **Base:** node:22-slim + pnpm
|
|
- **Deps:** Install dependencies
|
|
- **Build:** `next build` → standalone output + static files
|
|
- **Production:** Copy .next/standalone, public, static assets
|
|
|
|
**AI Services (libs/ai-services/Dockerfile):**
|
|
- **Base:** python:3.12-slim
|
|
- **Install:** System deps (gcc, g++), dumb-init, FastAPI/XGBoost/underthesea
|
|
- **Models:** Pre-download underthesea ML models at build time
|
|
- **User:** Run as non-root appuser
|
|
- **CMD:** `uvicorn app.main:app --host 0.0.0.0 --port 8000`
|
|
|
|
---
|
|
|
|
## Troubleshooting Guide
|
|
|
|
### Check Service Status
|
|
|
|
```bash
|
|
# All services
|
|
docker compose -f docker-compose.prod.yml ps
|
|
|
|
# Single service
|
|
docker compose -f docker-compose.prod.yml ps api
|
|
|
|
# Get logs
|
|
docker compose -f docker-compose.prod.yml logs -f api --tail=100
|
|
|
|
# Health check status
|
|
docker compose -f docker-compose.prod.yml exec api curl http://localhost:3001/health
|
|
```
|
|
|
|
### Common Issues
|
|
|
|
#### 1. API Service Not Healthy (stuck in "health-check-failed" state)
|
|
|
|
**Symptoms:**
|
|
- `docker compose ps` shows `(health: starting)` for >2 minutes
|
|
- `docker compose logs api` shows connection errors
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check API liveness
|
|
docker compose exec api curl http://localhost:3001/health
|
|
|
|
# Check readiness (includes DB + Redis checks)
|
|
docker compose exec api curl http://localhost:3001/health/ready
|
|
|
|
# Check specific dependencies
|
|
docker compose exec api curl http://localhost:3001/health/db
|
|
docker compose exec api curl http://localhost:3001/health/redis
|
|
```
|
|
|
|
**Solutions:**
|
|
|
|
- **PostgreSQL not ready:**
|
|
```bash
|
|
docker compose ps postgres # Should show (healthy)
|
|
docker compose exec postgres pg_isready -U goodgo -d goodgo
|
|
docker compose logs postgres --tail=50
|
|
```
|
|
|
|
- **Redis not ready:**
|
|
```bash
|
|
docker compose exec redis redis-cli ping # Should return PONG
|
|
docker compose logs redis --tail=50
|
|
```
|
|
|
|
- **PgBouncer not ready (prod):**
|
|
```bash
|
|
docker compose exec pgbouncer pg_isready -h 127.0.0.1 -p 6432 -U goodgo
|
|
docker compose logs pgbouncer --tail=50
|
|
```
|
|
|
|
- **Database schema not initialized:**
|
|
```bash
|
|
# Run migrations manually
|
|
docker compose exec api npx prisma migrate deploy
|
|
# Or check if schema exists
|
|
docker compose exec postgres psql -U goodgo -d goodgo -c "\dt"
|
|
```
|
|
|
|
#### 2. High Database Connection Pool Exhaustion
|
|
|
|
**Symptoms:**
|
|
- Errors: `Error: unable to get a connection from the pool after X s`
|
|
- Slow queries pile up
|
|
- API latency spikes
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check pool stats (prod, PgBouncer)
|
|
docker compose exec pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_stats -c "SHOW stats"
|
|
|
|
# Or query PostgreSQL directly
|
|
docker compose exec postgres psql -U goodgo -d goodgo -c "SELECT count(*) FROM pg_stat_activity"
|
|
```
|
|
|
|
**Solutions:**
|
|
- Increase `PGBOUNCER_POOL_SIZE` (default: 20)
|
|
- Increase `PGBOUNCER_MAX_CLIENT_CONN` (default: 200)
|
|
- Reduce long-running queries (add query timeout)
|
|
- Check for idle connections: `server_idle_timeout`
|
|
|
|
#### 3. Redis Connection Failures (Non-Fatal)
|
|
|
|
**Symptoms:**
|
|
- Logs: `Redis check failed` or `ECONNREFUSED`
|
|
- But API still responds with slower database reads
|
|
- Health check `/health/ready` returns 503
|
|
|
|
**Expected Behavior:** Cache misses → app serves from database
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check Redis availability
|
|
docker compose exec redis redis-cli ping
|
|
|
|
# Check RedisService logs
|
|
docker compose logs api | grep -i redis
|
|
```
|
|
|
|
**Solutions:**
|
|
- Restart Redis: `docker compose restart redis`
|
|
- Check memory: `docker compose exec redis redis-cli info memory`
|
|
- If at `maxmemory`, increase in docker-compose.yml and restart
|
|
|
|
#### 4. Typesense Search Not Indexing
|
|
|
|
**Symptoms:**
|
|
- Search returns 0 results
|
|
- Listings created but not searchable
|
|
- `/health` for typesense shows green, but collection empty
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check collection exists
|
|
curl http://localhost:8108/collections -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"
|
|
|
|
# Check collection stats
|
|
curl "http://localhost:8108/collections/listings" \
|
|
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" | jq .
|
|
|
|
# Check recent docs
|
|
curl "http://localhost:8108/collections/listings/documents/search?q=*" \
|
|
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" | jq '.found'
|
|
```
|
|
|
|
**Solutions:**
|
|
- Verify `TYPESENSE_API_KEY` matches container env var
|
|
- Reindex all listings:
|
|
```bash
|
|
docker compose exec api npx ts-node scripts/reindex-listings.ts
|
|
```
|
|
- If collection corrupted, drop and recreate:
|
|
```bash
|
|
curl -X DELETE "http://localhost:8108/collections/listings" \
|
|
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"
|
|
# Then restart API service to recreate schema
|
|
docker compose restart api
|
|
```
|
|
|
|
#### 5. Payment Callback Failures
|
|
|
|
**Symptoms:**
|
|
- Payment status stuck in `PENDING`
|
|
- Logs: `Invalid callback signature for provider=VNPAY`
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check payment record in DB
|
|
docker compose exec postgres psql -U goodgo -d goodgo -c \
|
|
"SELECT id, status, provider, \"providerTxId\", \"callbackData\" FROM \"Payment\" \
|
|
WHERE \"providerTxId\" = 'your-txid' ORDER BY \"createdAt\" DESC LIMIT 1;"
|
|
|
|
# Check logs for callback handler
|
|
docker compose logs api | grep -i "HandleCallbackHandler\|callback"
|
|
```
|
|
|
|
**Solutions:**
|
|
- Verify payment gateway credentials (VNPAY_HASH_SECRET, MOMO_SECRET_KEY, etc.)
|
|
- Manually verify callback signature (contact payment provider support)
|
|
- Replay callback manually (if idempotent key available):
|
|
```bash
|
|
curl -X POST http://localhost:3001/api/payments/callback \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"provider":"VNPAY",...callback data...}'
|
|
```
|
|
|
|
#### 6. Backup Verification Fails
|
|
|
|
**Symptoms:**
|
|
- GitHub Action `.github/workflows/backup-verify.yml` fails
|
|
- Restore test database shows mismatched row counts
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Run verification manually
|
|
docker compose -f docker-compose.ci.yml up postgres
|
|
docker compose -f docker-compose.ci.yml exec postgres \
|
|
/scripts/pg-verify-backup.sh /backups/goodgo_latest.sql.gz
|
|
|
|
# Check JSON report
|
|
cat /tmp/backups/verify-report.json | jq .
|
|
```
|
|
|
|
**Solutions:**
|
|
- Check if backup file corrupt: `file goodgo_*.sql.gz`
|
|
- Verify restore process: `pg_restore --verbose`
|
|
- Check PostGIS extension availability: `psql -c "CREATE EXTENSION postgis;"`
|
|
|
|
#### 7. Memory/CPU Pressure
|
|
|
|
**Symptoms:**
|
|
- OOM kills, container exits 137
|
|
- CPU throttling, latency spikes
|
|
- Prometheus `container_memory_usage_bytes` near limit
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check Docker stats
|
|
docker stats --no-stream
|
|
|
|
# Check limits in compose file
|
|
docker compose config | grep -A3 "resources:"
|
|
|
|
# Check actual memory usage
|
|
docker inspect goodgo-api | jq '.HostConfig.Memory'
|
|
```
|
|
|
|
**Solutions:**
|
|
- Increase resource limits in `docker-compose.prod.yml`
|
|
- Reduce log verbosity (set LOG_LEVEL=warn)
|
|
- Implement pagination for large queries
|
|
- Scale horizontally (add more API replicas)
|
|
|
|
### Prometheus Queries for Debugging
|
|
|
|
```promql
|
|
# API request latency p99
|
|
histogram_quantile(0.99, sum(rate(goodgo_api_request_duration_seconds_bucket[5m])) by (le))
|
|
|
|
# API error rate (5xx)
|
|
(sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
|
|
|
|
# Container memory usage
|
|
container_memory_usage_bytes{name="goodgo-api"}
|
|
|
|
# Container CPU usage
|
|
rate(container_cpu_usage_seconds_total{name="goodgo-api"}[5m])
|
|
|
|
# PostgreSQL active queries
|
|
pg_stat_activity_count{state="active"}
|
|
|
|
# Redis memory usage
|
|
redis_memory_used_bytes / 1024 / 1024 # in MB
|
|
|
|
# Typesense collection size
|
|
typesense_documents_count{collection="listings"}
|
|
```
|
|
|
|
### Emergency Procedures
|
|
|
|
**Full System Reset (dev only):**
|
|
```bash
|
|
docker compose down -v # Remove all volumes!
|
|
docker system prune -a
|
|
docker compose up -d --wait
|
|
docker compose exec api npx prisma db push
|
|
docker compose exec api npx ts-node scripts/seed.ts
|
|
```
|
|
|
|
**Database Emergency Restore:**
|
|
```bash
|
|
# Find latest backup
|
|
ls -t /var/lib/docker/volumes/pg_backups/_data/goodgo_*.sql.gz | head -1
|
|
|
|
# Restore to new database
|
|
pg_restore -h localhost -p 5432 -U goodgo -d goodgo_restored \
|
|
--clean --if-exists --verbose /path/to/backup.sql.gz
|
|
|
|
# Verify restore
|
|
psql -U goodgo -d goodgo_restored -c "SELECT count(*) FROM \"User\";"
|
|
```
|
|
|
|
**Force Kill Stuck Service:**
|
|
```bash
|
|
# If health check broken
|
|
docker compose kill api
|
|
docker compose rm -f api
|
|
docker compose up -d api
|
|
```
|
|
|
|
---
|
|
|
|
## Appendix: Key File Locations
|
|
|
|
```
|
|
/Users/velikho/Desktop/WORKING/goodgo-platform-ai/
|
|
├── docker-compose.yml # Dev environment
|
|
├── docker-compose.prod.yml # Prod environment (with pgbouncer, resource limits)
|
|
├── docker-compose.ci.yml # CI/E2E test environment
|
|
├── .env.example # Template for all required env vars
|
|
│
|
|
├── apps/
|
|
│ ├── api/
|
|
│ │ ├── Dockerfile # Multi-stage NestJS build
|
|
│ │ ├── docker-entrypoint.sh # Startup script (migrations, app start)
|
|
│ │ ├── src/
|
|
│ │ │ ├── modules/health/health.controller.ts
|
|
│ │ │ ├── modules/payments/application/commands/handle-callback/
|
|
│ │ │ ├── modules/shared/infrastructure/redis.service.ts
|
|
│ │ │ └── modules/search/infrastructure/services/typesense-search.repository.ts
|
|
│ │ └── package.json
|
|
│ │
|
|
│ └── web/
|
|
│ ├── Dockerfile # Multi-stage Next.js build
|
|
│ └── package.json
|
|
│
|
|
├── libs/
|
|
│ └── ai-services/
|
|
│ ├── Dockerfile # Python FastAPI build
|
|
│ ├── app/main.py # FastAPI app entry
|
|
│ └── pyproject.toml
|
|
│
|
|
├── prisma/
|
|
│ └── schema.prisma # Complete Prisma schema (22 models)
|
|
│
|
|
├── infra/
|
|
│ └── pgbouncer/
|
|
│ ├── pgbouncer.ini # Connection pooling config
|
|
│ ├── userlist.txt.template # User list (templated)
|
|
│ └── entrypoint.sh # Env substitution script
|
|
│
|
|
├── scripts/
|
|
│ └── backup/
|
|
│ ├── pg-backup.sh # Daily backup automation
|
|
│ ├── pg-verify-backup.sh # Restore verification
|
|
│ └── pg-restore.sh # Manual restore script
|
|
│
|
|
├── monitoring/
|
|
│ ├── prometheus/
|
|
│ │ ├── prometheus.yml # Scrape config (goodgo-api metrics)
|
|
│ │ └── alert-rules.yml # Latency + error rate alerts
|
|
│ ├── loki/
|
|
│ │ └── loki-config.yml # Log aggregation config (15-day retention)
|
|
│ ├── promtail/
|
|
│ │ └── promtail-config.yml # Log shipping (Pino JSON parsing)
|
|
│ └── grafana/
|
|
│ ├── provisioning/
|
|
│ │ ├── datasources/datasource.yml
|
|
│ │ └── dashboards/dashboard.yml
|
|
│ └── dashboards/
|
|
│ ├── api-latency.json
|
|
│ ├── api-overview.json
|
|
│ ├── database.json
|
|
│ ├── logs.json
|
|
│ ├── search.json
|
|
│ ├── web-vitals.json
|
|
│ └── business-metrics.json
|
|
│
|
|
└── .github/workflows/
|
|
├── ci.yml # Lint, test, build
|
|
├── deploy.yml # Build images, deploy to staging/prod
|
|
├── e2e.yml # End-to-end tests
|
|
├── backup-verify.yml # Weekly backup verification
|
|
├── security.yml # Dependency/SAST scanning
|
|
├── codeql.yml # GitHub CodeQL
|
|
└── load-test.yml # K6 load testing
|
|
```
|
|
|
|
---
|
|
|
|
## Document Version History
|
|
|
|
| Version | Date | Author | Changes |
|
|
|---------|------|--------|---------|
|
|
| 1.0 | 2026-04-11 | DevOps Team | Initial comprehensive runbook |
|
|
|
|
---
|
|
|
|
**Last Updated:** April 11, 2026
|
|
**Maintained By:** GoodGo Platform SRE Team
|
|
**Contact:** devops@goodgo.vn
|
|
|