Files
goodgo-platform/docs/deployment.md
Ho Ngoc Hai 64c6074735 feat(devops): add staging auto-deploy pipeline on develop branch
- Trigger deploy workflow on push to `develop` branch (in addition to `master`)
- Add `staging-latest` Docker image tag for develop branch builds
- Add `rollback-staging` job: auto-reverts to previous images on smoke test failure
- Add Slack success notification for staging deploys (previously only failure was notified)
- Record pre-deploy image digests for rollback capability
- Update deployment docs with CI/CD pipeline details, rollback procedures, and required secrets

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-11 01:18:37 +07:00

11 KiB

Deployment Guide

Overview

GoodGo Platform AI consists of four deployable services:

Service Technology Default Port
API NestJS (Node.js) 3001
Web Next.js 3000
AI Services FastAPI (Python) 8000
Infrastructure Docker Compose Various

Prerequisites

  • Docker Engine 24+ & Docker Compose v2
  • Node.js 22 LTS
  • pnpm 10.27+
  • Python 3.12 (for AI services, if running outside Docker)

Environment Configuration

Copy .env.example to .env and configure all required values:

cp .env.example .env

Required Variables

Variable Description Example
DATABASE_URL PostgreSQL connection string postgresql://user:pass@host:5432/goodgo
JWT_SECRET JWT signing key (min 32 chars) Generate with openssl rand -hex 32
JWT_REFRESH_SECRET Refresh token signing key Generate with openssl rand -hex 32
REDIS_URL Redis connection string redis://localhost:6379
TYPESENSE_API_KEY Typesense admin API key Generate a secure random key

Optional Variables

Variable Description Default
API_PORT API server port 3000
WEB_PORT Web app port 3001
NODE_ENV Environment mode development
CORS_ORIGINS Allowed CORS origins
CLAUDE_API_KEY Claude API key (for content moderation)
NEXT_PUBLIC_MAPBOX_TOKEN Mapbox token (for maps)
VNPAY_*, MOMO_*, ZALOPAY_* Payment gateway credentials

Infrastructure Setup (Docker Compose)

Start all infrastructure services:

docker compose up -d

This starts:

  • PostgreSQL 16 + PostGIS 3.4 (port 5432)
  • Redis 7 (port 6379)
  • Typesense 27 (port 8108)
  • MinIO (API: 9000, Console: 9001)
  • AI Services (port 8000)
  • pg-backup — automated daily PostgreSQL backups at 02:00 UTC with verification at 04:00 UTC
  • Loki (port 3100) — log aggregation
  • Promtail — log collection agent (ships container logs to Loki)
  • Prometheus (port 9090)
  • Grafana (port 3002) — dashboards for metrics and logs

Verify all services are healthy:

docker compose ps

All services include health checks. Wait until all show healthy status.

Database Setup

# Generate Prisma client
pnpm db:generate

# Apply migrations
pnpm db:migrate:deploy

# Seed initial data (optional)
pnpm db:seed

Building for Production

API (NestJS)

cd apps/api
pnpm build

Output: apps/api/dist/

Run in production:

NODE_ENV=production PORT=3001 node apps/api/dist/main.js

Web (Next.js)

cd apps/web
pnpm build

Output: apps/web/.next/

Run in production:

NODE_ENV=production pnpm --filter web start

AI Services (FastAPI)

The AI service runs in Docker via docker compose. To build separately:

cd libs/ai-services
docker build -t goodgo-ai-services .
docker run -p 8000:8000 --env-file ../../.env goodgo-ai-services

Production Checklist

Security

  • Set strong, unique JWT_SECRET and JWT_REFRESH_SECRET (min 32 characters)
  • Set NODE_ENV=production
  • Configure CORS_ORIGINS to only allow your domain(s)
  • Change default database passwords
  • Change default MinIO credentials (MINIO_USER, MINIO_PASSWORD)
  • Change default Grafana credentials (GRAFANA_ADMIN_USER, GRAFANA_ADMIN_PASSWORD)
  • Use a strong, unique TYPESENSE_API_KEY
  • Enable SSL/TLS termination (reverse proxy)
  • Set MINIO_USE_SSL=true if MinIO is exposed publicly

Database

  • Run pnpm db:migrate:deploy (not db:migrate:dev)
  • Enable PostgreSQL connection pooling (PgBouncer recommended)
  • Configure automated backups
  • Set appropriate max_connections in PostgreSQL config

Monitoring

  • Verify Prometheus is scraping /metrics endpoint
  • Import Grafana dashboards from monitoring/grafana/dashboards/
  • Set up alerting rules for error rates and latency

Performance

  • Configure Redis maxmemory and eviction policy
  • Set appropriate Typesense --memory-limit
  • Enable gzip/brotli compression in reverse proxy
  • Configure CDN for static assets (Next.js /_next/static/)

Health Checks

Service Endpoint Expected Response
API GET /health {"status": "ok"}
API (Swagger) GET /api/v1/docs Swagger UI page
API (Metrics) GET /api/v1/metrics Prometheus metrics
AI Services GET /health {"status": "ok"}
Typesense GET /health {"ok": true}
Loki GET /ready 200 OK
Redis redis-cli ping PONG
PostgreSQL pg_isready -h host -p 5432 Exit code 0

Scaling Considerations

Horizontal Scaling

  • API: Stateless — scale with multiple instances behind a load balancer
  • Web: Stateless — scale with multiple instances or deploy to Vercel/Cloudflare
  • AI Services: CPU-bound — scale based on valuation request volume
  • Redis: Use Redis Cluster for high availability
  • PostgreSQL: Read replicas for query-heavy workloads
                    ┌─────────────┐
                    │ Load Balancer│
                    │ (nginx/ALB)  │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
        ┌─────▼──┐  ┌─────▼──┐  ┌─────▼──┐
        │ API #1 │  │ API #2 │  │ API #N │
        └────────┘  └────────┘  └────────┘
              │            │            │
              └────────────┼────────────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
        ┌─────▼──┐  ┌─────▼──┐  ┌─────▼─────┐
        │  PG    │  │ Redis  │  │ Typesense  │
        │Primary │  │Cluster │  │  Cluster   │
        │+ Replica│  │        │  │            │
        └────────┘  └────────┘  └────────────┘

CI/CD Pipeline

Branch Strategy

Branch Deploy Target Trigger Notes
develop Staging Auto (push) Every merge to develop auto-deploys to staging
master Staging Auto (push) Master push also deploys to staging for verification
Manual Staging/Production workflow_dispatch Manual trigger via GitHub Actions UI

Staging Auto-Deploy Flow

Push to develop → Build images → Deploy to staging → Smoke tests → ✅ / Rollback
  1. Build: Docker images for API, Web, and AI Services are built and pushed to GHCR with staging-latest tag
  2. Deploy: Images are pulled and services are updated via rolling restart (zero-downtime)
  3. Verify: Health check polls $STAGING_URL/health for up to 100 seconds
  4. Smoke test: scripts/smoke-test.sh runs against the staging URL, checking health probes, core API endpoints, search, and auth
  5. Notify: Slack notification on success or failure
  6. Rollback: If smoke tests fail, automatic rollback restores previous container images

Notifications

Deploy status notifications are sent to Slack via SLACK_WEBHOOK_URL secret:

Event Channel Content
Staging smoke tests pass Slack Commit SHA, branch, link to run
Staging smoke tests fail Slack 🚨 Commit SHA, branch, link to run
Staging rollback triggered Slack ⚠️ Commit SHA, reason, link to run
Production deploy success Slack Commit SHA, branch
Production rollback triggered Slack ⚠️ Commit SHA, reason, link to run

Required Secrets

Secret Environment Description
STAGING_HOST staging Staging server hostname/IP
STAGING_USER staging SSH user for staging deploys
STAGING_SSH_KEY staging SSH private key for staging
STAGING_URL staging Staging base URL (e.g., https://staging.goodgo.vn)
PRODUCTION_HOST production Production server hostname/IP
PRODUCTION_USER production SSH user for production deploys
PRODUCTION_SSH_KEY production SSH private key for production
PRODUCTION_URL production Production base URL
SLACK_WEBHOOK_URL both Slack incoming webhook URL

Rollback

Automatic Rollback (Staging)

The staging pipeline includes automatic rollback when smoke tests fail:

  1. Pre-deploy: Current container image digests are recorded before deployment
  2. Smoke test failure: If scripts/smoke-test.sh exits non-zero, the rollback-staging job triggers
  3. Rollback execution: Containers are stopped and restarted with previous images
  4. Verification: Health check confirms the rollback succeeded
  5. Notification: Slack notification reports the rollback with links to the failed run

Automatic Rollback (Production)

Same mechanism as staging — smoke test failure triggers rollback-production.

Manual Rollback

To manually rollback a staging or production deployment:

Option 1: Re-deploy a known-good commit

# Trigger a deploy of a specific commit via GitHub Actions
gh workflow run deploy.yml \
  --ref <known-good-commit-or-branch> \
  -f environment=staging

Option 2: SSH rollback (emergency)

# SSH into the staging/production server
ssh deploy@<host>

cd ~/goodgo

# Stop the current services
docker compose -f docker-compose.prod.yml down api web ai-services

# Restart with the previous image layers still cached locally
docker compose -f docker-compose.prod.yml up -d --wait api web ai-services

# Verify health
curl -sf http://localhost:3001/health

Option 3: Pin to a specific image tag

ssh deploy@<host>
cd ~/goodgo

# Set IMAGE_TAG to a known-good SHA
export IMAGE_TAG=<known-good-commit-sha>
export REGISTRY_URL=ghcr.io/<owner>

# Pull and restart with the pinned tag
docker compose -f docker-compose.prod.yml pull api web ai-services
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api web ai-services

Database Rollback

Prisma does not support automatic down migrations. If a migration must be reverted:

  1. Identify the migration in prisma/migrations/
  2. Write a manual SQL rollback script
  3. Apply via psql or a migration tool
  4. Update _prisma_migrations table

Always test migrations against a staging database before production deployment.

Post-Rollback Checklist

  • Verify health endpoints respond: GET /health, GET /ready
  • Run smoke tests manually: ./scripts/smoke-test.sh <url>
  • Check application logs: docker compose -f docker-compose.prod.yml logs --tail=100 api web
  • Confirm Grafana dashboards show normal metrics
  • Notify the team via Slack about the rollback and root cause