# Deployment Guide ## Overview GoodGo Platform AI consists of four deployable services: | Service | Technology | Default Port | |---------|-----------|-------------| | **API** | NestJS (Node.js) | 3001 | | **Web** | Next.js | 3000 | | **AI Services** | FastAPI (Python) | 8000 | | **Infrastructure** | Docker Compose | Various | ## Prerequisites - Docker Engine 24+ & Docker Compose v2 - Node.js 22 LTS - pnpm 10.27+ - Python 3.12 (for AI services, if running outside Docker) ## Environment Configuration Copy `.env.example` to `.env` and configure all required values: ```bash cp .env.example .env ``` ### Required Variables | Variable | Description | Example | |----------|-------------|---------| | `DATABASE_URL` | PostgreSQL connection string | `postgresql://user:pass@host:5432/goodgo` | | `JWT_SECRET` | JWT signing key (min 32 chars) | Generate with `openssl rand -hex 32` | | `JWT_REFRESH_SECRET` | Refresh token signing key | Generate with `openssl rand -hex 32` | | `REDIS_URL` | Redis connection string | `redis://localhost:6379` | | `TYPESENSE_API_KEY` | Typesense admin API key | Generate a secure random key | ### Optional Variables | Variable | Description | Default | |----------|-------------|---------| | `API_PORT` | API server port | `3000` | | `WEB_PORT` | Web app port | `3001` | | `NODE_ENV` | Environment mode | `development` | | `CORS_ORIGINS` | Allowed CORS origins | — | | `CLAUDE_API_KEY` | Claude API key (for content moderation) | — | | `NEXT_PUBLIC_MAPBOX_TOKEN` | Mapbox token (for maps) | — | | `VNPAY_*`, `MOMO_*`, `ZALOPAY_*` | Payment gateway credentials | — | ## Infrastructure Setup (Docker Compose) Start all infrastructure services: ```bash docker compose up -d ``` This starts: - **PostgreSQL 16 + PostGIS 3.4** (port 5432) - **Redis 7** (port 6379) - **Typesense 27** (port 8108) - **MinIO** (API: 9000, Console: 9001) - **AI Services** (port 8000) - **pg-backup** — automated daily PostgreSQL backups at 02:00 UTC with verification at 04:00 UTC - **Loki** (port 3100) — log aggregation - **Promtail** — log collection agent (ships container logs to Loki) - **Prometheus** (port 9090) - **Grafana** (port 3002) — dashboards for metrics and logs Verify all services are healthy: ```bash docker compose ps ``` All services include health checks. Wait until all show `healthy` status. ## Database Setup ```bash # Generate Prisma client pnpm db:generate # Apply migrations pnpm db:migrate:deploy # Seed initial data (optional) pnpm db:seed ``` ## Building for Production ### API (NestJS) ```bash cd apps/api pnpm build ``` Output: `apps/api/dist/` Run in production: ```bash NODE_ENV=production PORT=3001 node apps/api/dist/main.js ``` ### Web (Next.js) ```bash cd apps/web pnpm build ``` Output: `apps/web/.next/` Run in production: ```bash NODE_ENV=production pnpm --filter web start ``` ### AI Services (FastAPI) The AI service runs in Docker via `docker compose`. To build separately: ```bash cd libs/ai-services docker build -t goodgo-ai-services . docker run -p 8000:8000 --env-file ../../.env goodgo-ai-services ``` ## Production Checklist ### Security - [ ] Set strong, unique `JWT_SECRET` and `JWT_REFRESH_SECRET` (min 32 characters) - [ ] Set `NODE_ENV=production` - [ ] Configure `CORS_ORIGINS` to only allow your domain(s) - [ ] Change default database passwords - [ ] Change default MinIO credentials (`MINIO_USER`, `MINIO_PASSWORD`) - [ ] Change default Grafana credentials (`GRAFANA_ADMIN_USER`, `GRAFANA_ADMIN_PASSWORD`) - [ ] Use a strong, unique `TYPESENSE_API_KEY` - [ ] Enable SSL/TLS termination (reverse proxy) - [ ] Set `MINIO_USE_SSL=true` if MinIO is exposed publicly ### Database - [ ] Run `pnpm db:migrate:deploy` (not `db:migrate:dev`) - [ ] Enable PostgreSQL connection pooling (PgBouncer recommended) - [ ] Configure automated backups - [ ] Set appropriate `max_connections` in PostgreSQL config ### Monitoring - [ ] Verify Prometheus is scraping `/metrics` endpoint - [ ] Import Grafana dashboards from `monitoring/grafana/dashboards/` - [ ] Set up alerting rules for error rates and latency ### Performance - [ ] Configure Redis `maxmemory` and eviction policy - [ ] Set appropriate Typesense `--memory-limit` - [ ] Enable gzip/brotli compression in reverse proxy - [ ] Configure CDN for static assets (Next.js `/_next/static/`) ## Health Checks | Service | Endpoint | Expected Response | |---------|----------|-------------------| | API | `GET /health` | `{"status": "ok"}` | | API (Swagger) | `GET /api/v1/docs` | Swagger UI page | | API (Metrics) | `GET /api/v1/metrics` | Prometheus metrics | | AI Services | `GET /health` | `{"status": "ok"}` | | Typesense | `GET /health` | `{"ok": true}` | | Loki | `GET /ready` | 200 OK | | Redis | `redis-cli ping` | `PONG` | | PostgreSQL | `pg_isready -h host -p 5432` | Exit code 0 | ## Scaling Considerations ### Horizontal Scaling - **API**: Stateless — scale with multiple instances behind a load balancer - **Web**: Stateless — scale with multiple instances or deploy to Vercel/Cloudflare - **AI Services**: CPU-bound — scale based on valuation request volume - **Redis**: Use Redis Cluster for high availability - **PostgreSQL**: Read replicas for query-heavy workloads ### Recommended Architecture (Production) ``` ┌─────────────┐ │ Load Balancer│ │ (nginx/ALB) │ └──────┬──────┘ │ ┌────────────┼────────────┐ │ │ │ ┌─────▼──┐ ┌─────▼──┐ ┌─────▼──┐ │ API #1 │ │ API #2 │ │ API #N │ └────────┘ └────────┘ └────────┘ │ │ │ └────────────┼────────────┘ │ ┌────────────┼────────────┐ │ │ │ ┌─────▼──┐ ┌─────▼──┐ ┌─────▼─────┐ │ PG │ │ Redis │ │ Typesense │ │Primary │ │Cluster │ │ Cluster │ │+ Replica│ │ │ │ │ └────────┘ └────────┘ └────────────┘ ``` ## CI/CD Pipeline ### Branch Strategy | Branch | Deploy Target | Trigger | Notes | |--------|--------------|---------|-------| | `develop` | Staging | Auto (push) | Every merge to `develop` auto-deploys to staging | | `master` | Staging | Auto (push) | Master push also deploys to staging for verification | | Manual | Staging/Production | `workflow_dispatch` | Manual trigger via GitHub Actions UI | ### Staging Auto-Deploy Flow ``` Push to develop → Build images → Tag rollback → Deploy to staging → Smoke tests → Cleanup / Rollback ``` 1. **Build**: Docker images for API, Web, and AI Services are built and pushed to GHCR with `staging-latest` tag 2. **Tag rollback**: Current running images are tagged as `:rollback` before new images are pulled 3. **Deploy**: New images are pulled and services are updated via rolling restart (zero-downtime) 4. **Verify**: Health check polls `$STAGING_URL/health` for up to 100 seconds 5. **Smoke test**: `scripts/smoke-test.sh` runs against the staging URL, checking health probes, core API endpoints, search, and auth 6. **Cleanup**: On success, `:rollback` tags are removed and `docker image prune` cleans up old layers 7. **Notify**: Slack notification on success or failure 8. **Rollback**: If smoke tests fail, automatic rollback restores the `:rollback` tagged images ### Notifications Deploy status notifications are sent to Slack via `SLACK_WEBHOOK_URL` secret: | Event | Channel | Content | |-------|---------|---------| | Staging smoke tests pass | Slack | ✅ Commit SHA, branch, link to run | | Staging smoke tests fail | Slack | 🚨 Commit SHA, branch, link to run | | Staging rollback triggered | Slack | ⚠️ Commit SHA, reason, link to run | | Production deploy success | Slack | ✅ Commit SHA, branch | | Production rollback triggered | Slack | ⚠️ Commit SHA, reason, link to run | ### Required Secrets | Secret | Environment | Description | |--------|-------------|-------------| | `STAGING_HOST` | staging | Staging server hostname/IP | | `STAGING_USER` | staging | SSH user for staging deploys | | `STAGING_SSH_KEY` | staging | SSH private key for staging | | `STAGING_URL` | staging | Staging base URL (e.g., `https://staging.goodgo.vn`) | | `PRODUCTION_HOST` | production | Production server hostname/IP | | `PRODUCTION_USER` | production | SSH user for production deploys | | `PRODUCTION_SSH_KEY` | production | SSH private key for production | | `PRODUCTION_URL` | production | Production base URL | | `SLACK_WEBHOOK_URL` | both | Slack incoming webhook URL | ## Rollback ### Rollback Safety Mechanism The deploy pipeline uses **explicit `:rollback` image tags** to guarantee safe rollbacks. Here's how it works: 1. **Before pulling new images**: The current running images are tagged as `goodgo-api:rollback`, `goodgo-web:rollback`, and `goodgo-ai-services:rollback` 2. **After pulling new images**: Services are updated with the new images via rolling restart 3. **After smoke tests pass**: The `:rollback` tags are removed and `docker image prune` cleans up old layers 4. **If smoke tests fail**: The `:rollback` tagged images are used to restore the previous version This ensures that `docker image prune` never deletes the images needed for rollback, because: - Image pruning only happens **after** smoke tests pass - The `:rollback` tags keep the previous images pinned even if pruning were to run accidentally ### Automatic Rollback (Staging) The staging pipeline includes automatic rollback when smoke tests fail: 1. **Pre-deploy**: Current container images are tagged with `:rollback` suffix before new images are pulled 2. **Smoke test failure**: If `scripts/smoke-test.sh` exits non-zero, the `rollback-staging` job triggers 3. **Rollback execution**: Containers are stopped and restarted using the `:rollback` tagged images 4. **Verification**: Health check confirms the rollback succeeded 5. **Notification**: Slack notification reports the rollback with links to the failed run ### Automatic Rollback (Production) Same mechanism as staging — smoke test failure triggers `rollback-production` using the `:rollback` tagged images. ### Manual Rollback To manually rollback a staging or production deployment: #### Option 1: Re-deploy a known-good commit ```bash # Trigger a deploy of a specific commit via GitHub Actions gh workflow run deploy.yml \ --ref \ -f environment=staging ``` #### Option 2: SSH rollback using :rollback tags (fastest) ```bash # SSH into the staging/production server ssh deploy@ cd ~/goodgo # Stop current services docker compose -f docker-compose.prod.yml stop api web ai-services # Verify :rollback images exist docker image inspect goodgo-api:rollback > /dev/null 2>&1 && echo "API rollback available" docker image inspect goodgo-web:rollback > /dev/null 2>&1 && echo "Web rollback available" docker image inspect goodgo-ai-services:rollback > /dev/null 2>&1 && echo "AI rollback available" # Restart services (compose picks up cached/rollback images) docker compose -f docker-compose.prod.yml up -d --wait api web ai-services # Verify health curl -sf http://localhost:3001/health && echo "Rollback successful" ``` > **Note:** The `:rollback` tags are only available until the next successful deploy cleans them up. If you need to roll back to an older version, use Option 3 below. #### Option 3: Pin to a specific image tag ```bash ssh deploy@ cd ~/goodgo # Set IMAGE_TAG to a known-good SHA export IMAGE_TAG= export REGISTRY_URL=ghcr.io/ # Pull and restart with the pinned tag docker compose -f docker-compose.prod.yml pull api web ai-services docker compose -f docker-compose.prod.yml up -d --no-deps --wait api web ai-services ``` #### Option 4: Use deploy-production.sh (built-in rollback) The manual deploy script (`scripts/deploy-production.sh`) has integrated rollback support: - Automatically tags `:rollback` images before pulling - Runs health checks and smoke tests - Auto-rollbacks using `:rollback` tags if either fails - Only prunes images after smoke tests pass ```bash ssh ubuntu@185.225.232.65 cd ~/goodgo ./scripts/deploy-production.sh [image-tag] ``` ### Database Rollback Prisma does not support automatic down migrations. If a migration must be reverted: 1. Identify the migration in `prisma/migrations/` 2. Write a manual SQL rollback script 3. Apply via `psql` or a migration tool 4. Update `_prisma_migrations` table Always test migrations against a staging database before production deployment. ### Post-Rollback Checklist - [ ] Verify health endpoints respond: `GET /health`, `GET /ready` - [ ] Run smoke tests manually: `./scripts/smoke-test.sh ` - [ ] Check application logs: `docker compose -f docker-compose.prod.yml logs --tail=100 api web` - [ ] Confirm Grafana dashboards show normal metrics - [ ] Notify the team via Slack about the rollback and root cause