Previously, `docker image prune` ran immediately after deploying new containers, potentially deleting the old images needed for rollback if smoke tests subsequently failed. Now the deploy pipeline: 1. Tags current images as :rollback before pulling new versions 2. Only runs `docker image prune` after smoke tests pass 3. Uses explicit :rollback tags for rollback instead of relying on Docker layer cache (which is fragile) Applied to: - scripts/deploy-production.sh (manual deploy script) - .github/workflows/deploy.yml (staging + production CI jobs) - docs/deployment.md (updated rollback documentation) Co-Authored-By: Paperclip <noreply@paperclip.ing>
13 KiB
Deployment Guide
Overview
GoodGo Platform AI consists of four deployable services:
| Service | Technology | Default Port |
|---|---|---|
| API | NestJS (Node.js) | 3001 |
| Web | Next.js | 3000 |
| AI Services | FastAPI (Python) | 8000 |
| Infrastructure | Docker Compose | Various |
Prerequisites
- Docker Engine 24+ & Docker Compose v2
- Node.js 22 LTS
- pnpm 10.27+
- Python 3.12 (for AI services, if running outside Docker)
Environment Configuration
Copy .env.example to .env and configure all required values:
cp .env.example .env
Required Variables
| Variable | Description | Example |
|---|---|---|
DATABASE_URL |
PostgreSQL connection string | postgresql://user:pass@host:5432/goodgo |
JWT_SECRET |
JWT signing key (min 32 chars) | Generate with openssl rand -hex 32 |
JWT_REFRESH_SECRET |
Refresh token signing key | Generate with openssl rand -hex 32 |
REDIS_URL |
Redis connection string | redis://localhost:6379 |
TYPESENSE_API_KEY |
Typesense admin API key | Generate a secure random key |
Optional Variables
| Variable | Description | Default |
|---|---|---|
API_PORT |
API server port | 3000 |
WEB_PORT |
Web app port | 3001 |
NODE_ENV |
Environment mode | development |
CORS_ORIGINS |
Allowed CORS origins | — |
CLAUDE_API_KEY |
Claude API key (for content moderation) | — |
NEXT_PUBLIC_MAPBOX_TOKEN |
Mapbox token (for maps) | — |
VNPAY_*, MOMO_*, ZALOPAY_* |
Payment gateway credentials | — |
Infrastructure Setup (Docker Compose)
Start all infrastructure services:
docker compose up -d
This starts:
- PostgreSQL 16 + PostGIS 3.4 (port 5432)
- Redis 7 (port 6379)
- Typesense 27 (port 8108)
- MinIO (API: 9000, Console: 9001)
- AI Services (port 8000)
- pg-backup — automated daily PostgreSQL backups at 02:00 UTC with verification at 04:00 UTC
- Loki (port 3100) — log aggregation
- Promtail — log collection agent (ships container logs to Loki)
- Prometheus (port 9090)
- Grafana (port 3002) — dashboards for metrics and logs
Verify all services are healthy:
docker compose ps
All services include health checks. Wait until all show healthy status.
Database Setup
# Generate Prisma client
pnpm db:generate
# Apply migrations
pnpm db:migrate:deploy
# Seed initial data (optional)
pnpm db:seed
Building for Production
API (NestJS)
cd apps/api
pnpm build
Output: apps/api/dist/
Run in production:
NODE_ENV=production PORT=3001 node apps/api/dist/main.js
Web (Next.js)
cd apps/web
pnpm build
Output: apps/web/.next/
Run in production:
NODE_ENV=production pnpm --filter web start
AI Services (FastAPI)
The AI service runs in Docker via docker compose. To build separately:
cd libs/ai-services
docker build -t goodgo-ai-services .
docker run -p 8000:8000 --env-file ../../.env goodgo-ai-services
Production Checklist
Security
- Set strong, unique
JWT_SECRETandJWT_REFRESH_SECRET(min 32 characters) - Set
NODE_ENV=production - Configure
CORS_ORIGINSto only allow your domain(s) - Change default database passwords
- Change default MinIO credentials (
MINIO_USER,MINIO_PASSWORD) - Change default Grafana credentials (
GRAFANA_ADMIN_USER,GRAFANA_ADMIN_PASSWORD) - Use a strong, unique
TYPESENSE_API_KEY - Enable SSL/TLS termination (reverse proxy)
- Set
MINIO_USE_SSL=trueif MinIO is exposed publicly
Database
- Run
pnpm db:migrate:deploy(notdb:migrate:dev) - Enable PostgreSQL connection pooling (PgBouncer recommended)
- Configure automated backups
- Set appropriate
max_connectionsin PostgreSQL config
Monitoring
- Verify Prometheus is scraping
/metricsendpoint - Import Grafana dashboards from
monitoring/grafana/dashboards/ - Set up alerting rules for error rates and latency
Performance
- Configure Redis
maxmemoryand eviction policy - Set appropriate Typesense
--memory-limit - Enable gzip/brotli compression in reverse proxy
- Configure CDN for static assets (Next.js
/_next/static/)
Health Checks
| Service | Endpoint | Expected Response |
|---|---|---|
| API | GET /health |
{"status": "ok"} |
| API (Swagger) | GET /api/v1/docs |
Swagger UI page |
| API (Metrics) | GET /api/v1/metrics |
Prometheus metrics |
| AI Services | GET /health |
{"status": "ok"} |
| Typesense | GET /health |
{"ok": true} |
| Loki | GET /ready |
200 OK |
| Redis | redis-cli ping |
PONG |
| PostgreSQL | pg_isready -h host -p 5432 |
Exit code 0 |
Scaling Considerations
Horizontal Scaling
- API: Stateless — scale with multiple instances behind a load balancer
- Web: Stateless — scale with multiple instances or deploy to Vercel/Cloudflare
- AI Services: CPU-bound — scale based on valuation request volume
- Redis: Use Redis Cluster for high availability
- PostgreSQL: Read replicas for query-heavy workloads
Recommended Architecture (Production)
┌─────────────┐
│ Load Balancer│
│ (nginx/ALB) │
└──────┬──────┘
│
┌────────────┼────────────┐
│ │ │
┌─────▼──┐ ┌─────▼──┐ ┌─────▼──┐
│ API #1 │ │ API #2 │ │ API #N │
└────────┘ └────────┘ └────────┘
│ │ │
└────────────┼────────────┘
│
┌────────────┼────────────┐
│ │ │
┌─────▼──┐ ┌─────▼──┐ ┌─────▼─────┐
│ PG │ │ Redis │ │ Typesense │
│Primary │ │Cluster │ │ Cluster │
│+ Replica│ │ │ │ │
└────────┘ └────────┘ └────────────┘
CI/CD Pipeline
Branch Strategy
| Branch | Deploy Target | Trigger | Notes |
|---|---|---|---|
develop |
Staging | Auto (push) | Every merge to develop auto-deploys to staging |
master |
Staging | Auto (push) | Master push also deploys to staging for verification |
| Manual | Staging/Production | workflow_dispatch |
Manual trigger via GitHub Actions UI |
Staging Auto-Deploy Flow
Push to develop → Build images → Tag rollback → Deploy to staging → Smoke tests → Cleanup / Rollback
- Build: Docker images for API, Web, and AI Services are built and pushed to GHCR with
staging-latesttag - Tag rollback: Current running images are tagged as
:rollbackbefore new images are pulled - Deploy: New images are pulled and services are updated via rolling restart (zero-downtime)
- Verify: Health check polls
$STAGING_URL/healthfor up to 100 seconds - Smoke test:
scripts/smoke-test.shruns against the staging URL, checking health probes, core API endpoints, search, and auth - Cleanup: On success,
:rollbacktags are removed anddocker image prunecleans up old layers - Notify: Slack notification on success or failure
- Rollback: If smoke tests fail, automatic rollback restores the
:rollbacktagged images
Notifications
Deploy status notifications are sent to Slack via SLACK_WEBHOOK_URL secret:
| Event | Channel | Content |
|---|---|---|
| Staging smoke tests pass | Slack | ✅ Commit SHA, branch, link to run |
| Staging smoke tests fail | Slack | 🚨 Commit SHA, branch, link to run |
| Staging rollback triggered | Slack | ⚠️ Commit SHA, reason, link to run |
| Production deploy success | Slack | ✅ Commit SHA, branch |
| Production rollback triggered | Slack | ⚠️ Commit SHA, reason, link to run |
Required Secrets
| Secret | Environment | Description |
|---|---|---|
STAGING_HOST |
staging | Staging server hostname/IP |
STAGING_USER |
staging | SSH user for staging deploys |
STAGING_SSH_KEY |
staging | SSH private key for staging |
STAGING_URL |
staging | Staging base URL (e.g., https://staging.goodgo.vn) |
PRODUCTION_HOST |
production | Production server hostname/IP |
PRODUCTION_USER |
production | SSH user for production deploys |
PRODUCTION_SSH_KEY |
production | SSH private key for production |
PRODUCTION_URL |
production | Production base URL |
SLACK_WEBHOOK_URL |
both | Slack incoming webhook URL |
Rollback
Rollback Safety Mechanism
The deploy pipeline uses explicit :rollback image tags to guarantee safe rollbacks. Here's how it works:
- Before pulling new images: The current running images are tagged as
goodgo-api:rollback,goodgo-web:rollback, andgoodgo-ai-services:rollback - After pulling new images: Services are updated with the new images via rolling restart
- After smoke tests pass: The
:rollbacktags are removed anddocker image prunecleans up old layers - If smoke tests fail: The
:rollbacktagged images are used to restore the previous version
This ensures that docker image prune never deletes the images needed for rollback, because:
- Image pruning only happens after smoke tests pass
- The
:rollbacktags keep the previous images pinned even if pruning were to run accidentally
Automatic Rollback (Staging)
The staging pipeline includes automatic rollback when smoke tests fail:
- Pre-deploy: Current container images are tagged with
:rollbacksuffix before new images are pulled - Smoke test failure: If
scripts/smoke-test.shexits non-zero, therollback-stagingjob triggers - Rollback execution: Containers are stopped and restarted using the
:rollbacktagged images - Verification: Health check confirms the rollback succeeded
- Notification: Slack notification reports the rollback with links to the failed run
Automatic Rollback (Production)
Same mechanism as staging — smoke test failure triggers rollback-production using the :rollback tagged images.
Manual Rollback
To manually rollback a staging or production deployment:
Option 1: Re-deploy a known-good commit
# Trigger a deploy of a specific commit via GitHub Actions
gh workflow run deploy.yml \
--ref <known-good-commit-or-branch> \
-f environment=staging
Option 2: SSH rollback using :rollback tags (fastest)
# SSH into the staging/production server
ssh deploy@<host>
cd ~/goodgo
# Stop current services
docker compose -f docker-compose.prod.yml stop api web ai-services
# Verify :rollback images exist
docker image inspect goodgo-api:rollback > /dev/null 2>&1 && echo "API rollback available"
docker image inspect goodgo-web:rollback > /dev/null 2>&1 && echo "Web rollback available"
docker image inspect goodgo-ai-services:rollback > /dev/null 2>&1 && echo "AI rollback available"
# Restart services (compose picks up cached/rollback images)
docker compose -f docker-compose.prod.yml up -d --wait api web ai-services
# Verify health
curl -sf http://localhost:3001/health && echo "Rollback successful"
Note: The
:rollbacktags are only available until the next successful deploy cleans them up. If you need to roll back to an older version, use Option 3 below.
Option 3: Pin to a specific image tag
ssh deploy@<host>
cd ~/goodgo
# Set IMAGE_TAG to a known-good SHA
export IMAGE_TAG=<known-good-commit-sha>
export REGISTRY_URL=ghcr.io/<owner>
# Pull and restart with the pinned tag
docker compose -f docker-compose.prod.yml pull api web ai-services
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api web ai-services
Option 4: Use deploy-production.sh (built-in rollback)
The manual deploy script (scripts/deploy-production.sh) has integrated rollback support:
- Automatically tags
:rollbackimages before pulling - Runs health checks and smoke tests
- Auto-rollbacks using
:rollbacktags if either fails - Only prunes images after smoke tests pass
ssh ubuntu@185.225.232.65
cd ~/goodgo
./scripts/deploy-production.sh [image-tag]
Database Rollback
Prisma does not support automatic down migrations. If a migration must be reverted:
- Identify the migration in
prisma/migrations/ - Write a manual SQL rollback script
- Apply via
psqlor a migration tool - Update
_prisma_migrationstable
Always test migrations against a staging database before production deployment.
Post-Rollback Checklist
- Verify health endpoints respond:
GET /health,GET /ready - Run smoke tests manually:
./scripts/smoke-test.sh <url> - Check application logs:
docker compose -f docker-compose.prod.yml logs --tail=100 api web - Confirm Grafana dashboards show normal metrics
- Notify the team via Slack about the rollback and root cause