Files

Ho Ngoc Hai 20b79acf08 fix(deploy): tag rollback images before pull, prune after smoke test

Previously, `docker image prune` ran immediately after deploying new
containers, potentially deleting the old images needed for rollback
if smoke tests subsequently failed. Now the deploy pipeline:

1. Tags current images as :rollback before pulling new versions
2. Only runs `docker image prune` after smoke tests pass
3. Uses explicit :rollback tags for rollback instead of relying on
   Docker layer cache (which is fragile)

Applied to:
- scripts/deploy-production.sh (manual deploy script)
- .github/workflows/deploy.yml (staging + production CI jobs)
- docs/deployment.md (updated rollback documentation)

Co-Authored-By: Paperclip <noreply@paperclip.ing>

2026-04-15 11:17:32 +07:00

13 KiB

Raw Blame History

Deployment Guide

Overview

GoodGo Platform AI consists of four deployable services:

Service	Technology	Default Port
API	NestJS (Node.js)	3001
Web	Next.js	3000
AI Services	FastAPI (Python)	8000
Infrastructure	Docker Compose	Various

Prerequisites

Docker Engine 24+ & Docker Compose v2
Node.js 22 LTS
pnpm 10.27+
Python 3.12 (for AI services, if running outside Docker)

Environment Configuration

Copy .env.example to .env and configure all required values:

cp .env.example .env

Required Variables

Variable	Description	Example
`DATABASE_URL`	PostgreSQL connection string	`postgresql://user:pass@host:5432/goodgo`
`JWT_SECRET`	JWT signing key (min 32 chars)	Generate with `openssl rand -hex 32`
`JWT_REFRESH_SECRET`	Refresh token signing key	Generate with `openssl rand -hex 32`
`REDIS_URL`	Redis connection string	`redis://localhost:6379`
`TYPESENSE_API_KEY`	Typesense admin API key	Generate a secure random key

Optional Variables

Variable	Description	Default
`API_PORT`	API server port	`3000`
`WEB_PORT`	Web app port	`3001`
`NODE_ENV`	Environment mode	`development`
`CORS_ORIGINS`	Allowed CORS origins	—
`CLAUDE_API_KEY`	Claude API key (for content moderation)	—
`NEXT_PUBLIC_MAPBOX_TOKEN`	Mapbox token (for maps)	—
`VNPAY_`, `MOMO_`, `ZALOPAY_*`	Payment gateway credentials	—

Infrastructure Setup (Docker Compose)

Start all infrastructure services:

docker compose up -d

This starts:

PostgreSQL 16 + PostGIS 3.4 (port 5432)
Redis 7 (port 6379)
Typesense 27 (port 8108)
MinIO (API: 9000, Console: 9001)
AI Services (port 8000)
pg-backup — automated daily PostgreSQL backups at 02:00 UTC with verification at 04:00 UTC
Loki (port 3100) — log aggregation
Promtail — log collection agent (ships container logs to Loki)
Prometheus (port 9090)
Grafana (port 3002) — dashboards for metrics and logs

Verify all services are healthy:

docker compose ps

All services include health checks. Wait until all show healthy status.

Database Setup

# Generate Prisma client
pnpm db:generate

# Apply migrations
pnpm db:migrate:deploy

# Seed initial data (optional)
pnpm db:seed

Building for Production

API (NestJS)

cd apps/api
pnpm build

Output: apps/api/dist/

Run in production:

NODE_ENV=production PORT=3001 node apps/api/dist/main.js

Web (Next.js)

cd apps/web
pnpm build

Output: apps/web/.next/

Run in production:

NODE_ENV=production pnpm --filter web start

AI Services (FastAPI)

The AI service runs in Docker via docker compose. To build separately:

cd libs/ai-services
docker build -t goodgo-ai-services .
docker run -p 8000:8000 --env-file ../../.env goodgo-ai-services

Production Checklist

Security

Set strong, unique JWT_SECRET and JWT_REFRESH_SECRET (min 32 characters)
Set NODE_ENV=production
Configure CORS_ORIGINS to only allow your domain(s)
Change default database passwords
Change default MinIO credentials (MINIO_USER, MINIO_PASSWORD)
Change default Grafana credentials (GRAFANA_ADMIN_USER, GRAFANA_ADMIN_PASSWORD)
Use a strong, unique TYPESENSE_API_KEY
Enable SSL/TLS termination (reverse proxy)
Set MINIO_USE_SSL=true if MinIO is exposed publicly

Database

Run pnpm db:migrate:deploy (not db:migrate:dev)
Enable PostgreSQL connection pooling (PgBouncer recommended)
Configure automated backups
Set appropriate max_connections in PostgreSQL config

Monitoring

Verify Prometheus is scraping /metrics endpoint
Import Grafana dashboards from monitoring/grafana/dashboards/
Set up alerting rules for error rates and latency

Performance

Configure Redis maxmemory and eviction policy
Set appropriate Typesense --memory-limit
Enable gzip/brotli compression in reverse proxy
Configure CDN for static assets (Next.js /_next/static/)

Health Checks

Service	Endpoint	Expected Response
API	`GET /health`	`{"status": "ok"}`
API (Swagger)	`GET /api/v1/docs`	Swagger UI page
API (Metrics)	`GET /api/v1/metrics`	Prometheus metrics
AI Services	`GET /health`	`{"status": "ok"}`
Typesense	`GET /health`	`{"ok": true}`
Loki	`GET /ready`	200 OK
Redis	`redis-cli ping`	`PONG`
PostgreSQL	`pg_isready -h host -p 5432`	Exit code 0

Scaling Considerations

Horizontal Scaling

API: Stateless — scale with multiple instances behind a load balancer
Web: Stateless — scale with multiple instances or deploy to Vercel/Cloudflare
AI Services: CPU-bound — scale based on valuation request volume
Redis: Use Redis Cluster for high availability
PostgreSQL: Read replicas for query-heavy workloads

Recommended Architecture (Production)

                    ┌─────────────┐
                    │ Load Balancer│
                    │ (nginx/ALB)  │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
        ┌─────▼──┐  ┌─────▼──┐  ┌─────▼──┐
        │ API #1 │  │ API #2 │  │ API #N │
        └────────┘  └────────┘  └────────┘
              │            │            │
              └────────────┼────────────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
        ┌─────▼──┐  ┌─────▼──┐  ┌─────▼─────┐
        │  PG    │  │ Redis  │  │ Typesense  │
        │Primary │  │Cluster │  │  Cluster   │
        │+ Replica│  │        │  │            │
        └────────┘  └────────┘  └────────────┘

CI/CD Pipeline

Branch Strategy

Branch	Deploy Target	Trigger	Notes
`develop`	Staging	Auto (push)	Every merge to `develop` auto-deploys to staging
`master`	Staging	Auto (push)	Master push also deploys to staging for verification
Manual	Staging/Production	`workflow_dispatch`	Manual trigger via GitHub Actions UI

Staging Auto-Deploy Flow

Push to develop → Build images → Tag rollback → Deploy to staging → Smoke tests → Cleanup / Rollback

Build: Docker images for API, Web, and AI Services are built and pushed to GHCR with staging-latest tag
Tag rollback: Current running images are tagged as :rollback before new images are pulled
Deploy: New images are pulled and services are updated via rolling restart (zero-downtime)
Verify: Health check polls $STAGING_URL/health for up to 100 seconds
Smoke test: scripts/smoke-test.sh runs against the staging URL, checking health probes, core API endpoints, search, and auth
Cleanup: On success, :rollback tags are removed and docker image prune cleans up old layers
Notify: Slack notification on success or failure
Rollback: If smoke tests fail, automatic rollback restores the :rollback tagged images

Notifications

Deploy status notifications are sent to Slack via SLACK_WEBHOOK_URL secret:

Event	Channel	Content
Staging smoke tests pass	Slack	✅ Commit SHA, branch, link to run
Staging smoke tests fail	Slack	🚨 Commit SHA, branch, link to run
Staging rollback triggered	Slack	⚠️ Commit SHA, reason, link to run
Production deploy success	Slack	✅ Commit SHA, branch
Production rollback triggered	Slack	⚠️ Commit SHA, reason, link to run

Required Secrets

Secret	Environment	Description
`STAGING_HOST`	staging	Staging server hostname/IP
`STAGING_USER`	staging	SSH user for staging deploys
`STAGING_SSH_KEY`	staging	SSH private key for staging
`STAGING_URL`	staging	Staging base URL (e.g., `https://staging.goodgo.vn`)
`PRODUCTION_HOST`	production	Production server hostname/IP
`PRODUCTION_USER`	production	SSH user for production deploys
`PRODUCTION_SSH_KEY`	production	SSH private key for production
`PRODUCTION_URL`	production	Production base URL
`SLACK_WEBHOOK_URL`	both	Slack incoming webhook URL

Rollback

Rollback Safety Mechanism

The deploy pipeline uses explicit :rollback image tags to guarantee safe rollbacks. Here's how it works:

Before pulling new images: The current running images are tagged as goodgo-api:rollback, goodgo-web:rollback, and goodgo-ai-services:rollback
After pulling new images: Services are updated with the new images via rolling restart
After smoke tests pass: The :rollback tags are removed and docker image prune cleans up old layers
If smoke tests fail: The :rollback tagged images are used to restore the previous version

This ensures that docker image prune never deletes the images needed for rollback, because:

Image pruning only happens after smoke tests pass
The :rollback tags keep the previous images pinned even if pruning were to run accidentally

Automatic Rollback (Staging)

The staging pipeline includes automatic rollback when smoke tests fail:

Pre-deploy: Current container images are tagged with :rollback suffix before new images are pulled
Smoke test failure: If scripts/smoke-test.sh exits non-zero, the rollback-staging job triggers
Rollback execution: Containers are stopped and restarted using the :rollback tagged images
Verification: Health check confirms the rollback succeeded
Notification: Slack notification reports the rollback with links to the failed run

Automatic Rollback (Production)

Same mechanism as staging — smoke test failure triggers rollback-production using the :rollback tagged images.

Manual Rollback

To manually rollback a staging or production deployment:

Option 1: Re-deploy a known-good commit

# Trigger a deploy of a specific commit via GitHub Actions
gh workflow run deploy.yml \
  --ref <known-good-commit-or-branch> \
  -f environment=staging

Option 2: SSH rollback using :rollback tags (fastest)

# SSH into the staging/production server
ssh deploy@<host>
cd ~/goodgo

# Stop current services
docker compose -f docker-compose.prod.yml stop api web ai-services

# Verify :rollback images exist
docker image inspect goodgo-api:rollback > /dev/null 2>&1 && echo "API rollback available"
docker image inspect goodgo-web:rollback > /dev/null 2>&1 && echo "Web rollback available"
docker image inspect goodgo-ai-services:rollback > /dev/null 2>&1 && echo "AI rollback available"

# Restart services (compose picks up cached/rollback images)
docker compose -f docker-compose.prod.yml up -d --wait api web ai-services

# Verify health
curl -sf http://localhost:3001/health && echo "Rollback successful"

Note: The :rollback tags are only available until the next successful deploy cleans them up. If you need to roll back to an older version, use Option 3 below.

Option 3: Pin to a specific image tag

ssh deploy@<host>
cd ~/goodgo

# Set IMAGE_TAG to a known-good SHA
export IMAGE_TAG=<known-good-commit-sha>
export REGISTRY_URL=ghcr.io/<owner>

# Pull and restart with the pinned tag
docker compose -f docker-compose.prod.yml pull api web ai-services
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api web ai-services

Option 4: Use deploy-production.sh (built-in rollback)

The manual deploy script (scripts/deploy-production.sh) has integrated rollback support:

Automatically tags :rollback images before pulling
Runs health checks and smoke tests
Auto-rollbacks using :rollback tags if either fails
Only prunes images after smoke tests pass

ssh ubuntu@185.225.232.65
cd ~/goodgo
./scripts/deploy-production.sh [image-tag]

Database Rollback

Prisma does not support automatic down migrations. If a migration must be reverted:

Identify the migration in prisma/migrations/
Write a manual SQL rollback script
Apply via psql or a migration tool
Update _prisma_migrations table

Always test migrations against a staging database before production deployment.

Post-Rollback Checklist

Verify health endpoints respond: GET /health, GET /ready
Run smoke tests manually: ./scripts/smoke-test.sh <url>
Check application logs: docker compose -f docker-compose.prod.yml logs --tail=100 api web
Confirm Grafana dashboards show normal metrics
Notify the team via Slack about the rollback and root cause

13 KiB Raw Blame History