- Trigger deploy workflow on push to `develop` branch (in addition to `master`) - Add `staging-latest` Docker image tag for develop branch builds - Add `rollback-staging` job: auto-reverts to previous images on smoke test failure - Add Slack success notification for staging deploys (previously only failure was notified) - Record pre-deploy image digests for rollback capability - Update deployment docs with CI/CD pipeline details, rollback procedures, and required secrets Co-Authored-By: Paperclip <noreply@paperclip.ing>
346 lines
11 KiB
Markdown
346 lines
11 KiB
Markdown
# Deployment Guide
|
|
|
|
## Overview
|
|
|
|
GoodGo Platform AI consists of four deployable services:
|
|
|
|
| Service | Technology | Default Port |
|
|
|---------|-----------|-------------|
|
|
| **API** | NestJS (Node.js) | 3001 |
|
|
| **Web** | Next.js | 3000 |
|
|
| **AI Services** | FastAPI (Python) | 8000 |
|
|
| **Infrastructure** | Docker Compose | Various |
|
|
|
|
## Prerequisites
|
|
|
|
- Docker Engine 24+ & Docker Compose v2
|
|
- Node.js 22 LTS
|
|
- pnpm 10.27+
|
|
- Python 3.12 (for AI services, if running outside Docker)
|
|
|
|
## Environment Configuration
|
|
|
|
Copy `.env.example` to `.env` and configure all required values:
|
|
|
|
```bash
|
|
cp .env.example .env
|
|
```
|
|
|
|
### Required Variables
|
|
|
|
| Variable | Description | Example |
|
|
|----------|-------------|---------|
|
|
| `DATABASE_URL` | PostgreSQL connection string | `postgresql://user:pass@host:5432/goodgo` |
|
|
| `JWT_SECRET` | JWT signing key (min 32 chars) | Generate with `openssl rand -hex 32` |
|
|
| `JWT_REFRESH_SECRET` | Refresh token signing key | Generate with `openssl rand -hex 32` |
|
|
| `REDIS_URL` | Redis connection string | `redis://localhost:6379` |
|
|
| `TYPESENSE_API_KEY` | Typesense admin API key | Generate a secure random key |
|
|
|
|
### Optional Variables
|
|
|
|
| Variable | Description | Default |
|
|
|----------|-------------|---------|
|
|
| `API_PORT` | API server port | `3000` |
|
|
| `WEB_PORT` | Web app port | `3001` |
|
|
| `NODE_ENV` | Environment mode | `development` |
|
|
| `CORS_ORIGINS` | Allowed CORS origins | — |
|
|
| `CLAUDE_API_KEY` | Claude API key (for content moderation) | — |
|
|
| `NEXT_PUBLIC_MAPBOX_TOKEN` | Mapbox token (for maps) | — |
|
|
| `VNPAY_*`, `MOMO_*`, `ZALOPAY_*` | Payment gateway credentials | — |
|
|
|
|
## Infrastructure Setup (Docker Compose)
|
|
|
|
Start all infrastructure services:
|
|
|
|
```bash
|
|
docker compose up -d
|
|
```
|
|
|
|
This starts:
|
|
|
|
- **PostgreSQL 16 + PostGIS 3.4** (port 5432)
|
|
- **Redis 7** (port 6379)
|
|
- **Typesense 27** (port 8108)
|
|
- **MinIO** (API: 9000, Console: 9001)
|
|
- **AI Services** (port 8000)
|
|
- **pg-backup** — automated daily PostgreSQL backups at 02:00 UTC with verification at 04:00 UTC
|
|
- **Loki** (port 3100) — log aggregation
|
|
- **Promtail** — log collection agent (ships container logs to Loki)
|
|
- **Prometheus** (port 9090)
|
|
- **Grafana** (port 3002) — dashboards for metrics and logs
|
|
|
|
Verify all services are healthy:
|
|
|
|
```bash
|
|
docker compose ps
|
|
```
|
|
|
|
All services include health checks. Wait until all show `healthy` status.
|
|
|
|
## Database Setup
|
|
|
|
```bash
|
|
# Generate Prisma client
|
|
pnpm db:generate
|
|
|
|
# Apply migrations
|
|
pnpm db:migrate:deploy
|
|
|
|
# Seed initial data (optional)
|
|
pnpm db:seed
|
|
```
|
|
|
|
## Building for Production
|
|
|
|
### API (NestJS)
|
|
|
|
```bash
|
|
cd apps/api
|
|
pnpm build
|
|
```
|
|
|
|
Output: `apps/api/dist/`
|
|
|
|
Run in production:
|
|
|
|
```bash
|
|
NODE_ENV=production PORT=3001 node apps/api/dist/main.js
|
|
```
|
|
|
|
### Web (Next.js)
|
|
|
|
```bash
|
|
cd apps/web
|
|
pnpm build
|
|
```
|
|
|
|
Output: `apps/web/.next/`
|
|
|
|
Run in production:
|
|
|
|
```bash
|
|
NODE_ENV=production pnpm --filter web start
|
|
```
|
|
|
|
### AI Services (FastAPI)
|
|
|
|
The AI service runs in Docker via `docker compose`. To build separately:
|
|
|
|
```bash
|
|
cd libs/ai-services
|
|
docker build -t goodgo-ai-services .
|
|
docker run -p 8000:8000 --env-file ../../.env goodgo-ai-services
|
|
```
|
|
|
|
## Production Checklist
|
|
|
|
### Security
|
|
|
|
- [ ] Set strong, unique `JWT_SECRET` and `JWT_REFRESH_SECRET` (min 32 characters)
|
|
- [ ] Set `NODE_ENV=production`
|
|
- [ ] Configure `CORS_ORIGINS` to only allow your domain(s)
|
|
- [ ] Change default database passwords
|
|
- [ ] Change default MinIO credentials (`MINIO_USER`, `MINIO_PASSWORD`)
|
|
- [ ] Change default Grafana credentials (`GRAFANA_ADMIN_USER`, `GRAFANA_ADMIN_PASSWORD`)
|
|
- [ ] Use a strong, unique `TYPESENSE_API_KEY`
|
|
- [ ] Enable SSL/TLS termination (reverse proxy)
|
|
- [ ] Set `MINIO_USE_SSL=true` if MinIO is exposed publicly
|
|
|
|
### Database
|
|
|
|
- [ ] Run `pnpm db:migrate:deploy` (not `db:migrate:dev`)
|
|
- [ ] Enable PostgreSQL connection pooling (PgBouncer recommended)
|
|
- [ ] Configure automated backups
|
|
- [ ] Set appropriate `max_connections` in PostgreSQL config
|
|
|
|
### Monitoring
|
|
|
|
- [ ] Verify Prometheus is scraping `/metrics` endpoint
|
|
- [ ] Import Grafana dashboards from `monitoring/grafana/dashboards/`
|
|
- [ ] Set up alerting rules for error rates and latency
|
|
|
|
### Performance
|
|
|
|
- [ ] Configure Redis `maxmemory` and eviction policy
|
|
- [ ] Set appropriate Typesense `--memory-limit`
|
|
- [ ] Enable gzip/brotli compression in reverse proxy
|
|
- [ ] Configure CDN for static assets (Next.js `/_next/static/`)
|
|
|
|
## Health Checks
|
|
|
|
| Service | Endpoint | Expected Response |
|
|
|---------|----------|-------------------|
|
|
| API | `GET /health` | `{"status": "ok"}` |
|
|
| API (Swagger) | `GET /api/v1/docs` | Swagger UI page |
|
|
| API (Metrics) | `GET /api/v1/metrics` | Prometheus metrics |
|
|
| AI Services | `GET /health` | `{"status": "ok"}` |
|
|
| Typesense | `GET /health` | `{"ok": true}` |
|
|
| Loki | `GET /ready` | 200 OK |
|
|
| Redis | `redis-cli ping` | `PONG` |
|
|
| PostgreSQL | `pg_isready -h host -p 5432` | Exit code 0 |
|
|
|
|
## Scaling Considerations
|
|
|
|
### Horizontal Scaling
|
|
|
|
- **API**: Stateless — scale with multiple instances behind a load balancer
|
|
- **Web**: Stateless — scale with multiple instances or deploy to Vercel/Cloudflare
|
|
- **AI Services**: CPU-bound — scale based on valuation request volume
|
|
- **Redis**: Use Redis Cluster for high availability
|
|
- **PostgreSQL**: Read replicas for query-heavy workloads
|
|
|
|
### Recommended Architecture (Production)
|
|
|
|
```
|
|
┌─────────────┐
|
|
│ Load Balancer│
|
|
│ (nginx/ALB) │
|
|
└──────┬──────┘
|
|
│
|
|
┌────────────┼────────────┐
|
|
│ │ │
|
|
┌─────▼──┐ ┌─────▼──┐ ┌─────▼──┐
|
|
│ API #1 │ │ API #2 │ │ API #N │
|
|
└────────┘ └────────┘ └────────┘
|
|
│ │ │
|
|
└────────────┼────────────┘
|
|
│
|
|
┌────────────┼────────────┐
|
|
│ │ │
|
|
┌─────▼──┐ ┌─────▼──┐ ┌─────▼─────┐
|
|
│ PG │ │ Redis │ │ Typesense │
|
|
│Primary │ │Cluster │ │ Cluster │
|
|
│+ Replica│ │ │ │ │
|
|
└────────┘ └────────┘ └────────────┘
|
|
```
|
|
|
|
## CI/CD Pipeline
|
|
|
|
### Branch Strategy
|
|
|
|
| Branch | Deploy Target | Trigger | Notes |
|
|
|--------|--------------|---------|-------|
|
|
| `develop` | Staging | Auto (push) | Every merge to `develop` auto-deploys to staging |
|
|
| `master` | Staging | Auto (push) | Master push also deploys to staging for verification |
|
|
| Manual | Staging/Production | `workflow_dispatch` | Manual trigger via GitHub Actions UI |
|
|
|
|
### Staging Auto-Deploy Flow
|
|
|
|
```
|
|
Push to develop → Build images → Deploy to staging → Smoke tests → ✅ / Rollback
|
|
```
|
|
|
|
1. **Build**: Docker images for API, Web, and AI Services are built and pushed to GHCR with `staging-latest` tag
|
|
2. **Deploy**: Images are pulled and services are updated via rolling restart (zero-downtime)
|
|
3. **Verify**: Health check polls `$STAGING_URL/health` for up to 100 seconds
|
|
4. **Smoke test**: `scripts/smoke-test.sh` runs against the staging URL, checking health probes, core API endpoints, search, and auth
|
|
5. **Notify**: Slack notification on success or failure
|
|
6. **Rollback**: If smoke tests fail, automatic rollback restores previous container images
|
|
|
|
### Notifications
|
|
|
|
Deploy status notifications are sent to Slack via `SLACK_WEBHOOK_URL` secret:
|
|
|
|
| Event | Channel | Content |
|
|
|-------|---------|---------|
|
|
| Staging smoke tests pass | Slack | ✅ Commit SHA, branch, link to run |
|
|
| Staging smoke tests fail | Slack | 🚨 Commit SHA, branch, link to run |
|
|
| Staging rollback triggered | Slack | ⚠️ Commit SHA, reason, link to run |
|
|
| Production deploy success | Slack | ✅ Commit SHA, branch |
|
|
| Production rollback triggered | Slack | ⚠️ Commit SHA, reason, link to run |
|
|
|
|
### Required Secrets
|
|
|
|
| Secret | Environment | Description |
|
|
|--------|-------------|-------------|
|
|
| `STAGING_HOST` | staging | Staging server hostname/IP |
|
|
| `STAGING_USER` | staging | SSH user for staging deploys |
|
|
| `STAGING_SSH_KEY` | staging | SSH private key for staging |
|
|
| `STAGING_URL` | staging | Staging base URL (e.g., `https://staging.goodgo.vn`) |
|
|
| `PRODUCTION_HOST` | production | Production server hostname/IP |
|
|
| `PRODUCTION_USER` | production | SSH user for production deploys |
|
|
| `PRODUCTION_SSH_KEY` | production | SSH private key for production |
|
|
| `PRODUCTION_URL` | production | Production base URL |
|
|
| `SLACK_WEBHOOK_URL` | both | Slack incoming webhook URL |
|
|
|
|
## Rollback
|
|
|
|
### Automatic Rollback (Staging)
|
|
|
|
The staging pipeline includes automatic rollback when smoke tests fail:
|
|
|
|
1. **Pre-deploy**: Current container image digests are recorded before deployment
|
|
2. **Smoke test failure**: If `scripts/smoke-test.sh` exits non-zero, the `rollback-staging` job triggers
|
|
3. **Rollback execution**: Containers are stopped and restarted with previous images
|
|
4. **Verification**: Health check confirms the rollback succeeded
|
|
5. **Notification**: Slack notification reports the rollback with links to the failed run
|
|
|
|
### Automatic Rollback (Production)
|
|
|
|
Same mechanism as staging — smoke test failure triggers `rollback-production`.
|
|
|
|
### Manual Rollback
|
|
|
|
To manually rollback a staging or production deployment:
|
|
|
|
#### Option 1: Re-deploy a known-good commit
|
|
|
|
```bash
|
|
# Trigger a deploy of a specific commit via GitHub Actions
|
|
gh workflow run deploy.yml \
|
|
--ref <known-good-commit-or-branch> \
|
|
-f environment=staging
|
|
```
|
|
|
|
#### Option 2: SSH rollback (emergency)
|
|
|
|
```bash
|
|
# SSH into the staging/production server
|
|
ssh deploy@<host>
|
|
|
|
cd ~/goodgo
|
|
|
|
# Stop the current services
|
|
docker compose -f docker-compose.prod.yml down api web ai-services
|
|
|
|
# Restart with the previous image layers still cached locally
|
|
docker compose -f docker-compose.prod.yml up -d --wait api web ai-services
|
|
|
|
# Verify health
|
|
curl -sf http://localhost:3001/health
|
|
```
|
|
|
|
#### Option 3: Pin to a specific image tag
|
|
|
|
```bash
|
|
ssh deploy@<host>
|
|
cd ~/goodgo
|
|
|
|
# Set IMAGE_TAG to a known-good SHA
|
|
export IMAGE_TAG=<known-good-commit-sha>
|
|
export REGISTRY_URL=ghcr.io/<owner>
|
|
|
|
# Pull and restart with the pinned tag
|
|
docker compose -f docker-compose.prod.yml pull api web ai-services
|
|
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api web ai-services
|
|
```
|
|
|
|
### Database Rollback
|
|
|
|
Prisma does not support automatic down migrations. If a migration must be reverted:
|
|
|
|
1. Identify the migration in `prisma/migrations/`
|
|
2. Write a manual SQL rollback script
|
|
3. Apply via `psql` or a migration tool
|
|
4. Update `_prisma_migrations` table
|
|
|
|
Always test migrations against a staging database before production deployment.
|
|
|
|
### Post-Rollback Checklist
|
|
|
|
- [ ] Verify health endpoints respond: `GET /health`, `GET /ready`
|
|
- [ ] Run smoke tests manually: `./scripts/smoke-test.sh <url>`
|
|
- [ ] Check application logs: `docker compose -f docker-compose.prod.yml logs --tail=100 api web`
|
|
- [ ] Confirm Grafana dashboards show normal metrics
|
|
- [ ] Notify the team via Slack about the rollback and root cause
|