goodgo-platform/docs/deployment.md

# Deployment Guide

## Overview

GoodGo Platform AI consists of four deployable services:

| Service | Technology | Default Port |
|---------|-----------|-------------|
| **API** | NestJS (Node.js) | 3001 |
| **Web** | Next.js | 3000 |
| **AI Services** | FastAPI (Python) | 8000 |
| **Infrastructure** | Docker Compose | Various |

## Prerequisites

- Docker Engine 24+ & Docker Compose v2
- Node.js 22 LTS
- pnpm 10.27+
- Python 3.12 (for AI services, if running outside Docker)

## Environment Configuration

Copy `.env.example` to `.env` and configure all required values:

```bash
cp .env.example .env
```

### Required Variables

| Variable | Description | Example |
|----------|-------------|---------|
| `DATABASE_URL` | PostgreSQL connection string | `postgresql://user:pass@host:5432/goodgo` |
| `JWT_SECRET` | JWT signing key (min 32 chars) | Generate with `openssl rand -hex 32` |
| `JWT_REFRESH_SECRET` | Refresh token signing key | Generate with `openssl rand -hex 32` |
| `REDIS_URL` | Redis connection string | `redis://localhost:6379` |
| `TYPESENSE_API_KEY` | Typesense admin API key | Generate a secure random key |

### Optional Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `API_PORT` | API server port | `3000` |
| `WEB_PORT` | Web app port | `3001` |
| `NODE_ENV` | Environment mode | `development` |
| `CORS_ORIGINS` | Allowed CORS origins | — |
| `CLAUDE_API_KEY` | Claude API key (for content moderation) | — |
| `NEXT_PUBLIC_MAPBOX_TOKEN` | Mapbox token (for maps) | — |
| `VNPAY_*`, `MOMO_*`, `ZALOPAY_*` | Payment gateway credentials | — |

## Infrastructure Setup (Docker Compose)

Start all infrastructure services:

```bash
docker compose up -d
```

This starts:

- **PostgreSQL 16 + PostGIS 3.4** (port 5432)
- **Redis 7** (port 6379)
- **Typesense 27** (port 8108)
- **MinIO** (API: 9000, Console: 9001)
- **AI Services** (port 8000)
- **pg-backup** — automated daily PostgreSQL backups at 02:00 UTC with verification at 04:00 UTC
- **Loki** (port 3100) — log aggregation
- **Promtail** — log collection agent (ships container logs to Loki)
- **Prometheus** (port 9090)
- **Grafana** (port 3002) — dashboards for metrics and logs

Verify all services are healthy:

```bash
docker compose ps
```

All services include health checks. Wait until all show `healthy` status.

## Database Setup

```bash
# Generate Prisma client
pnpm db:generate

# Apply migrations
pnpm db:migrate:deploy

# Seed initial data (optional)
pnpm db:seed
```

## Building for Production

### API (NestJS)

```bash
cd apps/api
pnpm build
```

Output: `apps/api/dist/`

Run in production:

```bash
NODE_ENV=production PORT=3001 node apps/api/dist/main.js
```

### Web (Next.js)

```bash
cd apps/web
pnpm build
```

Output: `apps/web/.next/`

Run in production:

```bash
NODE_ENV=production pnpm --filter web start
```

### AI Services (FastAPI)

The AI service runs in Docker via `docker compose`. To build separately:

```bash
cd libs/ai-services
docker build -t goodgo-ai-services .
docker run -p 8000:8000 --env-file ../../.env goodgo-ai-services
```

## Production Checklist

### Security

- [ ] Set strong, unique `JWT_SECRET` and `JWT_REFRESH_SECRET` (min 32 characters)
- [ ] Set `NODE_ENV=production`
- [ ] Configure `CORS_ORIGINS` to only allow your domain(s)
- [ ] Change default database passwords
- [ ] Change default MinIO credentials (`MINIO_USER`, `MINIO_PASSWORD`)
- [ ] Change default Grafana credentials (`GRAFANA_ADMIN_USER`, `GRAFANA_ADMIN_PASSWORD`)
- [ ] Use a strong, unique `TYPESENSE_API_KEY`
- [ ] Enable SSL/TLS termination (reverse proxy)
- [ ] Set `MINIO_USE_SSL=true` if MinIO is exposed publicly

### Database

- [ ] Run `pnpm db:migrate:deploy` (not `db:migrate:dev`)
- [ ] Enable PostgreSQL connection pooling (PgBouncer recommended)
- [ ] Configure automated backups
- [ ] Set appropriate `max_connections` in PostgreSQL config

### Monitoring

- [ ] Verify Prometheus is scraping `/metrics` endpoint
- [ ] Import Grafana dashboards from `monitoring/grafana/dashboards/`
- [ ] Set up alerting rules for error rates and latency

### Performance

- [ ] Configure Redis `maxmemory` and eviction policy
- [ ] Set appropriate Typesense `--memory-limit`
- [ ] Enable gzip/brotli compression in reverse proxy
- [ ] Configure CDN for static assets (Next.js `/_next/static/`)

## Health Checks

| Service | Endpoint | Expected Response |
|---------|----------|-------------------|
| API | `GET /health` | `{"status": "ok"}` |
| API (Swagger) | `GET /api/v1/docs` | Swagger UI page |
| API (Metrics) | `GET /api/v1/metrics` | Prometheus metrics |
| AI Services | `GET /health` | `{"status": "ok"}` |
| Typesense | `GET /health` | `{"ok": true}` |
| Loki | `GET /ready` | 200 OK |
| Redis | `redis-cli ping` | `PONG` |
| PostgreSQL | `pg_isready -h host -p 5432` | Exit code 0 |

## Scaling Considerations

### Horizontal Scaling

- **API**: Stateless — scale with multiple instances behind a load balancer
- **Web**: Stateless — scale with multiple instances or deploy to Vercel/Cloudflare
- **AI Services**: CPU-bound — scale based on valuation request volume
- **Redis**: Use Redis Cluster for high availability
- **PostgreSQL**: Read replicas for query-heavy workloads

### Recommended Architecture (Production)

```
                    ┌─────────────┐
                    │ Load Balancer│
                    │ (nginx/ALB)  │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
        ┌─────▼──┐  ┌─────▼──┐  ┌─────▼──┐
        │ API #1 │  │ API #2 │  │ API #N │
        └────────┘  └────────┘  └────────┘
              │            │            │
              └────────────┼────────────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
        ┌─────▼──┐  ┌─────▼──┐  ┌─────▼─────┐
        │  PG    │  │ Redis  │  │ Typesense  │
        │Primary │  │Cluster │  │  Cluster   │
        │+ Replica│  │        │  │            │
        └────────┘  └────────┘  └────────────┘
```

## CI/CD Pipeline

### Branch Strategy

| Branch | Deploy Target | Trigger | Notes |
|--------|--------------|---------|-------|
| `develop` | Staging | Auto (push) | Every merge to `develop` auto-deploys to staging |
| `master` | Staging | Auto (push) | Master push also deploys to staging for verification |
| Manual | Staging/Production | `workflow_dispatch` | Manual trigger via GitHub Actions UI |

### Staging Auto-Deploy Flow

```
Push to develop → Build images → Tag rollback → Deploy to staging → Smoke tests → Cleanup / Rollback
```

1. **Build**: Docker images for API, Web, and AI Services are built and pushed to GHCR with `staging-latest` tag
2. **Tag rollback**: Current running images are tagged as `:rollback` before new images are pulled
3. **Deploy**: New images are pulled and services are updated via rolling restart (zero-downtime)
4. **Verify**: Health check polls `$STAGING_URL/health` for up to 100 seconds
5. **Smoke test**: `scripts/smoke-test.sh` runs against the staging URL, checking health probes, core API endpoints, search, and auth
6. **Cleanup**: On success, `:rollback` tags are removed and `docker image prune` cleans up old layers
7. **Notify**: Slack notification on success or failure
8. **Rollback**: If smoke tests fail, automatic rollback restores the `:rollback` tagged images

### Notifications

Deploy status notifications are sent to Slack via `SLACK_WEBHOOK_URL` secret:

| Event | Channel | Content |
|-------|---------|---------|
| Staging smoke tests pass | Slack | ✅ Commit SHA, branch, link to run |
| Staging smoke tests fail | Slack | 🚨 Commit SHA, branch, link to run |
| Staging rollback triggered | Slack | ⚠️ Commit SHA, reason, link to run |
| Production deploy success | Slack | ✅ Commit SHA, branch |
| Production rollback triggered | Slack | ⚠️ Commit SHA, reason, link to run |

### Required Secrets

| Secret | Environment | Description |
|--------|-------------|-------------|
| `STAGING_HOST` | staging | Staging server hostname/IP |
| `STAGING_USER` | staging | SSH user for staging deploys |
| `STAGING_SSH_KEY` | staging | SSH private key for staging |
| `STAGING_URL` | staging | Staging base URL (e.g., `https://staging.goodgo.vn`) |
| `PRODUCTION_HOST` | production | Production server hostname/IP |
| `PRODUCTION_USER` | production | SSH user for production deploys |
| `PRODUCTION_SSH_KEY` | production | SSH private key for production |
| `PRODUCTION_URL` | production | Production base URL |
| `SLACK_WEBHOOK_URL` | both | Slack incoming webhook URL |

## Rollback

### Rollback Safety Mechanism

The deploy pipeline uses **explicit `:rollback` image tags** to guarantee safe rollbacks. Here's how it works:

1. **Before pulling new images**: The current running images are tagged as `goodgo-api:rollback`, `goodgo-web:rollback`, and `goodgo-ai-services:rollback`
2. **After pulling new images**: Services are updated with the new images via rolling restart
3. **After smoke tests pass**: The `:rollback` tags are removed and `docker image prune` cleans up old layers
4. **If smoke tests fail**: The `:rollback` tagged images are used to restore the previous version

This ensures that `docker image prune` never deletes the images needed for rollback, because:
- Image pruning only happens **after** smoke tests pass
- The `:rollback` tags keep the previous images pinned even if pruning were to run accidentally

### Automatic Rollback (Staging)

The staging pipeline includes automatic rollback when smoke tests fail:

1. **Pre-deploy**: Current container images are tagged with `:rollback` suffix before new images are pulled
2. **Smoke test failure**: If `scripts/smoke-test.sh` exits non-zero, the `rollback-staging` job triggers
3. **Rollback execution**: Containers are stopped and restarted using the `:rollback` tagged images
4. **Verification**: Health check confirms the rollback succeeded
5. **Notification**: Slack notification reports the rollback with links to the failed run

### Automatic Rollback (Production)

Same mechanism as staging — smoke test failure triggers `rollback-production` using the `:rollback` tagged images.

### Manual Rollback

To manually rollback a staging or production deployment:

#### Option 1: Re-deploy a known-good commit

```bash
# Trigger a deploy of a specific commit via GitHub Actions
gh workflow run deploy.yml \
  --ref <known-good-commit-or-branch> \
  -f environment=staging
```

#### Option 2: SSH rollback using :rollback tags (fastest)

```bash
# SSH into the staging/production server
ssh deploy@<host>
cd ~/goodgo

# Stop current services
docker compose -f docker-compose.prod.yml stop api web ai-services

# Verify :rollback images exist
docker image inspect goodgo-api:rollback > /dev/null 2>&1 && echo "API rollback available"
docker image inspect goodgo-web:rollback > /dev/null 2>&1 && echo "Web rollback available"
docker image inspect goodgo-ai-services:rollback > /dev/null 2>&1 && echo "AI rollback available"

# Restart services (compose picks up cached/rollback images)
docker compose -f docker-compose.prod.yml up -d --wait api web ai-services

# Verify health
curl -sf http://localhost:3001/health && echo "Rollback successful"
```

> **Note:** The `:rollback` tags are only available until the next successful deploy cleans them up. If you need to roll back to an older version, use Option 3 below.

#### Option 3: Pin to a specific image tag

```bash
ssh deploy@<host>
cd ~/goodgo

# Set IMAGE_TAG to a known-good SHA
export IMAGE_TAG=<known-good-commit-sha>
export REGISTRY_URL=ghcr.io/<owner>

# Pull and restart with the pinned tag
docker compose -f docker-compose.prod.yml pull api web ai-services
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api web ai-services
```

#### Option 4: Use deploy-production.sh (built-in rollback)

The manual deploy script (`scripts/deploy-production.sh`) has integrated rollback support:
- Automatically tags `:rollback` images before pulling
- Runs health checks and smoke tests
- Auto-rollbacks using `:rollback` tags if either fails
- Only prunes images after smoke tests pass

```bash
ssh ubuntu@185.225.232.65
cd ~/goodgo
./scripts/deploy-production.sh [image-tag]
```

### Database Rollback

Prisma does not support automatic down migrations. If a migration must be reverted:

1. Identify the migration in `prisma/migrations/`
2. Write a manual SQL rollback script
3. Apply via `psql` or a migration tool
4. Update `_prisma_migrations` table

Always test migrations against a staging database before production deployment.

### Post-Rollback Checklist

- [ ] Verify health endpoints respond: `GET /health`, `GET /ready`
- [ ] Run smoke tests manually: `./scripts/smoke-test.sh <url>`
- [ ] Check application logs: `docker compose -f docker-compose.prod.yml logs --tail=100 api web`
- [ ] Confirm Grafana dashboards show normal metrics
- [ ] Notify the team via Slack about the rollback and root cause