feat(devops): add staging auto-deploy pipeline on develop branch

- Trigger deploy workflow on push to `develop` branch (in addition to `master`) - Add `staging-latest` Docker image tag for develop branch builds - Add `rollback-staging` job: auto-reverts to previous images on smoke test failure - Add Slack success notification for staging deploys (previously only failure was notified) - Record pre-deploy image digests for rollback capability - Update deployment docs with CI/CD pipeline details, rollback procedures, and required secrets Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-11 01:18:37 +07:00
parent 0593d40098
commit 64c6074735
2 changed files with 224 additions and 11 deletions
--- a/docs/deployment.md
+++ b/docs/deployment.md
@@ -214,11 +214,116 @@ docker run -p 8000:8000 --env-file ../../.env goodgo-ai-services
        └────────┘  └────────┘  └────────────┘
 ```

+## CI/CD Pipeline
+
+### Branch Strategy
+
+| Branch | Deploy Target | Trigger | Notes |
+|--------|--------------|---------|-------|
+| `develop` | Staging | Auto (push) | Every merge to `develop` auto-deploys to staging |
+| `master` | Staging | Auto (push) | Master push also deploys to staging for verification |
+| Manual | Staging/Production | `workflow_dispatch` | Manual trigger via GitHub Actions UI |
+
+### Staging Auto-Deploy Flow
+
+```
+Push to develop → Build images → Deploy to staging → Smoke tests → ✅ / Rollback
+```
+
+1. **Build**: Docker images for API, Web, and AI Services are built and pushed to GHCR with `staging-latest` tag
+2. **Deploy**: Images are pulled and services are updated via rolling restart (zero-downtime)
+3. **Verify**: Health check polls `$STAGING_URL/health` for up to 100 seconds
+4. **Smoke test**: `scripts/smoke-test.sh` runs against the staging URL, checking health probes, core API endpoints, search, and auth
+5. **Notify**: Slack notification on success or failure
+6. **Rollback**: If smoke tests fail, automatic rollback restores previous container images
+
+### Notifications
+
+Deploy status notifications are sent to Slack via `SLACK_WEBHOOK_URL` secret:
+
+| Event | Channel | Content |
+|-------|---------|---------|
+| Staging smoke tests pass | Slack | ✅ Commit SHA, branch, link to run |
+| Staging smoke tests fail | Slack | 🚨 Commit SHA, branch, link to run |
+| Staging rollback triggered | Slack | ⚠️ Commit SHA, reason, link to run |
+| Production deploy success | Slack | ✅ Commit SHA, branch |
+| Production rollback triggered | Slack | ⚠️ Commit SHA, reason, link to run |
+
+### Required Secrets
+
+| Secret | Environment | Description |
+|--------|-------------|-------------|
+| `STAGING_HOST` | staging | Staging server hostname/IP |
+| `STAGING_USER` | staging | SSH user for staging deploys |
+| `STAGING_SSH_KEY` | staging | SSH private key for staging |
+| `STAGING_URL` | staging | Staging base URL (e.g., `https://staging.goodgo.vn`) |
+| `PRODUCTION_HOST` | production | Production server hostname/IP |
+| `PRODUCTION_USER` | production | SSH user for production deploys |
+| `PRODUCTION_SSH_KEY` | production | SSH private key for production |
+| `PRODUCTION_URL` | production | Production base URL |
+| `SLACK_WEBHOOK_URL` | both | Slack incoming webhook URL |
+
 ## Rollback

-### Application Rollback
+### Automatic Rollback (Staging)

-Deploy the previous container image or build artifact. The API and Web are stateless — no rollback-specific steps needed.
+The staging pipeline includes automatic rollback when smoke tests fail:
+
+1. **Pre-deploy**: Current container image digests are recorded before deployment
+2. **Smoke test failure**: If `scripts/smoke-test.sh` exits non-zero, the `rollback-staging` job triggers
+3. **Rollback execution**: Containers are stopped and restarted with previous images
+4. **Verification**: Health check confirms the rollback succeeded
+5. **Notification**: Slack notification reports the rollback with links to the failed run
+
+### Automatic Rollback (Production)
+
+Same mechanism as staging — smoke test failure triggers `rollback-production`.
+
+### Manual Rollback
+
+To manually rollback a staging or production deployment:
+
+#### Option 1: Re-deploy a known-good commit
+
+```bash
+# Trigger a deploy of a specific commit via GitHub Actions
+gh workflow run deploy.yml \
+  --ref <known-good-commit-or-branch> \
+  -f environment=staging
+```
+
+#### Option 2: SSH rollback (emergency)
+
+```bash
+# SSH into the staging/production server
+ssh deploy@<host>
+
+cd ~/goodgo
+
+# Stop the current services
+docker compose -f docker-compose.prod.yml down api web ai-services
+
+# Restart with the previous image layers still cached locally
+docker compose -f docker-compose.prod.yml up -d --wait api web ai-services
+
+# Verify health
+curl -sf http://localhost:3001/health
+```
+
+#### Option 3: Pin to a specific image tag
+
+```bash
+ssh deploy@<host>
+cd ~/goodgo
+
+# Set IMAGE_TAG to a known-good SHA
+export IMAGE_TAG=<known-good-commit-sha>
+export REGISTRY_URL=ghcr.io/<owner>
+
+# Pull and restart with the pinned tag
+docker compose -f docker-compose.prod.yml pull api web ai-services
+docker compose -f docker-compose.prod.yml up -d --no-deps --wait api web ai-services
+```

 ### Database Rollback

@@ -230,3 +335,11 @@ Prisma does not support automatic down migrations. If a migration must be revert
 4. Update `_prisma_migrations` table

 Always test migrations against a staging database before production deployment.
+
+### Post-Rollback Checklist
+
+- [ ] Verify health endpoints respond: `GET /health`, `GET /ready`
+- [ ] Run smoke tests manually: `./scripts/smoke-test.sh <url>`
+- [ ] Check application logs: `docker compose -f docker-compose.prod.yml logs --tail=100 api web`
+- [ ] Confirm Grafana dashboards show normal metrics
+- [ ] Notify the team via Slack about the rollback and root cause