fix(deploy): tag rollback images before pull, prune after smoke test
Previously, `docker image prune` ran immediately after deploying new containers, potentially deleting the old images needed for rollback if smoke tests subsequently failed. Now the deploy pipeline: 1. Tags current images as :rollback before pulling new versions 2. Only runs `docker image prune` after smoke tests pass 3. Uses explicit :rollback tags for rollback instead of relying on Docker layer cache (which is fragile) Applied to: - scripts/deploy-production.sh (manual deploy script) - .github/workflows/deploy.yml (staging + production CI jobs) - docs/deployment.md (updated rollback documentation) Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
@@ -227,15 +227,17 @@ docker run -p 8000:8000 --env-file ../../.env goodgo-ai-services
|
||||
### Staging Auto-Deploy Flow
|
||||
|
||||
```
|
||||
Push to develop → Build images → Deploy to staging → Smoke tests → ✅ / Rollback
|
||||
Push to develop → Build images → Tag rollback → Deploy to staging → Smoke tests → Cleanup / Rollback
|
||||
```
|
||||
|
||||
1. **Build**: Docker images for API, Web, and AI Services are built and pushed to GHCR with `staging-latest` tag
|
||||
2. **Deploy**: Images are pulled and services are updated via rolling restart (zero-downtime)
|
||||
3. **Verify**: Health check polls `$STAGING_URL/health` for up to 100 seconds
|
||||
4. **Smoke test**: `scripts/smoke-test.sh` runs against the staging URL, checking health probes, core API endpoints, search, and auth
|
||||
5. **Notify**: Slack notification on success or failure
|
||||
6. **Rollback**: If smoke tests fail, automatic rollback restores previous container images
|
||||
2. **Tag rollback**: Current running images are tagged as `:rollback` before new images are pulled
|
||||
3. **Deploy**: New images are pulled and services are updated via rolling restart (zero-downtime)
|
||||
4. **Verify**: Health check polls `$STAGING_URL/health` for up to 100 seconds
|
||||
5. **Smoke test**: `scripts/smoke-test.sh` runs against the staging URL, checking health probes, core API endpoints, search, and auth
|
||||
6. **Cleanup**: On success, `:rollback` tags are removed and `docker image prune` cleans up old layers
|
||||
7. **Notify**: Slack notification on success or failure
|
||||
8. **Rollback**: If smoke tests fail, automatic rollback restores the `:rollback` tagged images
|
||||
|
||||
### Notifications
|
||||
|
||||
@@ -265,19 +267,32 @@ Deploy status notifications are sent to Slack via `SLACK_WEBHOOK_URL` secret:
|
||||
|
||||
## Rollback
|
||||
|
||||
### Rollback Safety Mechanism
|
||||
|
||||
The deploy pipeline uses **explicit `:rollback` image tags** to guarantee safe rollbacks. Here's how it works:
|
||||
|
||||
1. **Before pulling new images**: The current running images are tagged as `goodgo-api:rollback`, `goodgo-web:rollback`, and `goodgo-ai-services:rollback`
|
||||
2. **After pulling new images**: Services are updated with the new images via rolling restart
|
||||
3. **After smoke tests pass**: The `:rollback` tags are removed and `docker image prune` cleans up old layers
|
||||
4. **If smoke tests fail**: The `:rollback` tagged images are used to restore the previous version
|
||||
|
||||
This ensures that `docker image prune` never deletes the images needed for rollback, because:
|
||||
- Image pruning only happens **after** smoke tests pass
|
||||
- The `:rollback` tags keep the previous images pinned even if pruning were to run accidentally
|
||||
|
||||
### Automatic Rollback (Staging)
|
||||
|
||||
The staging pipeline includes automatic rollback when smoke tests fail:
|
||||
|
||||
1. **Pre-deploy**: Current container image digests are recorded before deployment
|
||||
1. **Pre-deploy**: Current container images are tagged with `:rollback` suffix before new images are pulled
|
||||
2. **Smoke test failure**: If `scripts/smoke-test.sh` exits non-zero, the `rollback-staging` job triggers
|
||||
3. **Rollback execution**: Containers are stopped and restarted with previous images
|
||||
3. **Rollback execution**: Containers are stopped and restarted using the `:rollback` tagged images
|
||||
4. **Verification**: Health check confirms the rollback succeeded
|
||||
5. **Notification**: Slack notification reports the rollback with links to the failed run
|
||||
|
||||
### Automatic Rollback (Production)
|
||||
|
||||
Same mechanism as staging — smoke test failure triggers `rollback-production`.
|
||||
Same mechanism as staging — smoke test failure triggers `rollback-production` using the `:rollback` tagged images.
|
||||
|
||||
### Manual Rollback
|
||||
|
||||
@@ -292,24 +307,30 @@ gh workflow run deploy.yml \
|
||||
-f environment=staging
|
||||
```
|
||||
|
||||
#### Option 2: SSH rollback (emergency)
|
||||
#### Option 2: SSH rollback using :rollback tags (fastest)
|
||||
|
||||
```bash
|
||||
# SSH into the staging/production server
|
||||
ssh deploy@<host>
|
||||
|
||||
cd ~/goodgo
|
||||
|
||||
# Stop the current services
|
||||
docker compose -f docker-compose.prod.yml down api web ai-services
|
||||
# Stop current services
|
||||
docker compose -f docker-compose.prod.yml stop api web ai-services
|
||||
|
||||
# Restart with the previous image layers still cached locally
|
||||
# Verify :rollback images exist
|
||||
docker image inspect goodgo-api:rollback > /dev/null 2>&1 && echo "API rollback available"
|
||||
docker image inspect goodgo-web:rollback > /dev/null 2>&1 && echo "Web rollback available"
|
||||
docker image inspect goodgo-ai-services:rollback > /dev/null 2>&1 && echo "AI rollback available"
|
||||
|
||||
# Restart services (compose picks up cached/rollback images)
|
||||
docker compose -f docker-compose.prod.yml up -d --wait api web ai-services
|
||||
|
||||
# Verify health
|
||||
curl -sf http://localhost:3001/health
|
||||
curl -sf http://localhost:3001/health && echo "Rollback successful"
|
||||
```
|
||||
|
||||
> **Note:** The `:rollback` tags are only available until the next successful deploy cleans them up. If you need to roll back to an older version, use Option 3 below.
|
||||
|
||||
#### Option 3: Pin to a specific image tag
|
||||
|
||||
```bash
|
||||
@@ -325,6 +346,20 @@ docker compose -f docker-compose.prod.yml pull api web ai-services
|
||||
docker compose -f docker-compose.prod.yml up -d --no-deps --wait api web ai-services
|
||||
```
|
||||
|
||||
#### Option 4: Use deploy-production.sh (built-in rollback)
|
||||
|
||||
The manual deploy script (`scripts/deploy-production.sh`) has integrated rollback support:
|
||||
- Automatically tags `:rollback` images before pulling
|
||||
- Runs health checks and smoke tests
|
||||
- Auto-rollbacks using `:rollback` tags if either fails
|
||||
- Only prunes images after smoke tests pass
|
||||
|
||||
```bash
|
||||
ssh ubuntu@185.225.232.65
|
||||
cd ~/goodgo
|
||||
./scripts/deploy-production.sh [image-tag]
|
||||
```
|
||||
|
||||
### Database Rollback
|
||||
|
||||
Prisma does not support automatic down migrations. If a migration must be reverted:
|
||||
|
||||
Reference in New Issue
Block a user