- Added RTO/RPO targets (RPO ≤24h, RTO ≤30m) - Added Redis backup/restore procedures (volume + BGSAVE) - Added Typesense backup/restore + rebuild from source - Added DR runbook: DB failure, service crash, host failure, data corruption - Restructured doc with clear sections per service Ref: TEC-1572 Co-Authored-By: Paperclip <noreply@paperclip.ing>
6.9 KiB
Backup & Disaster Recovery
RTO / RPO Targets
| Metric | Target | Notes |
|---|---|---|
| RPO (Recovery Point Objective) | ≤ 24 hours | Daily backups at 02:00 UTC; worst case is a full day of data loss |
| RTO (Recovery Time Objective) | ≤ 30 minutes | Restore from local volume backup; longer if retrieving from off-site |
To reduce RPO further, consider WAL archiving for PostgreSQL (continuous point-in-time recovery) and more frequent backup schedules.
PostgreSQL Backup
Overview
Automated daily PostgreSQL backups run inside the pg-backup Docker container using pg_dump with custom format compression. Backups are stored in the pg_backups Docker volume.
Backup Configuration
| Setting | Default | Environment Variable |
|---|---|---|
| Schedule | Daily at 02:00 UTC | Cron in pg-backup service |
| Retention | 7 days | BACKUP_RETENTION_DAYS |
| Format | Custom (pg_dump --format=custom) |
— |
| Compression | Level 6 | — |
| Storage | pg_backups Docker volume |
— |
Listing Backups
docker exec goodgo-pg-backup ls -lh /backups/
Manual Backup
docker exec goodgo-pg-backup /scripts/pg-backup.sh
Restore Procedure
1. Identify the backup to restore
docker exec goodgo-pg-backup ls -lht /backups/
2. Stop application services
docker compose stop ai-services
# Stop any NestJS API processes
3. Run restore
docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz
The restore script will:
- Terminate active database connections
- Drop and recreate the database
- Restore from the selected backup
4. Verify restore
docker exec goodgo-postgres psql -U goodgo -d goodgo -c '\dt'
docker exec goodgo-postgres psql -U goodgo -d goodgo -c 'SELECT count(*) FROM "User";'
5. Run Prisma migrations (if needed)
pnpm prisma migrate deploy
6. Restart services
docker compose up -d
Backup Verification
Check the backup log:
docker exec goodgo-pg-backup cat /var/log/pg-backup.log
Verify backup integrity without restoring:
docker exec goodgo-pg-backup pg_restore --list /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz
Redis Backup & Restore
Redis is configured with AOF persistence (--appendonly yes). Data is stored in the redis_data Docker volume.
Manual Snapshot
# Trigger an RDB snapshot
docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" BGSAVE
# Wait for completion
docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" LASTSAVE
Backup the Volume
# Stop Redis to ensure consistent snapshot
docker compose stop redis
# Copy volume data to a backup location
docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \
alpine tar czf /backup/redis_$(date +%Y%m%d_%H%M%S).tar.gz -C /data .
docker compose start redis
Restore Redis
docker compose stop redis
# Clear existing data and restore
docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \
alpine sh -c "rm -rf /data/* && tar xzf /backup/redis_YYYYMMDD_HHMMSS.tar.gz -C /data"
docker compose start redis
Note: Redis is used as a cache with
allkeys-lrueviction. Full data loss is non-critical — the API will repopulate cache entries on demand. Restore is only necessary if session data or queue state must be preserved.
Typesense Backup & Restore
Typesense data is stored in the typesense_data Docker volume.
Create Snapshot
# Typesense built-in snapshot API
curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
"http://localhost:8108/operations/snapshot?snapshot_path=/data/snapshots/$(date +%Y%m%d)"
Backup the Volume
docker compose stop typesense
docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
alpine tar czf /backup/typesense_$(date +%Y%m%d_%H%M%S).tar.gz -C /data .
docker compose start typesense
Restore Typesense
docker compose stop typesense
docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
alpine sh -c "rm -rf /data/* && tar xzf /backup/typesense_YYYYMMDD_HHMMSS.tar.gz -C /data"
docker compose start typesense
Rebuild from Source
If backup is unavailable, Typesense can be rebuilt by re-indexing from PostgreSQL:
# After Typesense restarts with empty data:
pnpm run typesense:reindex
Disaster Recovery Runbook
Scenario 1: PostgreSQL Failure (container crash or data corruption)
- Assess:
docker logs goodgo-postgres— check for corruption or OOM - Stop services:
docker compose stop api ai-services - Attempt restart:
docker compose restart postgres - If restart fails (data corruption):
docker compose stop postgres- Remove volume:
docker volume rm goodgo-platform-ai_postgres_data - Recreate:
docker compose up -d postgres - Restore from backup:
docker exec goodgo-pg-backup /scripts/pg-restore.sh /backups/<latest>.sql.gz - Verify:
docker exec goodgo-postgres psql -U goodgo -d goodgo -c '\dt' - Run migrations:
pnpm prisma migrate deploy
- Restart services:
docker compose up -d - Verify: Check API health at
http://localhost:3000/health
Expected RTO: ~15 minutes (local backup), ~30 minutes (off-site backup)
Scenario 2: Service Crash (API, AI-services, or Web)
- Check logs:
docker compose logs --tail=100 <service> - Restart service:
docker compose restart <service> - If crash loops:
- Check resource limits:
docker stats - Check environment variables:
docker compose config - Roll back to previous image tag if recent deployment caused the issue
- Check resource limits:
- Verify: Check health endpoints
Expected RTO: ~5 minutes
Scenario 3: Full Host Failure
- Provision new host with Docker + Docker Compose
- Clone the repo and set up
.envfrom secrets manager - Pull images:
docker compose pull - Restore PostgreSQL backup from off-site storage
- Start all services:
docker compose up -d - Trigger Typesense reindex:
pnpm run typesense:reindex - Verify all health endpoints
Expected RTO: ~60 minutes (depends on backup transfer speed)
Scenario 4: Data Corruption (application-level)
- Identify the scope: which tables/records are affected
- If limited: restore specific tables using
pg_restore --table=<name> - If widespread: full restore from last known good backup
- Invalidate Redis cache:
docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" FLUSHALL - Reindex Typesense:
pnpm run typesense:reindex
Log Aggregation
Logs are aggregated via Loki + Promtail and viewable in Grafana:
- Grafana: http://localhost:3002 (dashboard: "GoodGo - Logs")
- Loki: http://localhost:3100
- Log retention: 15 days (configured in
monitoring/loki/loki-config.yml)