Files
goodgo-platform/docs/backup-restore.md
Ho Ngoc Hai 6a40ab4555 docs: comprehensive backup & DR documentation
- Added RTO/RPO targets (RPO ≤24h, RTO ≤30m)
- Added Redis backup/restore procedures (volume + BGSAVE)
- Added Typesense backup/restore + rebuild from source
- Added DR runbook: DB failure, service crash, host failure, data corruption
- Restructured doc with clear sections per service

Ref: TEC-1572

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-09 08:41:40 +07:00

6.9 KiB

Backup & Disaster Recovery

RTO / RPO Targets

Metric Target Notes
RPO (Recovery Point Objective) ≤ 24 hours Daily backups at 02:00 UTC; worst case is a full day of data loss
RTO (Recovery Time Objective) ≤ 30 minutes Restore from local volume backup; longer if retrieving from off-site

To reduce RPO further, consider WAL archiving for PostgreSQL (continuous point-in-time recovery) and more frequent backup schedules.


PostgreSQL Backup

Overview

Automated daily PostgreSQL backups run inside the pg-backup Docker container using pg_dump with custom format compression. Backups are stored in the pg_backups Docker volume.

Backup Configuration

Setting Default Environment Variable
Schedule Daily at 02:00 UTC Cron in pg-backup service
Retention 7 days BACKUP_RETENTION_DAYS
Format Custom (pg_dump --format=custom)
Compression Level 6
Storage pg_backups Docker volume

Listing Backups

docker exec goodgo-pg-backup ls -lh /backups/

Manual Backup

docker exec goodgo-pg-backup /scripts/pg-backup.sh

Restore Procedure

1. Identify the backup to restore

docker exec goodgo-pg-backup ls -lht /backups/

2. Stop application services

docker compose stop ai-services
# Stop any NestJS API processes

3. Run restore

docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz

The restore script will:

  • Terminate active database connections
  • Drop and recreate the database
  • Restore from the selected backup

4. Verify restore

docker exec goodgo-postgres psql -U goodgo -d goodgo -c '\dt'
docker exec goodgo-postgres psql -U goodgo -d goodgo -c 'SELECT count(*) FROM "User";'

5. Run Prisma migrations (if needed)

pnpm prisma migrate deploy

6. Restart services

docker compose up -d

Backup Verification

Check the backup log:

docker exec goodgo-pg-backup cat /var/log/pg-backup.log

Verify backup integrity without restoring:

docker exec goodgo-pg-backup pg_restore --list /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz

Redis Backup & Restore

Redis is configured with AOF persistence (--appendonly yes). Data is stored in the redis_data Docker volume.

Manual Snapshot

# Trigger an RDB snapshot
docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" BGSAVE

# Wait for completion
docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" LASTSAVE

Backup the Volume

# Stop Redis to ensure consistent snapshot
docker compose stop redis

# Copy volume data to a backup location
docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \
  alpine tar czf /backup/redis_$(date +%Y%m%d_%H%M%S).tar.gz -C /data .

docker compose start redis

Restore Redis

docker compose stop redis

# Clear existing data and restore
docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \
  alpine sh -c "rm -rf /data/* && tar xzf /backup/redis_YYYYMMDD_HHMMSS.tar.gz -C /data"

docker compose start redis

Note: Redis is used as a cache with allkeys-lru eviction. Full data loss is non-critical — the API will repopulate cache entries on demand. Restore is only necessary if session data or queue state must be preserved.


Typesense Backup & Restore

Typesense data is stored in the typesense_data Docker volume.

Create Snapshot

# Typesense built-in snapshot API
curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
  "http://localhost:8108/operations/snapshot?snapshot_path=/data/snapshots/$(date +%Y%m%d)"

Backup the Volume

docker compose stop typesense

docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
  alpine tar czf /backup/typesense_$(date +%Y%m%d_%H%M%S).tar.gz -C /data .

docker compose start typesense

Restore Typesense

docker compose stop typesense

docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
  alpine sh -c "rm -rf /data/* && tar xzf /backup/typesense_YYYYMMDD_HHMMSS.tar.gz -C /data"

docker compose start typesense

Rebuild from Source

If backup is unavailable, Typesense can be rebuilt by re-indexing from PostgreSQL:

# After Typesense restarts with empty data:
pnpm run typesense:reindex

Disaster Recovery Runbook

Scenario 1: PostgreSQL Failure (container crash or data corruption)

  1. Assess: docker logs goodgo-postgres — check for corruption or OOM
  2. Stop services: docker compose stop api ai-services
  3. Attempt restart: docker compose restart postgres
  4. If restart fails (data corruption):
    • docker compose stop postgres
    • Remove volume: docker volume rm goodgo-platform-ai_postgres_data
    • Recreate: docker compose up -d postgres
    • Restore from backup: docker exec goodgo-pg-backup /scripts/pg-restore.sh /backups/<latest>.sql.gz
    • Verify: docker exec goodgo-postgres psql -U goodgo -d goodgo -c '\dt'
    • Run migrations: pnpm prisma migrate deploy
  5. Restart services: docker compose up -d
  6. Verify: Check API health at http://localhost:3000/health

Expected RTO: ~15 minutes (local backup), ~30 minutes (off-site backup)

Scenario 2: Service Crash (API, AI-services, or Web)

  1. Check logs: docker compose logs --tail=100 <service>
  2. Restart service: docker compose restart <service>
  3. If crash loops:
    • Check resource limits: docker stats
    • Check environment variables: docker compose config
    • Roll back to previous image tag if recent deployment caused the issue
  4. Verify: Check health endpoints

Expected RTO: ~5 minutes

Scenario 3: Full Host Failure

  1. Provision new host with Docker + Docker Compose
  2. Clone the repo and set up .env from secrets manager
  3. Pull images: docker compose pull
  4. Restore PostgreSQL backup from off-site storage
  5. Start all services: docker compose up -d
  6. Trigger Typesense reindex: pnpm run typesense:reindex
  7. Verify all health endpoints

Expected RTO: ~60 minutes (depends on backup transfer speed)

Scenario 4: Data Corruption (application-level)

  1. Identify the scope: which tables/records are affected
  2. If limited: restore specific tables using pg_restore --table=<name>
  3. If widespread: full restore from last known good backup
  4. Invalidate Redis cache: docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" FLUSHALL
  5. Reindex Typesense: pnpm run typesense:reindex

Log Aggregation

Logs are aggregated via Loki + Promtail and viewable in Grafana: