Files

Ho Ngoc Hai 6a40ab4555 docs: comprehensive backup & DR documentation

- Added RTO/RPO targets (RPO ≤24h, RTO ≤30m)
- Added Redis backup/restore procedures (volume + BGSAVE)
- Added Typesense backup/restore + rebuild from source
- Added DR runbook: DB failure, service crash, host failure, data corruption
- Restructured doc with clear sections per service

Ref: TEC-1572

Co-Authored-By: Paperclip <noreply@paperclip.ing>

2026-04-09 08:41:40 +07:00

6.9 KiB

Raw Blame History

Backup & Disaster Recovery

RTO / RPO Targets

Metric	Target	Notes
RPO (Recovery Point Objective)	≤ 24 hours	Daily backups at 02:00 UTC; worst case is a full day of data loss
RTO (Recovery Time Objective)	≤ 30 minutes	Restore from local volume backup; longer if retrieving from off-site

To reduce RPO further, consider WAL archiving for PostgreSQL (continuous point-in-time recovery) and more frequent backup schedules.

PostgreSQL Backup

Overview

Automated daily PostgreSQL backups run inside the pg-backup Docker container using pg_dump with custom format compression. Backups are stored in the pg_backups Docker volume.

Backup Configuration

Setting	Default	Environment Variable
Schedule	Daily at 02:00 UTC	Cron in `pg-backup` service
Retention	7 days	`BACKUP_RETENTION_DAYS`
Format	Custom (`pg_dump --format=custom`)	—
Compression	Level 6	—
Storage	`pg_backups` Docker volume	—

Listing Backups

docker exec goodgo-pg-backup ls -lh /backups/

Manual Backup

docker exec goodgo-pg-backup /scripts/pg-backup.sh

Restore Procedure

1. Identify the backup to restore

docker exec goodgo-pg-backup ls -lht /backups/

2. Stop application services

docker compose stop ai-services
# Stop any NestJS API processes

3. Run restore

docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz

The restore script will:

Terminate active database connections
Drop and recreate the database
Restore from the selected backup

4. Verify restore

docker exec goodgo-postgres psql -U goodgo -d goodgo -c '\dt'
docker exec goodgo-postgres psql -U goodgo -d goodgo -c 'SELECT count(*) FROM "User";'

5. Run Prisma migrations (if needed)

pnpm prisma migrate deploy

6. Restart services

docker compose up -d

Backup Verification

Check the backup log:

docker exec goodgo-pg-backup cat /var/log/pg-backup.log

Verify backup integrity without restoring:

docker exec goodgo-pg-backup pg_restore --list /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz

Redis Backup & Restore

Redis is configured with AOF persistence (--appendonly yes). Data is stored in the redis_data Docker volume.

Manual Snapshot

# Trigger an RDB snapshot
docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" BGSAVE

# Wait for completion
docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" LASTSAVE

Backup the Volume

# Stop Redis to ensure consistent snapshot
docker compose stop redis

# Copy volume data to a backup location
docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \
  alpine tar czf /backup/redis_$(date +%Y%m%d_%H%M%S).tar.gz -C /data .

docker compose start redis

Restore Redis

docker compose stop redis

# Clear existing data and restore
docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \
  alpine sh -c "rm -rf /data/* && tar xzf /backup/redis_YYYYMMDD_HHMMSS.tar.gz -C /data"

docker compose start redis

Note: Redis is used as a cache with allkeys-lru eviction. Full data loss is non-critical — the API will repopulate cache entries on demand. Restore is only necessary if session data or queue state must be preserved.

Typesense Backup & Restore

Typesense data is stored in the typesense_data Docker volume.

Create Snapshot

# Typesense built-in snapshot API
curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
  "http://localhost:8108/operations/snapshot?snapshot_path=/data/snapshots/$(date +%Y%m%d)"

Backup the Volume

docker compose stop typesense

docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
  alpine tar czf /backup/typesense_$(date +%Y%m%d_%H%M%S).tar.gz -C /data .

docker compose start typesense

Restore Typesense

docker compose stop typesense

docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
  alpine sh -c "rm -rf /data/* && tar xzf /backup/typesense_YYYYMMDD_HHMMSS.tar.gz -C /data"

docker compose start typesense

Rebuild from Source

If backup is unavailable, Typesense can be rebuilt by re-indexing from PostgreSQL:

# After Typesense restarts with empty data:
pnpm run typesense:reindex

Disaster Recovery Runbook

Scenario 1: PostgreSQL Failure (container crash or data corruption)

Assess: docker logs goodgo-postgres — check for corruption or OOM
Stop services: docker compose stop api ai-services
Attempt restart: docker compose restart postgres
If restart fails (data corruption):
- docker compose stop postgres
- Remove volume: docker volume rm goodgo-platform-ai_postgres_data
- Recreate: docker compose up -d postgres
- Restore from backup: docker exec goodgo-pg-backup /scripts/pg-restore.sh /backups/<latest>.sql.gz
- Verify: docker exec goodgo-postgres psql -U goodgo -d goodgo -c '\dt'
- Run migrations: pnpm prisma migrate deploy
Restart services: docker compose up -d
Verify: Check API health at http://localhost:3000/health

Expected RTO: ~15 minutes (local backup), ~30 minutes (off-site backup)

Scenario 2: Service Crash (API, AI-services, or Web)

Check logs: docker compose logs --tail=100 <service>
Restart service: docker compose restart <service>
If crash loops:
- Check resource limits: docker stats
- Check environment variables: docker compose config
- Roll back to previous image tag if recent deployment caused the issue
Verify: Check health endpoints

Expected RTO: ~5 minutes

Scenario 3: Full Host Failure

Provision new host with Docker + Docker Compose
Clone the repo and set up .env from secrets manager
Pull images: docker compose pull
Restore PostgreSQL backup from off-site storage
Start all services: docker compose up -d
Trigger Typesense reindex: pnpm run typesense:reindex
Verify all health endpoints

Expected RTO: ~60 minutes (depends on backup transfer speed)

Scenario 4: Data Corruption (application-level)

Identify the scope: which tables/records are affected
If limited: restore specific tables using pg_restore --table=<name>
If widespread: full restore from last known good backup
Invalidate Redis cache: docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" FLUSHALL
Reindex Typesense: pnpm run typesense:reindex

Log Aggregation

Logs are aggregated via Loki + Promtail and viewable in Grafana:

Grafana: http://localhost:3002 (dashboard: "GoodGo - Logs")
Loki: http://localhost:3100
Log retention: 15 days (configured in monitoring/loki/loki-config.yml)

6.9 KiB Raw Blame History

Backup & Disaster Recovery

RTO / RPO Targets

PostgreSQL Backup

Overview

Backup Configuration

Listing Backups

Manual Backup

Restore Procedure

1. Identify the backup to restore

2. Stop application services

3. Run restore

4. Verify restore

5. Run Prisma migrations (if needed)

6. Restart services

Backup Verification

Redis Backup & Restore

Manual Snapshot

Backup the Volume

Restore Redis

Typesense Backup & Restore

Create Snapshot

Backup the Volume

Restore Typesense

Rebuild from Source

Disaster Recovery Runbook

Scenario 1: PostgreSQL Failure (container crash or data corruption)

Scenario 2: Service Crash (API, AI-services, or Web)

Scenario 3: Full Host Failure

Scenario 4: Data Corruption (application-level)

Log Aggregation

6.9 KiB

Raw Blame History