From 6a40ab45554d30630fe9f1e7be2a0cfa5acc9297 Mon Sep 17 00:00:00 2001 From: Ho Ngoc Hai Date: Thu, 9 Apr 2026 08:41:40 +0700 Subject: [PATCH] docs: comprehensive backup & DR documentation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Added RTO/RPO targets (RPO ≤24h, RTO ≤30m) - Added Redis backup/restore procedures (volume + BGSAVE) - Added Typesense backup/restore + rebuild from source - Added DR runbook: DB failure, service crash, host failure, data corruption - Restructured doc with clear sections per service Ref: TEC-1572 Co-Authored-By: Paperclip --- docs/backup-restore.md | 162 +++++++++++++++++++++++++++++++++++++++-- 1 file changed, 155 insertions(+), 7 deletions(-) diff --git a/docs/backup-restore.md b/docs/backup-restore.md index 5c283ad..05c6e81 100644 --- a/docs/backup-restore.md +++ b/docs/backup-restore.md @@ -1,6 +1,19 @@ -# Database Backup & Restore Procedures +# Backup & Disaster Recovery -## Overview +## RTO / RPO Targets + +| Metric | Target | Notes | +|--------|--------|-------| +| **RPO** (Recovery Point Objective) | ≤ 24 hours | Daily backups at 02:00 UTC; worst case is a full day of data loss | +| **RTO** (Recovery Time Objective) | ≤ 30 minutes | Restore from local volume backup; longer if retrieving from off-site | + +> To reduce RPO further, consider WAL archiving for PostgreSQL (continuous point-in-time recovery) and more frequent backup schedules. + +--- + +## PostgreSQL Backup + +### Overview Automated daily PostgreSQL backups run inside the `pg-backup` Docker container using `pg_dump` with custom format compression. Backups are stored in the `pg_backups` Docker volume. @@ -85,13 +98,148 @@ Verify backup integrity without restoring: docker exec goodgo-pg-backup pg_restore --list /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz ``` -## Disaster Recovery +--- -For complete data loss (volume destroyed): +## Redis Backup & Restore -1. Retrieve backup from external storage (if configured) -2. Recreate the `pg_backups` volume and copy backup file in -3. Follow the restore procedure above +Redis is configured with AOF persistence (`--appendonly yes`). Data is stored in the `redis_data` Docker volume. + +### Manual Snapshot + +```bash +# Trigger an RDB snapshot +docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" BGSAVE + +# Wait for completion +docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" LASTSAVE +``` + +### Backup the Volume + +```bash +# Stop Redis to ensure consistent snapshot +docker compose stop redis + +# Copy volume data to a backup location +docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \ + alpine tar czf /backup/redis_$(date +%Y%m%d_%H%M%S).tar.gz -C /data . + +docker compose start redis +``` + +### Restore Redis + +```bash +docker compose stop redis + +# Clear existing data and restore +docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \ + alpine sh -c "rm -rf /data/* && tar xzf /backup/redis_YYYYMMDD_HHMMSS.tar.gz -C /data" + +docker compose start redis +``` + +> **Note:** Redis is used as a cache with `allkeys-lru` eviction. Full data loss is non-critical — the API will repopulate cache entries on demand. Restore is only necessary if session data or queue state must be preserved. + +--- + +## Typesense Backup & Restore + +Typesense data is stored in the `typesense_data` Docker volume. + +### Create Snapshot + +```bash +# Typesense built-in snapshot API +curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \ + "http://localhost:8108/operations/snapshot?snapshot_path=/data/snapshots/$(date +%Y%m%d)" +``` + +### Backup the Volume + +```bash +docker compose stop typesense + +docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \ + alpine tar czf /backup/typesense_$(date +%Y%m%d_%H%M%S).tar.gz -C /data . + +docker compose start typesense +``` + +### Restore Typesense + +```bash +docker compose stop typesense + +docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \ + alpine sh -c "rm -rf /data/* && tar xzf /backup/typesense_YYYYMMDD_HHMMSS.tar.gz -C /data" + +docker compose start typesense +``` + +### Rebuild from Source + +If backup is unavailable, Typesense can be rebuilt by re-indexing from PostgreSQL: + +```bash +# After Typesense restarts with empty data: +pnpm run typesense:reindex +``` + +--- + +## Disaster Recovery Runbook + +### Scenario 1: PostgreSQL Failure (container crash or data corruption) + +1. **Assess:** `docker logs goodgo-postgres` — check for corruption or OOM +2. **Stop services:** `docker compose stop api ai-services` +3. **Attempt restart:** `docker compose restart postgres` +4. If restart fails (data corruption): + - `docker compose stop postgres` + - Remove volume: `docker volume rm goodgo-platform-ai_postgres_data` + - Recreate: `docker compose up -d postgres` + - Restore from backup: `docker exec goodgo-pg-backup /scripts/pg-restore.sh /backups/.sql.gz` + - Verify: `docker exec goodgo-postgres psql -U goodgo -d goodgo -c '\dt'` + - Run migrations: `pnpm prisma migrate deploy` +5. **Restart services:** `docker compose up -d` +6. **Verify:** Check API health at `http://localhost:3000/health` + +**Expected RTO:** ~15 minutes (local backup), ~30 minutes (off-site backup) + +### Scenario 2: Service Crash (API, AI-services, or Web) + +1. **Check logs:** `docker compose logs --tail=100 ` +2. **Restart service:** `docker compose restart ` +3. If crash loops: + - Check resource limits: `docker stats` + - Check environment variables: `docker compose config` + - Roll back to previous image tag if recent deployment caused the issue +4. **Verify:** Check health endpoints + +**Expected RTO:** ~5 minutes + +### Scenario 3: Full Host Failure + +1. Provision new host with Docker + Docker Compose +2. Clone the repo and set up `.env` from secrets manager +3. Pull images: `docker compose pull` +4. Restore PostgreSQL backup from off-site storage +5. Start all services: `docker compose up -d` +6. Trigger Typesense reindex: `pnpm run typesense:reindex` +7. Verify all health endpoints + +**Expected RTO:** ~60 minutes (depends on backup transfer speed) + +### Scenario 4: Data Corruption (application-level) + +1. Identify the scope: which tables/records are affected +2. If limited: restore specific tables using `pg_restore --table=` +3. If widespread: full restore from last known good backup +4. Invalidate Redis cache: `docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" FLUSHALL` +5. Reindex Typesense: `pnpm run typesense:reindex` + +--- ## Log Aggregation