# Backup & Disaster Recovery ## RTO / RPO Targets | Metric | Target | Notes | |--------|--------|-------| | **RPO** (Recovery Point Objective) | ≤ 24 hours | Daily backups at 02:00 UTC; worst case is a full day of data loss | | **RTO** (Recovery Time Objective) | ≤ 30 minutes | Restore from local volume backup; longer if retrieving from off-site | > To reduce RPO further, consider WAL archiving for PostgreSQL (continuous point-in-time recovery) and more frequent backup schedules. --- ## PostgreSQL Backup ### Overview Automated daily PostgreSQL backups run inside the `pg-backup` Docker container using `pg_dump` with custom format compression. Backups are stored in the `pg_backups` Docker volume. ## Backup Configuration | Setting | Default | Environment Variable | |---------|---------|---------------------| | Schedule | Daily at 02:00 UTC | Cron in `pg-backup` service | | Retention | 7 days | `BACKUP_RETENTION_DAYS` | | Format | Custom (`pg_dump --format=custom`) | — | | Compression | Level 6 | — | | Storage | `pg_backups` Docker volume | — | ## Listing Backups ```bash docker exec goodgo-pg-backup ls -lh /backups/ ``` ## Manual Backup ```bash docker exec goodgo-pg-backup /scripts/pg-backup.sh ``` ## Restore Procedure ### 1. Identify the backup to restore ```bash docker exec goodgo-pg-backup ls -lht /backups/ ``` ### 2. Stop application services ```bash docker compose stop ai-services # Stop any NestJS API processes ``` ### 3. Run restore ```bash docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz ``` The restore script will: - Terminate active database connections - Drop and recreate the database - Restore from the selected backup ### 4. Verify restore ```bash docker exec goodgo-postgres psql -U goodgo -d goodgo -c '\dt' docker exec goodgo-postgres psql -U goodgo -d goodgo -c 'SELECT count(*) FROM "User";' ``` ### 5. Run Prisma migrations (if needed) ```bash pnpm prisma migrate deploy ``` ### 6. Restart services ```bash docker compose up -d ``` ## Backup Verification Check the backup log: ```bash docker exec goodgo-pg-backup cat /var/log/pg-backup.log ``` Verify backup integrity without restoring: ```bash docker exec goodgo-pg-backup pg_restore --list /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz ``` --- ## Redis Backup & Restore Redis is configured with AOF persistence (`--appendonly yes`). Data is stored in the `redis_data` Docker volume. ### Manual Snapshot ```bash # Trigger an RDB snapshot docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" BGSAVE # Wait for completion docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" LASTSAVE ``` ### Backup the Volume ```bash # Stop Redis to ensure consistent snapshot docker compose stop redis # Copy volume data to a backup location docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \ alpine tar czf /backup/redis_$(date +%Y%m%d_%H%M%S).tar.gz -C /data . docker compose start redis ``` ### Restore Redis ```bash docker compose stop redis # Clear existing data and restore docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \ alpine sh -c "rm -rf /data/* && tar xzf /backup/redis_YYYYMMDD_HHMMSS.tar.gz -C /data" docker compose start redis ``` > **Note:** Redis is used as a cache with `allkeys-lru` eviction. Full data loss is non-critical — the API will repopulate cache entries on demand. Restore is only necessary if session data or queue state must be preserved. --- ## Typesense Backup & Restore Typesense data is stored in the `typesense_data` Docker volume. ### Create Snapshot ```bash # Typesense built-in snapshot API curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \ "http://localhost:8108/operations/snapshot?snapshot_path=/data/snapshots/$(date +%Y%m%d)" ``` ### Backup the Volume ```bash docker compose stop typesense docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \ alpine tar czf /backup/typesense_$(date +%Y%m%d_%H%M%S).tar.gz -C /data . docker compose start typesense ``` ### Restore Typesense ```bash docker compose stop typesense docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \ alpine sh -c "rm -rf /data/* && tar xzf /backup/typesense_YYYYMMDD_HHMMSS.tar.gz -C /data" docker compose start typesense ``` ### Rebuild from Source If backup is unavailable, Typesense can be rebuilt by re-indexing from PostgreSQL: ```bash # After Typesense restarts with empty data: pnpm run typesense:reindex ``` --- ## Disaster Recovery Runbook ### Scenario 1: PostgreSQL Failure (container crash or data corruption) 1. **Assess:** `docker logs goodgo-postgres` — check for corruption or OOM 2. **Stop services:** `docker compose stop api ai-services` 3. **Attempt restart:** `docker compose restart postgres` 4. If restart fails (data corruption): - `docker compose stop postgres` - Remove volume: `docker volume rm goodgo-platform-ai_postgres_data` - Recreate: `docker compose up -d postgres` - Restore from backup: `docker exec goodgo-pg-backup /scripts/pg-restore.sh /backups/.sql.gz` - Verify: `docker exec goodgo-postgres psql -U goodgo -d goodgo -c '\dt'` - Run migrations: `pnpm prisma migrate deploy` 5. **Restart services:** `docker compose up -d` 6. **Verify:** Check API health at `http://localhost:3000/health` **Expected RTO:** ~15 minutes (local backup), ~30 minutes (off-site backup) ### Scenario 2: Service Crash (API, AI-services, or Web) 1. **Check logs:** `docker compose logs --tail=100 ` 2. **Restart service:** `docker compose restart ` 3. If crash loops: - Check resource limits: `docker stats` - Check environment variables: `docker compose config` - Roll back to previous image tag if recent deployment caused the issue 4. **Verify:** Check health endpoints **Expected RTO:** ~5 minutes ### Scenario 3: Full Host Failure 1. Provision new host with Docker + Docker Compose 2. Clone the repo and set up `.env` from secrets manager 3. Pull images: `docker compose pull` 4. Restore PostgreSQL backup from off-site storage 5. Start all services: `docker compose up -d` 6. Trigger Typesense reindex: `pnpm run typesense:reindex` 7. Verify all health endpoints **Expected RTO:** ~60 minutes (depends on backup transfer speed) ### Scenario 4: Data Corruption (application-level) 1. Identify the scope: which tables/records are affected 2. If limited: restore specific tables using `pg_restore --table=` 3. If widespread: full restore from last known good backup 4. Invalidate Redis cache: `docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" FLUSHALL` 5. Reindex Typesense: `pnpm run typesense:reindex` --- ## Log Aggregation Logs are aggregated via Loki + Promtail and viewable in Grafana: - **Grafana**: http://localhost:3002 (dashboard: "GoodGo - Logs") - **Loki**: http://localhost:3100 - **Log retention**: 15 days (configured in `monitoring/loki/loki-config.yml`)