docs: comprehensive backup & DR documentation

- Added RTO/RPO targets (RPO ≤24h, RTO ≤30m)
- Added Redis backup/restore procedures (volume + BGSAVE)
- Added Typesense backup/restore + rebuild from source
- Added DR runbook: DB failure, service crash, host failure, data corruption
- Restructured doc with clear sections per service

Ref: TEC-1572

Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
Ho Ngoc Hai
2026-04-09 08:41:40 +07:00
parent a8e1a438b9
commit 6a40ab4555

View File

@@ -1,6 +1,19 @@
# Database Backup & Restore Procedures
# Backup & Disaster Recovery
## Overview
## RTO / RPO Targets
| Metric | Target | Notes |
|--------|--------|-------|
| **RPO** (Recovery Point Objective) | ≤ 24 hours | Daily backups at 02:00 UTC; worst case is a full day of data loss |
| **RTO** (Recovery Time Objective) | ≤ 30 minutes | Restore from local volume backup; longer if retrieving from off-site |
> To reduce RPO further, consider WAL archiving for PostgreSQL (continuous point-in-time recovery) and more frequent backup schedules.
---
## PostgreSQL Backup
### Overview
Automated daily PostgreSQL backups run inside the `pg-backup` Docker container using `pg_dump` with custom format compression. Backups are stored in the `pg_backups` Docker volume.
@@ -85,13 +98,148 @@ Verify backup integrity without restoring:
docker exec goodgo-pg-backup pg_restore --list /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz
```
## Disaster Recovery
---
For complete data loss (volume destroyed):
## Redis Backup & Restore
1. Retrieve backup from external storage (if configured)
2. Recreate the `pg_backups` volume and copy backup file in
3. Follow the restore procedure above
Redis is configured with AOF persistence (`--appendonly yes`). Data is stored in the `redis_data` Docker volume.
### Manual Snapshot
```bash
# Trigger an RDB snapshot
docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" BGSAVE
# Wait for completion
docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" LASTSAVE
```
### Backup the Volume
```bash
# Stop Redis to ensure consistent snapshot
docker compose stop redis
# Copy volume data to a backup location
docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \
alpine tar czf /backup/redis_$(date +%Y%m%d_%H%M%S).tar.gz -C /data .
docker compose start redis
```
### Restore Redis
```bash
docker compose stop redis
# Clear existing data and restore
docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \
alpine sh -c "rm -rf /data/* && tar xzf /backup/redis_YYYYMMDD_HHMMSS.tar.gz -C /data"
docker compose start redis
```
> **Note:** Redis is used as a cache with `allkeys-lru` eviction. Full data loss is non-critical — the API will repopulate cache entries on demand. Restore is only necessary if session data or queue state must be preserved.
---
## Typesense Backup & Restore
Typesense data is stored in the `typesense_data` Docker volume.
### Create Snapshot
```bash
# Typesense built-in snapshot API
curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
"http://localhost:8108/operations/snapshot?snapshot_path=/data/snapshots/$(date +%Y%m%d)"
```
### Backup the Volume
```bash
docker compose stop typesense
docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
alpine tar czf /backup/typesense_$(date +%Y%m%d_%H%M%S).tar.gz -C /data .
docker compose start typesense
```
### Restore Typesense
```bash
docker compose stop typesense
docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
alpine sh -c "rm -rf /data/* && tar xzf /backup/typesense_YYYYMMDD_HHMMSS.tar.gz -C /data"
docker compose start typesense
```
### Rebuild from Source
If backup is unavailable, Typesense can be rebuilt by re-indexing from PostgreSQL:
```bash
# After Typesense restarts with empty data:
pnpm run typesense:reindex
```
---
## Disaster Recovery Runbook
### Scenario 1: PostgreSQL Failure (container crash or data corruption)
1. **Assess:** `docker logs goodgo-postgres` — check for corruption or OOM
2. **Stop services:** `docker compose stop api ai-services`
3. **Attempt restart:** `docker compose restart postgres`
4. If restart fails (data corruption):
- `docker compose stop postgres`
- Remove volume: `docker volume rm goodgo-platform-ai_postgres_data`
- Recreate: `docker compose up -d postgres`
- Restore from backup: `docker exec goodgo-pg-backup /scripts/pg-restore.sh /backups/<latest>.sql.gz`
- Verify: `docker exec goodgo-postgres psql -U goodgo -d goodgo -c '\dt'`
- Run migrations: `pnpm prisma migrate deploy`
5. **Restart services:** `docker compose up -d`
6. **Verify:** Check API health at `http://localhost:3000/health`
**Expected RTO:** ~15 minutes (local backup), ~30 minutes (off-site backup)
### Scenario 2: Service Crash (API, AI-services, or Web)
1. **Check logs:** `docker compose logs --tail=100 <service>`
2. **Restart service:** `docker compose restart <service>`
3. If crash loops:
- Check resource limits: `docker stats`
- Check environment variables: `docker compose config`
- Roll back to previous image tag if recent deployment caused the issue
4. **Verify:** Check health endpoints
**Expected RTO:** ~5 minutes
### Scenario 3: Full Host Failure
1. Provision new host with Docker + Docker Compose
2. Clone the repo and set up `.env` from secrets manager
3. Pull images: `docker compose pull`
4. Restore PostgreSQL backup from off-site storage
5. Start all services: `docker compose up -d`
6. Trigger Typesense reindex: `pnpm run typesense:reindex`
7. Verify all health endpoints
**Expected RTO:** ~60 minutes (depends on backup transfer speed)
### Scenario 4: Data Corruption (application-level)
1. Identify the scope: which tables/records are affected
2. If limited: restore specific tables using `pg_restore --table=<name>`
3. If widespread: full restore from last known good backup
4. Invalidate Redis cache: `docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" FLUSHALL`
5. Reindex Typesense: `pnpm run typesense:reindex`
---
## Log Aggregation