docs: comprehensive backup & DR documentation
- Added RTO/RPO targets (RPO ≤24h, RTO ≤30m) - Added Redis backup/restore procedures (volume + BGSAVE) - Added Typesense backup/restore + rebuild from source - Added DR runbook: DB failure, service crash, host failure, data corruption - Restructured doc with clear sections per service Ref: TEC-1572 Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
@@ -1,6 +1,19 @@
|
||||
# Database Backup & Restore Procedures
|
||||
# Backup & Disaster Recovery
|
||||
|
||||
## Overview
|
||||
## RTO / RPO Targets
|
||||
|
||||
| Metric | Target | Notes |
|
||||
|--------|--------|-------|
|
||||
| **RPO** (Recovery Point Objective) | ≤ 24 hours | Daily backups at 02:00 UTC; worst case is a full day of data loss |
|
||||
| **RTO** (Recovery Time Objective) | ≤ 30 minutes | Restore from local volume backup; longer if retrieving from off-site |
|
||||
|
||||
> To reduce RPO further, consider WAL archiving for PostgreSQL (continuous point-in-time recovery) and more frequent backup schedules.
|
||||
|
||||
---
|
||||
|
||||
## PostgreSQL Backup
|
||||
|
||||
### Overview
|
||||
|
||||
Automated daily PostgreSQL backups run inside the `pg-backup` Docker container using `pg_dump` with custom format compression. Backups are stored in the `pg_backups` Docker volume.
|
||||
|
||||
@@ -85,13 +98,148 @@ Verify backup integrity without restoring:
|
||||
docker exec goodgo-pg-backup pg_restore --list /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz
|
||||
```
|
||||
|
||||
## Disaster Recovery
|
||||
---
|
||||
|
||||
For complete data loss (volume destroyed):
|
||||
## Redis Backup & Restore
|
||||
|
||||
1. Retrieve backup from external storage (if configured)
|
||||
2. Recreate the `pg_backups` volume and copy backup file in
|
||||
3. Follow the restore procedure above
|
||||
Redis is configured with AOF persistence (`--appendonly yes`). Data is stored in the `redis_data` Docker volume.
|
||||
|
||||
### Manual Snapshot
|
||||
|
||||
```bash
|
||||
# Trigger an RDB snapshot
|
||||
docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" BGSAVE
|
||||
|
||||
# Wait for completion
|
||||
docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" LASTSAVE
|
||||
```
|
||||
|
||||
### Backup the Volume
|
||||
|
||||
```bash
|
||||
# Stop Redis to ensure consistent snapshot
|
||||
docker compose stop redis
|
||||
|
||||
# Copy volume data to a backup location
|
||||
docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \
|
||||
alpine tar czf /backup/redis_$(date +%Y%m%d_%H%M%S).tar.gz -C /data .
|
||||
|
||||
docker compose start redis
|
||||
```
|
||||
|
||||
### Restore Redis
|
||||
|
||||
```bash
|
||||
docker compose stop redis
|
||||
|
||||
# Clear existing data and restore
|
||||
docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \
|
||||
alpine sh -c "rm -rf /data/* && tar xzf /backup/redis_YYYYMMDD_HHMMSS.tar.gz -C /data"
|
||||
|
||||
docker compose start redis
|
||||
```
|
||||
|
||||
> **Note:** Redis is used as a cache with `allkeys-lru` eviction. Full data loss is non-critical — the API will repopulate cache entries on demand. Restore is only necessary if session data or queue state must be preserved.
|
||||
|
||||
---
|
||||
|
||||
## Typesense Backup & Restore
|
||||
|
||||
Typesense data is stored in the `typesense_data` Docker volume.
|
||||
|
||||
### Create Snapshot
|
||||
|
||||
```bash
|
||||
# Typesense built-in snapshot API
|
||||
curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
|
||||
"http://localhost:8108/operations/snapshot?snapshot_path=/data/snapshots/$(date +%Y%m%d)"
|
||||
```
|
||||
|
||||
### Backup the Volume
|
||||
|
||||
```bash
|
||||
docker compose stop typesense
|
||||
|
||||
docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
|
||||
alpine tar czf /backup/typesense_$(date +%Y%m%d_%H%M%S).tar.gz -C /data .
|
||||
|
||||
docker compose start typesense
|
||||
```
|
||||
|
||||
### Restore Typesense
|
||||
|
||||
```bash
|
||||
docker compose stop typesense
|
||||
|
||||
docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
|
||||
alpine sh -c "rm -rf /data/* && tar xzf /backup/typesense_YYYYMMDD_HHMMSS.tar.gz -C /data"
|
||||
|
||||
docker compose start typesense
|
||||
```
|
||||
|
||||
### Rebuild from Source
|
||||
|
||||
If backup is unavailable, Typesense can be rebuilt by re-indexing from PostgreSQL:
|
||||
|
||||
```bash
|
||||
# After Typesense restarts with empty data:
|
||||
pnpm run typesense:reindex
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Disaster Recovery Runbook
|
||||
|
||||
### Scenario 1: PostgreSQL Failure (container crash or data corruption)
|
||||
|
||||
1. **Assess:** `docker logs goodgo-postgres` — check for corruption or OOM
|
||||
2. **Stop services:** `docker compose stop api ai-services`
|
||||
3. **Attempt restart:** `docker compose restart postgres`
|
||||
4. If restart fails (data corruption):
|
||||
- `docker compose stop postgres`
|
||||
- Remove volume: `docker volume rm goodgo-platform-ai_postgres_data`
|
||||
- Recreate: `docker compose up -d postgres`
|
||||
- Restore from backup: `docker exec goodgo-pg-backup /scripts/pg-restore.sh /backups/<latest>.sql.gz`
|
||||
- Verify: `docker exec goodgo-postgres psql -U goodgo -d goodgo -c '\dt'`
|
||||
- Run migrations: `pnpm prisma migrate deploy`
|
||||
5. **Restart services:** `docker compose up -d`
|
||||
6. **Verify:** Check API health at `http://localhost:3000/health`
|
||||
|
||||
**Expected RTO:** ~15 minutes (local backup), ~30 minutes (off-site backup)
|
||||
|
||||
### Scenario 2: Service Crash (API, AI-services, or Web)
|
||||
|
||||
1. **Check logs:** `docker compose logs --tail=100 <service>`
|
||||
2. **Restart service:** `docker compose restart <service>`
|
||||
3. If crash loops:
|
||||
- Check resource limits: `docker stats`
|
||||
- Check environment variables: `docker compose config`
|
||||
- Roll back to previous image tag if recent deployment caused the issue
|
||||
4. **Verify:** Check health endpoints
|
||||
|
||||
**Expected RTO:** ~5 minutes
|
||||
|
||||
### Scenario 3: Full Host Failure
|
||||
|
||||
1. Provision new host with Docker + Docker Compose
|
||||
2. Clone the repo and set up `.env` from secrets manager
|
||||
3. Pull images: `docker compose pull`
|
||||
4. Restore PostgreSQL backup from off-site storage
|
||||
5. Start all services: `docker compose up -d`
|
||||
6. Trigger Typesense reindex: `pnpm run typesense:reindex`
|
||||
7. Verify all health endpoints
|
||||
|
||||
**Expected RTO:** ~60 minutes (depends on backup transfer speed)
|
||||
|
||||
### Scenario 4: Data Corruption (application-level)
|
||||
|
||||
1. Identify the scope: which tables/records are affected
|
||||
2. If limited: restore specific tables using `pg_restore --table=<name>`
|
||||
3. If widespread: full restore from last known good backup
|
||||
4. Invalidate Redis cache: `docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" FLUSHALL`
|
||||
5. Reindex Typesense: `pnpm run typesense:reindex`
|
||||
|
||||
---
|
||||
|
||||
## Log Aggregation
|
||||
|
||||
|
||||
Reference in New Issue
Block a user