- Added RTO/RPO targets (RPO ≤24h, RTO ≤30m) - Added Redis backup/restore procedures (volume + BGSAVE) - Added Typesense backup/restore + rebuild from source - Added DR runbook: DB failure, service crash, host failure, data corruption - Restructured doc with clear sections per service Ref: TEC-1572 Co-Authored-By: Paperclip <noreply@paperclip.ing>
251 lines
6.9 KiB
Markdown
251 lines
6.9 KiB
Markdown
# Backup & Disaster Recovery
|
|
|
|
## RTO / RPO Targets
|
|
|
|
| Metric | Target | Notes |
|
|
|--------|--------|-------|
|
|
| **RPO** (Recovery Point Objective) | ≤ 24 hours | Daily backups at 02:00 UTC; worst case is a full day of data loss |
|
|
| **RTO** (Recovery Time Objective) | ≤ 30 minutes | Restore from local volume backup; longer if retrieving from off-site |
|
|
|
|
> To reduce RPO further, consider WAL archiving for PostgreSQL (continuous point-in-time recovery) and more frequent backup schedules.
|
|
|
|
---
|
|
|
|
## PostgreSQL Backup
|
|
|
|
### Overview
|
|
|
|
Automated daily PostgreSQL backups run inside the `pg-backup` Docker container using `pg_dump` with custom format compression. Backups are stored in the `pg_backups` Docker volume.
|
|
|
|
## Backup Configuration
|
|
|
|
| Setting | Default | Environment Variable |
|
|
|---------|---------|---------------------|
|
|
| Schedule | Daily at 02:00 UTC | Cron in `pg-backup` service |
|
|
| Retention | 7 days | `BACKUP_RETENTION_DAYS` |
|
|
| Format | Custom (`pg_dump --format=custom`) | — |
|
|
| Compression | Level 6 | — |
|
|
| Storage | `pg_backups` Docker volume | — |
|
|
|
|
## Listing Backups
|
|
|
|
```bash
|
|
docker exec goodgo-pg-backup ls -lh /backups/
|
|
```
|
|
|
|
## Manual Backup
|
|
|
|
```bash
|
|
docker exec goodgo-pg-backup /scripts/pg-backup.sh
|
|
```
|
|
|
|
## Restore Procedure
|
|
|
|
### 1. Identify the backup to restore
|
|
|
|
```bash
|
|
docker exec goodgo-pg-backup ls -lht /backups/
|
|
```
|
|
|
|
### 2. Stop application services
|
|
|
|
```bash
|
|
docker compose stop ai-services
|
|
# Stop any NestJS API processes
|
|
```
|
|
|
|
### 3. Run restore
|
|
|
|
```bash
|
|
docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz
|
|
```
|
|
|
|
The restore script will:
|
|
- Terminate active database connections
|
|
- Drop and recreate the database
|
|
- Restore from the selected backup
|
|
|
|
### 4. Verify restore
|
|
|
|
```bash
|
|
docker exec goodgo-postgres psql -U goodgo -d goodgo -c '\dt'
|
|
docker exec goodgo-postgres psql -U goodgo -d goodgo -c 'SELECT count(*) FROM "User";'
|
|
```
|
|
|
|
### 5. Run Prisma migrations (if needed)
|
|
|
|
```bash
|
|
pnpm prisma migrate deploy
|
|
```
|
|
|
|
### 6. Restart services
|
|
|
|
```bash
|
|
docker compose up -d
|
|
```
|
|
|
|
## Backup Verification
|
|
|
|
Check the backup log:
|
|
|
|
```bash
|
|
docker exec goodgo-pg-backup cat /var/log/pg-backup.log
|
|
```
|
|
|
|
Verify backup integrity without restoring:
|
|
|
|
```bash
|
|
docker exec goodgo-pg-backup pg_restore --list /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz
|
|
```
|
|
|
|
---
|
|
|
|
## Redis Backup & Restore
|
|
|
|
Redis is configured with AOF persistence (`--appendonly yes`). Data is stored in the `redis_data` Docker volume.
|
|
|
|
### Manual Snapshot
|
|
|
|
```bash
|
|
# Trigger an RDB snapshot
|
|
docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" BGSAVE
|
|
|
|
# Wait for completion
|
|
docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" LASTSAVE
|
|
```
|
|
|
|
### Backup the Volume
|
|
|
|
```bash
|
|
# Stop Redis to ensure consistent snapshot
|
|
docker compose stop redis
|
|
|
|
# Copy volume data to a backup location
|
|
docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \
|
|
alpine tar czf /backup/redis_$(date +%Y%m%d_%H%M%S).tar.gz -C /data .
|
|
|
|
docker compose start redis
|
|
```
|
|
|
|
### Restore Redis
|
|
|
|
```bash
|
|
docker compose stop redis
|
|
|
|
# Clear existing data and restore
|
|
docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \
|
|
alpine sh -c "rm -rf /data/* && tar xzf /backup/redis_YYYYMMDD_HHMMSS.tar.gz -C /data"
|
|
|
|
docker compose start redis
|
|
```
|
|
|
|
> **Note:** Redis is used as a cache with `allkeys-lru` eviction. Full data loss is non-critical — the API will repopulate cache entries on demand. Restore is only necessary if session data or queue state must be preserved.
|
|
|
|
---
|
|
|
|
## Typesense Backup & Restore
|
|
|
|
Typesense data is stored in the `typesense_data` Docker volume.
|
|
|
|
### Create Snapshot
|
|
|
|
```bash
|
|
# Typesense built-in snapshot API
|
|
curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
|
|
"http://localhost:8108/operations/snapshot?snapshot_path=/data/snapshots/$(date +%Y%m%d)"
|
|
```
|
|
|
|
### Backup the Volume
|
|
|
|
```bash
|
|
docker compose stop typesense
|
|
|
|
docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
|
|
alpine tar czf /backup/typesense_$(date +%Y%m%d_%H%M%S).tar.gz -C /data .
|
|
|
|
docker compose start typesense
|
|
```
|
|
|
|
### Restore Typesense
|
|
|
|
```bash
|
|
docker compose stop typesense
|
|
|
|
docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
|
|
alpine sh -c "rm -rf /data/* && tar xzf /backup/typesense_YYYYMMDD_HHMMSS.tar.gz -C /data"
|
|
|
|
docker compose start typesense
|
|
```
|
|
|
|
### Rebuild from Source
|
|
|
|
If backup is unavailable, Typesense can be rebuilt by re-indexing from PostgreSQL:
|
|
|
|
```bash
|
|
# After Typesense restarts with empty data:
|
|
pnpm run typesense:reindex
|
|
```
|
|
|
|
---
|
|
|
|
## Disaster Recovery Runbook
|
|
|
|
### Scenario 1: PostgreSQL Failure (container crash or data corruption)
|
|
|
|
1. **Assess:** `docker logs goodgo-postgres` — check for corruption or OOM
|
|
2. **Stop services:** `docker compose stop api ai-services`
|
|
3. **Attempt restart:** `docker compose restart postgres`
|
|
4. If restart fails (data corruption):
|
|
- `docker compose stop postgres`
|
|
- Remove volume: `docker volume rm goodgo-platform-ai_postgres_data`
|
|
- Recreate: `docker compose up -d postgres`
|
|
- Restore from backup: `docker exec goodgo-pg-backup /scripts/pg-restore.sh /backups/<latest>.sql.gz`
|
|
- Verify: `docker exec goodgo-postgres psql -U goodgo -d goodgo -c '\dt'`
|
|
- Run migrations: `pnpm prisma migrate deploy`
|
|
5. **Restart services:** `docker compose up -d`
|
|
6. **Verify:** Check API health at `http://localhost:3000/health`
|
|
|
|
**Expected RTO:** ~15 minutes (local backup), ~30 minutes (off-site backup)
|
|
|
|
### Scenario 2: Service Crash (API, AI-services, or Web)
|
|
|
|
1. **Check logs:** `docker compose logs --tail=100 <service>`
|
|
2. **Restart service:** `docker compose restart <service>`
|
|
3. If crash loops:
|
|
- Check resource limits: `docker stats`
|
|
- Check environment variables: `docker compose config`
|
|
- Roll back to previous image tag if recent deployment caused the issue
|
|
4. **Verify:** Check health endpoints
|
|
|
|
**Expected RTO:** ~5 minutes
|
|
|
|
### Scenario 3: Full Host Failure
|
|
|
|
1. Provision new host with Docker + Docker Compose
|
|
2. Clone the repo and set up `.env` from secrets manager
|
|
3. Pull images: `docker compose pull`
|
|
4. Restore PostgreSQL backup from off-site storage
|
|
5. Start all services: `docker compose up -d`
|
|
6. Trigger Typesense reindex: `pnpm run typesense:reindex`
|
|
7. Verify all health endpoints
|
|
|
|
**Expected RTO:** ~60 minutes (depends on backup transfer speed)
|
|
|
|
### Scenario 4: Data Corruption (application-level)
|
|
|
|
1. Identify the scope: which tables/records are affected
|
|
2. If limited: restore specific tables using `pg_restore --table=<name>`
|
|
3. If widespread: full restore from last known good backup
|
|
4. Invalidate Redis cache: `docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" FLUSHALL`
|
|
5. Reindex Typesense: `pnpm run typesense:reindex`
|
|
|
|
---
|
|
|
|
## Log Aggregation
|
|
|
|
Logs are aggregated via Loki + Promtail and viewable in Grafana:
|
|
|
|
- **Grafana**: http://localhost:3002 (dashboard: "GoodGo - Logs")
|
|
- **Loki**: http://localhost:3100
|
|
- **Log retention**: 15 days (configured in `monitoring/loki/loki-config.yml`)
|