goodgo-platform/docs/backup-restore.md

# Backup & Disaster Recovery

## RTO / RPO Targets

| Metric | Target | Notes |
|--------|--------|-------|
| **RPO** (Recovery Point Objective) | ≤ 24 hours | Daily backups at 02:00 UTC; worst case is a full day of data loss |
| **RTO** (Recovery Time Objective) | ≤ 30 minutes | Restore from local volume backup; longer if retrieving from off-site |

> To reduce RPO further, consider WAL archiving for PostgreSQL (continuous point-in-time recovery) and more frequent backup schedules.

---

## PostgreSQL Backup

### Overview

Automated daily PostgreSQL backups run inside the `pg-backup` Docker container using `pg_dump` with custom format compression. Backups are stored in the `pg_backups` Docker volume.

## Backup Configuration

| Setting | Default | Environment Variable |
|---------|---------|---------------------|
| Schedule | Daily at 02:00 UTC | Cron in `pg-backup` service |
| Retention | 7 days | `BACKUP_RETENTION_DAYS` |
| Format | Custom (`pg_dump --format=custom`) | — |
| Compression | Level 6 | — |
| Storage | `pg_backups` Docker volume | — |

## Listing Backups

```bash
docker exec goodgo-pg-backup ls -lh /backups/
```

## Manual Backup

```bash
docker exec goodgo-pg-backup /scripts/pg-backup.sh
```

## Restore Procedure

### 1. Identify the backup to restore

```bash
docker exec goodgo-pg-backup ls -lht /backups/
```

### 2. Stop application services

```bash
docker compose stop ai-services
# Stop any NestJS API processes
```

### 3. Run restore

```bash
docker exec -it goodgo-pg-backup /scripts/pg-restore.sh /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz
```

The restore script will:
- Terminate active database connections
- Drop and recreate the database
- Restore from the selected backup

### 4. Verify restore

```bash
docker exec goodgo-postgres psql -U goodgo -d goodgo -c '\dt'
docker exec goodgo-postgres psql -U goodgo -d goodgo -c 'SELECT count(*) FROM "User";'
```

### 5. Run Prisma migrations (if needed)

```bash
pnpm prisma migrate deploy
```

### 6. Restart services

```bash
docker compose up -d
```

## Backup Verification

Check the backup log:

```bash
docker exec goodgo-pg-backup cat /var/log/pg-backup.log
```

Verify backup integrity without restoring:

```bash
docker exec goodgo-pg-backup pg_restore --list /backups/goodgo_YYYYMMDD_HHMMSS.sql.gz
```

---

## Redis Backup & Restore

Redis is configured with AOF persistence (`--appendonly yes`). Data is stored in the `redis_data` Docker volume.

### Manual Snapshot

```bash
# Trigger an RDB snapshot
docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" BGSAVE

# Wait for completion
docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" LASTSAVE
```

### Backup the Volume

```bash
# Stop Redis to ensure consistent snapshot
docker compose stop redis

# Copy volume data to a backup location
docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \
  alpine tar czf /backup/redis_$(date +%Y%m%d_%H%M%S).tar.gz -C /data .

docker compose start redis
```

### Restore Redis

```bash
docker compose stop redis

# Clear existing data and restore
docker run --rm -v goodgo-platform-ai_redis_data:/data -v $(pwd)/backups:/backup \
  alpine sh -c "rm -rf /data/* && tar xzf /backup/redis_YYYYMMDD_HHMMSS.tar.gz -C /data"

docker compose start redis
```

> **Note:** Redis is used as a cache with `allkeys-lru` eviction. Full data loss is non-critical — the API will repopulate cache entries on demand. Restore is only necessary if session data or queue state must be preserved.

---

## Typesense Backup & Restore

Typesense data is stored in the `typesense_data` Docker volume.

### Create Snapshot

```bash
# Typesense built-in snapshot API
curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
  "http://localhost:8108/operations/snapshot?snapshot_path=/data/snapshots/$(date +%Y%m%d)"
```

### Backup the Volume

```bash
docker compose stop typesense

docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
  alpine tar czf /backup/typesense_$(date +%Y%m%d_%H%M%S).tar.gz -C /data .

docker compose start typesense
```

### Restore Typesense

```bash
docker compose stop typesense

docker run --rm -v goodgo-platform-ai_typesense_data:/data -v $(pwd)/backups:/backup \
  alpine sh -c "rm -rf /data/* && tar xzf /backup/typesense_YYYYMMDD_HHMMSS.tar.gz -C /data"

docker compose start typesense
```

### Rebuild from Source

If backup is unavailable, Typesense can be rebuilt by re-indexing from PostgreSQL:

```bash
# After Typesense restarts with empty data:
pnpm run typesense:reindex
```

---

## Disaster Recovery Runbook

### Scenario 1: PostgreSQL Failure (container crash or data corruption)

1. **Assess:** `docker logs goodgo-postgres` — check for corruption or OOM
2. **Stop services:** `docker compose stop api ai-services`
3. **Attempt restart:** `docker compose restart postgres`
4. If restart fails (data corruption):
   - `docker compose stop postgres`
   - Remove volume: `docker volume rm goodgo-platform-ai_postgres_data`
   - Recreate: `docker compose up -d postgres`
   - Restore from backup: `docker exec goodgo-pg-backup /scripts/pg-restore.sh /backups/<latest>.sql.gz`
   - Verify: `docker exec goodgo-postgres psql -U goodgo -d goodgo -c '\dt'`
   - Run migrations: `pnpm prisma migrate deploy`
5. **Restart services:** `docker compose up -d`
6. **Verify:** Check API health at `http://localhost:3000/health`

**Expected RTO:** ~15 minutes (local backup), ~30 minutes (off-site backup)

### Scenario 2: Service Crash (API, AI-services, or Web)

1. **Check logs:** `docker compose logs --tail=100 <service>`
2. **Restart service:** `docker compose restart <service>`
3. If crash loops:
   - Check resource limits: `docker stats`
   - Check environment variables: `docker compose config`
   - Roll back to previous image tag if recent deployment caused the issue
4. **Verify:** Check health endpoints

**Expected RTO:** ~5 minutes

### Scenario 3: Full Host Failure

1. Provision new host with Docker + Docker Compose
2. Clone the repo and set up `.env` from secrets manager
3. Pull images: `docker compose pull`
4. Restore PostgreSQL backup from off-site storage
5. Start all services: `docker compose up -d`
6. Trigger Typesense reindex: `pnpm run typesense:reindex`
7. Verify all health endpoints

**Expected RTO:** ~60 minutes (depends on backup transfer speed)

### Scenario 4: Data Corruption (application-level)

1. Identify the scope: which tables/records are affected
2. If limited: restore specific tables using `pg_restore --table=<name>`
3. If widespread: full restore from last known good backup
4. Invalidate Redis cache: `docker exec goodgo-redis redis-cli -a "$REDIS_PASSWORD" FLUSHALL`
5. Reindex Typesense: `pnpm run typesense:reindex`

---

## Log Aggregation

Logs are aggregated via Loki + Promtail and viewable in Grafana:

- **Grafana**: http://localhost:3002 (dashboard: "GoodGo - Logs")
- **Loki**: http://localhost:3100
- **Log retention**: 15 days (configured in `monitoring/loki/loki-config.yml`)