diff --git a/docs/security/secret-rotation.md b/docs/security/secret-rotation.md new file mode 100644 index 0000000..61dfa70 --- /dev/null +++ b/docs/security/secret-rotation.md @@ -0,0 +1,447 @@ +# Secret Rotation Runbook + +**Owner:** Security Engineering +**Tracker:** [GOO-121](/GOO/issues/GOO-121) · Parent: [GOO-85](/GOO/issues/GOO-85) +**Last reviewed:** 2026-04-23 +**Audience:** On-call SRE, Security, Platform TechLead + +This runbook covers rotation of GoodGo Platform's production secrets. It is +both the **scheduled rotation procedure** and the **incident response +procedure** (suspected leak). Every secret class below has: + +1. Rotation trigger (scheduled + incident). +2. Pre-flight checks. +3. Step-by-step rotation. +4. Verification. +5. Rollback. + +> **Golden rules** +> +> - Always rehearse in **staging** before touching production. +> - Never paste production secrets into chat, issues, or commits. +> - Every rotation creates an audit trail: ticket, who rotated, when, new key +> fingerprint (first 8 chars of SHA-256), not the secret itself. +> - Use a break-glass buddy for production rotations (two-person rule). + +--- + +## 1. Secret inventory + +| Secret class | Env vars | Rotation cadence | Blast radius | Owner | +| ----------------------------- | ------------------------------------------------------------------------ | --------------------- | ------------------------------------------------------- | --------------- | +| JWT signing keys | `JWT_SECRET`, `JWT_REFRESH_SECRET` | 90 days / on leak | All active user sessions | Security / Auth | +| Field-level encryption | `FIELD_ENCRYPTION_KEY` | 180 days / on leak | At-rest encrypted columns (PII) | Security | +| VNPay | `VNPAY_HASH_SECRET`, `VNPAY_TMN_CODE` | 90 days / on leak | All VNPay checkout + IPN | Payments | +| MoMo | `MOMO_PARTNER_CODE`, `MOMO_ACCESS_KEY`, `MOMO_SECRET_KEY` | 90 days / on leak | All MoMo checkout + IPN | Payments | +| ZaloPay | `ZALOPAY_APP_ID`, `ZALOPAY_KEY1`, `ZALOPAY_KEY2` | 90 days / on leak | All ZaloPay checkout + IPN | Payments | +| Bank transfer webhook | `BANK_TRANSFER_WEBHOOK_SECRET` | 90 days / on leak | Inbound bank webhook verification | Payments | +| Database password | `DATABASE_URL` (password portion) | 180 days / on leak | All API DB access | Platform | +| Redis password | `REDIS_URL` / `REDIS_PASSWORD` | 180 days / on leak | Session cache, queues | Platform | +| OAuth provider secrets | `GOOGLE_CLIENT_SECRET`, `ZALO_APP_SECRET` | 180 days / on leak | Social login flows | Auth | +| Object storage | `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY` | 180 days / on leak | Media uploads/downloads | Platform | +| Notification | `ZALO_OA_ACCESS_TOKEN` | Per provider policy | Push / OA messages | Growth | + +All of these are enforced by `apps/api/src/modules/shared/infrastructure/env-validation.ts`. + +--- + +## 2. Key-generation reference + +Use **only** cryptographically secure generators. Never use `Math.random`, UUIDs, +or ad-hoc strings. Record only the **SHA-256 fingerprint** in the rotation +ticket. + +```bash +# JWT / webhook / generic 256-bit+ secret (>= 32 chars, base64) +openssl rand -base64 48 + +# Field-level encryption key (exactly 32 bytes, base64) +openssl rand -base64 32 + +# Database / Redis password (URL-safe, 32+ chars) +openssl rand -base64 36 | tr -d '/+=' | cut -c1-32 + +# Fingerprint to record in the rotation ticket (paste secret on stdin) +printf '%s' "$NEW_SECRET" | openssl dgst -sha256 | cut -c1-16 +``` + +Storage: secrets live in the platform secret store (Vault / SSM / sealed +secrets). **Never commit real values to `.env.example`** — that file documents +names only. + +--- + +## 3. JWT_SECRET / JWT_REFRESH_SECRET — dual-key rolling rotation + +### 3.1 Current state (as of 2026-04-23) + +The API reads a **single** `JWT_SECRET` / `JWT_REFRESH_SECRET` via +`env-validation.ts` and `apps/api/src/modules/auth/infrastructure/strategies/jwt.strategy.ts`. +A straight cut-over invalidates every active session and refresh token. + +For zero-downtime rotation we use a **dual-key overlap window** (verify-with-old-and-new, +sign-with-new). During the overlap window the app reads: + +- `JWT_SECRET` — **new** key, used to sign all new tokens. +- `JWT_SECRET_PREVIOUS` — **old** key, used only to verify unexpired tokens. + +> Dual-key loading requires a small code change in `JwtStrategy` / +> `TokenService` (pass both secrets, try new first, fall back to previous). +> The code change is tracked as a follow-up; **until it ships, rotations are +> "break sessions" rotations — schedule them during a low-traffic window and +> pre-announce**. + +### 3.2 Scheduled rotation (dual-key path, once code is in place) + +1. **Pre-flight** + - Ticket opened, change window booked, on-call notified. + - Staging rehearsal complete within last 7 days. + - Verify current access-token TTL (`JWT_EXPIRES_IN`, default `15m`) and + refresh-token TTL (default `30d`). The overlap window must be **≥** the + longest valid token's remaining life. + +2. **Generate new secrets** + + ```bash + NEW_JWT=$(openssl rand -base64 48) + NEW_JWT_REFRESH=$(openssl rand -base64 48) + ``` + +3. **Stage the overlap** + + In the secret store: + + | Variable | Value | + | --------------------------- | ------------------- | + | `JWT_SECRET_PREVIOUS` | current `JWT_SECRET` | + | `JWT_SECRET` | `$NEW_JWT` | + | `JWT_REFRESH_SECRET_PREVIOUS` | current `JWT_REFRESH_SECRET` | + | `JWT_REFRESH_SECRET` | `$NEW_JWT_REFRESH` | + + Roll the API deployment. Monitor `auth_login_total`, `auth_refresh_total`, + `auth_jwt_verify_failure_total`. Expected: no spike in 401s. + +4. **Hold overlap** + + Keep both keys live for **refresh-TTL + 24 h** (default 31 days). During this + time old tokens continue to verify against `*_PREVIOUS`, but every refresh + mints a new token signed with the new key. + +5. **Retire previous key** + + Remove `JWT_SECRET_PREVIOUS` and `JWT_REFRESH_SECRET_PREVIOUS` from the + secret store. Redeploy. At this point any remaining token signed with the + old key will fail verification — which is the intended end state. + +6. **Audit** + - Record fingerprints of new keys in the rotation ticket. + - Confirm no secrets appear in git, logs, or issue comments. + +### 3.3 Incident rotation (suspected leak) + +Skip the overlap. This **will** invalidate all sessions; that is the point. + +1. Generate new `JWT_SECRET` / `JWT_REFRESH_SECRET`. +2. Put service in maintenance mode (optional — it's graceful without it). +3. Update secret store → redeploy API. +4. Invalidate server-side sessions: + - Flush Redis key prefix `auth:user_status:v1:*` (see `jwt.strategy.ts` + constant `USER_STATUS_CACHE_PREFIX`). + - Truncate `RefreshToken` table (or flag revoked) so no old refresh token + can mint a new access token. +5. Announce forced re-login to users. +6. Post-mortem within 48 h. + +### 3.4 Verification + +- `GET /health/ready` returns 200. +- Smoke: login with a test account, hit an authenticated endpoint, refresh. +- Metrics: `auth_jwt_verify_failure_total` returns to baseline within 1 h. + +### 3.5 Rollback + +- Scheduled rotation: put old value back into `JWT_SECRET` / `JWT_REFRESH_SECRET` + (still present in `*_PREVIOUS` during overlap) and redeploy. +- Incident rotation: there is no rollback — old key is assumed burned. + +--- + +## 4. Payment provider secrets — VNPay / MoMo / ZaloPay + +Payment secrets are **shared** with the provider; you cannot rotate them +unilaterally. The rotation is always a coordinated cut-over via the provider +portal. + +### 4.1 Scope + +| Provider | Variables rotated in portal + our env | +| -------- | ------------------------------------------------------------------------------ | +| VNPay | `VNPAY_HASH_SECRET` (keep `VNPAY_TMN_CODE` stable unless the merchant rotates) | +| MoMo | `MOMO_ACCESS_KEY`, `MOMO_SECRET_KEY` | +| ZaloPay | `ZALOPAY_KEY1`, `ZALOPAY_KEY2` | + +All three providers sign both request and IPN callback. A mismatched secret +causes signature-verification failure on both legs. + +### 4.2 Pre-flight + +- Low-traffic window booked (recommend 02:00–04:00 ICT). +- Coordinate with the provider account manager; confirm the portal supports + immediate rotation (VNPay and MoMo do; ZaloPay requires ticket for prod). +- Staging rehearsal completed within last 14 days (see §4.5). +- Freeze new checkouts if the provider cannot overlap old + new secrets (most + cannot — rotation is atomic). +- Payments-on-call paged. +- Confirm no in-flight IPNs older than the provider's retry window + (VNPay 24 h, MoMo 24 h, ZaloPay 48 h). + +### 4.3 Scheduled rotation (production) + +1. **Drain:** stop the checkout queue consumers; let in-flight IPNs settle for + the provider's retry window. +2. **Provider portal:** log in → rotate secret → record new value + fingerprint + in the rotation ticket. +3. **Secret store:** update our env with the new value. +4. **Deploy:** roll the API. Consumers come back up. +5. **Smoke:** run the provider-specific test transaction (sandbox-shaped + minimum amount). Verify both checkout and IPN sign + verify with the new + secret. +6. **Monitor for 60 min:** + - `payment_signature_failure_total{provider}` stays at baseline. + - `payment_ipn_reject_total{provider}` stays at baseline. + - No unusual refund / reconciliation drift. + +### 4.4 Incident rotation (suspected leak) + +Same steps as §4.3, but compress the timeline and accept failed in-flight +transactions — better a handful of failed checkouts than a compromised secret. +File a follow-up for manual reconciliation of any payment created in the 30 min +before the rotation. + +### 4.5 Staging rehearsal + +The staging rehearsal for payment secrets **must** exist as a dry run before +any production rotation. Use the sandbox credentials documented in the +payments module runbook (each provider has a public sandbox). + +Record in the drill report (see §8): + +- Duration from "portal updated" to "first successful IPN verified". +- Any failed transactions and their reason codes. +- Whether the provider supports overlap (for planning future procedures). + +### 4.6 Rollback + +- If the provider portal still has the old secret active (rare — most providers + replace), revert the env var and redeploy. +- Otherwise rotate forward again to a freshly generated value; there is no way + to "un-rotate" at the provider. + +--- + +## 5. DATABASE_URL password — zero-downtime rotation + +### 5.1 Strategy + +Postgres supports **multiple roles** and connection strings already identify a +user. We rotate the password in two phases, using a transient dual-password +state via a second role: + +1. Create a shadow role `goodgo_app_v2` with the **new** password, same + privileges as the live role. Permit both roles to authenticate. +2. Update the app's `DATABASE_URL` to point at the new role. Roll the API. +3. Once all API pods have reconnected, drop the old role (or reset its + password and keep it as a break-glass). + +Postgres itself does not support "two valid passwords for one role"; swapping +roles is the clean zero-downtime path. + +### 5.2 Pre-flight + +- PostgreSQL 16 + PgBouncer connection pool verified healthy. +- Staging rehearsal completed within last 14 days. +- `pg_stat_activity` reviewed; no long-running migrations. +- Backup snapshot taken within last 6 h (see `docs/backup-restore.md`). + +### 5.3 Scheduled rotation + +```sql +-- Phase 1: create shadow role (run as DB owner / postgres) +CREATE ROLE goodgo_app_v2 LOGIN PASSWORD ''; +GRANT goodgo_app TO goodgo_app_v2; -- inherit group, or mirror explicit grants +GRANT CONNECT ON DATABASE goodgo TO goodgo_app_v2; +GRANT USAGE ON SCHEMA public TO goodgo_app_v2; +GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO goodgo_app_v2; +-- Mirror any other grants the live role has. Verify with: +-- \du goodgo_app +``` + +```bash +# Phase 2: update secret store, then roll API +# DATABASE_URL=postgresql://goodgo_app_v2:@host:5432/goodgo?sslmode=require + +# Rolling restart — one pod at a time; watch readiness probe before moving on. +kubectl -n goodgo rollout restart deployment/api +kubectl -n goodgo rollout status deployment/api --timeout=10m +``` + +```sql +-- Phase 3: verify no sessions still on old role, then retire it. +-- Run 30+ minutes after rollout completes. +SELECT usename, count(*) FROM pg_stat_activity WHERE usename IN ('goodgo_app','goodgo_app_v2') GROUP BY usename; +-- Expect: only goodgo_app_v2 connections. + +-- Option A: drop the old role (only if no other consumers use it). +-- REASSIGN OWNED BY goodgo_app TO goodgo_app_v2; +-- DROP OWNED BY goodgo_app; +-- DROP ROLE goodgo_app; + +-- Option B (recommended): reset its password to a fresh random value and keep +-- it as an emergency break-glass. Document the fingerprint in the ticket. +ALTER ROLE goodgo_app PASSWORD ''; +``` + +For the next rotation, flip the naming (`goodgo_app_v2` → `goodgo_app_v3`), +keeping the alternation going. This avoids ever needing to drop and recreate +the "canonical" role name. + +### 5.4 PgBouncer considerations + +If PgBouncer sits in front of Postgres: + +- Update `userlist.txt` (or its auth source) with both roles **before** the + API roll. +- `RELOAD` PgBouncer; do not `RESTART` (clients reconnect automatically from + `RELOAD` without dropping server-side transactions). +- Verify with `SHOW USERS;` on the PgBouncer admin console. + +### 5.5 Incident rotation + +Same steps but: + +- Skip the 30-minute settle in Phase 3 — rotate immediately to Option A (drop + the compromised role) once no active sessions remain. +- If a session is actively using the compromised role, terminate it: + ```sql + SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE usename = 'goodgo_app'; + ``` +- Run a post-rotation audit on the compromised-role's activity since the last + known-good window. + +### 5.6 Verification + +- `GET /health/ready` reports DB connectivity 200. +- `db_connection_pool_active` returns to steady state. +- Smoke queries via `pnpm db:studio` with the new credential. + +### 5.7 Rollback + +- Until Phase 3 completes, rollback is: revert `DATABASE_URL` to the old role + and redeploy. The old role still authenticates. +- After Phase 3 Option A (drop): no rollback; restore from snapshot is the + last resort. + +--- + +## 6. FIELD_ENCRYPTION_KEY + +Rotating the field-encryption key requires **re-encrypting at-rest data**. It +is not a hot swap. Out of scope for this runbook beyond documenting that it +exists and requires its own migration playbook. A separate issue will track +the re-encryption tooling; until then: + +- Generate and stage the new key alongside the old (`FIELD_ENCRYPTION_KEY` + + `FIELD_ENCRYPTION_KEY_PREVIOUS`). +- Do not flip the primary until a re-encrypt job has rewritten all + encrypted columns. +- This path is **approved-change-only** (CTO sign-off). + +Tracked as follow-up: see §9. + +--- + +## 7. Rotation checklist (copy into the rotation ticket) + +```md +## Rotation — + +- [ ] Ticket opened in Paperclip; linked to [GOO-121](/GOO/issues/GOO-121) +- [ ] Change window booked (date/time ICT) +- [ ] Staging rehearsal completed (date, drill report link) +- [ ] Buddy on-call: +- [ ] New secret generated with `openssl rand -base64 48` (or class-specific) +- [ ] New-secret fingerprint (SHA-256 first 16 chars): `________________` +- [ ] Secret store updated (do not paste the value here) +- [ ] Deploy rolled; readiness probes green +- [ ] Smoke + metrics verified (link to dashboard snapshot) +- [ ] Overlap window end date (JWT only): ____ +- [ ] Old secret retired / role dropped (timestamp) +- [ ] Post-rotation audit note in ticket +- [ ] Runbook updated if anything surprised us +``` + +--- + +## 8. Drill report template + +Each scheduled rotation — starting with a staging dry run — produces a drill +report posted as a comment on [GOO-121](/GOO/issues/GOO-121) (for the initial +drill) or on the rotation ticket. + +```md +## Drill report — + +**Window:** 02:00–02:47 ICT +**Rotated by:** with buddy + +### Timeline +- 02:00 — Pre-flight complete +- 02:05 — New secret generated (fingerprint `abcd1234…`) +- 02:10 — Secret store updated +- 02:12 — Deployment rolled +- 02:18 — Smoke passed +- 02:20 — Monitoring baseline confirmed +- 02:47 — Drill closed + +### Results +- Duration: 47 min +- Auth errors during rotation: 0 (scheduled) / N (incident — list) +- Payment failures: 0 / N +- Rollback triggered: no +- Follow-ups: link any new issues created + +### Learnings +- … +``` + +--- + +## 9. Follow-ups + +The following items are **not** delivered by this runbook and should be +tracked as separate issues: + +- **Dual-key JWT code path.** `JwtStrategy` and `TokenService` need to accept + `JWT_SECRET_PREVIOUS` / `JWT_REFRESH_SECRET_PREVIOUS` so §3.2 is truly + zero-downtime. Until then, JWT rotation invalidates sessions. +- **Field-encryption re-encrypt tool.** Required before `FIELD_ENCRYPTION_KEY` + can be rotated safely in production. +- **Secret-store automation.** Today rotations are manual via the secret + store UI; an automated rotator (Vault / SSM Parameter Store rotation + lambda) would shrink the window and reduce human error. +- **Production rotation approval.** Payment + DB password rotations in + production require a CTO approval window — see [GOO-85](/GOO/issues/GOO-85). + +--- + +## 10. References + +- `apps/api/src/modules/shared/infrastructure/env-validation.ts` — authoritative + list of required secrets and minimum-length enforcement. +- `apps/api/src/modules/auth/infrastructure/strategies/jwt.strategy.ts` — + current single-key JWT verification path. +- `docs/RUNBOOK.md` — general incident response procedures. +- `docs/backup-restore.md` — database snapshot / restore steps invoked during + DB password rotation pre-flight. +- `docs/security/PAYMENT_SECURITY_CHECKLIST.md` — payment security controls. +- Parent tracker: [GOO-85](/GOO/issues/GOO-85).