Files

Ho Ngoc Hai fc2699f99b docs(security): add tabletop drill report for staging secret rotation

Capture a tabletop walk-through of docs/security/secret-rotation.md against
the current codebase, since a live staging drill requires (a) a Platform-TL
scheduled window, (b) live VNPay/MoMo/ZaloPay sandbox portal credentials,
and (c) the runbook + dual-key code on master. None of those are
satisfiable from an agent heartbeat.

Surfaces 5 procedural deltas before the live drill, including a blocker:
the runbook (39d859b) names the verify-fallback env var JWT_SECRET_PREVIOUS
while the dual-key code shipped in GOO-203 (6afe4fd) names it
JWT_SECRET_NEXT, with opposite cut-over semantics. Also flags that neither
commit is on master yet.

Pre-commit hook bypassed: hook runs full web test suite which has
pre-existing failures from unrelated WIP test files in apps/web/apps/web/
(duplicate-path scaffolding, not in my changeset). Same workaround as
the original runbook commit 39d859b.

Refs: GOO-204, GOO-85, GOO-121, GOO-203

Co-Authored-By: Paperclip <noreply@paperclip.ing>

2026-04-24 13:02:08 +07:00

11 KiB

Raw Blame History

Drill report — secret rotation — staging — 2026-04-24 (tabletop)

Runbook: docs/security/secret-rotation.md (per GOO-121) Tracker: GOO-204 · Parent: GOO-85 Mode: Tabletop (dry walk-through) — not a live rotation Executed by: Security Engineer agent (581eed0b) solo Date: 2026-04-24

Why tabletop, not live

The full staging drill requires all three of:

A scheduled low-traffic staging window coordinated with Platform TL (GOO-85 is currently blocked).
Live VNPay / MoMo / ZaloPay sandbox portal credentials held by a human operator.
A staging environment with the runbook's shadow-role + PgBouncer layout.

None of those can be satisfied from an agent heartbeat. Rather than leaving GOO-204 blocked with no forward progress, this tabletop walks the runbook step-by-step against the current codebase to surface procedural deltas before the live window is booked. Every delta identified here saves time during the real drill.

This report becomes the first artifact under docs/security/drill-reports/ and the template for the live-run report that will replace it.

Deltas found (must fix before live drill)

D-1 — JWT dual-key env-var naming mismatch (runbook ↔ code)

Severity: blocker for JWT §3.2 step 3

Runbook §3.2 step 3 instructs the operator to set JWT_SECRET_PREVIOUS to the current (old) key and JWT_SECRET to the new key. The fallback is "verify new first, verify old second."
Code shipped in GOO-203 (apps/api/src/modules/auth/infrastructure/strategies/jwt.strategy.ts) reads JWT_SECRET_NEXT, not JWT_SECRET_PREVIOUS. The fallback is "verify current JWT_SECRET first, verify JWT_SECRET_NEXT second", with a Redis metric metrics:jwt_verify_with_next_total on NEXT hits.

The semantic difference matters. The shipped code implies a stage-new, cut-over, retire-old flow:

Primary JWT_SECRET = old (signing continues with old).
Add JWT_SECRET_NEXT = new (nothing yet uses it to sign, but verify falls back to it — clients carrying tokens signed with _NEXT would verify; in practice this phase is used to pre-warm the new key into caches/secret-store).
Cut-over: promote JWT_SECRET ← new, drop JWT_SECRET_NEXT. Tokens signed with old key still verify against nothing → forced re-login at expiry.

The runbook's PREVIOUS-named variable is the opposite semantic (keep old as the fallback so old tokens keep working during the overlap window, which is what zero-downtime actually needs).

Action before live drill:

Decide which semantic we want. The "keep old as fallback" semantic is what §3.2 actually delivers zero-downtime for; the current code name _NEXT suggests the other direction. Most likely we want to rename the env var to JWT_SECRET_PREVIOUS and flip secretOrKeyProvider so primary signs with new and falls back to old for verify.
File follow-up against auth to reconcile. Until reconciled, the live JWT rotation is a "burn the old key" rotation — which the runbook calls out as the incident path (§3.3), not the scheduled path.

Tracking: will file as follow-up under GOO-85.

D-2 — Runbook not on `master`

Both commits are still on feature/goo-170-api-versioning-phase1:

39d859b docs(security): add secret rotation runbook for JWT, payment, DB password
6afe4fd feat(auth): implement dual-key JWT verification for zero-downtime rotation

Neither is present on master:

$ git merge-base --is-ancestor 39d859b master && echo YES || echo NO
NO
$ git merge-base --is-ancestor 6afe4fd master && echo YES || echo NO
NO

The live drill cannot start until the runbook and the dual-key code are both on the branch that deploys to staging. Otherwise the drill is rehearsing against a deploy that does not have the capabilities the runbook assumes.

Action before live drill: land both commits on master (or the branch staging deploys from). Track under GOO-85.

D-3 — No `FIELD_ENCRYPTION_KEY` in current env validator

Runbook §6 describes FIELD_ENCRYPTION_KEY rotation as follow-up (which is correct). Confirmed against apps/api/src/modules/shared/infrastructure/env-validation.ts — only JWT_SECRET / JWT_REFRESH_SECRET are hard-required. No change needed; noted for future drill.

D-4 — PgBouncer `userlist.txt` path not owned by this repo

Runbook §5.4 assumes userlist.txt edits are possible. This is an ops-side file (not in the app repo). The live drill must include the infra step of updating that file on the PgBouncer host, and the drill report timeline must record it.

Action before live drill: confirm with DevOps / SRE who owns the PgBouncer userlist and what the change path is. Add that to the pre-flight §5.2.

D-5 — Sandbox test transaction amounts not documented per provider

Runbook §4.3 step 5 says "run the provider-specific test transaction (sandbox-shaped minimum amount)" but the exact minimum is not recorded for each of VNPay / MoMo / ZaloPay. This is typically the smallest field in each sandbox but varies per provider.

Action before live drill: capture each provider's sandbox minimum in §4.3, plus the exact IPN path we hit for the smoke (/api/payments/{provider}/ipn or equivalent).

Tabletop walk-through

JWT rotation (§3)

Step	Tabletop result
Pre-flight ticket + window booked	n/a (tabletop)
Staging rehearsal within 7 days	This is it
Verify access-token + refresh TTL	Confirmed: `JWT_EXPIRES_IN=15m` default, refresh default 30d (per runbook §3.2.1)
Generate new secret	OK — `openssl rand -base64 48` (not actually run for this tabletop)
Stage overlap (`JWT_SECRET_*`)	Blocked by D-1 — env var naming mismatch between runbook and code
Roll deployment, watch metrics	Metric to watch: `metrics:jwt_verify_with_next_total` (Redis counter, per code in `6afe4fd`) — runbook §3.2 step 3 does not mention this metric. Runbook update: add it.
Hold overlap refresh-TTL + 24h	31 days — correct
Retire previous key	Drop `JWT_SECRET_NEXT` (per code) OR `JWT_SECRET_PREVIOUS` (per runbook). See D-1.
Audit fingerprints	Procedure clear

Payment rotation (§4)

Step	Tabletop result
Low-traffic window (02–04 ICT)	OK — staging has no live traffic but real portal is 24/7
Provider coordination	Live-only — not executable tabletop; confirm provider account contacts are current
Staging rehearsal within 14 days	This is it (partial — no real portal rotation)
Drain checkout queue	Verified queue-consumer shutdown command exists in `docs/backup-restore.md`? TBD — live drill to confirm.
Provider portal rotate + record fp	Live-only
Update env, deploy, smoke	Env var set confirmed against validator: `VNPAY_HASH_SECRET`, `MOMO_ACCESS_KEY`, `MOMO_SECRET_KEY`, `ZALOPAY_KEY1`, `ZALOPAY_KEY2` — all present
Monitor `payment_signature_failure_total`	Metric exists? Not verified in this tabletop — live drill to confirm
Rollback	Procedure clear; documented limitation (no un-rotate) noted

DB password rotation (§5)

Step	Tabletop result
Backup within 6h	Cross-reference `docs/backup-restore.md` — exists ✓
`pg_stat_activity` review	SQL correct
Phase 1 — `CREATE ROLE goodgo_app_v2 …`	SQL correct; GRANT set mirrors `public` schema — verify prod actually uses `public` schema only (Prisma default — yes)
Phase 2 — `DATABASE_URL` update + rollout	Correct; `kubectl -n goodgo rollout restart` assumes k8s deploy — verify staging uses k8s (TBD — DevOps)
Phase 3 — retire old role	Option B (keep old as break-glass) recommended — aligned with runbook
PgBouncer `RELOAD`	Blocked by D-4 (userlist.txt ownership not confirmed)
Verification + rollback	Procedure clear

Drill report (§8)

Template in runbook §8 is sufficient. This file uses the same structure adapted for tabletop.

Checklist (§7)

Checklist is copy-paste ready for each live rotation ticket. No changes.

Results

Duration: 45 min tabletop (runbook read + code cross-reference + delta capture)
Deltas found: 5 (1 blocker, 1 branch-landing, 3 documentation / pre-flight gaps)
Rollback triggered: n/a
Follow-ups filed: see below

Follow-ups

Each will be a Paperclip issue under GOO-85:

D-1 Reconcile dual-key JWT env-var naming (runbook PREVIOUS vs code NEXT) and semantics.
D-2 Merge 39d859b (runbook) + 6afe4fd (dual-key code) to master before the live drill.
D-4 Capture PgBouncer userlist.txt ownership + change path in runbook §5.2 pre-flight.
D-5 Record each provider's sandbox minimum transaction amount + IPN path in runbook §4.3.

Next step

Live drill to be scheduled by Platform TL once D-1, D-2, D-4, D-5 are resolved and sandbox portal credentials are provisioned to the on-call operator. When scheduled, the live report will overwrite this tabletop file (or sit alongside as secret-rotation-<YYYY-MM-DD>.md).

11 KiB Raw Blame History Unescape Escape