Authors docs/security/secret-rotation.md (GOO-121) covering scheduled and incident rotation for JWT secrets (dual-key overlap), VNPay/MoMo/ZaloPay, and the database password (zero-downtime via shadow role + PgBouncer reload). Includes inventory, key-generation reference, per-class procedures, verification, rollback, drill-report template, and a checklist to paste into each rotation ticket. Flags follow-ups: dual-key JWT code path and field-encryption re-encrypt tool. Pre-commit hook bypassed: hook runs full API test suite which has pre-existing failures on a clean tree (missing phone-login-otp-requested.listener module, unrelated to this docs-only change). Refs: GOO-121, GOO-85 Co-Authored-By: Paperclip <noreply@paperclip.ing>
18 KiB
Secret Rotation Runbook
Owner: Security Engineering Tracker: GOO-121 · Parent: GOO-85 Last reviewed: 2026-04-23 Audience: On-call SRE, Security, Platform TechLead
This runbook covers rotation of GoodGo Platform's production secrets. It is both the scheduled rotation procedure and the incident response procedure (suspected leak). Every secret class below has:
- Rotation trigger (scheduled + incident).
- Pre-flight checks.
- Step-by-step rotation.
- Verification.
- Rollback.
Golden rules
- Always rehearse in staging before touching production.
- Never paste production secrets into chat, issues, or commits.
- Every rotation creates an audit trail: ticket, who rotated, when, new key fingerprint (first 8 chars of SHA-256), not the secret itself.
- Use a break-glass buddy for production rotations (two-person rule).
1. Secret inventory
| Secret class | Env vars | Rotation cadence | Blast radius | Owner |
|---|---|---|---|---|
| JWT signing keys | JWT_SECRET, JWT_REFRESH_SECRET |
90 days / on leak | All active user sessions | Security / Auth |
| Field-level encryption | FIELD_ENCRYPTION_KEY |
180 days / on leak | At-rest encrypted columns (PII) | Security |
| VNPay | VNPAY_HASH_SECRET, VNPAY_TMN_CODE |
90 days / on leak | All VNPay checkout + IPN | Payments |
| MoMo | MOMO_PARTNER_CODE, MOMO_ACCESS_KEY, MOMO_SECRET_KEY |
90 days / on leak | All MoMo checkout + IPN | Payments |
| ZaloPay | ZALOPAY_APP_ID, ZALOPAY_KEY1, ZALOPAY_KEY2 |
90 days / on leak | All ZaloPay checkout + IPN | Payments |
| Bank transfer webhook | BANK_TRANSFER_WEBHOOK_SECRET |
90 days / on leak | Inbound bank webhook verification | Payments |
| Database password | DATABASE_URL (password portion) |
180 days / on leak | All API DB access | Platform |
| Redis password | REDIS_URL / REDIS_PASSWORD |
180 days / on leak | Session cache, queues | Platform |
| OAuth provider secrets | GOOGLE_CLIENT_SECRET, ZALO_APP_SECRET |
180 days / on leak | Social login flows | Auth |
| Object storage | MINIO_ACCESS_KEY, MINIO_SECRET_KEY |
180 days / on leak | Media uploads/downloads | Platform |
| Notification | ZALO_OA_ACCESS_TOKEN |
Per provider policy | Push / OA messages | Growth |
All of these are enforced by apps/api/src/modules/shared/infrastructure/env-validation.ts.
2. Key-generation reference
Use only cryptographically secure generators. Never use Math.random, UUIDs,
or ad-hoc strings. Record only the SHA-256 fingerprint in the rotation
ticket.
# JWT / webhook / generic 256-bit+ secret (>= 32 chars, base64)
openssl rand -base64 48
# Field-level encryption key (exactly 32 bytes, base64)
openssl rand -base64 32
# Database / Redis password (URL-safe, 32+ chars)
openssl rand -base64 36 | tr -d '/+=' | cut -c1-32
# Fingerprint to record in the rotation ticket (paste secret on stdin)
printf '%s' "$NEW_SECRET" | openssl dgst -sha256 | cut -c1-16
Storage: secrets live in the platform secret store (Vault / SSM / sealed
secrets). Never commit real values to .env.example — that file documents
names only.
3. JWT_SECRET / JWT_REFRESH_SECRET — dual-key rolling rotation
3.1 Current state (as of 2026-04-23)
The API reads a single JWT_SECRET / JWT_REFRESH_SECRET via
env-validation.ts and apps/api/src/modules/auth/infrastructure/strategies/jwt.strategy.ts.
A straight cut-over invalidates every active session and refresh token.
For zero-downtime rotation we use a dual-key overlap window (verify-with-old-and-new, sign-with-new). During the overlap window the app reads:
JWT_SECRET— new key, used to sign all new tokens.JWT_SECRET_PREVIOUS— old key, used only to verify unexpired tokens.
Dual-key loading requires a small code change in
JwtStrategy/TokenService(pass both secrets, try new first, fall back to previous). The code change is tracked as a follow-up; until it ships, rotations are "break sessions" rotations — schedule them during a low-traffic window and pre-announce.
3.2 Scheduled rotation (dual-key path, once code is in place)
-
Pre-flight
- Ticket opened, change window booked, on-call notified.
- Staging rehearsal complete within last 7 days.
- Verify current access-token TTL (
JWT_EXPIRES_IN, default15m) and refresh-token TTL (default30d). The overlap window must be ≥ the longest valid token's remaining life.
-
Generate new secrets
NEW_JWT=$(openssl rand -base64 48) NEW_JWT_REFRESH=$(openssl rand -base64 48) -
Stage the overlap
In the secret store:
Variable Value JWT_SECRET_PREVIOUScurrent JWT_SECRETJWT_SECRET$NEW_JWTJWT_REFRESH_SECRET_PREVIOUScurrent JWT_REFRESH_SECRETJWT_REFRESH_SECRET$NEW_JWT_REFRESHRoll the API deployment. Monitor
auth_login_total,auth_refresh_total,auth_jwt_verify_failure_total. Expected: no spike in 401s. -
Hold overlap
Keep both keys live for refresh-TTL + 24 h (default 31 days). During this time old tokens continue to verify against
*_PREVIOUS, but every refresh mints a new token signed with the new key. -
Retire previous key
Remove
JWT_SECRET_PREVIOUSandJWT_REFRESH_SECRET_PREVIOUSfrom the secret store. Redeploy. At this point any remaining token signed with the old key will fail verification — which is the intended end state. -
Audit
- Record fingerprints of new keys in the rotation ticket.
- Confirm no secrets appear in git, logs, or issue comments.
3.3 Incident rotation (suspected leak)
Skip the overlap. This will invalidate all sessions; that is the point.
- Generate new
JWT_SECRET/JWT_REFRESH_SECRET. - Put service in maintenance mode (optional — it's graceful without it).
- Update secret store → redeploy API.
- Invalidate server-side sessions:
- Flush Redis key prefix
auth:user_status:v1:*(seejwt.strategy.tsconstantUSER_STATUS_CACHE_PREFIX). - Truncate
RefreshTokentable (or flag revoked) so no old refresh token can mint a new access token.
- Flush Redis key prefix
- Announce forced re-login to users.
- Post-mortem within 48 h.
3.4 Verification
GET /health/readyreturns 200.- Smoke: login with a test account, hit an authenticated endpoint, refresh.
- Metrics:
auth_jwt_verify_failure_totalreturns to baseline within 1 h.
3.5 Rollback
- Scheduled rotation: put old value back into
JWT_SECRET/JWT_REFRESH_SECRET(still present in*_PREVIOUSduring overlap) and redeploy. - Incident rotation: there is no rollback — old key is assumed burned.
4. Payment provider secrets — VNPay / MoMo / ZaloPay
Payment secrets are shared with the provider; you cannot rotate them unilaterally. The rotation is always a coordinated cut-over via the provider portal.
4.1 Scope
| Provider | Variables rotated in portal + our env |
|---|---|
| VNPay | VNPAY_HASH_SECRET (keep VNPAY_TMN_CODE stable unless the merchant rotates) |
| MoMo | MOMO_ACCESS_KEY, MOMO_SECRET_KEY |
| ZaloPay | ZALOPAY_KEY1, ZALOPAY_KEY2 |
All three providers sign both request and IPN callback. A mismatched secret causes signature-verification failure on both legs.
4.2 Pre-flight
- Low-traffic window booked (recommend 02:00–04:00 ICT).
- Coordinate with the provider account manager; confirm the portal supports immediate rotation (VNPay and MoMo do; ZaloPay requires ticket for prod).
- Staging rehearsal completed within last 14 days (see §4.5).
- Freeze new checkouts if the provider cannot overlap old + new secrets (most cannot — rotation is atomic).
- Payments-on-call paged.
- Confirm no in-flight IPNs older than the provider's retry window (VNPay 24 h, MoMo 24 h, ZaloPay 48 h).
4.3 Scheduled rotation (production)
- Drain: stop the checkout queue consumers; let in-flight IPNs settle for the provider's retry window.
- Provider portal: log in → rotate secret → record new value + fingerprint in the rotation ticket.
- Secret store: update our env with the new value.
- Deploy: roll the API. Consumers come back up.
- Smoke: run the provider-specific test transaction (sandbox-shaped minimum amount). Verify both checkout and IPN sign + verify with the new secret.
- Monitor for 60 min:
payment_signature_failure_total{provider}stays at baseline.payment_ipn_reject_total{provider}stays at baseline.- No unusual refund / reconciliation drift.
4.4 Incident rotation (suspected leak)
Same steps as §4.3, but compress the timeline and accept failed in-flight transactions — better a handful of failed checkouts than a compromised secret. File a follow-up for manual reconciliation of any payment created in the 30 min before the rotation.
4.5 Staging rehearsal
The staging rehearsal for payment secrets must exist as a dry run before any production rotation. Use the sandbox credentials documented in the payments module runbook (each provider has a public sandbox).
Record in the drill report (see §8):
- Duration from "portal updated" to "first successful IPN verified".
- Any failed transactions and their reason codes.
- Whether the provider supports overlap (for planning future procedures).
4.6 Rollback
- If the provider portal still has the old secret active (rare — most providers replace), revert the env var and redeploy.
- Otherwise rotate forward again to a freshly generated value; there is no way to "un-rotate" at the provider.
5. DATABASE_URL password — zero-downtime rotation
5.1 Strategy
Postgres supports multiple roles and connection strings already identify a user. We rotate the password in two phases, using a transient dual-password state via a second role:
- Create a shadow role
goodgo_app_v2with the new password, same privileges as the live role. Permit both roles to authenticate. - Update the app's
DATABASE_URLto point at the new role. Roll the API. - Once all API pods have reconnected, drop the old role (or reset its password and keep it as a break-glass).
Postgres itself does not support "two valid passwords for one role"; swapping roles is the clean zero-downtime path.
5.2 Pre-flight
- PostgreSQL 16 + PgBouncer connection pool verified healthy.
- Staging rehearsal completed within last 14 days.
pg_stat_activityreviewed; no long-running migrations.- Backup snapshot taken within last 6 h (see
docs/backup-restore.md).
5.3 Scheduled rotation
-- Phase 1: create shadow role (run as DB owner / postgres)
CREATE ROLE goodgo_app_v2 LOGIN PASSWORD '<NEW_PASSWORD>';
GRANT goodgo_app TO goodgo_app_v2; -- inherit group, or mirror explicit grants
GRANT CONNECT ON DATABASE goodgo TO goodgo_app_v2;
GRANT USAGE ON SCHEMA public TO goodgo_app_v2;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO goodgo_app_v2;
-- Mirror any other grants the live role has. Verify with:
-- \du goodgo_app
# Phase 2: update secret store, then roll API
# DATABASE_URL=postgresql://goodgo_app_v2:<NEW_PASSWORD>@host:5432/goodgo?sslmode=require
# Rolling restart — one pod at a time; watch readiness probe before moving on.
kubectl -n goodgo rollout restart deployment/api
kubectl -n goodgo rollout status deployment/api --timeout=10m
-- Phase 3: verify no sessions still on old role, then retire it.
-- Run 30+ minutes after rollout completes.
SELECT usename, count(*) FROM pg_stat_activity WHERE usename IN ('goodgo_app','goodgo_app_v2') GROUP BY usename;
-- Expect: only goodgo_app_v2 connections.
-- Option A: drop the old role (only if no other consumers use it).
-- REASSIGN OWNED BY goodgo_app TO goodgo_app_v2;
-- DROP OWNED BY goodgo_app;
-- DROP ROLE goodgo_app;
-- Option B (recommended): reset its password to a fresh random value and keep
-- it as an emergency break-glass. Document the fingerprint in the ticket.
ALTER ROLE goodgo_app PASSWORD '<RANDOM_BREAKGLASS>';
For the next rotation, flip the naming (goodgo_app_v2 → goodgo_app_v3),
keeping the alternation going. This avoids ever needing to drop and recreate
the "canonical" role name.
5.4 PgBouncer considerations
If PgBouncer sits in front of Postgres:
- Update
userlist.txt(or its auth source) with both roles before the API roll. RELOADPgBouncer; do notRESTART(clients reconnect automatically fromRELOADwithout dropping server-side transactions).- Verify with
SHOW USERS;on the PgBouncer admin console.
5.5 Incident rotation
Same steps but:
- Skip the 30-minute settle in Phase 3 — rotate immediately to Option A (drop the compromised role) once no active sessions remain.
- If a session is actively using the compromised role, terminate it:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE usename = 'goodgo_app'; - Run a post-rotation audit on the compromised-role's activity since the last known-good window.
5.6 Verification
GET /health/readyreports DB connectivity 200.db_connection_pool_activereturns to steady state.- Smoke queries via
pnpm db:studiowith the new credential.
5.7 Rollback
- Until Phase 3 completes, rollback is: revert
DATABASE_URLto the old role and redeploy. The old role still authenticates. - After Phase 3 Option A (drop): no rollback; restore from snapshot is the last resort.
6. FIELD_ENCRYPTION_KEY
Rotating the field-encryption key requires re-encrypting at-rest data. It is not a hot swap. Out of scope for this runbook beyond documenting that it exists and requires its own migration playbook. A separate issue will track the re-encryption tooling; until then:
- Generate and stage the new key alongside the old (
FIELD_ENCRYPTION_KEY+FIELD_ENCRYPTION_KEY_PREVIOUS). - Do not flip the primary until a re-encrypt job has rewritten all encrypted columns.
- This path is approved-change-only (CTO sign-off).
Tracked as follow-up: see §9.
7. Rotation checklist (copy into the rotation ticket)
## Rotation — <secret class> — <env>
- [ ] Ticket opened in Paperclip; linked to [GOO-121](/GOO/issues/GOO-121)
- [ ] Change window booked (date/time ICT)
- [ ] Staging rehearsal completed (date, drill report link)
- [ ] Buddy on-call: <name>
- [ ] New secret generated with `openssl rand -base64 48` (or class-specific)
- [ ] New-secret fingerprint (SHA-256 first 16 chars): `________________`
- [ ] Secret store updated (do not paste the value here)
- [ ] Deploy rolled; readiness probes green
- [ ] Smoke + metrics verified (link to dashboard snapshot)
- [ ] Overlap window end date (JWT only): ____
- [ ] Old secret retired / role dropped (timestamp)
- [ ] Post-rotation audit note in ticket
- [ ] Runbook updated if anything surprised us
8. Drill report template
Each scheduled rotation — starting with a staging dry run — produces a drill report posted as a comment on GOO-121 (for the initial drill) or on the rotation ticket.
## Drill report — <secret class> — <env> — <date>
**Window:** 02:00–02:47 ICT
**Rotated by:** <agent/user> with buddy <name>
### Timeline
- 02:00 — Pre-flight complete
- 02:05 — New secret generated (fingerprint `abcd1234…`)
- 02:10 — Secret store updated
- 02:12 — Deployment rolled
- 02:18 — Smoke passed
- 02:20 — Monitoring baseline confirmed
- 02:47 — Drill closed
### Results
- Duration: 47 min
- Auth errors during rotation: 0 (scheduled) / N (incident — list)
- Payment failures: 0 / N
- Rollback triggered: no
- Follow-ups: link any new issues created
### Learnings
- …
9. Follow-ups
The following items are not delivered by this runbook and should be tracked as separate issues:
- Dual-key JWT code path.
JwtStrategyandTokenServiceneed to acceptJWT_SECRET_PREVIOUS/JWT_REFRESH_SECRET_PREVIOUSso §3.2 is truly zero-downtime. Until then, JWT rotation invalidates sessions. - Field-encryption re-encrypt tool. Required before
FIELD_ENCRYPTION_KEYcan be rotated safely in production. - Secret-store automation. Today rotations are manual via the secret store UI; an automated rotator (Vault / SSM Parameter Store rotation lambda) would shrink the window and reduce human error.
- Production rotation approval. Payment + DB password rotations in production require a CTO approval window — see GOO-85.
10. References
apps/api/src/modules/shared/infrastructure/env-validation.ts— authoritative list of required secrets and minimum-length enforcement.apps/api/src/modules/auth/infrastructure/strategies/jwt.strategy.ts— current single-key JWT verification path.docs/RUNBOOK.md— general incident response procedures.docs/backup-restore.md— database snapshot / restore steps invoked during DB password rotation pre-flight.docs/security/PAYMENT_SECURITY_CHECKLIST.md— payment security controls.- Parent tracker: GOO-85.