# GoodGo Platform -- Production Deployment Checklist > Version: 1.0 > Last updated: 2026-03-06 > Owner: DevOps + CTO > Domain: goodgo.vn (production), admin.goodgo.vn (admin panel) --- ## Pre-Deployment - [ ] All E2E tests passing on staging (Playwright + functional tests) - [ ] Security audit completed (rate limiting, input validation, RLS) - [ ] Database migrations reviewed and tested on staging (EF Core) - [ ] Secrets rotated (JWT signing keys, DB passwords, API keys, MinIO credentials) - [ ] SSL/TLS certificates configured (goodgo.vn, api.goodgo.vn, admin.goodgo.vn) - [ ] DNS records configured (A/CNAME for all subdomains) - [ ] CDN configured for static assets (Blazor WASM _framework/, images) - [ ] Backup strategy verified (daily PostgreSQL backups via Neon, point-in-time recovery) - [ ] Load testing completed on staging (target: 100 concurrent users minimum) - [ ] Rollback plan reviewed and approved by CTO --- ## Infrastructure ### Kubernetes Cluster (RKE2) - [ ] K8s cluster provisioned and healthy (minimum 3 nodes) - [ ] Namespace `production` created - [ ] Resource limits set per service (256Mi-512Mi mem, 250m-500m CPU) - [ ] HPA (Horizontal Pod Autoscaler) configured (min 2, max 10 replicas) - [ ] PersistentVolumeClaims provisioned for MinIO and Redis - [ ] Ingress + TLS configured via Traefik IngressClass - [ ] Network policies enforced (service-to-service only, deny external by default) - [ ] Node affinity / anti-affinity rules for HA (spread pods across nodes) ### External Services - [ ] Neon PostgreSQL production database provisioned - [ ] Redis production instance running (persistence enabled, AOF + RDB) - [ ] RabbitMQ production cluster (mirrored queues, 2+ nodes) - [ ] MinIO production buckets created with proper access policies - [ ] Traefik v3 gateway deployed with production TLS config --- ## Services (repeat per service) > 8 core services: iam, merchant, order, fnb-engine, wallet, catalog, inventory, chat ### Per-Service Checklist - [ ] Docker image tagged with commit SHA (NEVER use :latest) - [ ] Image pushed to Docker Hub (goodgo/{service}:{sha}) - [ ] Environment variables set in K8s Secrets (not ConfigMaps for sensitive data) - [ ] Health checks responding: `/health/live` (liveness), `/health/ready` (readiness) - [ ] Database migrated (EF Core migrations applied via `dotnet ef database update`) - [ ] Seed data loaded (if applicable) - [ ] Connection string pointing to Neon PostgreSQL production - [ ] Redis connection string configured - [ ] RabbitMQ connection configured - [ ] API versioning header `X-Api-Version` tested - [ ] Logging level set to `Information` (not `Debug`) - [ ] Serilog structured logging outputting to stdout (for Promtail collection) ### Service-Specific | Service | Extra Checks | |---------|-------------| | iam-service | JWT signing key (RS256) deployed, OIDC discovery endpoint live, MFA configured | | merchant-service | Subscription plans seeded, shop lifecycle tested | | order-service | SignalR PosHub accessible, Redis backplane connected, MessagePack configured | | fnb-engine | Kitchen ticket flow tested, inventory deduction verified | | wallet-service | VNPay production credentials configured, IPN callback URL registered | | catalog-service | Product categories seeded | | inventory-service | Reorder level alerts configured | | chat-service | SignalR hub accessible, Redis backplane connected | --- ## Monitoring - [ ] Prometheus deployed and scraping all 8 services on `/metrics` - [ ] Grafana deployed with GoodGo Overview dashboard loaded - [ ] Alert rules active in Prometheus (service down, high error rate, high latency, DB pool, disk, memory) - [ ] Alert notifications configured (Slack channel #goodgo-alerts and/or PagerDuty) - [ ] Loki deployed and receiving logs from all containers via Promtail - [ ] Structured logging (Serilog JSON) verified in Loki queries - [ ] Grafana Loki datasource configured and queryable - [ ] Dashboard access restricted (admin credentials changed from defaults) --- ## Security ### Authentication & Authorization - [ ] JWT signing key rotated from staging key (RS256 key pair) - [ ] OIDC discovery endpoint (/.well-known/openid-configuration) returns production issuer - [ ] Token expiry configured (access: 15min, refresh: 7 days) - [ ] RBAC policies verified (Admin, Owner, Staff, Customer roles) ### Network & Transport - [ ] CORS configured (allow only goodgo.vn, admin.goodgo.vn origins) - [ ] HTTPS enforced (HTTP -> HTTPS redirect via Traefik middleware) - [ ] Security headers configured via Traefik middleware: - `Strict-Transport-Security: max-age=63072000; includeSubDomains; preload` - `Content-Security-Policy: default-src 'self'` - `X-Frame-Options: DENY` - `X-Content-Type-Options: nosniff` - `Referrer-Policy: strict-origin-when-cross-origin` ### Rate Limiting - [ ] Auth endpoints: 10 requests/min (brute force protection) - [ ] Payment endpoints: 30 requests/min - [ ] General API: 100 requests/min - [ ] SignalR hub: 500 requests/min ### Data Protection - [ ] Row-Level Security (RLS) policies applied on all tenant databases - [ ] Database user has minimal required permissions (no SUPERUSER) - [ ] MinIO buckets have proper ACLs (private by default, signed URLs for access) - [ ] No secrets in environment variables visible via K8s describe (use Secrets, not ConfigMaps) - [ ] Sensitive fields excluded from Serilog logging (passwords, tokens, card numbers) --- ## Rollback Plan - [ ] Previous Docker images retained in Docker Hub (at least 5 recent tags) - [ ] Database rollback migration scripts prepared and tested - [ ] Feature flags configured for new features (can disable without redeploy) - [ ] Canary deployment strategy documented: 1. Deploy to 1 replica first 2. Monitor error rate for 10 minutes 3. If error rate < 1%, proceed to full rollout 4. If error rate > 5%, auto-rollback via K8s rollout undo - [ ] `kubectl rollout undo` command documented per service - [ ] Communication plan for downtime (status page, Slack notification) --- ## Post-Deployment Verification ### Smoke Tests (within 30 minutes) - [ ] IAM: Login flow works (email + password) - [ ] IAM: Token refresh works - [ ] IAM: MFA enrollment works - [ ] Merchant: Shop creation works - [ ] Order: Create order -> add items -> submit - [ ] Order: Pay order (cash flow) - [ ] FnB: Kitchen ticket appears on KDS - [ ] Wallet: VNPay payment redirect works (sandbox -> production) - [ ] Catalog: Product listing loads - [ ] Inventory: Stock levels queryable - [ ] Chat: SignalR connection established - [ ] Storage: File upload + signed URL access ### Functional Verification (within 2 hours) - [ ] Full Karaoke POS workflow (room select -> order -> pay -> close) - [ ] Full Restaurant POS workflow (table -> order -> kitchen -> serve -> pay) - [ ] QR code menu accessible from customer phone - [ ] EOD report generates correctly with real data - [ ] Multi-browser session (concurrent POS users on same shop) ### Monitoring Verification (within 24 hours) - [ ] Monitor error rates (target: < 0.1% 5xx) - [ ] Monitor p95 latency (target: < 500ms) - [ ] Monitor SignalR connection stability (no unexpected disconnects) - [ ] Verify Grafana dashboards show live data - [ ] Verify alert rules fire correctly (test with synthetic failure if needed) - [ ] Review Loki logs for any unhandled exceptions - [ ] Verify PostgreSQL connection pool utilization is healthy (< 50%) --- ## Sign-Off | Role | Name | Date | Approved | |------|------|------|:--------:| | CTO | | | [ ] | | Tech Lead | | | [ ] | | DevOps Lead | | | [ ] | | QA Lead | | | [ ] | --- *This checklist must be completed and signed off before production traffic is routed to the new deployment.*