Files
pos-system/microservices/docs/production-checklist.md
Ho Ngoc Hai 76d75c753b Migrate
2026-05-23 18:37:02 +07:00

7.6 KiB

GoodGo Platform -- Production Deployment Checklist

Version: 1.0 Last updated: 2026-03-06 Owner: DevOps + CTO Domain: goodgo.vn (production), admin.goodgo.vn (admin panel)


Pre-Deployment

  • All E2E tests passing on staging (Playwright + functional tests)
  • Security audit completed (rate limiting, input validation, RLS)
  • Database migrations reviewed and tested on staging (EF Core)
  • Secrets rotated (JWT signing keys, DB passwords, API keys, MinIO credentials)
  • SSL/TLS certificates configured (goodgo.vn, api.goodgo.vn, admin.goodgo.vn)
  • DNS records configured (A/CNAME for all subdomains)
  • CDN configured for static assets (Blazor WASM _framework/, images)
  • Backup strategy verified (daily PostgreSQL backups via Neon, point-in-time recovery)
  • Load testing completed on staging (target: 100 concurrent users minimum)
  • Rollback plan reviewed and approved by CTO

Infrastructure

Kubernetes Cluster (RKE2)

  • K8s cluster provisioned and healthy (minimum 3 nodes)
  • Namespace production created
  • Resource limits set per service (256Mi-512Mi mem, 250m-500m CPU)
  • HPA (Horizontal Pod Autoscaler) configured (min 2, max 10 replicas)
  • PersistentVolumeClaims provisioned for MinIO and Redis
  • Ingress + TLS configured via Traefik IngressClass
  • Network policies enforced (service-to-service only, deny external by default)
  • Node affinity / anti-affinity rules for HA (spread pods across nodes)

External Services

  • Neon PostgreSQL production database provisioned
  • Redis production instance running (persistence enabled, AOF + RDB)
  • RabbitMQ production cluster (mirrored queues, 2+ nodes)
  • MinIO production buckets created with proper access policies
  • Traefik v3 gateway deployed with production TLS config

Services (repeat per service)

8 core services: iam, merchant, order, fnb-engine, wallet, catalog, inventory, chat

Per-Service Checklist

  • Docker image tagged with commit SHA (NEVER use :latest)
  • Image pushed to Docker Hub (goodgo/{service}:{sha})
  • Environment variables set in K8s Secrets (not ConfigMaps for sensitive data)
  • Health checks responding: /health/live (liveness), /health/ready (readiness)
  • Database migrated (EF Core migrations applied via dotnet ef database update)
  • Seed data loaded (if applicable)
  • Connection string pointing to Neon PostgreSQL production
  • Redis connection string configured
  • RabbitMQ connection configured
  • API versioning header X-Api-Version tested
  • Logging level set to Information (not Debug)
  • Serilog structured logging outputting to stdout (for Promtail collection)

Service-Specific

Service Extra Checks
iam-service JWT signing key (RS256) deployed, OIDC discovery endpoint live, MFA configured
merchant-service Subscription plans seeded, shop lifecycle tested
order-service SignalR PosHub accessible, Redis backplane connected, MessagePack configured
fnb-engine Kitchen ticket flow tested, inventory deduction verified
wallet-service VNPay production credentials configured, IPN callback URL registered
catalog-service Product categories seeded
inventory-service Reorder level alerts configured
chat-service SignalR hub accessible, Redis backplane connected

Monitoring

  • Prometheus deployed and scraping all 8 services on /metrics
  • Grafana deployed with GoodGo Overview dashboard loaded
  • Alert rules active in Prometheus (service down, high error rate, high latency, DB pool, disk, memory)
  • Alert notifications configured (Slack channel #goodgo-alerts and/or PagerDuty)
  • Loki deployed and receiving logs from all containers via Promtail
  • Structured logging (Serilog JSON) verified in Loki queries
  • Grafana Loki datasource configured and queryable
  • Dashboard access restricted (admin credentials changed from defaults)

Security

Authentication & Authorization

  • JWT signing key rotated from staging key (RS256 key pair)
  • OIDC discovery endpoint (/.well-known/openid-configuration) returns production issuer
  • Token expiry configured (access: 15min, refresh: 7 days)
  • RBAC policies verified (Admin, Owner, Staff, Customer roles)

Network & Transport

  • CORS configured (allow only goodgo.vn, admin.goodgo.vn origins)
  • HTTPS enforced (HTTP -> HTTPS redirect via Traefik middleware)
  • Security headers configured via Traefik middleware:
    • Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
    • Content-Security-Policy: default-src 'self'
    • X-Frame-Options: DENY
    • X-Content-Type-Options: nosniff
    • Referrer-Policy: strict-origin-when-cross-origin

Rate Limiting

  • Auth endpoints: 10 requests/min (brute force protection)
  • Payment endpoints: 30 requests/min
  • General API: 100 requests/min
  • SignalR hub: 500 requests/min

Data Protection

  • Row-Level Security (RLS) policies applied on all tenant databases
  • Database user has minimal required permissions (no SUPERUSER)
  • MinIO buckets have proper ACLs (private by default, signed URLs for access)
  • No secrets in environment variables visible via K8s describe (use Secrets, not ConfigMaps)
  • Sensitive fields excluded from Serilog logging (passwords, tokens, card numbers)

Rollback Plan

  • Previous Docker images retained in Docker Hub (at least 5 recent tags)
  • Database rollback migration scripts prepared and tested
  • Feature flags configured for new features (can disable without redeploy)
  • Canary deployment strategy documented:
    1. Deploy to 1 replica first
    2. Monitor error rate for 10 minutes
    3. If error rate < 1%, proceed to full rollout
    4. If error rate > 5%, auto-rollback via K8s rollout undo
  • kubectl rollout undo command documented per service
  • Communication plan for downtime (status page, Slack notification)

Post-Deployment Verification

Smoke Tests (within 30 minutes)

  • IAM: Login flow works (email + password)
  • IAM: Token refresh works
  • IAM: MFA enrollment works
  • Merchant: Shop creation works
  • Order: Create order -> add items -> submit
  • Order: Pay order (cash flow)
  • FnB: Kitchen ticket appears on KDS
  • Wallet: VNPay payment redirect works (sandbox -> production)
  • Catalog: Product listing loads
  • Inventory: Stock levels queryable
  • Chat: SignalR connection established
  • Storage: File upload + signed URL access

Functional Verification (within 2 hours)

  • Full Karaoke POS workflow (room select -> order -> pay -> close)
  • Full Restaurant POS workflow (table -> order -> kitchen -> serve -> pay)
  • QR code menu accessible from customer phone
  • EOD report generates correctly with real data
  • Multi-browser session (concurrent POS users on same shop)

Monitoring Verification (within 24 hours)

  • Monitor error rates (target: < 0.1% 5xx)
  • Monitor p95 latency (target: < 500ms)
  • Monitor SignalR connection stability (no unexpected disconnects)
  • Verify Grafana dashboards show live data
  • Verify alert rules fire correctly (test with synthetic failure if needed)
  • Review Loki logs for any unhandled exceptions
  • Verify PostgreSQL connection pool utilization is healthy (< 50%)

Sign-Off

Role Name Date Approved
CTO [ ]
Tech Lead [ ]
DevOps Lead [ ]
QA Lead [ ]

This checklist must be completed and signed off before production traffic is routed to the new deployment.