Files
goodgo-platform/docs/audits/README_INFRASTRUCTURE.md
Ho Ngoc Hai b8512ebff4 docs: consolidate audit and analysis reports into docs/audits/
Move 36 root-level audit/analysis documents and 7 web app audit documents
into docs/audits/ directory to declutter the project root. Remove stale
EXPLORATION_SUMMARY.txt.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-11 01:37:50 +07:00

9.5 KiB

📚 GoodGo Platform — Infrastructure Documentation

This directory contains three comprehensive operational documents for the GoodGo Platform infrastructure.

📖 Documentation Files

1. INFRASTRUCTURE_RUNBOOK.md (1,458 lines)

→ Read this for complete operational reference

Comprehensive guide covering:

  • Executive summary (12+ services overview)
  • Complete service inventory with ports, health checks, dependencies
  • Docker Compose specifications (dev, prod, CI environments)
  • Database layer (PostgreSQL 16 + PostGIS, 22 Prisma models)
  • Connection pooling (PgBouncer configuration, transaction mode)
  • Backup & recovery strategies (daily automated backups, verification)
  • Caching & search (Redis graceful degradation, Typesense full-text)
  • Monitoring & observability (Prometheus, Grafana dashboards, Loki logs)
  • Payment integration (VNPay, MoMo, ZaloPay, callback handling)
  • Health checks (liveness, readiness, dependency-specific probes)
  • Complete environment variables reference
  • Deployment pipeline (GitHub Actions CI/CD, Docker builds)
  • Detailed troubleshooting guide with 7+ common issues
  • Emergency procedures and Prometheus queries

Use when: Creating runbooks, investigating outages, onboarding new ops team members


2. INFRASTRUCTURE_QUICK_REFERENCE.md (222 lines)

→ Read this for quick lookup

Quick reference covering:

  • 🚀 Quick start commands (dev, prod, CI)
  • 📊 Service map with ports and health checks
  • 🗄️ Database overview (backup schedule, connection pooling)
  • 💾 Cache & search summary (Redis, Typesense features)
  • 📈 Monitoring dashboard links
  • 💳 Payment gateway summary
  • 🏥 Health endpoint reference
  • 🔐 Critical environment variables
  • 📦 Deployment container images
  • 🆘 Common troubleshooting steps (5 quick fixes)
  • 📝 Key file locations and links
  • 📞 Common Docker commands

Use when: Debugging quickly, on-call shift lookup, quick health checks


3. INFRASTRUCTURE_AUDIT.md (1,246 lines)

→ Read this for complete audit trail of what was explored

Detailed audit including:

  • Raw configuration file contents
  • Line-by-line analysis of each service
  • Environment variable specifications
  • Payment callback flow diagram (text)
  • Health check implementation details
  • Backup verification workflow
  • CI/CD pipeline stages

Use when: Verifying infrastructure documentation accuracy, compliance audits


🎯 Quick Navigation

By Role

🔧 DevOps/SRE Engineer

  1. Start: INFRASTRUCTURE_QUICK_REFERENCE.md (5 min overview)
  2. Deep dive: INFRASTRUCTURE_RUNBOOK.md (sections 2-3, 7, 11)
  3. Reference: INFRASTRUCTURE_AUDIT.md (for raw configs)

💼 Engineering Manager/Tech Lead

  1. Start: INFRASTRUCTURE_RUNBOOK.md (section 1: Executive Summary)
  2. Details: INFRASTRUCTURE_RUNBOOK.md (sections 2-6, 10)

🚀 On-Call Engineer

  1. Start: INFRASTRUCTURE_QUICK_REFERENCE.md (entire document)
  2. Troubleshoot: INFRASTRUCTURE_RUNBOOK.md (section 12)
  3. Debug: INFRASTRUCTURE_AUDIT.md (raw logs/configs if needed)

👤 New Team Member

  1. Start: INFRASTRUCTURE_QUICK_REFERENCE.md (overview)
  2. Learn: INFRASTRUCTURE_RUNBOOK.md (sections 1-6)
  3. Practice: Use common commands from Quick Reference

🔍 Common Questions & Where to Find Answers

Question Document Section
"How many services are running?" Runbook 1. Executive Summary
"What ports do I need to know?" Quick Reference 📊 Service Map
"How is the database backed up?" Runbook 8. Backup & Recovery
"Payment callback failed, what now?" Runbook 12. Troubleshooting (Payment Callback)
"Redis is down, will the app work?" Runbook 5. Caching & Search (Graceful Degradation)
"How do I restart a service?" Quick Reference 📞 Common Commands
"What's the monitoring setup?" Runbook 6. Monitoring & Observability
"Where are environment variables?" Runbook 9. Environment Variables
"How do I deploy to production?" Runbook 11. Deployment Pipeline
"What does a health check do?" Runbook 7. Health Checks

📊 Infrastructure at a Glance

Development Environment
├── 12 Services (no resource limits)
├── PostgreSQL 16 + PostGIS (5432)
├── Redis 7 (6379, 256MB)
├── Typesense 27.1 (8108)
├── Prometheus (9090, 15-day retention)
├── Grafana (3002, 7 dashboards)
├── Loki (3100, 15-day logs)
└── API/Web/AI services

Production Environment
├── 14 Services (with resource limits, security hardening)
├── PgBouncer (6432, 20-connection pool)
├── PostgreSQL 16 + PostGIS (5432)
├── Redis 7 (6379, 512MB, password auth)
├── Typesense 27.1 (8108)
├── Prometheus (9090, 30-day retention)
├── Grafana (3002, secrets management)
├── Loki (3100, 15-day logs)
└── API/Web/AI services (zero-downtime deployments)

CI/E2E Environment
├── 4 Services (tmpfs for speed)
├── PostgreSQL test DB
├── Redis (no persistence)
└── Typesense + MinIO (tmpfs)

goodgo-platform-ai/
├── README_INFRASTRUCTURE.md (THIS FILE)
├── INFRASTRUCTURE_RUNBOOK.md (Complete reference)
├── INFRASTRUCTURE_QUICK_REFERENCE.md (Quick lookup)
├── INFRASTRUCTURE_AUDIT.md (Detailed audit)
│
├── docker-compose.yml (Dev environment)
├── docker-compose.prod.yml (Production)
├── docker-compose.ci.yml (Testing)
│
├── .env.example (Environment variables template)
├── prisma/schema.prisma (Data model, 22 Prisma models)
│
├── infra/pgbouncer/ (Connection pooling)
├── monitoring/ (Prometheus, Grafana, Loki configs)
├── scripts/backup/ (Backup and verification scripts)
│
└── .github/workflows/ (CI/CD pipelines)
    ├── ci.yml (Lint → Test → Build)
    ├── deploy.yml (Build images, deploy)
    ├── e2e.yml (End-to-end tests)
    ├── backup-verify.yml (Weekly backup verification)
    └── security.yml (Dependency scanning)

🆘 Immediate Help

"The API is down. What do I check?"

  1. Read: INFRASTRUCTURE_QUICK_REFERENCE.md → 🆘 Troubleshooting
  2. Quick commands:
    docker compose ps api
    docker compose logs api --tail=50
    curl http://localhost:3001/health/ready
    
  3. If still stuck: See INFRASTRUCTURE_RUNBOOK.md → 12. Troubleshooting

"I need to deploy to production"

  1. Read: INFRASTRUCTURE_QUICK_REFERENCE.md → 📦 Deployment
  2. Then: INFRASTRUCTURE_RUNBOOK.md → 11. Deployment Pipeline
  3. Review: .github/workflows/deploy.yml for actual steps

"The database is slow"

  1. Read: INFRASTRUCTURE_RUNBOOK.md → 4. Database Layer (Connection Pooling)
  2. Check: INFRASTRUCTURE_QUICK_REFERENCE.md → 🆘 "Database connection pooling full?"
  3. Query: Use Prometheus queries from INFRASTRUCTURE_RUNBOOK.md

"How do I restore from backup?"

  1. Read: INFRASTRUCTURE_RUNBOOK.md → 8. Backup & Recovery
  2. Steps: "Restore from Backup" section with exact commands

📈 Key Metrics & SLOs

From INFRASTRUCTURE_RUNBOOK.md monitoring section:

Metric Warning Critical Source
API p99 latency > 1s (5min) > 3s (3min) Prometheus histogram
API p99/endpoint > 2s (5min) N/A Prometheus
5xx error rate > 1% (5min) N/A Prometheus
Database response Monitored Monitored Grafana dashboard
Redis availability Graceful fallback Graceful fallback App continues on DB

Dashboards available at http://localhost:3002 (Grafana):

  • API Latency
  • API Overview
  • Database Metrics
  • Logs & Errors
  • Search Analytics
  • Web Vitals
  • Business Metrics

🔐 Security Notes

From INFRASTRUCTURE_RUNBOOK.md environment variables section:

CRITICAL (Production):

  • JWT_SECRET must be ≥32 characters (generate: openssl rand -base64 48)
  • KYC_ENCRYPTION_KEY must be 64 hex chars (generate: openssl rand -hex 32)
  • All payment gateway credentials must be rotated regularly
  • Redis requires password authentication in production
  • Docker containers run as non-root (node user)
  • Read-only filesystems for application containers
  • No new privileges flag set

📞 Escalation Path

  1. Immediate Issue? → INFRASTRUCTURE_QUICK_REFERENCE.md
  2. Complex Problem? → INFRASTRUCTURE_RUNBOOK.md section 12
  3. Need Audit Trail? → INFRASTRUCTURE_AUDIT.md
  4. Still Stuck? → Check .github/workflows/ or git history

📝 Document Updates

These documents were generated on April 11, 2026 from a complete infrastructure audit of the GoodGo Platform monorepo.

To keep up-to-date:

  • Update these docs when adding new services
  • Review monitoring configs after infrastructure changes
  • Test backup procedures monthly (already automated)
  • Update runbooks based on incident postmortems

🎓 Learning Path

For new team members:

  1. Day 1: Read INFRASTRUCTURE_QUICK_REFERENCE.md (30 min)
  2. Day 2: Read INFRASTRUCTURE_RUNBOOK.md sections 1-3 (1 hour)
  3. Day 3: Practice commands from Quick Reference with mentor
  4. Day 4: Read INFRASTRUCTURE_RUNBOOK.md sections 4-7 (1.5 hours)
  5. Day 5: Read INFRASTRUCTURE_RUNBOOK.md sections 8-12 (1.5 hours)
  6. Week 2: Shadow on-call engineer, practice troubleshooting
  7. Week 3: Take on-call shift

Last Updated: April 11, 2026
Version: 1.0
Maintainers: GoodGo Platform SRE Team


For questions or updates to this documentation, contact: devops@goodgo.vn