# 📚 GoodGo Platform — Infrastructure Documentation This directory contains **three comprehensive operational documents** for the GoodGo Platform infrastructure. ## 📖 Documentation Files ### 1. **INFRASTRUCTURE_RUNBOOK.md** (1,458 lines) **→ Read this for complete operational reference** Comprehensive guide covering: - ✅ Executive summary (12+ services overview) - ✅ Complete service inventory with ports, health checks, dependencies - ✅ Docker Compose specifications (dev, prod, CI environments) - ✅ Database layer (PostgreSQL 16 + PostGIS, 22 Prisma models) - ✅ Connection pooling (PgBouncer configuration, transaction mode) - ✅ Backup & recovery strategies (daily automated backups, verification) - ✅ Caching & search (Redis graceful degradation, Typesense full-text) - ✅ Monitoring & observability (Prometheus, Grafana dashboards, Loki logs) - ✅ Payment integration (VNPay, MoMo, ZaloPay, callback handling) - ✅ Health checks (liveness, readiness, dependency-specific probes) - ✅ Complete environment variables reference - ✅ Deployment pipeline (GitHub Actions CI/CD, Docker builds) - ✅ Detailed troubleshooting guide with 7+ common issues - ✅ Emergency procedures and Prometheus queries **Use when:** Creating runbooks, investigating outages, onboarding new ops team members --- ### 2. **INFRASTRUCTURE_QUICK_REFERENCE.md** (222 lines) **→ Read this for quick lookup** Quick reference covering: - 🚀 Quick start commands (dev, prod, CI) - 📊 Service map with ports and health checks - 🗄️ Database overview (backup schedule, connection pooling) - 💾 Cache & search summary (Redis, Typesense features) - 📈 Monitoring dashboard links - 💳 Payment gateway summary - 🏥 Health endpoint reference - 🔐 Critical environment variables - 📦 Deployment container images - 🆘 Common troubleshooting steps (5 quick fixes) - 📝 Key file locations and links - 📞 Common Docker commands **Use when:** Debugging quickly, on-call shift lookup, quick health checks --- ### 3. **INFRASTRUCTURE_AUDIT.md** (1,246 lines) **→ Read this for complete audit trail of what was explored** Detailed audit including: - Raw configuration file contents - Line-by-line analysis of each service - Environment variable specifications - Payment callback flow diagram (text) - Health check implementation details - Backup verification workflow - CI/CD pipeline stages **Use when:** Verifying infrastructure documentation accuracy, compliance audits --- ## 🎯 Quick Navigation ### By Role **🔧 DevOps/SRE Engineer** 1. Start: INFRASTRUCTURE_QUICK_REFERENCE.md (5 min overview) 2. Deep dive: INFRASTRUCTURE_RUNBOOK.md (sections 2-3, 7, 11) 3. Reference: INFRASTRUCTURE_AUDIT.md (for raw configs) **💼 Engineering Manager/Tech Lead** 1. Start: INFRASTRUCTURE_RUNBOOK.md (section 1: Executive Summary) 2. Details: INFRASTRUCTURE_RUNBOOK.md (sections 2-6, 10) **🚀 On-Call Engineer** 1. Start: INFRASTRUCTURE_QUICK_REFERENCE.md (entire document) 2. Troubleshoot: INFRASTRUCTURE_RUNBOOK.md (section 12) 3. Debug: INFRASTRUCTURE_AUDIT.md (raw logs/configs if needed) **👤 New Team Member** 1. Start: INFRASTRUCTURE_QUICK_REFERENCE.md (overview) 2. Learn: INFRASTRUCTURE_RUNBOOK.md (sections 1-6) 3. Practice: Use common commands from Quick Reference --- ## 🔍 Common Questions & Where to Find Answers | Question | Document | Section | |----------|----------|---------| | "How many services are running?" | Runbook | 1. Executive Summary | | "What ports do I need to know?" | Quick Reference | 📊 Service Map | | "How is the database backed up?" | Runbook | 8. Backup & Recovery | | "Payment callback failed, what now?" | Runbook | 12. Troubleshooting (Payment Callback) | | "Redis is down, will the app work?" | Runbook | 5. Caching & Search (Graceful Degradation) | | "How do I restart a service?" | Quick Reference | 📞 Common Commands | | "What's the monitoring setup?" | Runbook | 6. Monitoring & Observability | | "Where are environment variables?" | Runbook | 9. Environment Variables | | "How do I deploy to production?" | Runbook | 11. Deployment Pipeline | | "What does a health check do?" | Runbook | 7. Health Checks | --- ## 📊 Infrastructure at a Glance ``` Development Environment ├── 12 Services (no resource limits) ├── PostgreSQL 16 + PostGIS (5432) ├── Redis 7 (6379, 256MB) ├── Typesense 27.1 (8108) ├── Prometheus (9090, 15-day retention) ├── Grafana (3002, 7 dashboards) ├── Loki (3100, 15-day logs) └── API/Web/AI services Production Environment ├── 14 Services (with resource limits, security hardening) ├── PgBouncer (6432, 20-connection pool) ├── PostgreSQL 16 + PostGIS (5432) ├── Redis 7 (6379, 512MB, password auth) ├── Typesense 27.1 (8108) ├── Prometheus (9090, 30-day retention) ├── Grafana (3002, secrets management) ├── Loki (3100, 15-day logs) └── API/Web/AI services (zero-downtime deployments) CI/E2E Environment ├── 4 Services (tmpfs for speed) ├── PostgreSQL test DB ├── Redis (no persistence) └── Typesense + MinIO (tmpfs) ``` --- ## 🔗 Related Files in Repository ``` goodgo-platform-ai/ ├── README_INFRASTRUCTURE.md (THIS FILE) ├── INFRASTRUCTURE_RUNBOOK.md (Complete reference) ├── INFRASTRUCTURE_QUICK_REFERENCE.md (Quick lookup) ├── INFRASTRUCTURE_AUDIT.md (Detailed audit) │ ├── docker-compose.yml (Dev environment) ├── docker-compose.prod.yml (Production) ├── docker-compose.ci.yml (Testing) │ ├── .env.example (Environment variables template) ├── prisma/schema.prisma (Data model, 22 Prisma models) │ ├── infra/pgbouncer/ (Connection pooling) ├── monitoring/ (Prometheus, Grafana, Loki configs) ├── scripts/backup/ (Backup and verification scripts) │ └── .github/workflows/ (CI/CD pipelines) ├── ci.yml (Lint → Test → Build) ├── deploy.yml (Build images, deploy) ├── e2e.yml (End-to-end tests) ├── backup-verify.yml (Weekly backup verification) └── security.yml (Dependency scanning) ``` --- ## 🆘 Immediate Help ### "The API is down. What do I check?" 1. Read: INFRASTRUCTURE_QUICK_REFERENCE.md → 🆘 Troubleshooting 2. Quick commands: ```bash docker compose ps api docker compose logs api --tail=50 curl http://localhost:3001/health/ready ``` 3. If still stuck: See INFRASTRUCTURE_RUNBOOK.md → 12. Troubleshooting ### "I need to deploy to production" 1. Read: INFRASTRUCTURE_QUICK_REFERENCE.md → 📦 Deployment 2. Then: INFRASTRUCTURE_RUNBOOK.md → 11. Deployment Pipeline 3. Review: `.github/workflows/deploy.yml` for actual steps ### "The database is slow" 1. Read: INFRASTRUCTURE_RUNBOOK.md → 4. Database Layer (Connection Pooling) 2. Check: INFRASTRUCTURE_QUICK_REFERENCE.md → 🆘 "Database connection pooling full?" 3. Query: Use Prometheus queries from INFRASTRUCTURE_RUNBOOK.md ### "How do I restore from backup?" 1. Read: INFRASTRUCTURE_RUNBOOK.md → 8. Backup & Recovery 2. Steps: "Restore from Backup" section with exact commands --- ## 📈 Key Metrics & SLOs From INFRASTRUCTURE_RUNBOOK.md monitoring section: | Metric | Warning | Critical | Source | |--------|---------|----------|--------| | API p99 latency | > 1s (5min) | > 3s (3min) | Prometheus histogram | | API p99/endpoint | > 2s (5min) | N/A | Prometheus | | 5xx error rate | > 1% (5min) | N/A | Prometheus | | Database response | Monitored | Monitored | Grafana dashboard | | Redis availability | Graceful fallback | Graceful fallback | App continues on DB | Dashboards available at `http://localhost:3002` (Grafana): - API Latency - API Overview - Database Metrics - Logs & Errors - Search Analytics - Web Vitals - Business Metrics --- ## 🔐 Security Notes From INFRASTRUCTURE_RUNBOOK.md environment variables section: **CRITICAL (Production):** - JWT_SECRET must be ≥32 characters (generate: `openssl rand -base64 48`) - KYC_ENCRYPTION_KEY must be 64 hex chars (generate: `openssl rand -hex 32`) - All payment gateway credentials must be rotated regularly - Redis requires password authentication in production - Docker containers run as non-root (node user) - Read-only filesystems for application containers - No new privileges flag set --- ## 📞 Escalation Path 1. **Immediate Issue?** → INFRASTRUCTURE_QUICK_REFERENCE.md 2. **Complex Problem?** → INFRASTRUCTURE_RUNBOOK.md section 12 3. **Need Audit Trail?** → INFRASTRUCTURE_AUDIT.md 4. **Still Stuck?** → Check .github/workflows/ or git history --- ## 📝 Document Updates These documents were generated on **April 11, 2026** from a complete infrastructure audit of the GoodGo Platform monorepo. **To keep up-to-date:** - Update these docs when adding new services - Review monitoring configs after infrastructure changes - Test backup procedures monthly (already automated) - Update runbooks based on incident postmortems --- ## 🎓 Learning Path **For new team members:** 1. **Day 1:** Read INFRASTRUCTURE_QUICK_REFERENCE.md (30 min) 2. **Day 2:** Read INFRASTRUCTURE_RUNBOOK.md sections 1-3 (1 hour) 3. **Day 3:** Practice commands from Quick Reference with mentor 4. **Day 4:** Read INFRASTRUCTURE_RUNBOOK.md sections 4-7 (1.5 hours) 5. **Day 5:** Read INFRASTRUCTURE_RUNBOOK.md sections 8-12 (1.5 hours) 6. **Week 2:** Shadow on-call engineer, practice troubleshooting 7. **Week 3:** Take on-call shift --- **Last Updated:** April 11, 2026 **Version:** 1.0 **Maintainers:** GoodGo Platform SRE Team --- *For questions or updates to this documentation, contact: devops@goodgo.vn*