Move 36 root-level audit/analysis documents and 7 web app audit documents into docs/audits/ directory to declutter the project root. Remove stale EXPLORATION_SUMMARY.txt. Co-Authored-By: Paperclip <noreply@paperclip.ing>
9.5 KiB
📚 GoodGo Platform — Infrastructure Documentation
This directory contains three comprehensive operational documents for the GoodGo Platform infrastructure.
📖 Documentation Files
1. INFRASTRUCTURE_RUNBOOK.md (1,458 lines)
→ Read this for complete operational reference
Comprehensive guide covering:
- ✅ Executive summary (12+ services overview)
- ✅ Complete service inventory with ports, health checks, dependencies
- ✅ Docker Compose specifications (dev, prod, CI environments)
- ✅ Database layer (PostgreSQL 16 + PostGIS, 22 Prisma models)
- ✅ Connection pooling (PgBouncer configuration, transaction mode)
- ✅ Backup & recovery strategies (daily automated backups, verification)
- ✅ Caching & search (Redis graceful degradation, Typesense full-text)
- ✅ Monitoring & observability (Prometheus, Grafana dashboards, Loki logs)
- ✅ Payment integration (VNPay, MoMo, ZaloPay, callback handling)
- ✅ Health checks (liveness, readiness, dependency-specific probes)
- ✅ Complete environment variables reference
- ✅ Deployment pipeline (GitHub Actions CI/CD, Docker builds)
- ✅ Detailed troubleshooting guide with 7+ common issues
- ✅ Emergency procedures and Prometheus queries
Use when: Creating runbooks, investigating outages, onboarding new ops team members
2. INFRASTRUCTURE_QUICK_REFERENCE.md (222 lines)
→ Read this for quick lookup
Quick reference covering:
- 🚀 Quick start commands (dev, prod, CI)
- 📊 Service map with ports and health checks
- 🗄️ Database overview (backup schedule, connection pooling)
- 💾 Cache & search summary (Redis, Typesense features)
- 📈 Monitoring dashboard links
- 💳 Payment gateway summary
- 🏥 Health endpoint reference
- 🔐 Critical environment variables
- 📦 Deployment container images
- 🆘 Common troubleshooting steps (5 quick fixes)
- 📝 Key file locations and links
- 📞 Common Docker commands
Use when: Debugging quickly, on-call shift lookup, quick health checks
3. INFRASTRUCTURE_AUDIT.md (1,246 lines)
→ Read this for complete audit trail of what was explored
Detailed audit including:
- Raw configuration file contents
- Line-by-line analysis of each service
- Environment variable specifications
- Payment callback flow diagram (text)
- Health check implementation details
- Backup verification workflow
- CI/CD pipeline stages
Use when: Verifying infrastructure documentation accuracy, compliance audits
🎯 Quick Navigation
By Role
🔧 DevOps/SRE Engineer
- Start: INFRASTRUCTURE_QUICK_REFERENCE.md (5 min overview)
- Deep dive: INFRASTRUCTURE_RUNBOOK.md (sections 2-3, 7, 11)
- Reference: INFRASTRUCTURE_AUDIT.md (for raw configs)
💼 Engineering Manager/Tech Lead
- Start: INFRASTRUCTURE_RUNBOOK.md (section 1: Executive Summary)
- Details: INFRASTRUCTURE_RUNBOOK.md (sections 2-6, 10)
🚀 On-Call Engineer
- Start: INFRASTRUCTURE_QUICK_REFERENCE.md (entire document)
- Troubleshoot: INFRASTRUCTURE_RUNBOOK.md (section 12)
- Debug: INFRASTRUCTURE_AUDIT.md (raw logs/configs if needed)
👤 New Team Member
- Start: INFRASTRUCTURE_QUICK_REFERENCE.md (overview)
- Learn: INFRASTRUCTURE_RUNBOOK.md (sections 1-6)
- Practice: Use common commands from Quick Reference
🔍 Common Questions & Where to Find Answers
| Question | Document | Section |
|---|---|---|
| "How many services are running?" | Runbook | 1. Executive Summary |
| "What ports do I need to know?" | Quick Reference | 📊 Service Map |
| "How is the database backed up?" | Runbook | 8. Backup & Recovery |
| "Payment callback failed, what now?" | Runbook | 12. Troubleshooting (Payment Callback) |
| "Redis is down, will the app work?" | Runbook | 5. Caching & Search (Graceful Degradation) |
| "How do I restart a service?" | Quick Reference | 📞 Common Commands |
| "What's the monitoring setup?" | Runbook | 6. Monitoring & Observability |
| "Where are environment variables?" | Runbook | 9. Environment Variables |
| "How do I deploy to production?" | Runbook | 11. Deployment Pipeline |
| "What does a health check do?" | Runbook | 7. Health Checks |
📊 Infrastructure at a Glance
Development Environment
├── 12 Services (no resource limits)
├── PostgreSQL 16 + PostGIS (5432)
├── Redis 7 (6379, 256MB)
├── Typesense 27.1 (8108)
├── Prometheus (9090, 15-day retention)
├── Grafana (3002, 7 dashboards)
├── Loki (3100, 15-day logs)
└── API/Web/AI services
Production Environment
├── 14 Services (with resource limits, security hardening)
├── PgBouncer (6432, 20-connection pool)
├── PostgreSQL 16 + PostGIS (5432)
├── Redis 7 (6379, 512MB, password auth)
├── Typesense 27.1 (8108)
├── Prometheus (9090, 30-day retention)
├── Grafana (3002, secrets management)
├── Loki (3100, 15-day logs)
└── API/Web/AI services (zero-downtime deployments)
CI/E2E Environment
├── 4 Services (tmpfs for speed)
├── PostgreSQL test DB
├── Redis (no persistence)
└── Typesense + MinIO (tmpfs)
🔗 Related Files in Repository
goodgo-platform-ai/
├── README_INFRASTRUCTURE.md (THIS FILE)
├── INFRASTRUCTURE_RUNBOOK.md (Complete reference)
├── INFRASTRUCTURE_QUICK_REFERENCE.md (Quick lookup)
├── INFRASTRUCTURE_AUDIT.md (Detailed audit)
│
├── docker-compose.yml (Dev environment)
├── docker-compose.prod.yml (Production)
├── docker-compose.ci.yml (Testing)
│
├── .env.example (Environment variables template)
├── prisma/schema.prisma (Data model, 22 Prisma models)
│
├── infra/pgbouncer/ (Connection pooling)
├── monitoring/ (Prometheus, Grafana, Loki configs)
├── scripts/backup/ (Backup and verification scripts)
│
└── .github/workflows/ (CI/CD pipelines)
├── ci.yml (Lint → Test → Build)
├── deploy.yml (Build images, deploy)
├── e2e.yml (End-to-end tests)
├── backup-verify.yml (Weekly backup verification)
└── security.yml (Dependency scanning)
🆘 Immediate Help
"The API is down. What do I check?"
- Read: INFRASTRUCTURE_QUICK_REFERENCE.md → 🆘 Troubleshooting
- Quick commands:
docker compose ps api docker compose logs api --tail=50 curl http://localhost:3001/health/ready - If still stuck: See INFRASTRUCTURE_RUNBOOK.md → 12. Troubleshooting
"I need to deploy to production"
- Read: INFRASTRUCTURE_QUICK_REFERENCE.md → 📦 Deployment
- Then: INFRASTRUCTURE_RUNBOOK.md → 11. Deployment Pipeline
- Review:
.github/workflows/deploy.ymlfor actual steps
"The database is slow"
- Read: INFRASTRUCTURE_RUNBOOK.md → 4. Database Layer (Connection Pooling)
- Check: INFRASTRUCTURE_QUICK_REFERENCE.md → 🆘 "Database connection pooling full?"
- Query: Use Prometheus queries from INFRASTRUCTURE_RUNBOOK.md
"How do I restore from backup?"
- Read: INFRASTRUCTURE_RUNBOOK.md → 8. Backup & Recovery
- Steps: "Restore from Backup" section with exact commands
📈 Key Metrics & SLOs
From INFRASTRUCTURE_RUNBOOK.md monitoring section:
| Metric | Warning | Critical | Source |
|---|---|---|---|
| API p99 latency | > 1s (5min) | > 3s (3min) | Prometheus histogram |
| API p99/endpoint | > 2s (5min) | N/A | Prometheus |
| 5xx error rate | > 1% (5min) | N/A | Prometheus |
| Database response | Monitored | Monitored | Grafana dashboard |
| Redis availability | Graceful fallback | Graceful fallback | App continues on DB |
Dashboards available at http://localhost:3002 (Grafana):
- API Latency
- API Overview
- Database Metrics
- Logs & Errors
- Search Analytics
- Web Vitals
- Business Metrics
🔐 Security Notes
From INFRASTRUCTURE_RUNBOOK.md environment variables section:
CRITICAL (Production):
- JWT_SECRET must be ≥32 characters (generate:
openssl rand -base64 48) - KYC_ENCRYPTION_KEY must be 64 hex chars (generate:
openssl rand -hex 32) - All payment gateway credentials must be rotated regularly
- Redis requires password authentication in production
- Docker containers run as non-root (node user)
- Read-only filesystems for application containers
- No new privileges flag set
📞 Escalation Path
- Immediate Issue? → INFRASTRUCTURE_QUICK_REFERENCE.md
- Complex Problem? → INFRASTRUCTURE_RUNBOOK.md section 12
- Need Audit Trail? → INFRASTRUCTURE_AUDIT.md
- Still Stuck? → Check .github/workflows/ or git history
📝 Document Updates
These documents were generated on April 11, 2026 from a complete infrastructure audit of the GoodGo Platform monorepo.
To keep up-to-date:
- Update these docs when adding new services
- Review monitoring configs after infrastructure changes
- Test backup procedures monthly (already automated)
- Update runbooks based on incident postmortems
🎓 Learning Path
For new team members:
- Day 1: Read INFRASTRUCTURE_QUICK_REFERENCE.md (30 min)
- Day 2: Read INFRASTRUCTURE_RUNBOOK.md sections 1-3 (1 hour)
- Day 3: Practice commands from Quick Reference with mentor
- Day 4: Read INFRASTRUCTURE_RUNBOOK.md sections 4-7 (1.5 hours)
- Day 5: Read INFRASTRUCTURE_RUNBOOK.md sections 8-12 (1.5 hours)
- Week 2: Shadow on-call engineer, practice troubleshooting
- Week 3: Take on-call shift
Last Updated: April 11, 2026
Version: 1.0
Maintainers: GoodGo Platform SRE Team
For questions or updates to this documentation, contact: devops@goodgo.vn