Move 36 root-level audit/analysis documents and 7 web app audit documents into docs/audits/ directory to declutter the project root. Remove stale EXPLORATION_SUMMARY.txt. Co-Authored-By: Paperclip <noreply@paperclip.ing>
279 lines
9.5 KiB
Markdown
279 lines
9.5 KiB
Markdown
# 📚 GoodGo Platform — Infrastructure Documentation
|
|
|
|
This directory contains **three comprehensive operational documents** for the GoodGo Platform infrastructure.
|
|
|
|
## 📖 Documentation Files
|
|
|
|
### 1. **INFRASTRUCTURE_RUNBOOK.md** (1,458 lines)
|
|
**→ Read this for complete operational reference**
|
|
|
|
Comprehensive guide covering:
|
|
- ✅ Executive summary (12+ services overview)
|
|
- ✅ Complete service inventory with ports, health checks, dependencies
|
|
- ✅ Docker Compose specifications (dev, prod, CI environments)
|
|
- ✅ Database layer (PostgreSQL 16 + PostGIS, 22 Prisma models)
|
|
- ✅ Connection pooling (PgBouncer configuration, transaction mode)
|
|
- ✅ Backup & recovery strategies (daily automated backups, verification)
|
|
- ✅ Caching & search (Redis graceful degradation, Typesense full-text)
|
|
- ✅ Monitoring & observability (Prometheus, Grafana dashboards, Loki logs)
|
|
- ✅ Payment integration (VNPay, MoMo, ZaloPay, callback handling)
|
|
- ✅ Health checks (liveness, readiness, dependency-specific probes)
|
|
- ✅ Complete environment variables reference
|
|
- ✅ Deployment pipeline (GitHub Actions CI/CD, Docker builds)
|
|
- ✅ Detailed troubleshooting guide with 7+ common issues
|
|
- ✅ Emergency procedures and Prometheus queries
|
|
|
|
**Use when:** Creating runbooks, investigating outages, onboarding new ops team members
|
|
|
|
---
|
|
|
|
### 2. **INFRASTRUCTURE_QUICK_REFERENCE.md** (222 lines)
|
|
**→ Read this for quick lookup**
|
|
|
|
Quick reference covering:
|
|
- 🚀 Quick start commands (dev, prod, CI)
|
|
- 📊 Service map with ports and health checks
|
|
- 🗄️ Database overview (backup schedule, connection pooling)
|
|
- 💾 Cache & search summary (Redis, Typesense features)
|
|
- 📈 Monitoring dashboard links
|
|
- 💳 Payment gateway summary
|
|
- 🏥 Health endpoint reference
|
|
- 🔐 Critical environment variables
|
|
- 📦 Deployment container images
|
|
- 🆘 Common troubleshooting steps (5 quick fixes)
|
|
- 📝 Key file locations and links
|
|
- 📞 Common Docker commands
|
|
|
|
**Use when:** Debugging quickly, on-call shift lookup, quick health checks
|
|
|
|
---
|
|
|
|
### 3. **INFRASTRUCTURE_AUDIT.md** (1,246 lines)
|
|
**→ Read this for complete audit trail of what was explored**
|
|
|
|
Detailed audit including:
|
|
- Raw configuration file contents
|
|
- Line-by-line analysis of each service
|
|
- Environment variable specifications
|
|
- Payment callback flow diagram (text)
|
|
- Health check implementation details
|
|
- Backup verification workflow
|
|
- CI/CD pipeline stages
|
|
|
|
**Use when:** Verifying infrastructure documentation accuracy, compliance audits
|
|
|
|
---
|
|
|
|
## 🎯 Quick Navigation
|
|
|
|
### By Role
|
|
|
|
**🔧 DevOps/SRE Engineer**
|
|
1. Start: INFRASTRUCTURE_QUICK_REFERENCE.md (5 min overview)
|
|
2. Deep dive: INFRASTRUCTURE_RUNBOOK.md (sections 2-3, 7, 11)
|
|
3. Reference: INFRASTRUCTURE_AUDIT.md (for raw configs)
|
|
|
|
**💼 Engineering Manager/Tech Lead**
|
|
1. Start: INFRASTRUCTURE_RUNBOOK.md (section 1: Executive Summary)
|
|
2. Details: INFRASTRUCTURE_RUNBOOK.md (sections 2-6, 10)
|
|
|
|
**🚀 On-Call Engineer**
|
|
1. Start: INFRASTRUCTURE_QUICK_REFERENCE.md (entire document)
|
|
2. Troubleshoot: INFRASTRUCTURE_RUNBOOK.md (section 12)
|
|
3. Debug: INFRASTRUCTURE_AUDIT.md (raw logs/configs if needed)
|
|
|
|
**👤 New Team Member**
|
|
1. Start: INFRASTRUCTURE_QUICK_REFERENCE.md (overview)
|
|
2. Learn: INFRASTRUCTURE_RUNBOOK.md (sections 1-6)
|
|
3. Practice: Use common commands from Quick Reference
|
|
|
|
---
|
|
|
|
## 🔍 Common Questions & Where to Find Answers
|
|
|
|
| Question | Document | Section |
|
|
|----------|----------|---------|
|
|
| "How many services are running?" | Runbook | 1. Executive Summary |
|
|
| "What ports do I need to know?" | Quick Reference | 📊 Service Map |
|
|
| "How is the database backed up?" | Runbook | 8. Backup & Recovery |
|
|
| "Payment callback failed, what now?" | Runbook | 12. Troubleshooting (Payment Callback) |
|
|
| "Redis is down, will the app work?" | Runbook | 5. Caching & Search (Graceful Degradation) |
|
|
| "How do I restart a service?" | Quick Reference | 📞 Common Commands |
|
|
| "What's the monitoring setup?" | Runbook | 6. Monitoring & Observability |
|
|
| "Where are environment variables?" | Runbook | 9. Environment Variables |
|
|
| "How do I deploy to production?" | Runbook | 11. Deployment Pipeline |
|
|
| "What does a health check do?" | Runbook | 7. Health Checks |
|
|
|
|
---
|
|
|
|
## 📊 Infrastructure at a Glance
|
|
|
|
```
|
|
Development Environment
|
|
├── 12 Services (no resource limits)
|
|
├── PostgreSQL 16 + PostGIS (5432)
|
|
├── Redis 7 (6379, 256MB)
|
|
├── Typesense 27.1 (8108)
|
|
├── Prometheus (9090, 15-day retention)
|
|
├── Grafana (3002, 7 dashboards)
|
|
├── Loki (3100, 15-day logs)
|
|
└── API/Web/AI services
|
|
|
|
Production Environment
|
|
├── 14 Services (with resource limits, security hardening)
|
|
├── PgBouncer (6432, 20-connection pool)
|
|
├── PostgreSQL 16 + PostGIS (5432)
|
|
├── Redis 7 (6379, 512MB, password auth)
|
|
├── Typesense 27.1 (8108)
|
|
├── Prometheus (9090, 30-day retention)
|
|
├── Grafana (3002, secrets management)
|
|
├── Loki (3100, 15-day logs)
|
|
└── API/Web/AI services (zero-downtime deployments)
|
|
|
|
CI/E2E Environment
|
|
├── 4 Services (tmpfs for speed)
|
|
├── PostgreSQL test DB
|
|
├── Redis (no persistence)
|
|
└── Typesense + MinIO (tmpfs)
|
|
```
|
|
|
|
---
|
|
|
|
## 🔗 Related Files in Repository
|
|
|
|
```
|
|
goodgo-platform-ai/
|
|
├── README_INFRASTRUCTURE.md (THIS FILE)
|
|
├── INFRASTRUCTURE_RUNBOOK.md (Complete reference)
|
|
├── INFRASTRUCTURE_QUICK_REFERENCE.md (Quick lookup)
|
|
├── INFRASTRUCTURE_AUDIT.md (Detailed audit)
|
|
│
|
|
├── docker-compose.yml (Dev environment)
|
|
├── docker-compose.prod.yml (Production)
|
|
├── docker-compose.ci.yml (Testing)
|
|
│
|
|
├── .env.example (Environment variables template)
|
|
├── prisma/schema.prisma (Data model, 22 Prisma models)
|
|
│
|
|
├── infra/pgbouncer/ (Connection pooling)
|
|
├── monitoring/ (Prometheus, Grafana, Loki configs)
|
|
├── scripts/backup/ (Backup and verification scripts)
|
|
│
|
|
└── .github/workflows/ (CI/CD pipelines)
|
|
├── ci.yml (Lint → Test → Build)
|
|
├── deploy.yml (Build images, deploy)
|
|
├── e2e.yml (End-to-end tests)
|
|
├── backup-verify.yml (Weekly backup verification)
|
|
└── security.yml (Dependency scanning)
|
|
```
|
|
|
|
---
|
|
|
|
## 🆘 Immediate Help
|
|
|
|
### "The API is down. What do I check?"
|
|
1. Read: INFRASTRUCTURE_QUICK_REFERENCE.md → 🆘 Troubleshooting
|
|
2. Quick commands:
|
|
```bash
|
|
docker compose ps api
|
|
docker compose logs api --tail=50
|
|
curl http://localhost:3001/health/ready
|
|
```
|
|
3. If still stuck: See INFRASTRUCTURE_RUNBOOK.md → 12. Troubleshooting
|
|
|
|
### "I need to deploy to production"
|
|
1. Read: INFRASTRUCTURE_QUICK_REFERENCE.md → 📦 Deployment
|
|
2. Then: INFRASTRUCTURE_RUNBOOK.md → 11. Deployment Pipeline
|
|
3. Review: `.github/workflows/deploy.yml` for actual steps
|
|
|
|
### "The database is slow"
|
|
1. Read: INFRASTRUCTURE_RUNBOOK.md → 4. Database Layer (Connection Pooling)
|
|
2. Check: INFRASTRUCTURE_QUICK_REFERENCE.md → 🆘 "Database connection pooling full?"
|
|
3. Query: Use Prometheus queries from INFRASTRUCTURE_RUNBOOK.md
|
|
|
|
### "How do I restore from backup?"
|
|
1. Read: INFRASTRUCTURE_RUNBOOK.md → 8. Backup & Recovery
|
|
2. Steps: "Restore from Backup" section with exact commands
|
|
|
|
---
|
|
|
|
## 📈 Key Metrics & SLOs
|
|
|
|
From INFRASTRUCTURE_RUNBOOK.md monitoring section:
|
|
|
|
| Metric | Warning | Critical | Source |
|
|
|--------|---------|----------|--------|
|
|
| API p99 latency | > 1s (5min) | > 3s (3min) | Prometheus histogram |
|
|
| API p99/endpoint | > 2s (5min) | N/A | Prometheus |
|
|
| 5xx error rate | > 1% (5min) | N/A | Prometheus |
|
|
| Database response | Monitored | Monitored | Grafana dashboard |
|
|
| Redis availability | Graceful fallback | Graceful fallback | App continues on DB |
|
|
|
|
Dashboards available at `http://localhost:3002` (Grafana):
|
|
- API Latency
|
|
- API Overview
|
|
- Database Metrics
|
|
- Logs & Errors
|
|
- Search Analytics
|
|
- Web Vitals
|
|
- Business Metrics
|
|
|
|
---
|
|
|
|
## 🔐 Security Notes
|
|
|
|
From INFRASTRUCTURE_RUNBOOK.md environment variables section:
|
|
|
|
**CRITICAL (Production):**
|
|
- JWT_SECRET must be ≥32 characters (generate: `openssl rand -base64 48`)
|
|
- KYC_ENCRYPTION_KEY must be 64 hex chars (generate: `openssl rand -hex 32`)
|
|
- All payment gateway credentials must be rotated regularly
|
|
- Redis requires password authentication in production
|
|
- Docker containers run as non-root (node user)
|
|
- Read-only filesystems for application containers
|
|
- No new privileges flag set
|
|
|
|
---
|
|
|
|
## 📞 Escalation Path
|
|
|
|
1. **Immediate Issue?** → INFRASTRUCTURE_QUICK_REFERENCE.md
|
|
2. **Complex Problem?** → INFRASTRUCTURE_RUNBOOK.md section 12
|
|
3. **Need Audit Trail?** → INFRASTRUCTURE_AUDIT.md
|
|
4. **Still Stuck?** → Check .github/workflows/ or git history
|
|
|
|
---
|
|
|
|
## 📝 Document Updates
|
|
|
|
These documents were generated on **April 11, 2026** from a complete infrastructure audit of the GoodGo Platform monorepo.
|
|
|
|
**To keep up-to-date:**
|
|
- Update these docs when adding new services
|
|
- Review monitoring configs after infrastructure changes
|
|
- Test backup procedures monthly (already automated)
|
|
- Update runbooks based on incident postmortems
|
|
|
|
---
|
|
|
|
## 🎓 Learning Path
|
|
|
|
**For new team members:**
|
|
1. **Day 1:** Read INFRASTRUCTURE_QUICK_REFERENCE.md (30 min)
|
|
2. **Day 2:** Read INFRASTRUCTURE_RUNBOOK.md sections 1-3 (1 hour)
|
|
3. **Day 3:** Practice commands from Quick Reference with mentor
|
|
4. **Day 4:** Read INFRASTRUCTURE_RUNBOOK.md sections 4-7 (1.5 hours)
|
|
5. **Day 5:** Read INFRASTRUCTURE_RUNBOOK.md sections 8-12 (1.5 hours)
|
|
6. **Week 2:** Shadow on-call engineer, practice troubleshooting
|
|
7. **Week 3:** Take on-call shift
|
|
|
|
---
|
|
|
|
**Last Updated:** April 11, 2026
|
|
**Version:** 1.0
|
|
**Maintainers:** GoodGo Platform SRE Team
|
|
|
|
---
|
|
|
|
*For questions or updates to this documentation, contact: devops@goodgo.vn*
|