Files
goodgo-platform/docs/audits/README_INFRASTRUCTURE.md
Ho Ngoc Hai b8512ebff4 docs: consolidate audit and analysis reports into docs/audits/
Move 36 root-level audit/analysis documents and 7 web app audit documents
into docs/audits/ directory to declutter the project root. Remove stale
EXPLORATION_SUMMARY.txt.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-11 01:37:50 +07:00

279 lines
9.5 KiB
Markdown

# 📚 GoodGo Platform — Infrastructure Documentation
This directory contains **three comprehensive operational documents** for the GoodGo Platform infrastructure.
## 📖 Documentation Files
### 1. **INFRASTRUCTURE_RUNBOOK.md** (1,458 lines)
**→ Read this for complete operational reference**
Comprehensive guide covering:
- ✅ Executive summary (12+ services overview)
- ✅ Complete service inventory with ports, health checks, dependencies
- ✅ Docker Compose specifications (dev, prod, CI environments)
- ✅ Database layer (PostgreSQL 16 + PostGIS, 22 Prisma models)
- ✅ Connection pooling (PgBouncer configuration, transaction mode)
- ✅ Backup & recovery strategies (daily automated backups, verification)
- ✅ Caching & search (Redis graceful degradation, Typesense full-text)
- ✅ Monitoring & observability (Prometheus, Grafana dashboards, Loki logs)
- ✅ Payment integration (VNPay, MoMo, ZaloPay, callback handling)
- ✅ Health checks (liveness, readiness, dependency-specific probes)
- ✅ Complete environment variables reference
- ✅ Deployment pipeline (GitHub Actions CI/CD, Docker builds)
- ✅ Detailed troubleshooting guide with 7+ common issues
- ✅ Emergency procedures and Prometheus queries
**Use when:** Creating runbooks, investigating outages, onboarding new ops team members
---
### 2. **INFRASTRUCTURE_QUICK_REFERENCE.md** (222 lines)
**→ Read this for quick lookup**
Quick reference covering:
- 🚀 Quick start commands (dev, prod, CI)
- 📊 Service map with ports and health checks
- 🗄️ Database overview (backup schedule, connection pooling)
- 💾 Cache & search summary (Redis, Typesense features)
- 📈 Monitoring dashboard links
- 💳 Payment gateway summary
- 🏥 Health endpoint reference
- 🔐 Critical environment variables
- 📦 Deployment container images
- 🆘 Common troubleshooting steps (5 quick fixes)
- 📝 Key file locations and links
- 📞 Common Docker commands
**Use when:** Debugging quickly, on-call shift lookup, quick health checks
---
### 3. **INFRASTRUCTURE_AUDIT.md** (1,246 lines)
**→ Read this for complete audit trail of what was explored**
Detailed audit including:
- Raw configuration file contents
- Line-by-line analysis of each service
- Environment variable specifications
- Payment callback flow diagram (text)
- Health check implementation details
- Backup verification workflow
- CI/CD pipeline stages
**Use when:** Verifying infrastructure documentation accuracy, compliance audits
---
## 🎯 Quick Navigation
### By Role
**🔧 DevOps/SRE Engineer**
1. Start: INFRASTRUCTURE_QUICK_REFERENCE.md (5 min overview)
2. Deep dive: INFRASTRUCTURE_RUNBOOK.md (sections 2-3, 7, 11)
3. Reference: INFRASTRUCTURE_AUDIT.md (for raw configs)
**💼 Engineering Manager/Tech Lead**
1. Start: INFRASTRUCTURE_RUNBOOK.md (section 1: Executive Summary)
2. Details: INFRASTRUCTURE_RUNBOOK.md (sections 2-6, 10)
**🚀 On-Call Engineer**
1. Start: INFRASTRUCTURE_QUICK_REFERENCE.md (entire document)
2. Troubleshoot: INFRASTRUCTURE_RUNBOOK.md (section 12)
3. Debug: INFRASTRUCTURE_AUDIT.md (raw logs/configs if needed)
**👤 New Team Member**
1. Start: INFRASTRUCTURE_QUICK_REFERENCE.md (overview)
2. Learn: INFRASTRUCTURE_RUNBOOK.md (sections 1-6)
3. Practice: Use common commands from Quick Reference
---
## 🔍 Common Questions & Where to Find Answers
| Question | Document | Section |
|----------|----------|---------|
| "How many services are running?" | Runbook | 1. Executive Summary |
| "What ports do I need to know?" | Quick Reference | 📊 Service Map |
| "How is the database backed up?" | Runbook | 8. Backup & Recovery |
| "Payment callback failed, what now?" | Runbook | 12. Troubleshooting (Payment Callback) |
| "Redis is down, will the app work?" | Runbook | 5. Caching & Search (Graceful Degradation) |
| "How do I restart a service?" | Quick Reference | 📞 Common Commands |
| "What's the monitoring setup?" | Runbook | 6. Monitoring & Observability |
| "Where are environment variables?" | Runbook | 9. Environment Variables |
| "How do I deploy to production?" | Runbook | 11. Deployment Pipeline |
| "What does a health check do?" | Runbook | 7. Health Checks |
---
## 📊 Infrastructure at a Glance
```
Development Environment
├── 12 Services (no resource limits)
├── PostgreSQL 16 + PostGIS (5432)
├── Redis 7 (6379, 256MB)
├── Typesense 27.1 (8108)
├── Prometheus (9090, 15-day retention)
├── Grafana (3002, 7 dashboards)
├── Loki (3100, 15-day logs)
└── API/Web/AI services
Production Environment
├── 14 Services (with resource limits, security hardening)
├── PgBouncer (6432, 20-connection pool)
├── PostgreSQL 16 + PostGIS (5432)
├── Redis 7 (6379, 512MB, password auth)
├── Typesense 27.1 (8108)
├── Prometheus (9090, 30-day retention)
├── Grafana (3002, secrets management)
├── Loki (3100, 15-day logs)
└── API/Web/AI services (zero-downtime deployments)
CI/E2E Environment
├── 4 Services (tmpfs for speed)
├── PostgreSQL test DB
├── Redis (no persistence)
└── Typesense + MinIO (tmpfs)
```
---
## 🔗 Related Files in Repository
```
goodgo-platform-ai/
├── README_INFRASTRUCTURE.md (THIS FILE)
├── INFRASTRUCTURE_RUNBOOK.md (Complete reference)
├── INFRASTRUCTURE_QUICK_REFERENCE.md (Quick lookup)
├── INFRASTRUCTURE_AUDIT.md (Detailed audit)
├── docker-compose.yml (Dev environment)
├── docker-compose.prod.yml (Production)
├── docker-compose.ci.yml (Testing)
├── .env.example (Environment variables template)
├── prisma/schema.prisma (Data model, 22 Prisma models)
├── infra/pgbouncer/ (Connection pooling)
├── monitoring/ (Prometheus, Grafana, Loki configs)
├── scripts/backup/ (Backup and verification scripts)
└── .github/workflows/ (CI/CD pipelines)
├── ci.yml (Lint → Test → Build)
├── deploy.yml (Build images, deploy)
├── e2e.yml (End-to-end tests)
├── backup-verify.yml (Weekly backup verification)
└── security.yml (Dependency scanning)
```
---
## 🆘 Immediate Help
### "The API is down. What do I check?"
1. Read: INFRASTRUCTURE_QUICK_REFERENCE.md → 🆘 Troubleshooting
2. Quick commands:
```bash
docker compose ps api
docker compose logs api --tail=50
curl http://localhost:3001/health/ready
```
3. If still stuck: See INFRASTRUCTURE_RUNBOOK.md → 12. Troubleshooting
### "I need to deploy to production"
1. Read: INFRASTRUCTURE_QUICK_REFERENCE.md → 📦 Deployment
2. Then: INFRASTRUCTURE_RUNBOOK.md → 11. Deployment Pipeline
3. Review: `.github/workflows/deploy.yml` for actual steps
### "The database is slow"
1. Read: INFRASTRUCTURE_RUNBOOK.md → 4. Database Layer (Connection Pooling)
2. Check: INFRASTRUCTURE_QUICK_REFERENCE.md → 🆘 "Database connection pooling full?"
3. Query: Use Prometheus queries from INFRASTRUCTURE_RUNBOOK.md
### "How do I restore from backup?"
1. Read: INFRASTRUCTURE_RUNBOOK.md → 8. Backup & Recovery
2. Steps: "Restore from Backup" section with exact commands
---
## 📈 Key Metrics & SLOs
From INFRASTRUCTURE_RUNBOOK.md monitoring section:
| Metric | Warning | Critical | Source |
|--------|---------|----------|--------|
| API p99 latency | > 1s (5min) | > 3s (3min) | Prometheus histogram |
| API p99/endpoint | > 2s (5min) | N/A | Prometheus |
| 5xx error rate | > 1% (5min) | N/A | Prometheus |
| Database response | Monitored | Monitored | Grafana dashboard |
| Redis availability | Graceful fallback | Graceful fallback | App continues on DB |
Dashboards available at `http://localhost:3002` (Grafana):
- API Latency
- API Overview
- Database Metrics
- Logs & Errors
- Search Analytics
- Web Vitals
- Business Metrics
---
## 🔐 Security Notes
From INFRASTRUCTURE_RUNBOOK.md environment variables section:
**CRITICAL (Production):**
- JWT_SECRET must be ≥32 characters (generate: `openssl rand -base64 48`)
- KYC_ENCRYPTION_KEY must be 64 hex chars (generate: `openssl rand -hex 32`)
- All payment gateway credentials must be rotated regularly
- Redis requires password authentication in production
- Docker containers run as non-root (node user)
- Read-only filesystems for application containers
- No new privileges flag set
---
## 📞 Escalation Path
1. **Immediate Issue?** → INFRASTRUCTURE_QUICK_REFERENCE.md
2. **Complex Problem?** → INFRASTRUCTURE_RUNBOOK.md section 12
3. **Need Audit Trail?** → INFRASTRUCTURE_AUDIT.md
4. **Still Stuck?** → Check .github/workflows/ or git history
---
## 📝 Document Updates
These documents were generated on **April 11, 2026** from a complete infrastructure audit of the GoodGo Platform monorepo.
**To keep up-to-date:**
- Update these docs when adding new services
- Review monitoring configs after infrastructure changes
- Test backup procedures monthly (already automated)
- Update runbooks based on incident postmortems
---
## 🎓 Learning Path
**For new team members:**
1. **Day 1:** Read INFRASTRUCTURE_QUICK_REFERENCE.md (30 min)
2. **Day 2:** Read INFRASTRUCTURE_RUNBOOK.md sections 1-3 (1 hour)
3. **Day 3:** Practice commands from Quick Reference with mentor
4. **Day 4:** Read INFRASTRUCTURE_RUNBOOK.md sections 4-7 (1.5 hours)
5. **Day 5:** Read INFRASTRUCTURE_RUNBOOK.md sections 8-12 (1.5 hours)
6. **Week 2:** Shadow on-call engineer, practice troubleshooting
7. **Week 3:** Take on-call shift
---
**Last Updated:** April 11, 2026
**Version:** 1.0
**Maintainers:** GoodGo Platform SRE Team
---
*For questions or updates to this documentation, contact: devops@goodgo.vn*