# Thiết Kế Hệ Thống / System Design > **VI**: Kiến trúc tổng thể của nền tảng GoodGo Microservices > **EN**: Overall architecture of GoodGo Microservices Platform ## Sơ đồ Tổng quan / Overview Diagram ```mermaid graph TD subgraph "Client Layer" Web[Web App
Next.js] Mobile[Mobile App
Flutter] end subgraph "API Gateway Layer" Traefik[Traefik
API Gateway] end subgraph "Services Layer" IAM[IAM Service
Auth & RBAC] Future1[Future Service 1] Future2[Future Service 2] end subgraph "Infrastructure Layer" DB[(Neon PostgreSQL
Primary Database)] Cache[(Redis
Cache & Session)] Kafka[Apache Kafka
Event Streaming] end subgraph "Observability Layer" Prom[Prometheus
Metrics] Loki[Loki
Logs] Jaeger[Jaeger
Tracing] Grafana[Grafana
Dashboards] end Web --> Traefik Mobile --> Traefik Traefik --> IAM Traefik --> Future1 Traefik --> Future2 IAM --> DB IAM --> Cache IAM --> Kafka Future1 --> DB Future1 --> Cache Future1 --> Kafka Future2 --> DB Future2 --> Cache Future2 --> Kafka IAM -.->|metrics| Prom Future1 -.->|metrics| Prom Future2 -.->|metrics| Prom IAM -.->|logs| Loki Future1 -.->|logs| Loki Future2 -.->|logs| Loki IAM -.->|traces| Jaeger Future1 -.->|traces| Jaeger Future2 -.->|traces| Jaeger Prom --> Grafana Loki --> Grafana Jaeger --> Grafana style Traefik fill:#e1f5ff style DB fill:#f0e1ff style Cache fill:#fff4e1 style Kafka fill:#d4edda style Grafana fill:#ffe1e1 ``` ## Mô tả Kiến trúc / Architecture Description ### VI: Phần Tiếng Việt GoodGo Platform được xây dựng theo kiến trúc microservices với các nguyên tắc sau: **Nguyên tắc Cốt lõi**: 1. **Độc Lập Service**: Mỗi service có database riêng và có thể deploy độc lập 2. **API Gateway Pattern**: Traefik xử lý routing, load balancing, và cross-cutting concerns 3. **Shared Libraries**: Chức năng chung được trích xuất vào shared packages (`@goodgo/*`) 4. **Infrastructure as Code**: Tất cả cấu hình infrastructure được version control 5. **Observability First**: Đầy đủ metrics, logging, và distributed tracing **Công nghệ Stack**: - **Frontend**: Next.js 14+ (App Router), Flutter 3.x - **Backend**: Node.js 20+, TypeScript 5+, Express - **Database**: Neon PostgreSQL (serverless) - **Cache**: Redis (multi-layer caching) - **Message Broker**: Apache Kafka - **API Gateway**: Traefik - **Observability**: Prometheus, Grafana, Loki, Jaeger ### EN: English Section GoodGo Platform is built on microservices architecture with the following principles: **Core Principles**: 1. **Service Independence**: Each service has its own database and can be deployed independently 2. **API Gateway Pattern**: Traefik handles routing, load balancing, and cross-cutting concerns 3. **Shared Libraries**: Common functionality extracted into shared packages (`@goodgo/*`) 4. **Infrastructure as Code**: All infrastructure configuration is version controlled 5. **Observability First**: Complete metrics, logging, and distributed tracing **Technology Stack**: - **Frontend**: Next.js 14+ (App Router), Flutter 3.x - **Backend**: Node.js 20+, TypeScript 5+, Express - **Database**: Neon PostgreSQL (serverless) - **Cache**: Redis (multi-layer caching) - **Message Broker**: Apache Kafka - **API Gateway**: Traefik - **Observability**: Prometheus, Grafana, Loki, Jaeger ## Bối cảnh Hệ thống / System Context ```mermaid C4Context title Sơ đồ Bối cảnh Hệ thống GoodGo Platform Person(user, "Người dùng / User", "End users accessing the platform") Person(admin, "Quản trị viên / Admin", "System administrators") Person(developer, "Nhà phát triển / Developer", "Platform developers") System(platform, "GoodGo Platform", "Microservices platform for business applications") System_Ext(neon, "Neon PostgreSQL", "Serverless PostgreSQL database") System_Ext(redis, "Redis", "In-memory cache and session store") System_Ext(kafka, "Apache Kafka", "Event streaming platform") System_Ext(monitoring, "Monitoring Stack", "Prometheus + Grafana + Loki + Jaeger") Rel(user, platform, "Uses", "HTTPS") Rel(admin, platform, "Manages", "HTTPS") Rel(developer, platform, "Develops & Deploys", "Git, CI/CD") Rel(platform, neon, "Stores data", "PostgreSQL Protocol") Rel(platform, redis, "Caches data", "Redis Protocol") Rel(platform, kafka, "Publishes/Consumes events", "Kafka Protocol") Rel(platform, monitoring, "Sends metrics, logs, traces", "HTTP, gRPC") ``` ## Thành phần / Components ### Frontend Layer #### Web App (Next.js) **Mô tả**: Ứng dụng web sử dụng Next.js 14+ với App Router **Tính năng chính**: - Server-side rendering (SSR) và Static Site Generation (SSG) - API routes cho BFF (Backend for Frontend) pattern - Optimized image loading với next/image - Built-in routing và code splitting **Công nghệ sử dụng**: - Next.js 14+, React 18+, TypeScript - Tailwind CSS, Zustand (state management) - `@goodgo/http-client`, `@goodgo/types` **Vị trí File**: [`apps/web-client/`](file:///Users/velikho/Desktop/WORKING/Base/apps/web-client) #### Mobile App (Flutter) **Mô tả**: Ứng dụng mobile cross-platform sử dụng Flutter **Tính năng chính**: - Cross-platform (iOS, Android) - Native performance - Provider pattern cho state management - Offline-first với local storage **Công nghệ sử dụng**: - Flutter 3.x, Dart - Provider, Dio (HTTP client) **Vị trí File**: [`apps/mobile-client/`](file:///Users/velikho/Desktop/WORKING/Base/apps/mobile-client) ### API Gateway Layer #### Traefik **Mô tả**: Reverse proxy và API gateway xử lý routing, load balancing, SSL termination **Tính năng chính**: - Dynamic service discovery - Automatic HTTPS với Let's Encrypt - Load balancing và health checks - Rate limiting và circuit breaker - Middleware chains (CORS, auth, logging) **Công nghệ sử dụng**: - Traefik 2.x - Docker labels cho dynamic configuration **Vị trí File**: [`infra/traefik/`](file:///Users/velikho/Desktop/WORKING/Base/infra/traefik) ### Services Layer #### IAM Service **Mô tả**: Identity and Access Management service xử lý authentication và authorization **Tính năng chính**: - JWT authentication (RS256) - RBAC (Role-Based Access Control) - ABAC (Attribute-Based Access Control) - Event sourcing cho audit trail - Zero-trust device validation **Công nghệ sử dụng**: - Node.js, Express, TypeScript - Prisma ORM, bcrypt, jsonwebtoken - `@goodgo/logger`, `@goodgo/tracing` **Vị trí File**: [`services/iam-service/`](file:///Users/velikho/Desktop/WORKING/Base/services/iam-service) #### Future Services **Mô tả**: Các services sẽ được phát triển trong tương lai **Dự kiến**: - Payment Service - Xử lý thanh toán - Order Service - Quản lý đơn hàng - Notification Service - Gửi thông báo - Analytics Service - Phân tích dữ liệu ### Infrastructure Layer #### Neon PostgreSQL **Mô tả**: Serverless PostgreSQL database với auto-scaling **Tính năng chính**: - Serverless với auto-scaling - Branching cho development/staging - Point-in-time recovery - Connection pooling **Vị trí File**: Database schemas trong mỗi service (`services/*/prisma/schema.prisma`) #### Redis **Mô tả**: In-memory cache và session store **Tính năng chính**: - Multi-layer caching (L1: Memory, L2: Redis) - Session storage - Rate limiting counters - Pub/Sub cho real-time features **Vị trí File**: [`infra/redis/`](file:///Users/velikho/Desktop/WORKING/Base/infra/redis) #### Apache Kafka **Mô tả**: Event streaming platform cho asynchronous communication **Tính năng chính**: - Event-driven architecture - Event sourcing - Eventual consistency - Dead letter queue (DLQ) **Vị trí File**: [`infra/kafka/`](file:///Users/velikho/Desktop/WORKING/Base/infra/kafka) ## Luồng Dữ liệu / Data Flow ```mermaid sequenceDiagram participant Client participant Traefik as API Gateway participant Service participant Cache as Redis participant DB as PostgreSQL participant Kafka Client->>Traefik: HTTPS Request Traefik->>Traefik: Rate Limiting Traefik->>Traefik: JWT Validation Traefik->>Service: Route to Service Service->>Cache: Check Cache alt Cache Hit Cache-->>Service: Return Cached Data else Cache Miss Service->>DB: Query Database DB-->>Service: Return Data Service->>Cache: Store in Cache (TTL: 5min) end Service->>Service: Process Business Logic Service->>DB: Update Data (if needed) Service->>Kafka: Publish Event (async) Service-->>Traefik: Response Traefik-->>Client: HTTPS Response Note over Kafka: Event consumers process asynchronously ``` **VI Giải thích chi tiết**: 1. **Request**: Client gửi HTTPS request đến Traefik 2. **Gateway Processing**: Traefik thực hiện rate limiting và JWT validation 3. **Routing**: Traefik route request đến service phù hợp 4. **Cache Check**: Service kiểm tra L1 (memory) → L2 (Redis) cache 5. **Database Query**: Nếu cache miss, query từ PostgreSQL 6. **Cache Update**: Lưu kết quả vào cache với TTL phù hợp 7. **Business Logic**: Xử lý logic nghiệp vụ 8. **Event Publishing**: Publish domain events đến Kafka (async) 9. **Response**: Trả về response cho client qua Traefik **EN Detailed Explanation**: 1. **Request**: Client sends HTTPS request to Traefik 2. **Gateway Processing**: Traefik performs rate limiting and JWT validation 3. **Routing**: Traefik routes request to appropriate service 4. **Cache Check**: Service checks L1 (memory) → L2 (Redis) cache 5. **Database Query**: If cache miss, query from PostgreSQL 6. **Cache Update**: Store result in cache with appropriate TTL 7. **Business Logic**: Process business logic 8. **Event Publishing**: Publish domain events to Kafka (async) 9. **Response**: Return response to client via Traefik ## Kiến trúc Database / Database Architecture ```mermaid erDiagram User ||--o{ Session : has User ||--o{ UserRole : has User ||--o{ UserPermission : has User ||--o{ MFADevice : has User ||--o{ AuditEvent : triggers Role ||--o{ UserRole : assigned_to Role ||--o{ RolePermission : has Permission ||--o{ RolePermission : granted_to Permission ||--o{ UserPermission : granted_to Organization ||--o{ User : contains Organization ||--o{ Role : defines User { string id PK string email UK string passwordHash string organizationId FK boolean mfaEnabled datetime createdAt datetime updatedAt } Session { string id PK string userId FK string refreshTokenHash string deviceFingerprint string ipAddress datetime expiresAt datetime createdAt } Role { string id PK string name string organizationId FK int hierarchy datetime createdAt } Permission { string id PK string resource string action string scope datetime createdAt } AuditEvent { string id PK string userId FK string eventType json eventData datetime timestamp } ``` **VI Mô tả**: - **Database per Service**: Mỗi service có database schema riêng - **Shared Database**: Hiện tại sử dụng shared Neon PostgreSQL, schemas isolated bằng Prisma - **Event Sourcing**: Audit events lưu tất cả thay đổi quan trọng - **Soft Delete**: Sử dụng `deletedAt` field thay vì hard delete **EN Description**: - **Database per Service**: Each service has its own database schema - **Shared Database**: Currently using shared Neon PostgreSQL, schemas isolated by Prisma - **Event Sourcing**: Audit events store all important changes - **Soft Delete**: Use `deletedAt` field instead of hard delete ## Quyết định Thiết kế / Design Decisions ### Quyết định 1: Microservices Architecture **VI Bối cảnh**: Cần khả năng scale độc lập và deploy riêng biệt cho từng business domain **VI Quyết định**: Sử dụng microservices architecture với database per service pattern **VI Hậu quả**: - ✅ **Tích cực**: - Scale độc lập từng service theo nhu cầu - Deploy riêng biệt, giảm risk khi release - Fault isolation - lỗi một service không ảnh hưởng toàn bộ - Technology flexibility - mỗi service có thể dùng tech stack khác - ❌ **Tiêu cực**: - Phức tạp hơn monolith (distributed systems challenges) - Eventual consistency thay vì strong consistency - Distributed transactions phức tạp (Saga pattern) - Operational overhead (monitoring, deployment) **VI Các lựa chọn thay thế**: Monolith, Modular Monolith **EN Context**: Need independent scaling and deployment for each business domain **EN Decision**: Use microservices architecture with database per service pattern **EN Consequences**: - ✅ **Positive**: - Independent scaling per service based on demand - Independent deployment, reduced release risk - Fault isolation - one service failure doesn't affect entire system - Technology flexibility - each service can use different tech stack - ❌ **Negative**: - More complex than monolith (distributed systems challenges) - Eventual consistency instead of strong consistency - Complex distributed transactions (Saga pattern) - Operational overhead (monitoring, deployment) **EN Alternatives**: Monolith, Modular Monolith --- ### Quyết định 2: Traefik as API Gateway **VI Bối cảnh**: Cần reverse proxy, load balancing, SSL termination, và service discovery **VI Quyết định**: Sử dụng Traefik thay vì Kong, NGINX, hoặc AWS API Gateway **VI Hậu quả**: - ✅ **Tích cực**: - Auto service discovery với Docker labels - Dynamic configuration không cần restart - Built-in Let's Encrypt support - Native Kubernetes integration - Built-in metrics và tracing - ❌ **Tiêu cực**: - Learning curve cao hơn NGINX - Plugin ecosystem nhỏ hơn Kong - Community nhỏ hơn NGINX **VI Các lựa chọn thay thế**: Kong, NGINX, AWS API Gateway, Envoy **EN Context**: Need reverse proxy, load balancing, SSL termination, and service discovery **EN Decision**: Use Traefik instead of Kong, NGINX, or AWS API Gateway **EN Consequences**: - ✅ **Positive**: - Auto service discovery with Docker labels - Dynamic configuration without restart - Built-in Let's Encrypt support - Native Kubernetes integration - Built-in metrics and tracing - ❌ **Negative**: - Higher learning curve than NGINX - Smaller plugin ecosystem than Kong - Smaller community than NGINX **EN Alternatives**: Kong, NGINX, AWS API Gateway, Envoy --- ### Quyết định 3: Neon PostgreSQL (Serverless) **VI Bối cảnh**: Cần database với auto-scaling, branching, và cost-effective cho development **VI Quyết định**: Sử dụng Neon PostgreSQL (serverless) thay vì self-hosted PostgreSQL hoặc AWS RDS **VI Hậu quả**: - ✅ **Tích cực**: - Auto-scaling theo usage - Database branching cho dev/staging - Pay-per-use pricing model - Automatic backups và point-in-time recovery - No infrastructure management - ❌ **Tiêu cực**: - Vendor lock-in - Cold start latency (mitigated by connection pooling) - Limited control over database configuration **VI Các lựa chọn thay thế**: Self-hosted PostgreSQL, AWS RDS, Google Cloud SQL **EN Context**: Need database with auto-scaling, branching, and cost-effective for development **EN Decision**: Use Neon PostgreSQL (serverless) instead of self-hosted PostgreSQL or AWS RDS **EN Consequences**: - ✅ **Positive**: - Auto-scaling based on usage - Database branching for dev/staging - Pay-per-use pricing model - Automatic backups and point-in-time recovery - No infrastructure management - ❌ **Negative**: - Vendor lock-in - Cold start latency (mitigated by connection pooling) - Limited control over database configuration **EN Alternatives**: Self-hosted PostgreSQL, AWS RDS, Google Cloud SQL ## Đặc điểm Hiệu suất / Performance Characteristics | Chỉ số / Metric | Mục tiêu / Target | Ghi chú / Notes | |-----------------|-------------------|-----------------| | **API Response Time (P95)** | < 200ms | Excluding external API calls | | **API Response Time (P99)** | < 500ms | Peak load scenarios | | **Throughput** | 1000 req/s | Per service instance | | **Database Query Time (P95)** | < 50ms | Simple queries with indexes | | **Cache Hit Rate (L1)** | > 40% | In-memory cache | | **Cache Hit Rate (L2)** | > 80% | Redis cache | | **Event Publish Latency (P95)** | < 10ms | Kafka fire-and-forget | | **Service Availability** | > 99.9% | Monthly uptime target | | **Error Rate** | < 1% | 4xx + 5xx errors | **VI Tối ưu hóa Hiệu suất**: - Multi-layer caching (L1: Memory, L2: Redis) - Connection pooling cho database - Pagination cho list endpoints (max 100 items) - Database indexes cho frequently queried fields - Async event publishing (fire-and-forget) - CDN cho static assets (Next.js) **EN Performance Optimizations**: - Multi-layer caching (L1: Memory, L2: Redis) - Connection pooling for database - Pagination for list endpoints (max 100 items) - Database indexes for frequently queried fields - Async event publishing (fire-and-forget) - CDN for static assets (Next.js) ## Cân nhắc Bảo mật / Security Considerations ### VI: Phần Tiếng Việt **Authentication**: - JWT với RS256 (asymmetric signing) - Access token: 15 phút expiry - Refresh token: 7 ngày expiry, rotation on use - httpOnly cookies cho token storage - MFA support (TOTP, backup codes) **Authorization**: - RBAC (Role-Based Access Control) - ABAC (Attribute-Based Access Control) - Permission format: `resource:action:scope` - Permission caching (5 min TTL) - Zero-trust device validation **Network Security**: - TLS 1.2+ enforcement - HTTPS-only (HSTS headers) - Rate limiting: 100 req/15min (standard), 10 req/hour (strict) - CORS whitelist từ environment variables - Network policies (Kubernetes) **Data Protection**: - AES-256-GCM encryption cho PII at rest - bcrypt (cost 12) cho password hashing - SHA-256 hashing cho tokens before storage - Database encryption at rest (Neon) - TLS in-transit cho tất cả connections **Secrets Management**: - Kubernetes secrets cho production - Environment variables validation với Zod - No hardcoded secrets in code - Quarterly secret rotation **Audit Trail**: - Event sourcing cho tất cả auth events - 7-year retention cho compliance - Immutable audit logs - Correlation IDs cho request tracing ### EN: English Section **Authentication**: - JWT with RS256 (asymmetric signing) - Access token: 15 minutes expiry - Refresh token: 7 days expiry, rotation on use - httpOnly cookies for token storage - MFA support (TOTP, backup codes) **Authorization**: - RBAC (Role-Based Access Control) - ABAC (Attribute-Based Access Control) - Permission format: `resource:action:scope` - Permission caching (5 min TTL) - Zero-trust device validation **Network Security**: - TLS 1.2+ enforcement - HTTPS-only (HSTS headers) - Rate limiting: 100 req/15min (standard), 10 req/hour (strict) - CORS whitelist from environment variables - Network policies (Kubernetes) **Data Protection**: - AES-256-GCM encryption for PII at rest - bcrypt (cost 12) for password hashing - SHA-256 hashing for tokens before storage - Database encryption at rest (Neon) - TLS in-transit for all connections **Secrets Management**: - Kubernetes secrets for production - Environment variables validation with Zod - No hardcoded secrets in code - Quarterly secret rotation **Audit Trail**: - Event sourcing for all auth events - 7-year retention for compliance - Immutable audit logs - Correlation IDs for request tracing ## Triển khai / Deployment ```mermaid graph TD subgraph "Kubernetes Cluster" subgraph "Ingress" LB[Load Balancer
External IP] Traefik[Traefik Pods
Replicas: 2] end subgraph "Services" IAM[IAM Service Pods
Replicas: 2-10 HPA] Service1[Service 1 Pods
Replicas: 2-10 HPA] Service2[Service 2 Pods
Replicas: 2-10 HPA] end subgraph "Infrastructure" Redis[Redis Cluster
3 Masters + 3 Slaves] Kafka[Kafka Cluster
3 Brokers] end subgraph "Observability" Prom[Prometheus
Replicas: 2] Loki[Loki
Replicas: 2] Jaeger[Jaeger
Replicas: 2] Grafana[Grafana
Replicas: 2] end end subgraph "External" DB[(Neon PostgreSQL
Serverless)] end LB --> Traefik Traefik --> IAM Traefik --> Service1 Traefik --> Service2 IAM --> Redis IAM --> Kafka IAM --> DB Service1 --> Redis Service1 --> Kafka Service1 --> DB Service2 --> Redis Service2 --> Kafka Service2 --> DB IAM -.->|metrics| Prom Service1 -.->|metrics| Prom Service2 -.->|metrics| Prom IAM -.->|logs| Loki Service1 -.->|logs| Loki Service2 -.->|logs| Loki IAM -.->|traces| Jaeger Service1 -.->|traces| Jaeger Service2 -.->|traces| Jaeger Prom --> Grafana Loki --> Grafana Jaeger --> Grafana style LB fill:#e1f5ff style DB fill:#f0e1ff style Redis fill:#fff4e1 style Kafka fill:#d4edda style Grafana fill:#ffe1e1 ``` ### VI: Chiến lược Triển khai **Deployment Strategy**: - Rolling updates (maxSurge: 1, maxUnavailable: 0) - Zero-downtime deployments - Blue-green deployment cho major releases - Canary deployment cho high-risk changes **Auto-scaling**: - Horizontal Pod Autoscaler (HPA) - Min replicas: 2 - Max replicas: 10 - Target CPU: 70% - Target Memory: 80% **Resource Allocation**: | Service | Requests | Limits | |---------|----------|--------| | **Microservices** | 256Mi RAM, 250m CPU | 512Mi RAM, 500m CPU | | **Traefik** | 512Mi RAM, 500m CPU | 1Gi RAM, 1000m CPU | | **Redis** | 2Gi RAM, 1 CPU | 4Gi RAM, 2 CPU | | **Prometheus** | 4Gi RAM, 2 CPU | 8Gi RAM, 4 CPU | **Health Checks**: - Liveness probe: `/health/live` (K8s restarts if fails) - Readiness probe: `/health/ready` (K8s removes from LB if fails) - Startup probe: `/health/live` (initial delay 30s) **Environments**: - **Local**: Docker Compose - **Staging**: Kubernetes cluster (shared) - **Production**: Kubernetes cluster (dedicated) ### EN: Deployment Strategy **Deployment Strategy**: - Rolling updates (maxSurge: 1, maxUnavailable: 0) - Zero-downtime deployments - Blue-green deployment for major releases - Canary deployment for high-risk changes **Auto-scaling**: - Horizontal Pod Autoscaler (HPA) - Min replicas: 2 - Max replicas: 10 - Target CPU: 70% - Target Memory: 80% **Resource Allocation**: | Service | Requests | Limits | |---------|----------|--------| | **Microservices** | 256Mi RAM, 250m CPU | 512Mi RAM, 500m CPU | | **Traefik** | 512Mi RAM, 500m CPU | 1Gi RAM, 1000m CPU | | **Redis** | 2Gi RAM, 1 CPU | 4Gi RAM, 2 CPU | | **Prometheus** | 4Gi RAM, 2 CPU | 8Gi RAM, 4 CPU | **Health Checks**: - Liveness probe: `/health/live` (K8s restarts if fails) - Readiness probe: `/health/ready` (K8s removes from LB if fails) - Startup probe: `/health/live` (initial delay 30s) **Environments**: - **Local**: Docker Compose - **Staging**: Kubernetes cluster (shared) - **Production**: Kubernetes cluster (dedicated) ## Giám sát & Khả năng quan sát / Monitoring & Observability ### VI: Chỉ số Chính **Application Metrics**: - `http_requests_total` - Total HTTP requests (counter) - `http_request_duration_seconds` - Request duration (histogram) - `http_requests_active` - Active requests (gauge) - `cache_hits_total` / `cache_misses_total` - Cache performance - `db_query_duration_seconds` - Database query duration **Infrastructure Metrics**: - CPU usage, Memory usage per pod - Network I/O, Disk I/O - Pod restart count - Node resource utilization **Business Metrics**: - User registrations per day - Login success/failure rate - API usage by endpoint - Error rate by service **Kiểm tra Sức khỏe**: - `/health/live` - Liveness probe (service running?) - `/health/ready` - Readiness probe (ready for traffic?) - `/metrics` - Prometheus metrics endpoint **Alerting Rules**: ```yaml # High error rate - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 2m severity: warning # High latency - alert: HighLatency expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 0.5 for: 5m severity: warning # Service down - alert: ServiceDown expr: up == 0 for: 1m severity: critical # High memory usage - alert: HighMemoryUsage expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85 for: 5m severity: warning ``` **Logging**: - Structured JSON logging với Winston - Correlation IDs cho request tracing - Log levels: error, warn, info, debug - Log aggregation với Loki - 7 days retention **Distributed Tracing**: - OpenTelemetry instrumentation - Jaeger backend - Trace sampling: 10% in production, 100% in staging - Span attributes: service, operation, user_id, correlation_id ### EN: Key Metrics **Application Metrics**: - `http_requests_total` - Total HTTP requests (counter) - `http_request_duration_seconds` - Request duration (histogram) - `http_requests_active` - Active requests (gauge) - `cache_hits_total` / `cache_misses_total` - Cache performance - `db_query_duration_seconds` - Database query duration **Infrastructure Metrics**: - CPU usage, Memory usage per pod - Network I/O, Disk I/O - Pod restart count - Node resource utilization **Business Metrics**: - User registrations per day - Login success/failure rate - API usage by endpoint - Error rate by service **Health Checks**: - `/health/live` - Liveness probe (service running?) - `/health/ready` - Readiness probe (ready for traffic?) - `/metrics` - Prometheus metrics endpoint **Alerting Rules**: ```yaml # High error rate - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 2m severity: warning # High latency - alert: HighLatency expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 0.5 for: 5m severity: warning # Service down - alert: ServiceDown expr: up == 0 for: 1m severity: critical # High memory usage - alert: HighMemoryUsage expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85 for: 5m severity: warning ``` **Logging**: - Structured JSON logging with Winston - Correlation IDs for request tracing - Log levels: error, warn, info, debug - Log aggregation with Loki - 7 days retention **Distributed Tracing**: - OpenTelemetry instrumentation - Jaeger backend - Trace sampling: 10% in production, 100% in staging - Span attributes: service, operation, user_id, correlation_id ## Tài liệu Liên quan / Related Documentation - [Event-Driven Architecture](./event-driven-architecture.md) - Kiến trúc hướng sự kiện / Event-driven architecture - [Caching Architecture](./caching-architecture.md) - Chiến lược caching / Caching strategy - [Security Architecture](./security-architecture.md) - Kiến trúc bảo mật / Security architecture - [Observability Architecture](./observability-architecture.md) - Khả năng quan sát / Observability - [Data Consistency Patterns](./data-consistency-patterns.md) - Mẫu nhất quán dữ liệu / Data consistency patterns - [Microservices Communication](./microservices-communication.md) - Giao tiếp microservices / Microservices communication ## Tham khảo / References - [Microservices Patterns](https://microservices.io/patterns/index.html) - Microservices pattern catalog - [Twelve-Factor App](https://12factor.net/) - Best practices for cloud-native apps - [C4 Model](https://c4model.com/) - Software architecture diagrams - [Kubernetes Documentation](https://kubernetes.io/docs/) - Kubernetes official docs - [Traefik Documentation](https://doc.traefik.io/traefik/) - Traefik official docs --- **Cập nhật Lần cuối / Last Updated**: 2026-01-07 **Tác giả / Authors**: GoodGo Architecture Team **Người review / Reviewers**: GoodGo Development Team