Files
pos-system/docs/vi/architecture/system-design.md

28 KiB

Thiết Kế Hệ Thống / System Design

VI: Kiến trúc tổng thể của nền tảng GoodGo Microservices EN: Overall architecture of GoodGo Microservices Platform

Sơ đồ Tổng quan / Overview Diagram

graph TD
    subgraph "Client Layer"
        Web[Web App<br/>Next.js]
        Mobile[Mobile App<br/>Flutter]
    end
    
    subgraph "API Gateway Layer"
        Traefik[Traefik<br/>API Gateway]
    end
    
    subgraph "Services Layer"
        IAM[IAM Service<br/>Auth & RBAC]
        Future1[Future Service 1]
        Future2[Future Service 2]
    end
    
    subgraph "Infrastructure Layer"
        DB[(Neon PostgreSQL<br/>Primary Database)]
        Cache[(Redis<br/>Cache & Session)]
        Kafka[Apache Kafka<br/>Event Streaming]
    end
    
    subgraph "Observability Layer"
        Prom[Prometheus<br/>Metrics]
        Loki[Loki<br/>Logs]
        Jaeger[Jaeger<br/>Tracing]
        Grafana[Grafana<br/>Dashboards]
    end
    
    Web --> Traefik
    Mobile --> Traefik
    
    Traefik --> IAM
    Traefik --> Future1
    Traefik --> Future2
    
    IAM --> DB
    IAM --> Cache
    IAM --> Kafka
    
    Future1 --> DB
    Future1 --> Cache
    Future1 --> Kafka
    
    Future2 --> DB
    Future2 --> Cache
    Future2 --> Kafka
    
    IAM -.->|metrics| Prom
    Future1 -.->|metrics| Prom
    Future2 -.->|metrics| Prom
    
    IAM -.->|logs| Loki
    Future1 -.->|logs| Loki
    Future2 -.->|logs| Loki
    
    IAM -.->|traces| Jaeger
    Future1 -.->|traces| Jaeger
    Future2 -.->|traces| Jaeger
    
    Prom --> Grafana
    Loki --> Grafana
    Jaeger --> Grafana
    
    style Traefik fill:#e1f5ff
    style DB fill:#f0e1ff
    style Cache fill:#fff4e1
    style Kafka fill:#d4edda
    style Grafana fill:#ffe1e1

Mô tả Kiến trúc / Architecture Description

VI: Phần Tiếng Việt

GoodGo Platform được xây dựng theo kiến trúc microservices với các nguyên tắc sau:

Nguyên tắc Cốt lõi:

  1. Độc Lập Service: Mỗi service có database riêng và có thể deploy độc lập
  2. API Gateway Pattern: Traefik xử lý routing, load balancing, và cross-cutting concerns
  3. Shared Libraries: Chức năng chung được trích xuất vào shared packages (@goodgo/*)
  4. Infrastructure as Code: Tất cả cấu hình infrastructure được version control
  5. Observability First: Đầy đủ metrics, logging, và distributed tracing

Công nghệ Stack:

  • Frontend: Next.js 14+ (App Router), Flutter 3.x
  • Backend: Node.js 20+, TypeScript 5+, Express
  • Database: Neon PostgreSQL (serverless)
  • Cache: Redis (multi-layer caching)
  • Message Broker: Apache Kafka
  • API Gateway: Traefik
  • Observability: Prometheus, Grafana, Loki, Jaeger

EN: English Section

GoodGo Platform is built on microservices architecture with the following principles:

Core Principles:

  1. Service Independence: Each service has its own database and can be deployed independently
  2. API Gateway Pattern: Traefik handles routing, load balancing, and cross-cutting concerns
  3. Shared Libraries: Common functionality extracted into shared packages (@goodgo/*)
  4. Infrastructure as Code: All infrastructure configuration is version controlled
  5. Observability First: Complete metrics, logging, and distributed tracing

Technology Stack:

  • Frontend: Next.js 14+ (App Router), Flutter 3.x
  • Backend: Node.js 20+, TypeScript 5+, Express
  • Database: Neon PostgreSQL (serverless)
  • Cache: Redis (multi-layer caching)
  • Message Broker: Apache Kafka
  • API Gateway: Traefik
  • Observability: Prometheus, Grafana, Loki, Jaeger

Bối cảnh Hệ thống / System Context

C4Context
    title Sơ đồ Bối cảnh Hệ thống GoodGo Platform
    
    Person(user, "Người dùng / User", "End users accessing the platform")
    Person(admin, "Quản trị viên / Admin", "System administrators")
    Person(developer, "Nhà phát triển / Developer", "Platform developers")
    
    System(platform, "GoodGo Platform", "Microservices platform for business applications")
    
    System_Ext(neon, "Neon PostgreSQL", "Serverless PostgreSQL database")
    System_Ext(redis, "Redis", "In-memory cache and session store")
    System_Ext(kafka, "Apache Kafka", "Event streaming platform")
    System_Ext(monitoring, "Monitoring Stack", "Prometheus + Grafana + Loki + Jaeger")
    
    Rel(user, platform, "Uses", "HTTPS")
    Rel(admin, platform, "Manages", "HTTPS")
    Rel(developer, platform, "Develops & Deploys", "Git, CI/CD")
    
    Rel(platform, neon, "Stores data", "PostgreSQL Protocol")
    Rel(platform, redis, "Caches data", "Redis Protocol")
    Rel(platform, kafka, "Publishes/Consumes events", "Kafka Protocol")
    Rel(platform, monitoring, "Sends metrics, logs, traces", "HTTP, gRPC")

Thành phần / Components

Frontend Layer

Web App (Next.js)

Mô tả: Ứng dụng web sử dụng Next.js 14+ với App Router

Tính năng chính:

  • Server-side rendering (SSR) và Static Site Generation (SSG)
  • API routes cho BFF (Backend for Frontend) pattern
  • Optimized image loading với next/image
  • Built-in routing và code splitting

Công nghệ sử dụng:

  • Next.js 14+, React 18+, TypeScript
  • Tailwind CSS, Zustand (state management)
  • @goodgo/http-client, @goodgo/types

Vị trí File: apps/web-client/

Mobile App (Flutter)

Mô tả: Ứng dụng mobile cross-platform sử dụng Flutter

Tính năng chính:

  • Cross-platform (iOS, Android)
  • Native performance
  • Provider pattern cho state management
  • Offline-first với local storage

Công nghệ sử dụng:

  • Flutter 3.x, Dart
  • Provider, Dio (HTTP client)

Vị trí File: apps/mobile-client/

API Gateway Layer

Traefik

Mô tả: Reverse proxy và API gateway xử lý routing, load balancing, SSL termination

Tính năng chính:

  • Dynamic service discovery
  • Automatic HTTPS với Let's Encrypt
  • Load balancing và health checks
  • Rate limiting và circuit breaker
  • Middleware chains (CORS, auth, logging)

Công nghệ sử dụng:

  • Traefik 2.x
  • Docker labels cho dynamic configuration

Vị trí File: infra/traefik/

Services Layer

IAM Service

Mô tả: Identity and Access Management service xử lý authentication và authorization

Tính năng chính:

  • JWT authentication (RS256)
  • RBAC (Role-Based Access Control)
  • ABAC (Attribute-Based Access Control)
  • Event sourcing cho audit trail
  • Zero-trust device validation

Công nghệ sử dụng:

  • Node.js, Express, TypeScript
  • Prisma ORM, bcrypt, jsonwebtoken
  • @goodgo/logger, @goodgo/tracing

Vị trí File: services/iam-service/

Future Services

Mô tả: Các services sẽ được phát triển trong tương lai

Dự kiến:

  • Payment Service - Xử lý thanh toán
  • Order Service - Quản lý đơn hàng
  • Notification Service - Gửi thông báo
  • Analytics Service - Phân tích dữ liệu

Infrastructure Layer

Neon PostgreSQL

Mô tả: Serverless PostgreSQL database với auto-scaling

Tính năng chính:

  • Serverless với auto-scaling
  • Branching cho development/staging
  • Point-in-time recovery
  • Connection pooling

Vị trí File: Database schemas trong mỗi service (services/*/prisma/schema.prisma)

Redis

Mô tả: In-memory cache và session store

Tính năng chính:

  • Multi-layer caching (L1: Memory, L2: Redis)
  • Session storage
  • Rate limiting counters
  • Pub/Sub cho real-time features

Vị trí File: infra/redis/

Apache Kafka

Mô tả: Event streaming platform cho asynchronous communication

Tính năng chính:

  • Event-driven architecture
  • Event sourcing
  • Eventual consistency
  • Dead letter queue (DLQ)

Vị trí File: infra/kafka/

Luồng Dữ liệu / Data Flow

sequenceDiagram
    participant Client
    participant Traefik as API Gateway
    participant Service
    participant Cache as Redis
    participant DB as PostgreSQL
    participant Kafka
    
    Client->>Traefik: HTTPS Request
    Traefik->>Traefik: Rate Limiting
    Traefik->>Traefik: JWT Validation
    Traefik->>Service: Route to Service
    
    Service->>Cache: Check Cache
    alt Cache Hit
        Cache-->>Service: Return Cached Data
    else Cache Miss
        Service->>DB: Query Database
        DB-->>Service: Return Data
        Service->>Cache: Store in Cache (TTL: 5min)
    end
    
    Service->>Service: Process Business Logic
    Service->>DB: Update Data (if needed)
    Service->>Kafka: Publish Event (async)
    
    Service-->>Traefik: Response
    Traefik-->>Client: HTTPS Response
    
    Note over Kafka: Event consumers process asynchronously

VI Giải thích chi tiết:

  1. Request: Client gửi HTTPS request đến Traefik
  2. Gateway Processing: Traefik thực hiện rate limiting và JWT validation
  3. Routing: Traefik route request đến service phù hợp
  4. Cache Check: Service kiểm tra L1 (memory) → L2 (Redis) cache
  5. Database Query: Nếu cache miss, query từ PostgreSQL
  6. Cache Update: Lưu kết quả vào cache với TTL phù hợp
  7. Business Logic: Xử lý logic nghiệp vụ
  8. Event Publishing: Publish domain events đến Kafka (async)
  9. Response: Trả về response cho client qua Traefik

EN Detailed Explanation:

  1. Request: Client sends HTTPS request to Traefik
  2. Gateway Processing: Traefik performs rate limiting and JWT validation
  3. Routing: Traefik routes request to appropriate service
  4. Cache Check: Service checks L1 (memory) → L2 (Redis) cache
  5. Database Query: If cache miss, query from PostgreSQL
  6. Cache Update: Store result in cache with appropriate TTL
  7. Business Logic: Process business logic
  8. Event Publishing: Publish domain events to Kafka (async)
  9. Response: Return response to client via Traefik

Kiến trúc Database / Database Architecture

erDiagram
    User ||--o{ Session : has
    User ||--o{ UserRole : has
    User ||--o{ UserPermission : has
    User ||--o{ MFADevice : has
    User ||--o{ AuditEvent : triggers
    
    Role ||--o{ UserRole : assigned_to
    Role ||--o{ RolePermission : has
    
    Permission ||--o{ RolePermission : granted_to
    Permission ||--o{ UserPermission : granted_to
    
    Organization ||--o{ User : contains
    Organization ||--o{ Role : defines
    
    User {
        string id PK
        string email UK
        string passwordHash
        string organizationId FK
        boolean mfaEnabled
        datetime createdAt
        datetime updatedAt
    }
    
    Session {
        string id PK
        string userId FK
        string refreshTokenHash
        string deviceFingerprint
        string ipAddress
        datetime expiresAt
        datetime createdAt
    }
    
    Role {
        string id PK
        string name
        string organizationId FK
        int hierarchy
        datetime createdAt
    }
    
    Permission {
        string id PK
        string resource
        string action
        string scope
        datetime createdAt
    }
    
    AuditEvent {
        string id PK
        string userId FK
        string eventType
        json eventData
        datetime timestamp
    }

VI Mô tả:

  • Database per Service: Mỗi service có database schema riêng
  • Shared Database: Hiện tại sử dụng shared Neon PostgreSQL, schemas isolated bằng Prisma
  • Event Sourcing: Audit events lưu tất cả thay đổi quan trọng
  • Soft Delete: Sử dụng deletedAt field thay vì hard delete

EN Description:

  • Database per Service: Each service has its own database schema
  • Shared Database: Currently using shared Neon PostgreSQL, schemas isolated by Prisma
  • Event Sourcing: Audit events store all important changes
  • Soft Delete: Use deletedAt field instead of hard delete

Quyết định Thiết kế / Design Decisions

Quyết định 1: Microservices Architecture

VI Bối cảnh: Cần khả năng scale độc lập và deploy riêng biệt cho từng business domain

VI Quyết định: Sử dụng microservices architecture với database per service pattern

VI Hậu quả:

  • Tích cực:
    • Scale độc lập từng service theo nhu cầu
    • Deploy riêng biệt, giảm risk khi release
    • Fault isolation - lỗi một service không ảnh hưởng toàn bộ
    • Technology flexibility - mỗi service có thể dùng tech stack khác
  • Tiêu cực:
    • Phức tạp hơn monolith (distributed systems challenges)
    • Eventual consistency thay vì strong consistency
    • Distributed transactions phức tạp (Saga pattern)
    • Operational overhead (monitoring, deployment)

VI Các lựa chọn thay thế: Monolith, Modular Monolith

EN Context: Need independent scaling and deployment for each business domain

EN Decision: Use microservices architecture with database per service pattern

EN Consequences:

  • Positive:
    • Independent scaling per service based on demand
    • Independent deployment, reduced release risk
    • Fault isolation - one service failure doesn't affect entire system
    • Technology flexibility - each service can use different tech stack
  • Negative:
    • More complex than monolith (distributed systems challenges)
    • Eventual consistency instead of strong consistency
    • Complex distributed transactions (Saga pattern)
    • Operational overhead (monitoring, deployment)

EN Alternatives: Monolith, Modular Monolith


Quyết định 2: Traefik as API Gateway

VI Bối cảnh: Cần reverse proxy, load balancing, SSL termination, và service discovery

VI Quyết định: Sử dụng Traefik thay vì Kong, NGINX, hoặc AWS API Gateway

VI Hậu quả:

  • Tích cực:
    • Auto service discovery với Docker labels
    • Dynamic configuration không cần restart
    • Built-in Let's Encrypt support
    • Native Kubernetes integration
    • Built-in metrics và tracing
  • Tiêu cực:
    • Learning curve cao hơn NGINX
    • Plugin ecosystem nhỏ hơn Kong
    • Community nhỏ hơn NGINX

VI Các lựa chọn thay thế: Kong, NGINX, AWS API Gateway, Envoy

EN Context: Need reverse proxy, load balancing, SSL termination, and service discovery

EN Decision: Use Traefik instead of Kong, NGINX, or AWS API Gateway

EN Consequences:

  • Positive:
    • Auto service discovery with Docker labels
    • Dynamic configuration without restart
    • Built-in Let's Encrypt support
    • Native Kubernetes integration
    • Built-in metrics and tracing
  • Negative:
    • Higher learning curve than NGINX
    • Smaller plugin ecosystem than Kong
    • Smaller community than NGINX

EN Alternatives: Kong, NGINX, AWS API Gateway, Envoy


Quyết định 3: Neon PostgreSQL (Serverless)

VI Bối cảnh: Cần database với auto-scaling, branching, và cost-effective cho development

VI Quyết định: Sử dụng Neon PostgreSQL (serverless) thay vì self-hosted PostgreSQL hoặc AWS RDS

VI Hậu quả:

  • Tích cực:
    • Auto-scaling theo usage
    • Database branching cho dev/staging
    • Pay-per-use pricing model
    • Automatic backups và point-in-time recovery
    • No infrastructure management
  • Tiêu cực:
    • Vendor lock-in
    • Cold start latency (mitigated by connection pooling)
    • Limited control over database configuration

VI Các lựa chọn thay thế: Self-hosted PostgreSQL, AWS RDS, Google Cloud SQL

EN Context: Need database with auto-scaling, branching, and cost-effective for development

EN Decision: Use Neon PostgreSQL (serverless) instead of self-hosted PostgreSQL or AWS RDS

EN Consequences:

  • Positive:
    • Auto-scaling based on usage
    • Database branching for dev/staging
    • Pay-per-use pricing model
    • Automatic backups and point-in-time recovery
    • No infrastructure management
  • Negative:
    • Vendor lock-in
    • Cold start latency (mitigated by connection pooling)
    • Limited control over database configuration

EN Alternatives: Self-hosted PostgreSQL, AWS RDS, Google Cloud SQL

Đặc điểm Hiệu suất / Performance Characteristics

Chỉ số / Metric Mục tiêu / Target Ghi chú / Notes
API Response Time (P95) < 200ms Excluding external API calls
API Response Time (P99) < 500ms Peak load scenarios
Throughput 1000 req/s Per service instance
Database Query Time (P95) < 50ms Simple queries with indexes
Cache Hit Rate (L1) > 40% In-memory cache
Cache Hit Rate (L2) > 80% Redis cache
Event Publish Latency (P95) < 10ms Kafka fire-and-forget
Service Availability > 99.9% Monthly uptime target
Error Rate < 1% 4xx + 5xx errors

VI Tối ưu hóa Hiệu suất:

  • Multi-layer caching (L1: Memory, L2: Redis)
  • Connection pooling cho database
  • Pagination cho list endpoints (max 100 items)
  • Database indexes cho frequently queried fields
  • Async event publishing (fire-and-forget)
  • CDN cho static assets (Next.js)

EN Performance Optimizations:

  • Multi-layer caching (L1: Memory, L2: Redis)
  • Connection pooling for database
  • Pagination for list endpoints (max 100 items)
  • Database indexes for frequently queried fields
  • Async event publishing (fire-and-forget)
  • CDN for static assets (Next.js)

Cân nhắc Bảo mật / Security Considerations

VI: Phần Tiếng Việt

Authentication:

  • JWT với RS256 (asymmetric signing)
  • Access token: 15 phút expiry
  • Refresh token: 7 ngày expiry, rotation on use
  • httpOnly cookies cho token storage
  • MFA support (TOTP, backup codes)

Authorization:

  • RBAC (Role-Based Access Control)
  • ABAC (Attribute-Based Access Control)
  • Permission format: resource:action:scope
  • Permission caching (5 min TTL)
  • Zero-trust device validation

Network Security:

  • TLS 1.2+ enforcement
  • HTTPS-only (HSTS headers)
  • Rate limiting: 100 req/15min (standard), 10 req/hour (strict)
  • CORS whitelist từ environment variables
  • Network policies (Kubernetes)

Data Protection:

  • AES-256-GCM encryption cho PII at rest
  • bcrypt (cost 12) cho password hashing
  • SHA-256 hashing cho tokens before storage
  • Database encryption at rest (Neon)
  • TLS in-transit cho tất cả connections

Secrets Management:

  • Kubernetes secrets cho production
  • Environment variables validation với Zod
  • No hardcoded secrets in code
  • Quarterly secret rotation

Audit Trail:

  • Event sourcing cho tất cả auth events
  • 7-year retention cho compliance
  • Immutable audit logs
  • Correlation IDs cho request tracing

EN: English Section

Authentication:

  • JWT with RS256 (asymmetric signing)
  • Access token: 15 minutes expiry
  • Refresh token: 7 days expiry, rotation on use
  • httpOnly cookies for token storage
  • MFA support (TOTP, backup codes)

Authorization:

  • RBAC (Role-Based Access Control)
  • ABAC (Attribute-Based Access Control)
  • Permission format: resource:action:scope
  • Permission caching (5 min TTL)
  • Zero-trust device validation

Network Security:

  • TLS 1.2+ enforcement
  • HTTPS-only (HSTS headers)
  • Rate limiting: 100 req/15min (standard), 10 req/hour (strict)
  • CORS whitelist from environment variables
  • Network policies (Kubernetes)

Data Protection:

  • AES-256-GCM encryption for PII at rest
  • bcrypt (cost 12) for password hashing
  • SHA-256 hashing for tokens before storage
  • Database encryption at rest (Neon)
  • TLS in-transit for all connections

Secrets Management:

  • Kubernetes secrets for production
  • Environment variables validation with Zod
  • No hardcoded secrets in code
  • Quarterly secret rotation

Audit Trail:

  • Event sourcing for all auth events
  • 7-year retention for compliance
  • Immutable audit logs
  • Correlation IDs for request tracing

Triển khai / Deployment

graph TD
    subgraph "Kubernetes Cluster"
        subgraph "Ingress"
            LB[Load Balancer<br/>External IP]
            Traefik[Traefik Pods<br/>Replicas: 2]
        end
        
        subgraph "Services"
            IAM[IAM Service Pods<br/>Replicas: 2-10 HPA]
            Service1[Service 1 Pods<br/>Replicas: 2-10 HPA]
            Service2[Service 2 Pods<br/>Replicas: 2-10 HPA]
        end
        
        subgraph "Infrastructure"
            Redis[Redis Cluster<br/>3 Masters + 3 Slaves]
            Kafka[Kafka Cluster<br/>3 Brokers]
        end
        
        subgraph "Observability"
            Prom[Prometheus<br/>Replicas: 2]
            Loki[Loki<br/>Replicas: 2]
            Jaeger[Jaeger<br/>Replicas: 2]
            Grafana[Grafana<br/>Replicas: 2]
        end
    end
    
    subgraph "External"
        DB[(Neon PostgreSQL<br/>Serverless)]
    end
    
    LB --> Traefik
    Traefik --> IAM
    Traefik --> Service1
    Traefik --> Service2
    
    IAM --> Redis
    IAM --> Kafka
    IAM --> DB
    
    Service1 --> Redis
    Service1 --> Kafka
    Service1 --> DB
    
    Service2 --> Redis
    Service2 --> Kafka
    Service2 --> DB
    
    IAM -.->|metrics| Prom
    Service1 -.->|metrics| Prom
    Service2 -.->|metrics| Prom
    
    IAM -.->|logs| Loki
    Service1 -.->|logs| Loki
    Service2 -.->|logs| Loki
    
    IAM -.->|traces| Jaeger
    Service1 -.->|traces| Jaeger
    Service2 -.->|traces| Jaeger
    
    Prom --> Grafana
    Loki --> Grafana
    Jaeger --> Grafana
    
    style LB fill:#e1f5ff
    style DB fill:#f0e1ff
    style Redis fill:#fff4e1
    style Kafka fill:#d4edda
    style Grafana fill:#ffe1e1

VI: Chiến lược Triển khai

Deployment Strategy:

  • Rolling updates (maxSurge: 1, maxUnavailable: 0)
  • Zero-downtime deployments
  • Blue-green deployment cho major releases
  • Canary deployment cho high-risk changes

Auto-scaling:

  • Horizontal Pod Autoscaler (HPA)
    • Min replicas: 2
    • Max replicas: 10
    • Target CPU: 70%
    • Target Memory: 80%

Resource Allocation:

Service Requests Limits
Microservices 256Mi RAM, 250m CPU 512Mi RAM, 500m CPU
Traefik 512Mi RAM, 500m CPU 1Gi RAM, 1000m CPU
Redis 2Gi RAM, 1 CPU 4Gi RAM, 2 CPU
Prometheus 4Gi RAM, 2 CPU 8Gi RAM, 4 CPU

Health Checks:

  • Liveness probe: /health/live (K8s restarts if fails)
  • Readiness probe: /health/ready (K8s removes from LB if fails)
  • Startup probe: /health/live (initial delay 30s)

Environments:

  • Local: Docker Compose
  • Staging: Kubernetes cluster (shared)
  • Production: Kubernetes cluster (dedicated)

EN: Deployment Strategy

Deployment Strategy:

  • Rolling updates (maxSurge: 1, maxUnavailable: 0)
  • Zero-downtime deployments
  • Blue-green deployment for major releases
  • Canary deployment for high-risk changes

Auto-scaling:

  • Horizontal Pod Autoscaler (HPA)
    • Min replicas: 2
    • Max replicas: 10
    • Target CPU: 70%
    • Target Memory: 80%

Resource Allocation:

Service Requests Limits
Microservices 256Mi RAM, 250m CPU 512Mi RAM, 500m CPU
Traefik 512Mi RAM, 500m CPU 1Gi RAM, 1000m CPU
Redis 2Gi RAM, 1 CPU 4Gi RAM, 2 CPU
Prometheus 4Gi RAM, 2 CPU 8Gi RAM, 4 CPU

Health Checks:

  • Liveness probe: /health/live (K8s restarts if fails)
  • Readiness probe: /health/ready (K8s removes from LB if fails)
  • Startup probe: /health/live (initial delay 30s)

Environments:

  • Local: Docker Compose
  • Staging: Kubernetes cluster (shared)
  • Production: Kubernetes cluster (dedicated)

Giám sát & Khả năng quan sát / Monitoring & Observability

VI: Chỉ số Chính

Application Metrics:

  • http_requests_total - Total HTTP requests (counter)
  • http_request_duration_seconds - Request duration (histogram)
  • http_requests_active - Active requests (gauge)
  • cache_hits_total / cache_misses_total - Cache performance
  • db_query_duration_seconds - Database query duration

Infrastructure Metrics:

  • CPU usage, Memory usage per pod
  • Network I/O, Disk I/O
  • Pod restart count
  • Node resource utilization

Business Metrics:

  • User registrations per day
  • Login success/failure rate
  • API usage by endpoint
  • Error rate by service

Kiểm tra Sức khỏe:

  • /health/live - Liveness probe (service running?)
  • /health/ready - Readiness probe (ready for traffic?)
  • /metrics - Prometheus metrics endpoint

Alerting Rules:

# High error rate
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 2m
  severity: warning

# High latency
- alert: HighLatency
  expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 0.5
  for: 5m
  severity: warning

# Service down
- alert: ServiceDown
  expr: up == 0
  for: 1m
  severity: critical

# High memory usage
- alert: HighMemoryUsage
  expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85
  for: 5m
  severity: warning

Logging:

  • Structured JSON logging với Winston
  • Correlation IDs cho request tracing
  • Log levels: error, warn, info, debug
  • Log aggregation với Loki
  • 7 days retention

Distributed Tracing:

  • OpenTelemetry instrumentation
  • Jaeger backend
  • Trace sampling: 10% in production, 100% in staging
  • Span attributes: service, operation, user_id, correlation_id

EN: Key Metrics

Application Metrics:

  • http_requests_total - Total HTTP requests (counter)
  • http_request_duration_seconds - Request duration (histogram)
  • http_requests_active - Active requests (gauge)
  • cache_hits_total / cache_misses_total - Cache performance
  • db_query_duration_seconds - Database query duration

Infrastructure Metrics:

  • CPU usage, Memory usage per pod
  • Network I/O, Disk I/O
  • Pod restart count
  • Node resource utilization

Business Metrics:

  • User registrations per day
  • Login success/failure rate
  • API usage by endpoint
  • Error rate by service

Health Checks:

  • /health/live - Liveness probe (service running?)
  • /health/ready - Readiness probe (ready for traffic?)
  • /metrics - Prometheus metrics endpoint

Alerting Rules:

# High error rate
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 2m
  severity: warning

# High latency
- alert: HighLatency
  expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 0.5
  for: 5m
  severity: warning

# Service down
- alert: ServiceDown
  expr: up == 0
  for: 1m
  severity: critical

# High memory usage
- alert: HighMemoryUsage
  expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85
  for: 5m
  severity: warning

Logging:

  • Structured JSON logging with Winston
  • Correlation IDs for request tracing
  • Log levels: error, warn, info, debug
  • Log aggregation with Loki
  • 7 days retention

Distributed Tracing:

  • OpenTelemetry instrumentation
  • Jaeger backend
  • Trace sampling: 10% in production, 100% in staging
  • Span attributes: service, operation, user_id, correlation_id

Tham khảo / References


Cập nhật Lần cuối / Last Updated: 2026-01-07
Tác giả / Authors: GoodGo Architecture Team
Người review / Reviewers: GoodGo Development Team