# Kiến trúc Khả năng Quan sát / Observability Architecture > **VI**: Khả năng quan sát toàn diện với metrics, logging và tracing > **EN**: Comprehensive observability with metrics, logging, and tracing ## Sơ đồ Tổng quan / Overview Diagram ```mermaid graph TD subgraph "Services" Service1[Service A] Service2[Service B] end subgraph "Metrics" Service1 -->|/metrics| Prom[Prometheus] Service2 -->|/metrics| Prom Prom --> Grafana[Grafana
Dashboards] end subgraph "Logging" Service1 -->|JSON Logs| Loki Service2 -->|JSON Logs| Loki Loki --> GrafanaLogs[Grafana
Log Explorer] end subgraph "Tracing" Service1 -->|Spans| Jaeger Service2 -->|Spans| Jaeger Jaeger --> JaegerUI[Jaeger UI] end style Prom fill:#d4edda style Loki fill:#fff4e1 style Jaeger fill:#e1f5ff ``` ``` ## Bối cảnh Hệ thống / System Context ```mermaid C4Context title Sơ đồ Bối cảnh Khả năng Quan sát / Observability System Context Person(dev, "Developer", "Uses dashboards to monitor system") Person(sre, "SRE", "Manages infrastructure & alerts") System(obs, "Observability Stack", "Prometheus, Loki, Jaeger, Grafana") System_Ext(service, "Microservices", "Sends telemetry data") System_Ext(k8s, "Kubernetes", "Sends cluster metrics") Rel(dev, obs, "Views Dashboards", "HTTPS") Rel(sre, obs, "Configures Alerts", "HTTPS") Rel(service, obs, "Push/Pull Telemetry", "HTTP/gRPC") Rel(k8s, obs, "Exposes Metrics", "HTTP") ``` ### VI Mô tả Bối cảnh - **Observability Stack**: Trung tâm thu thập và hiển thị dữ liệu (Prometheus, Loki, Jaeger, Grafana). - **Microservices**: Gửi logs, metrics và traces (OpenTelemetry). - **Developer/SRE**: Sử dụng Grafana để theo dõi sức khỏe hệ thống và debug. ### EN Context Description - **Observability Stack**: Central collection and visualization (Prometheus, Loki, Jaeger, Grafana). - **Microservices**: Send logs, metrics, and traces (OpenTelemetry). - **Developer/SRE**: Use Grafana to monitor system health and debug. ## Ba Trụ cột Khả năng Quan sát / Three Pillars of Observability ### 1. Metrics (Prometheus + Grafana) ```mermaid graph LR Service[Service] -->|Expose /metrics| Prom[Prometheus] Prom -->|Scrape every 15s| Metrics[Time Series DB] Metrics --> Grafana[Grafana] Grafana --> Dashboard1[Request Dashboard] Grafana --> Dashboard2[Error Dashboard] Grafana --> Dashboard3[Performance Dashboard] style Prom fill:#d4edda style Grafana fill:#e1f5ff ``` **VI**: Các phép đo số theo thời gian (requests/sec, latency, errors). **EN**: Numerical measurements over time (requests/sec, latency, errors). **Triển khai / Implementation**: ```typescript import { Counter, Histogram, Gauge } from 'prom-client'; // VI: HTTP request metrics // EN: HTTP request metrics export const httpRequestDuration = new Histogram({ name: 'http_request_duration_seconds', help: 'Duration of HTTP requests in seconds', labelNames: ['method', 'route', 'status'], buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1, 2, 5] }); export const httpRequestTotal = new Counter({ name: 'http_requests_total', help: 'Total HTTP requests', labelNames: ['method', 'route', 'status'] }); export const activeRequests = new Gauge({ name: 'http_requests_active', help: 'Number of active HTTP requests' }); // VI: Middleware để track metrics // EN: Middleware to track metrics export function metricsMiddleware(req, res, next) { const start = Date.now(); activeRequests.inc(); res.on('finish', () => { const duration = (Date.now() - start) / 1000; httpRequestDuration.observe( { method: req.method, route: req.route?.path || req.path, status: res.statusCode }, duration ); httpRequestTotal.inc({ method: req.method, route: req.route?.path || req.path, status: res.statusCode }); activeRequests.dec(); }); next(); } ``` ### 2. Logging (Winston + Loki) ```mermaid sequenceDiagram participant Service participant Winston as Winston Logger participant Loki participant Grafana Service->>Winston: Log event Winston->>Winston: Format JSON Winston->>Winston: Add metadata
(correlation ID, trace ID) Winston->>Loki: Push logs Loki->>Loki: Index & store User->>Grafana: Query logs Grafana->>Loki: LogQL query Loki-->>Grafana: Log results ``` **VI**: Structured logging với correlation IDs để tracing requests. **EN**: Structured logging with correlation IDs for request tracing. **Triển khai / Implementation**: ```typescript import winston from 'winston'; export const logger = winston.createLogger({ level: process.env.LOG_LEVEL || 'info', format: winston.format.combine( winston.format.timestamp(), winston.format.errors({ stack: true }), winston.format.json() ), defaultMeta: { service: process.env.SERVICE_NAME || 'unknown-service', environment: process.env.NODE_ENV || 'development' }, transports: [ new winston.transports.Console(), // VI: Loki transport (nếu configured) // EN: Loki transport (if configured) ] }); // VI: Logger middleware // EN: Logger middleware export function loggerMiddleware(req, res, next) { const correlationId = req.headers['x-correlation-id'] || generateId(); req.correlationId = correlationId; req.logger = logger.child({ correlationId }); req.logger.info('Incoming request', { method: req.method, path: req.path, ip: req.ip }); res.on('finish', () => { req.logger.info('Request completed', { method: req.method, path: req.path, status: res.statusCode, duration: Date.now() - req.startTime }); }); next(); } ``` ### 3. Tracing (OpenTelemetry + Jaeger) ```mermaid graph LR Request[Incoming Request] --> Trace[Create Trace] Trace --> SpanA[Span: HTTP Request] SpanA --> SpanB[Span: DB Query] SpanA --> SpanC[Span: Cache Check] SpanA --> SpanD[Span: External API] SpanB --> Jaeger[Jaeger] SpanC --> Jaeger SpanD --> Jaeger Jaeger --> Timeline[Trace Timeline] style Trace fill:#e1f5ff style Jaeger fill:#d4edda ``` **VI**: Distributed tracing để track requests giữa các services. **EN**: Distributed tracing to track requests across services. **Triển khai / Implementation**: ```typescript import { trace, SpanStatusCode } from '@opentelemetry/api'; // VI: Tạo traced function // EN: Create traced function export function traced( name: string, fn: () => Promise ): Promise { const tracer = trace.getTracer('app'); const span = tracer.startSpan(name); return fn() .then(result => { span.setStatus({ code: SpanStatusCode.OK }); return result; }) .catch(error => { span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }); span.recordException(error); throw error; }) .finally(() => { span.end(); }); } // VI: Sử dụng // EN: Usage async getUserWithTracing(userId: string): Promise { return traced('getUserById', async () => { return await userRepository.findById(userId); }); } ``` ## Kiểm tra Sức khỏe / Health Checks ```typescript // VI: Liveness probe - service có đang chạy không? // EN: Liveness probe - is service running? app.get('/health/live', (req, res) => { res.json({ status: 'ok', timestamp: new Date().toISOString() }); }); // VI: Readiness probe - service có sẵn sàng nhận traffic không? // EN: Readiness probe - is service ready for traffic? app.get('/health/ready', async (req, res) => { const checks = { database: await checkDatabase(), redis: await checkRedis(), disk: await checkDiskSpace() }; const ready = Object.values(checks).every(check => check === true); res.status(ready ? 200 : 503).json({ ready, checks, timestamp: new Date().toISOString() }); }); async function checkDatabase(): Promise { try { await prisma.$queryRaw`SELECT 1`; return true; } catch { return false; } } ``` ## Quy tắc Cảnh báo / Alerting Rules ```yaml # VI: Prometheus alerting rules # EN: Prometheus alerting rules groups: - name: service_alerts interval: 30s rules: # VI: Tỷ lệ lỗi cao # EN: High error rate - alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 2m labels: severity: warning annotations: summary: "High error rate detected" description: "Error rate is {{ $value }} (> 5%)" # VI: Độ trễ cao # EN: High latency - alert: HighLatency expr: | histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1 for: 5m labels: severity: warning annotations: summary: "High latency detected" description: "P95 latency is {{ $value }}s" # VI: Service down # EN: Service down - alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Service is down" ``` ## Đặc điểm Hiệu suất / Performance Characteristics ### VI: Mục tiêu Hiệu suất | Chỉ số / Metric | Mục tiêu / Target | Ghi chú / Notes | |-----------------|-------------------|-----------------| | **Metric Scrape Interval** | 15s | Critical services | | **Log Ingestion Latency** | < 1s | Time from emit to queryable | | **Trace Sampling Rate** | 10% | Production (100% in Dev/Staging) | | **Dashboard Load Time** | < 2s | P95 Latency | | **Alert Evaluation** | Every 1m | Evaluation interval | | **Retention Policy** | 14 days | Logs & Traces (Metrics: 30 days) | ### EN: Performance Targets | Metric | Target | Notes | |--------|--------|-------| | **Metric Scrape Interval** | 15s | Critical services | | **Log Ingestion Latency** | < 1s | Time from emit to queryable | | **Trace Sampling Rate** | 10% | Production (100% in Dev/Staging) | | **Dashboard Load Time** | < 2s | P95 Latency | | **Alert Evaluation** | Every 1m | Evaluation interval | | **Retention Policy** | 14 days | Logs & Traces (Metrics: 30 days) | ## Cân nhắc Bảo mật / Security Considerations ### VI: Bảo mật Observability - **Log Scrubbing**: Tự động loại bỏ PII (emails, ssn, credit cards) và secrets khỏi logs trước khi ingestion. - **Access Control**: Grafana integrated với OAuth2/OIDC, phân quyền Viewer/Editor/Admin. - **Network Policy**: Chỉ cho phép traffic từ namespace nội bộ tới các cổng ingestion (9090, 3100, 14268). - **TLS**: Mã hóa traffic giữa agents và collectors. ### EN: Observability Security - **Log Scrubbing**: Automatically scrub PII (emails, ssn, credit cards) and secrets from logs before ingestion. - **Access Control**: Grafana integrated with OAuth2/OIDC, roles for Viewer/Editor/Admin. - **Network Policy**: Allow traffic only from internal namespaces to ingestion ports (9090, 3100, 14268). - **TLS**: Encrypt traffic between agents and collectors. ## Triển khai / Deployment ```mermaid graph TD subgraph "Kubernetes Monitoring Namespace" Grafana[Grafana] Prom[Prometheus Server] Loki[Loki Gateway] Jaeger[Jaeger Collector] end subgraph "App Namespace" App[Application Pods] Agent[Grafana Agent / Promtail] end App -->|Push Logs| Agent Agent -->|Push| Loki Prom -->|Pull Metrics| App Prom -->|Pull Metrics| Agent App -->|Push Traces| Jaeger Grafana --> Prom Grafana --> Loki Grafana --> Jaeger style Grafana fill:#ffe1e1 style Prom fill:#d4edda style Loki fill:#fff4e1 style Jaeger fill:#e1f5ff ``` **VI Mô tả Triển khai**: - **Agent**: Promtail hoặc Grafana Agent chạy như DaemonSet hoặc Sidecar để thu thập logs. - **Pull Model**: Prometheus scrape metrics từ endpoints `/metrics`. - **Push Model**: Traces và Logs được push tới collectors. - **Resources**: Dedicated nodes cho monitoring stack trong production để tránh ảnh hưởng workload chính. **EN Deployment Description**: - **Agent**: Promtail or Grafana Agent runs as DaemonSet or Sidecar to collect logs. - **Pull Model**: Prometheus scrapes metrics from `/metrics` endpoints. - **Push Model**: Traces and Logs are pushed to collectors. - **Resources**: Dedicated nodes for monitoring stack in production to prevent impact on main workload. ## Tài liệu Liên quan / Related Documentation - [System Design](./system-design.md) - Kiến trúc tổng thể / Overall architecture - [Caching Architecture](./caching-architecture.md) - Cache metrics --- **Cập nhật Lần cuối / Last Updated**: 2026-01-07 **Tác giả / Authors**: GoodGo Architecture Team