# Kiến trúc Khả năng Quan sát > **Note**: Khả năng quan sát toàn diện với metrics, logging và tracing ## Sơ đồ Tổng quan ```mermaid graph TD subgraph "Services" Service1[Service A] Service2[Service B] end subgraph "Metrics" Service1 -->|/metrics| Prom[Prometheus] Service2 -->|/metrics| Prom Prom --> Grafana[Grafana
Dashboards] end subgraph "Logging" Service1 -->|JSON Logs| Loki Service2 -->|JSON Logs| Loki Loki --> GrafanaLogs[Grafana
Log Explorer] end subgraph "Tracing" Service1 -->|Spans| Jaeger Service2 -->|Spans| Jaeger Jaeger --> JaegerUI[Jaeger UI] end classDef service fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF; classDef metrics fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF; classDef logging fill:#C05621,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF; classDef tracing fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF; classDef dashboard fill:#4A5568,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF; class Service1,Service2 service; class Prom metrics; class Loki logging; class Jaeger,JaegerUI tracing; class Grafana,GrafanaLogs dashboard; ``` ## Bối cảnh Hệ thống ```mermaid C4Context title Sơ đồ Bối cảnh Khả năng Quan sát Person(dev, "Developer", "Uses dashboards to monitor system") Person(sre, "SRE", "Manages infrastructure & alerts") System(obs, "Observability Stack", "Prometheus, Loki, Jaeger, Grafana") System_Ext(service, "Microservices", "Sends telemetry data") System_Ext(k8s, "Kubernetes", "Sends cluster metrics") Rel(dev, obs, "Views Dashboards", "HTTPS") Rel(sre, obs, "Configures Alerts", "HTTPS") Rel(service, obs, "Push/Pull Telemetry", "HTTP/gRPC") Rel(k8s, obs, "Exposes Metrics", "HTTP") UpdateElementStyle(dev, $fontColor="white", $bgColor="#2D3748", $borderColor="white") UpdateElementStyle(sre, $fontColor="white", $bgColor="#2D3748", $borderColor="white") UpdateElementStyle(obs, $fontColor="white", $bgColor="#2C5282", $borderColor="white") UpdateElementStyle(service, $fontColor="white", $bgColor="#4A5568", $borderColor="white") UpdateElementStyle(k8s, $fontColor="white", $bgColor="#4A5568", $borderColor="white") ``` ### Mô tả Bối cảnh - **Observability Stack**: Trung tâm thu thập và hiển thị dữ liệu (Prometheus, Loki, Jaeger, Grafana). - **Microservices**: Gửi logs, metrics và traces (OpenTelemetry). - **Developer/SRE**: Sử dụng Grafana để theo dõi sức khỏe hệ thống và debug. ## Ba Trụ cột Khả năng Quan sát ### 1. Metrics (Prometheus + Grafana) ```mermaid graph LR Service[Service] -->|Expose /metrics| Prom[Prometheus] Prom -->|Scrape every 15s| Metrics[Time Series DB] Metrics --> Grafana[Grafana] Grafana --> Dashboard1[Request Dashboard] Grafana --> Dashboard2[Error Dashboard] Grafana --> Dashboard3[Performance Dashboard] classDef default fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF; classDef prom fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF; classDef grafana fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF; class Prom prom; class Grafana grafana; ``` **Mô tả**: Các phép đo số theo thời gian (requests/sec, latency, errors). **Triển khai**: ```typescript import { Counter, Histogram, Gauge } from 'prom-client'; // HTTP request metrics export const httpRequestDuration = new Histogram({ name: 'http_request_duration_seconds', help: 'Duration of HTTP requests in seconds', labelNames: ['method', 'route', 'status'], buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1, 2, 5] }); export const httpRequestTotal = new Counter({ name: 'http_requests_total', help: 'Total HTTP requests', labelNames: ['method', 'route', 'status'] }); export const activeRequests = new Gauge({ name: 'http_requests_active', help: 'Number of active HTTP requests' }); // Middleware để track metrics export function metricsMiddleware(req, res, next) { const start = Date.now(); activeRequests.inc(); res.on('finish', () => { const duration = (Date.now() - start) / 1000; httpRequestDuration.observe( { method: req.method, route: req.route?.path || req.path, status: res.statusCode }, duration ); httpRequestTotal.inc({ method: req.method, route: req.route?.path || req.path, status: res.statusCode }); activeRequests.dec(); }); next(); } ``` ### 2. Logging (Serilog + Loki) ```mermaid sequenceDiagram participant Service participant Serilog as Serilog Logger participant Loki participant Grafana Service->>Serilog: Log event Serilog->>Serilog: Format JSON Serilog->>Serilog: Add metadata
(correlation ID, trace ID) Serilog->>Loki: Push logs Loki->>Loki: Index & store User->>Grafana: Query logs Grafana->>Loki: LogQL query Loki-->>Grafana: Log results ``` **Mô tả**: Structured logging với correlation IDs để tracing requests. **Triển khai (.NET)**: ```csharp // Program.cs - Serilog configuration builder.Host.UseSerilog((context, config) => config .ReadFrom.Configuration(context.Configuration) .Enrich.FromLogContext() .Enrich.WithProperty("Service", serviceName) .Enrich.WithProperty("Environment", environment) .WriteTo.Console(new JsonFormatter()) .WriteTo.GrafanaLoki( "http://loki:3100", labels: new [] { new LokiLabel { Key = "app", Value = serviceName } } )); // Middleware - Add correlation ID public class CorrelationIdMiddleware { private readonly RequestDelegate _next; private readonly ILogger _logger; public async Task InvokeAsync(HttpContext context) { var correlationId = context.Request.Headers["X-Correlation-Id"].FirstOrDefault() ?? Guid.NewGuid().ToString(); context.Items["CorrelationId"] = correlationId; context.Response.Headers["X-Correlation-Id"] = correlationId; using (LogContext.PushProperty("CorrelationId", correlationId)) { _logger.LogInformation("Request started: {Method} {Path}", context.Request.Method, context.Request.Path); var sw = Stopwatch.StartNew(); await _next(context); sw.Stop(); _logger.LogInformation("Request completed: {StatusCode} in {Duration}ms", context.Response.StatusCode, sw.ElapsedMilliseconds); } } } ``` ### 3. Tracing (OpenTelemetry + Jaeger) ```mermaid graph LR Request[Incoming Request] --> Trace[Create Trace] Trace --> SpanA[Span: HTTP Request] SpanA --> SpanB[Span: DB Query] SpanA --> SpanC[Span: Cache Check] SpanA --> SpanD[Span: External API] SpanB --> Jaeger[Jaeger] SpanC --> Jaeger SpanD --> Jaeger Jaeger --> Timeline[Trace Timeline] classDef default fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF; classDef trace fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF; classDef jaeger fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF; class Trace trace; class Jaeger jaeger; ``` **Mô tả**: Distributed tracing để track requests giữa các services. > [!NOTE] > Distributed Tracing với Jaeger đang trong kế hoạch triển khai. Hiện tại sử dụng correlation IDs cho request tracking. **Triển khai (.NET với OpenTelemetry)**: ```csharp // Program.cs - OpenTelemetry configuration (planned) builder.Services.AddOpenTelemetry() .WithTracing(tracing => tracing .AddAspNetCoreInstrumentation() .AddHttpClientInstrumentation() .AddEntityFrameworkCoreInstrumentation() .AddJaegerExporter(options => { options.AgentHost = "jaeger"; options.AgentPort = 6831; })); // Manual span creation public async Task GetUserByIdAsync(Guid userId, CancellationToken ct) { using var activity = ActivitySource.StartActivity("GetUserById"); activity?.SetTag("user.id", userId.ToString()); try { var user = await _context.Users.FindAsync([userId], ct); activity?.SetStatus(ActivityStatusCode.Ok); return user; } catch (Exception ex) { activity?.SetStatus(ActivityStatusCode.Error, ex.Message); throw; } } ``` ## Kiểm tra Sức khỏe ```typescript // Health check (.NET) app.MapHealthChecks("/health", new HealthCheckOptions { ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse }); app.MapHealthChecks("/health/live", new HealthCheckOptions { Predicate = _ => false // Liveness - always return healthy }); app.MapHealthChecks("/health/ready", new HealthCheckOptions { Predicate = check => check.Tags.Contains("ready") }); // Health check registration builder.Services.AddHealthChecks() .AddNpgSql(connectionString, name: "database", tags: new[] { "ready" }) .AddRedis(redisConnectionString, name: "redis", tags: new[] { "ready" }); ``` ## Quy tắc Cảnh báo ```yaml # Prometheus alerting rules groups: - name: service_alerts interval: 30s rules: # Tỷ lệ lỗi cao - alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 2m labels: severity: warning annotations: summary: "High error rate detected" description: "Error rate is {{ $value }} (> 5%)" # Độ trễ cao - alert: HighLatency expr: | histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1 for: 5m labels: severity: warning annotations: summary: "High latency detected" description: "P95 latency is {{ $value }}s" # Service down - alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Service is down" ``` ## Đặc điểm Hiệu suất ### Mục tiêu Hiệu suất | Chỉ số | Mục tiêu | Ghi chú | |--------|----------|---------| | **Metric Scrape Interval** | 15s | Critical services | | **Log Ingestion Latency** | < 1s | Time from emit to queryable | | **Trace Sampling Rate** | 10% | Production (100% in Dev/Staging) | | **Dashboard Load Time** | < 2s | P95 Latency | | **Alert Evaluation** | Every 1m | Evaluation interval | | **Retention Policy** | 14 days | Logs & Traces (Metrics: 30 days) | ## Cân nhắc Bảo mật ### Bảo mật Observability - **Log Scrubbing**: Tự động loại bỏ PII (emails, ssn, credit cards) và secrets khỏi logs trước khi ingestion. - **Access Control**: Grafana integrated với OAuth2/OIDC, phân quyền Viewer/Editor/Admin. - **Network Policy**: Chỉ cho phép traffic từ namespace nội bộ tới các cổng ingestion (9090, 3100, 14268). - **TLS**: Mã hóa traffic giữa agents và collectors. ## Triển khai ```mermaid graph TD subgraph "Kubernetes Monitoring Namespace" Grafana[Grafana] Prom[Prometheus Server] Loki[Loki Gateway] Jaeger[Jaeger Collector] end subgraph "App Namespace" App[Application Pods] Agent[Grafana Agent / Promtail] end App -->|Push Logs| Agent Agent -->|Push| Loki Prom -->|Pull Metrics| App Prom -->|Pull Metrics| Agent App -->|Push Traces| Jaeger Grafana --> Prom Grafana --> Loki Grafana --> Jaeger classDef k8s fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF; classDef app fill:#4A5568,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF; classDef grafana fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF; classDef loki fill:#C05621,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF; classDef jaeger fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF; classDef prom fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF; class Grafana grafana; class Loki loki; class Jaeger jaeger; class Prom prom; class App,Agent app; ``` **Mô tả Triển khai**: - **Agent**: Promtail hoặc Grafana Agent chạy như DaemonSet hoặc Sidecar để thu thập logs. - **Pull Model**: Prometheus scrape metrics từ endpoints `/metrics`. - **Push Model**: Traces và Logs được push tới collectors. - **Resources**: Dedicated nodes cho monitoring stack trong production để tránh ảnh hưởng workload chính. ## Tài liệu Liên quan - [System Design](./system-design.md) - Kiến trúc tổng thể - [Caching Architecture](./caching-architecture.md) - Cache metrics ## Quick Tips ### Mermaid Common Issues | Issue | Solution | |-------|----------| | **Parse Error** | Check for special characters like `()` or `[]` inside node text without quotes. Use `"text"` for complex strings. | | **Color Not Showing** | Ensure `style` or `classDef` definitions are correct and IDs match. | | **Arrow Direction** | `TD` = Top-Down, `LR` = Left-Right. Choose appropriately for layout. | ### Color Pattern Quick Reference | Element | Color | Hex | Use Case | |---------|-------|-----|----------| | **Primary** | Dark Blue | `#2D3748` | System components, core services | | **Secondary** | Grey | `#4A5568` | Supporting modules, libraries | | **Accent** | Blue | `#2C5282` | Databases, external APIs | | **Highlight** | Teal | `#285E61` | User interactions, highlights | | **Success** | Green | `#2F855A` | Successful states, active | | **Warning** | Orange | `#C05621` | Warning/Caution states | | **Error** | Red | `#C53030` | Error states, failures | ### Visual Indicators | Indicator | Meaning | |-----------|---------| | 🟢 | Safe / Recommended | | 🟡 | Warning / Caution | | 🔴 | Danger / Anti-pattern | | 💡 | Tip / Best Practice | --- **Cập nhật Lần cuối**: 2026-01-14 **Tác giả**: GoodGo Architecture Team