Files
pos-system/docs/en/architecture/observability-architecture.md
Ho Ngoc Hai 3ed499ef7c docs: Update architecture documentation for GoodGo Platform
- Translated and revised architecture documents to enhance clarity and accessibility for both English and Vietnamese audiences.
- Improved diagrams and descriptions for caching, data consistency, event-driven architecture, microservices communication, observability, and security architecture.
- Ensured consistent formatting and terminology across all documents to facilitate better understanding and navigation.
- Added quick tips and troubleshooting sections to assist developers in implementing and managing the architecture effectively.
2026-01-14 13:07:19 +07:00

14 KiB

Kiến trúc Khả năng Quan sát

Note

: Khả năng quan sát toàn diện với metrics, logging và tracing

Sơ đồ Tổng quan

graph TD
    subgraph "Services"
        Service1[Service A]
        Service2[Service B]
    end
    
    subgraph "Metrics"
        Service1 -->|/metrics| Prom[Prometheus]
        Service2 -->|/metrics| Prom
        Prom --> Grafana[Grafana<br/>Dashboards]
    end
    
    subgraph "Logging"
        Service1 -->|JSON Logs| Loki
        Service2 -->|JSON Logs| Loki
        Loki --> GrafanaLogs[Grafana<br/>Log Explorer]
    end
    
    subgraph "Tracing"
        Service1 -->|Spans| Jaeger
        Service2 -->|Spans| Jaeger
        Jaeger --> JaegerUI[Jaeger UI]
    end
    
    classDef service fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef metrics fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef logging fill:#C05621,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef tracing fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef dashboard fill:#4A5568,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;

    class Service1,Service2 service;
    class Prom metrics;
    class Loki logging;
    class Jaeger,JaegerUI tracing;
    class Grafana,GrafanaLogs dashboard;

Bối cảnh Hệ thống

C4Context
    title Sơ đồ Bối cảnh Khả năng Quan sát

    Person(dev, "Developer", "Uses dashboards to monitor system")
    Person(sre, "SRE", "Manages infrastructure & alerts")
    
    System(obs, "Observability Stack", "Prometheus, Loki, Jaeger, Grafana")
    
    System_Ext(service, "Microservices", "Sends telemetry data")
    System_Ext(k8s, "Kubernetes", "Sends cluster metrics")
    
    Rel(dev, obs, "Views Dashboards", "HTTPS")
    Rel(sre, obs, "Configures Alerts", "HTTPS")
    Rel(service, obs, "Push/Pull Telemetry", "HTTP/gRPC")
    Rel(k8s, obs, "Exposes Metrics", "HTTP")

    UpdateElementStyle(dev, $fontColor="white", $bgColor="#2D3748", $borderColor="white")
    UpdateElementStyle(sre, $fontColor="white", $bgColor="#2D3748", $borderColor="white")
    UpdateElementStyle(obs, $fontColor="white", $bgColor="#2C5282", $borderColor="white")
    UpdateElementStyle(service, $fontColor="white", $bgColor="#4A5568", $borderColor="white")
    UpdateElementStyle(k8s, $fontColor="white", $bgColor="#4A5568", $borderColor="white")

Mô tả Bối cảnh

  • Observability Stack: Trung tâm thu thập và hiển thị dữ liệu (Prometheus, Loki, Jaeger, Grafana).
  • Microservices: Gửi logs, metrics và traces (OpenTelemetry).
  • Developer/SRE: Sử dụng Grafana để theo dõi sức khỏe hệ thống và debug.

Ba Trụ cột Khả năng Quan sát

1. Metrics (Prometheus + Grafana)

graph LR
    Service[Service] -->|Expose /metrics| Prom[Prometheus]
    Prom -->|Scrape every 15s| Metrics[Time Series DB]
    Metrics --> Grafana[Grafana]
    Grafana --> Dashboard1[Request Dashboard]
    Grafana --> Dashboard2[Error Dashboard]
    Grafana --> Dashboard3[Performance Dashboard]
    
    classDef default fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef prom fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef grafana fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;

    class Prom prom;
    class Grafana grafana;

Mô tả: Các phép đo số theo thời gian (requests/sec, latency, errors).

Triển khai:

import { Counter, Histogram, Gauge } from 'prom-client';

// HTTP request metrics
export const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});

export const httpRequestTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status']
});

export const activeRequests = new Gauge({
  name: 'http_requests_active',
  help: 'Number of active HTTP requests'
});

// Middleware để track metrics
export function metricsMiddleware(req, res, next) {
  const start = Date.now();
  activeRequests.inc();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    
    httpRequestDuration.observe(
      { method: req.method, route: req.route?.path || req.path, status: res.statusCode },
      duration
    );
    
    httpRequestTotal.inc({
      method: req.method,
      route: req.route?.path || req.path,
      status: res.statusCode
    });
    
    activeRequests.dec();
  });
  
  next();
}

2. Logging (Serilog + Loki)

sequenceDiagram
    participant Service
    participant Serilog as Serilog Logger
    participant Loki
    participant Grafana
    
    Service->>Serilog: Log event
    Serilog->>Serilog: Format JSON
    Serilog->>Serilog: Add metadata<br/>(correlation ID, trace ID)
    Serilog->>Loki: Push logs
    Loki->>Loki: Index & store
    
    User->>Grafana: Query logs
    Grafana->>Loki: LogQL query
    Loki-->>Grafana: Log results

Mô tả: Structured logging với correlation IDs để tracing requests.

Triển khai (.NET):

// Program.cs - Serilog configuration
builder.Host.UseSerilog((context, config) => config
    .ReadFrom.Configuration(context.Configuration)
    .Enrich.FromLogContext()
    .Enrich.WithProperty("Service", serviceName)
    .Enrich.WithProperty("Environment", environment)
    .WriteTo.Console(new JsonFormatter())
    .WriteTo.GrafanaLoki(
        "http://loki:3100",
        labels: new [] { new LokiLabel { Key = "app", Value = serviceName } }
    ));

// Middleware - Add correlation ID
public class CorrelationIdMiddleware
{
    private readonly RequestDelegate _next;
    private readonly ILogger<CorrelationIdMiddleware> _logger;
    
    public async Task InvokeAsync(HttpContext context)
    {
        var correlationId = context.Request.Headers["X-Correlation-Id"].FirstOrDefault()
            ?? Guid.NewGuid().ToString();
        
        context.Items["CorrelationId"] = correlationId;
        context.Response.Headers["X-Correlation-Id"] = correlationId;
        
        using (LogContext.PushProperty("CorrelationId", correlationId))
        {
            _logger.LogInformation("Request started: {Method} {Path}",
                context.Request.Method, context.Request.Path);
            
            var sw = Stopwatch.StartNew();
            await _next(context);
            sw.Stop();
            
            _logger.LogInformation("Request completed: {StatusCode} in {Duration}ms",
                context.Response.StatusCode, sw.ElapsedMilliseconds);
        }
    }
}

3. Tracing (OpenTelemetry + Jaeger)

graph LR
    Request[Incoming Request] --> Trace[Create Trace]
    Trace --> SpanA[Span: HTTP Request]
    SpanA --> SpanB[Span: DB Query]
    SpanA --> SpanC[Span: Cache Check]
    SpanA --> SpanD[Span: External API]
    
    SpanB --> Jaeger[Jaeger]
    SpanC --> Jaeger
    SpanD --> Jaeger
    
    Jaeger --> Timeline[Trace Timeline]
    
    classDef default fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef trace fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef jaeger fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;

    class Trace trace;
    class Jaeger jaeger;

Mô tả: Distributed tracing để track requests giữa các services.

Note

Distributed Tracing với Jaeger đang trong kế hoạch triển khai. Hiện tại sử dụng correlation IDs cho request tracking.

Triển khai (.NET với OpenTelemetry):

// Program.cs - OpenTelemetry configuration (planned)
builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation()
        .AddJaegerExporter(options =>
        {
            options.AgentHost = "jaeger";
            options.AgentPort = 6831;
        }));

// Manual span creation
public async Task<User?> GetUserByIdAsync(Guid userId, CancellationToken ct)
{
    using var activity = ActivitySource.StartActivity("GetUserById");
    activity?.SetTag("user.id", userId.ToString());
    
    try
    {
        var user = await _context.Users.FindAsync([userId], ct);
        activity?.SetStatus(ActivityStatusCode.Ok);
        return user;
    }
    catch (Exception ex)
    {
        activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
        throw;
    }
}

Kiểm tra Sức khỏe

// Health check (.NET)
app.MapHealthChecks("/health", new HealthCheckOptions
{
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});

app.MapHealthChecks("/health/live", new HealthCheckOptions
{
    Predicate = _ => false // Liveness - always return healthy
});

app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("ready")
});

// Health check registration
builder.Services.AddHealthChecks()
    .AddNpgSql(connectionString, name: "database", tags: new[] { "ready" })
    .AddRedis(redisConnectionString, name: "redis", tags: new[] { "ready" });

Quy tắc Cảnh báo

# Prometheus alerting rules
groups:
  - name: service_alerts
    interval: 30s
    rules:
      # Tỷ lệ lỗi cao
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} (> 5%)"
      
      # Độ trễ cao
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "P95 latency is {{ $value }}s"
      
      # Service down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"

Đặc điểm Hiệu suất

Mục tiêu Hiệu suất

Chỉ số Mục tiêu Ghi chú
Metric Scrape Interval 15s Critical services
Log Ingestion Latency < 1s Time from emit to queryable
Trace Sampling Rate 10% Production (100% in Dev/Staging)
Dashboard Load Time < 2s P95 Latency
Alert Evaluation Every 1m Evaluation interval
Retention Policy 14 days Logs & Traces (Metrics: 30 days)

Cân nhắc Bảo mật

Bảo mật Observability

  • Log Scrubbing: Tự động loại bỏ PII (emails, ssn, credit cards) và secrets khỏi logs trước khi ingestion.
  • Access Control: Grafana integrated với OAuth2/OIDC, phân quyền Viewer/Editor/Admin.
  • Network Policy: Chỉ cho phép traffic từ namespace nội bộ tới các cổng ingestion (9090, 3100, 14268).
  • TLS: Mã hóa traffic giữa agents và collectors.

Triển khai

graph TD
    subgraph "Kubernetes Monitoring Namespace"
        Grafana[Grafana]
        Prom[Prometheus Server]
        Loki[Loki Gateway]
        Jaeger[Jaeger Collector]
    end
    
    subgraph "App Namespace"
        App[Application Pods]
        Agent[Grafana Agent / Promtail]
    end
    
    App -->|Push Logs| Agent
    Agent -->|Push| Loki
    
    Prom -->|Pull Metrics| App
    Prom -->|Pull Metrics| Agent
    
    App -->|Push Traces| Jaeger
    
    Grafana --> Prom
    Grafana --> Loki
    Grafana --> Jaeger
    
    classDef k8s fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef app fill:#4A5568,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef grafana fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef loki fill:#C05621,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef jaeger fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef prom fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;

    class Grafana grafana;
    class Loki loki;
    class Jaeger jaeger;
    class Prom prom;
    class App,Agent app;

Mô tả Triển khai:

  • Agent: Promtail hoặc Grafana Agent chạy như DaemonSet hoặc Sidecar để thu thập logs.
  • Pull Model: Prometheus scrape metrics từ endpoints /metrics.
  • Push Model: Traces và Logs được push tới collectors.
  • Resources: Dedicated nodes cho monitoring stack trong production để tránh ảnh hưởng workload chính.

Tài liệu Liên quan

Quick Tips

Mermaid Common Issues

Issue Solution
Parse Error Check for special characters like () or [] inside node text without quotes. Use "text" for complex strings.
Color Not Showing Ensure style or classDef definitions are correct and IDs match.
Arrow Direction TD = Top-Down, LR = Left-Right. Choose appropriately for layout.

Color Pattern Quick Reference

Element Color Hex Use Case
Primary Dark Blue #2D3748 System components, core services
Secondary Grey #4A5568 Supporting modules, libraries
Accent Blue #2C5282 Databases, external APIs
Highlight Teal #285E61 User interactions, highlights
Success Green #2F855A Successful states, active
Warning Orange #C05621 Warning/Caution states
Error Red #C53030 Error states, failures

Visual Indicators

Indicator Meaning
🟢 Safe / Recommended
🟡 Warning / Caution
🔴 Danger / Anti-pattern
💡 Tip / Best Practice

Cập nhật Lần cuối: 2026-01-14
Tác giả: GoodGo Architecture Team