Files

Ho Ngoc Hai 495618ded7 docs: Thêm tài liệu kiến trúc bảo mật, hướng sự kiện, nhất quán dữ liệu, khả năng quan sát và caching bằng tiếng Việt, đồng thời cập nhật các tài liệu hướng dẫn và kiến trúc hiện có.

2026-01-07 10:22:42 +07:00

13 KiB

Raw Blame History

Kiến trúc Khả năng Quan sát / Observability Architecture

VI: Khả năng quan sát toàn diện với metrics, logging và tracing EN: Comprehensive observability with metrics, logging, and tracing

Sơ đồ Tổng quan / Overview Diagram

graph TD
    subgraph "Services"
        Service1[Service A]
        Service2[Service B]
    end
    
    subgraph "Metrics"
        Service1 -->|/metrics| Prom[Prometheus]
        Service2 -->|/metrics| Prom
        Prom --> Grafana[Grafana<br/>Dashboards]
    end
    
    subgraph "Logging"
        Service1 -->|JSON Logs| Loki
        Service2 -->|JSON Logs| Loki
        Loki --> GrafanaLogs[Grafana<br/>Log Explorer]
    end
    
    subgraph "Tracing"
        Service1 -->|Spans| Jaeger
        Service2 -->|Spans| Jaeger
        Jaeger --> JaegerUI[Jaeger UI]
    end
    
    style Prom fill:#d4edda
    style Loki fill:#fff4e1
    style Jaeger fill:#e1f5ff


## Bối cảnh Hệ thống / System Context

```mermaid
C4Context
    title Sơ đồ Bối cảnh Khả năng Quan sát / Observability System Context

    Person(dev, "Developer", "Uses dashboards to monitor system")
    Person(sre, "SRE", "Manages infrastructure & alerts")
    
    System(obs, "Observability Stack", "Prometheus, Loki, Jaeger, Grafana")
    
    System_Ext(service, "Microservices", "Sends telemetry data")
    System_Ext(k8s, "Kubernetes", "Sends cluster metrics")
    
    Rel(dev, obs, "Views Dashboards", "HTTPS")
    Rel(sre, obs, "Configures Alerts", "HTTPS")
    Rel(service, obs, "Push/Pull Telemetry", "HTTP/gRPC")
    Rel(k8s, obs, "Exposes Metrics", "HTTP")

VI Mô tả Bối cảnh

Observability Stack: Trung tâm thu thập và hiển thị dữ liệu (Prometheus, Loki, Jaeger, Grafana).
Microservices: Gửi logs, metrics và traces (OpenTelemetry).
Developer/SRE: Sử dụng Grafana để theo dõi sức khỏe hệ thống và debug.

EN Context Description

Observability Stack: Central collection and visualization (Prometheus, Loki, Jaeger, Grafana).
Microservices: Send logs, metrics, and traces (OpenTelemetry).
Developer/SRE: Use Grafana to monitor system health and debug.

Ba Trụ cột Khả năng Quan sát / Three Pillars of Observability

1. Metrics (Prometheus + Grafana)

graph LR
    Service[Service] -->|Expose /metrics| Prom[Prometheus]
    Prom -->|Scrape every 15s| Metrics[Time Series DB]
    Metrics --> Grafana[Grafana]
    Grafana --> Dashboard1[Request Dashboard]
    Grafana --> Dashboard2[Error Dashboard]
    Grafana --> Dashboard3[Performance Dashboard]
    
    style Prom fill:#d4edda
    style Grafana fill:#e1f5ff

VI: Các phép đo số theo thời gian (requests/sec, latency, errors).

EN: Numerical measurements over time (requests/sec, latency, errors).

Triển khai / Implementation:

import { Counter, Histogram, Gauge } from 'prom-client';

// VI: HTTP request metrics
// EN: HTTP request metrics
export const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});

export const httpRequestTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status']
});

export const activeRequests = new Gauge({
  name: 'http_requests_active',
  help: 'Number of active HTTP requests'
});

// VI: Middleware để track metrics
// EN: Middleware to track metrics
export function metricsMiddleware(req, res, next) {
  const start = Date.now();
  activeRequests.inc();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    
    httpRequestDuration.observe(
      { method: req.method, route: req.route?.path || req.path, status: res.statusCode },
      duration
    );
    
    httpRequestTotal.inc({
      method: req.method,
      route: req.route?.path || req.path,
      status: res.statusCode
    });
    
    activeRequests.dec();
  });
  
  next();
}

2. Logging (Winston + Loki)

sequenceDiagram
    participant Service
    participant Winston as Winston Logger
    participant Loki
    participant Grafana
    
    Service->>Winston: Log event
    Winston->>Winston: Format JSON
    Winston->>Winston: Add metadata<br/>(correlation ID, trace ID)
    Winston->>Loki: Push logs
    Loki->>Loki: Index & store
    
    User->>Grafana: Query logs
    Grafana->>Loki: LogQL query
    Loki-->>Grafana: Log results

VI: Structured logging với correlation IDs để tracing requests.

EN: Structured logging with correlation IDs for request tracing.

Triển khai / Implementation:

import winston from 'winston';

export const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: process.env.SERVICE_NAME || 'unknown-service',
    environment: process.env.NODE_ENV || 'development'
  },
  transports: [
    new winston.transports.Console(),
    // VI: Loki transport (nếu configured)
    // EN: Loki transport (if configured)
  ]
});

// VI: Logger middleware
// EN: Logger middleware
export function loggerMiddleware(req, res, next) {
  const correlationId = req.headers['x-correlation-id'] || generateId();
  
  req.correlationId = correlationId;
  req.logger = logger.child({ correlationId });
  
  req.logger.info('Incoming request', {
    method: req.method,
    path: req.path,
    ip: req.ip
  });
  
  res.on('finish', () => {
    req.logger.info('Request completed', {
      method: req.method,
      path: req.path,
      status: res.statusCode,
      duration: Date.now() - req.startTime
    });
  });
  
  next();
}

3. Tracing (OpenTelemetry + Jaeger)

graph LR
    Request[Incoming Request] --> Trace[Create Trace]
    Trace --> SpanA[Span: HTTP Request]
    SpanA --> SpanB[Span: DB Query]
    SpanA --> SpanC[Span: Cache Check]
    SpanA --> SpanD[Span: External API]
    
    SpanB --> Jaeger[Jaeger]
    SpanC --> Jaeger
    SpanD --> Jaeger
    
    Jaeger --> Timeline[Trace Timeline]
    
    style Trace fill:#e1f5ff
    style Jaeger fill:#d4edda

VI: Distributed tracing để track requests giữa các services.

EN: Distributed tracing to track requests across services.

Triển khai / Implementation:

import { trace, SpanStatusCode } from '@opentelemetry/api';

// VI: Tạo traced function
// EN: Create traced function
export function traced<T>(
  name: string,
  fn: () => Promise<T>
): Promise<T> {
  const tracer = trace.getTracer('app');
  const span = tracer.startSpan(name);
  
  return fn()
    .then(result => {
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    })
    .catch(error => {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message
      });
      span.recordException(error);
      throw error;
    })
    .finally(() => {
      span.end();
    });
}

// VI: Sử dụng
// EN: Usage
async getUserWithTracing(userId: string): Promise<User> {
  return traced('getUserById', async () => {
    return await userRepository.findById(userId);
  });
}

Kiểm tra Sức khỏe / Health Checks

// VI: Liveness probe - service có đang chạy không?
// EN: Liveness probe - is service running?
app.get('/health/live', (req, res) => {
  res.json({ status: 'ok', timestamp: new Date().toISOString() });
});

// VI: Readiness probe - service có sẵn sàng nhận traffic không?
// EN: Readiness probe - is service ready for traffic?
app.get('/health/ready', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    disk: await checkDiskSpace()
  };
  
  const ready = Object.values(checks).every(check => check === true);
  
  res.status(ready ? 200 : 503).json({
    ready,
    checks,
    timestamp: new Date().toISOString()
  });
});

async function checkDatabase(): Promise<boolean> {
  try {
    await prisma.$queryRaw`SELECT 1`;
    return true;
  } catch {
    return false;
  }
}

Quy tắc Cảnh báo / Alerting Rules

# VI: Prometheus alerting rules
# EN: Prometheus alerting rules
groups:
  - name: service_alerts
    interval: 30s
    rules:
      # VI: Tỷ lệ lỗi cao
      # EN: High error rate
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} (> 5%)"
      
      # VI: Độ trễ cao
      # EN: High latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "P95 latency is {{ $value }}s"
      
      # VI: Service down
      # EN: Service down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"

Đặc điểm Hiệu suất / Performance Characteristics

VI: Mục tiêu Hiệu suất

Chỉ số / Metric	Mục tiêu / Target	Ghi chú / Notes
Metric Scrape Interval	15s	Critical services
Log Ingestion Latency	< 1s	Time from emit to queryable
Trace Sampling Rate	10%	Production (100% in Dev/Staging)
Dashboard Load Time	< 2s	P95 Latency
Alert Evaluation	Every 1m	Evaluation interval
Retention Policy	14 days	Logs & Traces (Metrics: 30 days)

EN: Performance Targets

Metric	Target	Notes
Metric Scrape Interval	15s	Critical services
Log Ingestion Latency	< 1s	Time from emit to queryable
Trace Sampling Rate	10%	Production (100% in Dev/Staging)
Dashboard Load Time	< 2s	P95 Latency
Alert Evaluation	Every 1m	Evaluation interval
Retention Policy	14 days	Logs & Traces (Metrics: 30 days)

Cân nhắc Bảo mật / Security Considerations

VI: Bảo mật Observability

Log Scrubbing: Tự động loại bỏ PII (emails, ssn, credit cards) và secrets khỏi logs trước khi ingestion.
Access Control: Grafana integrated với OAuth2/OIDC, phân quyền Viewer/Editor/Admin.
Network Policy: Chỉ cho phép traffic từ namespace nội bộ tới các cổng ingestion (9090, 3100, 14268).
TLS: Mã hóa traffic giữa agents và collectors.

EN: Observability Security

Log Scrubbing: Automatically scrub PII (emails, ssn, credit cards) and secrets from logs before ingestion.
Access Control: Grafana integrated with OAuth2/OIDC, roles for Viewer/Editor/Admin.
Network Policy: Allow traffic only from internal namespaces to ingestion ports (9090, 3100, 14268).
TLS: Encrypt traffic between agents and collectors.

Triển khai / Deployment

graph TD
    subgraph "Kubernetes Monitoring Namespace"
        Grafana[Grafana]
        Prom[Prometheus Server]
        Loki[Loki Gateway]
        Jaeger[Jaeger Collector]
    end
    
    subgraph "App Namespace"
        App[Application Pods]
        Agent[Grafana Agent / Promtail]
    end
    
    App -->|Push Logs| Agent
    Agent -->|Push| Loki
    
    Prom -->|Pull Metrics| App
    Prom -->|Pull Metrics| Agent
    
    App -->|Push Traces| Jaeger
    
    Grafana --> Prom
    Grafana --> Loki
    Grafana --> Jaeger
    
    style Grafana fill:#ffe1e1
    style Prom fill:#d4edda
    style Loki fill:#fff4e1
    style Jaeger fill:#e1f5ff

VI Mô tả Triển khai:

Agent: Promtail hoặc Grafana Agent chạy như DaemonSet hoặc Sidecar để thu thập logs.
Pull Model: Prometheus scrape metrics từ endpoints /metrics.
Push Model: Traces và Logs được push tới collectors.
Resources: Dedicated nodes cho monitoring stack trong production để tránh ảnh hưởng workload chính.

EN Deployment Description:

Agent: Promtail or Grafana Agent runs as DaemonSet or Sidecar to collect logs.
Pull Model: Prometheus scrapes metrics from /metrics endpoints.
Push Model: Traces and Logs are pushed to collectors.
Resources: Dedicated nodes for monitoring stack in production to prevent impact on main workload.

System Design - Kiến trúc tổng thể / Overall architecture
Caching Architecture - Cache metrics

Cập nhật Lần cuối / Last Updated: 2026-01-07
Tác giả / Authors: GoodGo Architecture Team

13 KiB Raw Blame History

Kiến trúc Khả năng Quan sát / Observability Architecture

Sơ đồ Tổng quan / Overview Diagram

VI Mô tả Bối cảnh

EN Context Description

Ba Trụ cột Khả năng Quan sát / Three Pillars of Observability

1. Metrics (Prometheus + Grafana)

2. Logging (Winston + Loki)

3. Tracing (OpenTelemetry + Jaeger)

Kiểm tra Sức khỏe / Health Checks

Quy tắc Cảnh báo / Alerting Rules

Đặc điểm Hiệu suất / Performance Characteristics

VI: Mục tiêu Hiệu suất

EN: Performance Targets

Cân nhắc Bảo mật / Security Considerations

VI: Bảo mật Observability

EN: Observability Security

Triển khai / Deployment

Tài liệu Liên quan / Related Documentation

13 KiB

Raw Blame History