Files
pos-system/docs/en/architecture/observability-architecture.md
Ho Ngoc Hai b89e07f4cb feat(docs): Update Mermaid diagrams and enhance Vietnamese architecture documentation
- Changed the Mermaid theme from 'default' to 'base' for improved visual consistency.
- Expanded the architecture documentation by adding new sections on caching, data consistency, and observability patterns.
- Enhanced existing diagrams with clearer color coding and class definitions for better readability.
- Removed outdated service communication documentation to streamline content.
- Included quick tips and common issues sections to assist users in navigating the documentation effectively.
2026-01-10 17:42:19 +07:00

13 KiB

Observability Architecture

Note

: Comprehensive observability with metrics, logging, and tracing

Overview Diagram

graph TD
    subgraph "Services"
        Service1[Service A]
        Service2[Service B]
    end
    
    subgraph "Metrics"
        Service1 -->|/metrics| Prom[Prometheus]
        Service2 -->|/metrics| Prom
        Prom --> Grafana[Grafana<br/>Dashboards]
    end
    
    subgraph "Logging"
        Service1 -->|JSON Logs| Loki
        Service2 -->|JSON Logs| Loki
        Loki --> GrafanaLogs[Grafana<br/>Log Explorer]
    end
    
    subgraph "Tracing"
        Service1 -->|Spans| Jaeger
        Service2 -->|Spans| Jaeger
        Jaeger --> JaegerUI[Jaeger UI]
    end
    
    classDef service fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef metrics fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef logging fill:#C05621,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef tracing fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef dashboard fill:#4A5568,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;

    class Service1,Service2 service;
    class Prom metrics;
    class Loki logging;
    class Jaeger,JaegerUI tracing;
    class Grafana,GrafanaLogs dashboard;

System Context

C4Context
    title Observability System Context

    Person(dev, "Developer", "Uses dashboards to monitor system")
    Person(sre, "SRE", "Manages infrastructure & alerts")
    
    System(obs, "Observability Stack", "Prometheus, Loki, Jaeger, Grafana")
    
    System_Ext(service, "Microservices", "Sends telemetry data")
    System_Ext(k8s, "Kubernetes", "Sends cluster metrics")
    
    Rel(dev, obs, "Views Dashboards", "HTTPS")
    Rel(sre, obs, "Configures Alerts", "HTTPS")
    Rel(service, obs, "Push/Pull Telemetry", "HTTP/gRPC")
    Rel(k8s, obs, "Exposes Metrics", "HTTP")

    UpdateElementStyle(dev, $fontColor="white", $bgColor="#2D3748", $borderColor="white")
    UpdateElementStyle(sre, $fontColor="white", $bgColor="#2D3748", $borderColor="white")
    UpdateElementStyle(obs, $fontColor="white", $bgColor="#2C5282", $borderColor="white")
    UpdateElementStyle(service, $fontColor="white", $bgColor="#4A5568", $borderColor="white")
    UpdateElementStyle(k8s, $fontColor="white", $bgColor="#4A5568", $borderColor="white")

Context Description

  • Observability Stack: Central hub for collecting and displaying data (Prometheus, Loki, Jaeger, Grafana).
  • Microservices: Send logs, metrics, and traces (OpenTelemetry).
  • Developer/SRE: Use Grafana to monitor system health and debug.

Three Pillars of Observability

1. Metrics (Prometheus + Grafana)

graph LR
    Service[Service] -->|Expose /metrics| Prom[Prometheus]
    Prom -->|Scrape every 15s| Metrics[Time Series DB]
    Metrics --> Grafana[Grafana]
    Grafana --> Dashboard1[Request Dashboard]
    Grafana --> Dashboard2[Error Dashboard]
    Grafana --> Dashboard3[Performance Dashboard]
    
    classDef default fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef prom fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef grafana fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;

    class Prom prom;
    class Grafana grafana;

Description: Numerical measurements over time (requests/sec, latency, errors).

Implementation:

import { Counter, Histogram, Gauge } from 'prom-client';

// HTTP request metrics
export const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});

export const httpRequestTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status']
});

export const activeRequests = new Gauge({
  name: 'http_requests_active',
  help: 'Number of active HTTP requests'
});

// Middleware to track metrics
export function metricsMiddleware(req, res, next) {
  const start = Date.now();
  activeRequests.inc();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    
    httpRequestDuration.observe(
      { method: req.method, route: req.route?.path || req.path, status: res.statusCode },
      duration
    );
    
    httpRequestTotal.inc({
      method: req.method,
      route: req.route?.path || req.path,
      status: res.statusCode
    });
    
    activeRequests.dec();
  });
  
  next();
}

2. Logging (Winston + Loki)

sequenceDiagram
    participant Service
    participant Winston as Winston Logger
    participant Loki
    participant Grafana
    
    Service->>Winston: Log event
    Winston->>Winston: Format JSON
    Winston->>Winston: Add metadata<br/>(correlation ID, trace ID)
    Winston->>Loki: Push logs
    Loki->>Loki: Index & store
    
    User->>Grafana: Query logs
    Grafana->>Loki: LogQL query
    Loki-->>Grafana: Log results

Description: Structured logging with correlation IDs for request tracing.

Implementation:

import winston from 'winston';

export const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: process.env.SERVICE_NAME || 'unknown-service',
    environment: process.env.NODE_ENV || 'development'
  },
  transports: [
    new winston.transports.Console(),
    // Loki transport (if configured)
  ]
});

// Logger middleware
export function loggerMiddleware(req, res, next) {
  const correlationId = req.headers['x-correlation-id'] || generateId();
  
  req.correlationId = correlationId;
  req.logger = logger.child({ correlationId });
  
  req.logger.info('Incoming request', {
    method: req.method,
    path: req.path,
    ip: req.ip
  });
  
  res.on('finish', () => {
    req.logger.info('Request completed', {
      method: req.method,
      path: req.path,
      status: res.statusCode,
      duration: Date.now() - req.startTime
    });
  });
  
  next();
}

3. Tracing (OpenTelemetry + Jaeger)

graph LR
    Request[Incoming Request] --> Trace[Create Trace]
    Trace --> SpanA[Span: HTTP Request]
    SpanA --> SpanB[Span: DB Query]
    SpanA --> SpanC[Span: Cache Check]
    SpanA --> SpanD[Span: External API]
    
    SpanB --> Jaeger[Jaeger]
    SpanC --> Jaeger
    SpanD --> Jaeger
    
    Jaeger --> Timeline[Trace Timeline]
    
    classDef default fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef trace fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef jaeger fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;

    class Trace trace;
    class Jaeger jaeger;

Description: Distributed tracing to track requests across services.

Implementation:

import { trace, SpanStatusCode } from '@opentelemetry/api';

// Create traced function
export function traced<T>(
  name: string,
  fn: () => Promise<T>
): Promise<T> {
  const tracer = trace.getTracer('app');
  const span = tracer.startSpan(name);
  
  return fn()
    .then(result => {
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    })
    .catch(error => {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message
      });
      span.recordException(error);
      throw error;
    })
    .finally(() => {
      span.end();
    });
}

// Usage
async getUserWithTracing(userId: string): Promise<User> {
  return traced('getUserById', async () => {
    return await userRepository.findById(userId);
  });
}

Health Checks

// Liveness probe - is service running?
app.get('/health/live', (req, res) => {
  res.json({ status: 'ok', timestamp: new Date().toISOString() });
});

// Readiness probe - is service ready for traffic?
app.get('/health/ready', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    disk: await checkDiskSpace()
  };
  
  const ready = Object.values(checks).every(check => check === true);
  
  res.status(ready ? 200 : 503).json({
    ready,
    checks,
    timestamp: new Date().toISOString()
  });
});

async function checkDatabase(): Promise<boolean> {
  try {
    await prisma.$queryRaw`SELECT 1`;
    return true;
  } catch {
    return false;
  }
}

Alerting Rules

# Prometheus alerting rules
groups:
  - name: service_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} (> 5%)"
      
      # High latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "P95 latency is {{ $value }}s"
      
      # Service down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"

Performance Targets

Performance Goals

Metric Target Notes
Metric Scrape Interval 15s Critical services
Log Ingestion Latency < 1s Time from emit to queryable
Trace Sampling Rate 10% Production (100% in Dev/Staging)
Dashboard Load Time < 2s P95 Latency
Alert Evaluation Every 1m Evaluation interval
Retention Policy 14 days Logs & Traces (Metrics: 30 days)

Security Considerations

Observability Security

  • Log Scrubbing: Automatically remove PII (emails, ssn, credit cards) and secrets from logs before ingestion.
  • Access Control: Grafana integrated with OAuth2/OIDC, with Viewer/Editor/Admin roles.
  • Network Policy: Only allow traffic from internal namespace to ingestion ports (9090, 3100, 14268).
  • TLS: Encrypt traffic between agents and collectors.

Deployment

graph TD
    subgraph "Kubernetes Monitoring Namespace"
        Grafana[Grafana]
        Prom[Prometheus Server]
        Loki[Loki Gateway]
        Jaeger[Jaeger Collector]
    end
    
    subgraph "App Namespace"
        App[Application Pods]
        Agent[Grafana Agent / Promtail]
    end
    
    App -->|Push Logs| Agent
    Agent -->|Push| Loki
    
    Prom -->|Pull Metrics| App
    Prom -->|Pull Metrics| Agent
    
    App -->|Push Traces| Jaeger
    
    Grafana --> Prom
    Grafana --> Loki
    Grafana --> Jaeger
    
    classDef k8s fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef app fill:#4A5568,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef grafana fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef loki fill:#C05621,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef jaeger fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef prom fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;

    class Grafana grafana;
    class Loki loki;
    class Jaeger jaeger;
    class Prom prom;
    class App,Agent app;

Deployment Description:

  • Agent: Promtail or Grafana Agent runs as DaemonSet or Sidecar to collect logs.
  • Pull Model: Prometheus scrapes metrics from /metrics endpoints.
  • Push Model: Traces and Logs are pushed to collectors.
  • Resources: Dedicated nodes for monitoring stack in production to avoid impacting main workload.

Quick Tips

Mermaid Common Issues

Issue Solution
Parse Error Check for special characters like () or [] inside node text without quotes. Use "text" for complex strings.
Color Not Showing Ensure style or classDef definitions are correct and IDs match.
Arrow Direction TD = Top-Down, LR = Left-Right. Choose appropriately for layout.

Color Pattern Quick Reference

Element Color Hex Use Case
Primary Dark Blue #2D3748 System components, core services
Secondary Grey #4A5568 Supporting modules, libraries
Accent Blue #2C5282 Databases, external APIs
Highlight Teal #285E61 User interactions, highlights
Success Green #2F855A Successful states, active
Warning Orange #C05621 Warning/Caution states
Error Red #C53030 Error states, failures

Visual Indicators

Indicator Meaning
🟢 Safe / Recommended
🟡 Warning / Caution
🔴 Danger / Anti-pattern
💡 Tip / Best Practice

Last Updated: 2026-01-10
Author: GoodGo Architecture Team