Files

Ho Ngoc Hai b89e07f4cb feat(docs): Update Mermaid diagrams and enhance Vietnamese architecture documentation

- Changed the Mermaid theme from 'default' to 'base' for improved visual consistency.
- Expanded the architecture documentation by adding new sections on caching, data consistency, and observability patterns.
- Enhanced existing diagrams with clearer color coding and class definitions for better readability.
- Removed outdated service communication documentation to streamline content.
- Included quick tips and common issues sections to assist users in navigating the documentation effectively.

2026-01-10 17:42:19 +07:00

13 KiB

Raw Blame History

Observability Architecture

Note

: Comprehensive observability with metrics, logging, and tracing

Overview Diagram

graph TD
    subgraph "Services"
        Service1[Service A]
        Service2[Service B]
    end
    
    subgraph "Metrics"
        Service1 -->|/metrics| Prom[Prometheus]
        Service2 -->|/metrics| Prom
        Prom --> Grafana[Grafana<br/>Dashboards]
    end
    
    subgraph "Logging"
        Service1 -->|JSON Logs| Loki
        Service2 -->|JSON Logs| Loki
        Loki --> GrafanaLogs[Grafana<br/>Log Explorer]
    end
    
    subgraph "Tracing"
        Service1 -->|Spans| Jaeger
        Service2 -->|Spans| Jaeger
        Jaeger --> JaegerUI[Jaeger UI]
    end
    
    classDef service fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef metrics fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef logging fill:#C05621,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef tracing fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef dashboard fill:#4A5568,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;

    class Service1,Service2 service;
    class Prom metrics;
    class Loki logging;
    class Jaeger,JaegerUI tracing;
    class Grafana,GrafanaLogs dashboard;

System Context

C4Context
    title Observability System Context

    Person(dev, "Developer", "Uses dashboards to monitor system")
    Person(sre, "SRE", "Manages infrastructure & alerts")
    
    System(obs, "Observability Stack", "Prometheus, Loki, Jaeger, Grafana")
    
    System_Ext(service, "Microservices", "Sends telemetry data")
    System_Ext(k8s, "Kubernetes", "Sends cluster metrics")
    
    Rel(dev, obs, "Views Dashboards", "HTTPS")
    Rel(sre, obs, "Configures Alerts", "HTTPS")
    Rel(service, obs, "Push/Pull Telemetry", "HTTP/gRPC")
    Rel(k8s, obs, "Exposes Metrics", "HTTP")

    UpdateElementStyle(dev, $fontColor="white", $bgColor="#2D3748", $borderColor="white")
    UpdateElementStyle(sre, $fontColor="white", $bgColor="#2D3748", $borderColor="white")
    UpdateElementStyle(obs, $fontColor="white", $bgColor="#2C5282", $borderColor="white")
    UpdateElementStyle(service, $fontColor="white", $bgColor="#4A5568", $borderColor="white")
    UpdateElementStyle(k8s, $fontColor="white", $bgColor="#4A5568", $borderColor="white")

Context Description

Observability Stack: Central hub for collecting and displaying data (Prometheus, Loki, Jaeger, Grafana).
Microservices: Send logs, metrics, and traces (OpenTelemetry).
Developer/SRE: Use Grafana to monitor system health and debug.

Three Pillars of Observability

1. Metrics (Prometheus + Grafana)

graph LR
    Service[Service] -->|Expose /metrics| Prom[Prometheus]
    Prom -->|Scrape every 15s| Metrics[Time Series DB]
    Metrics --> Grafana[Grafana]
    Grafana --> Dashboard1[Request Dashboard]
    Grafana --> Dashboard2[Error Dashboard]
    Grafana --> Dashboard3[Performance Dashboard]
    
    classDef default fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef prom fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef grafana fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;

    class Prom prom;
    class Grafana grafana;

Description: Numerical measurements over time (requests/sec, latency, errors).

Implementation:

import { Counter, Histogram, Gauge } from 'prom-client';

// HTTP request metrics
export const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});

export const httpRequestTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status']
});

export const activeRequests = new Gauge({
  name: 'http_requests_active',
  help: 'Number of active HTTP requests'
});

// Middleware to track metrics
export function metricsMiddleware(req, res, next) {
  const start = Date.now();
  activeRequests.inc();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    
    httpRequestDuration.observe(
      { method: req.method, route: req.route?.path || req.path, status: res.statusCode },
      duration
    );
    
    httpRequestTotal.inc({
      method: req.method,
      route: req.route?.path || req.path,
      status: res.statusCode
    });
    
    activeRequests.dec();
  });
  
  next();
}

2. Logging (Winston + Loki)

sequenceDiagram
    participant Service
    participant Winston as Winston Logger
    participant Loki
    participant Grafana
    
    Service->>Winston: Log event
    Winston->>Winston: Format JSON
    Winston->>Winston: Add metadata<br/>(correlation ID, trace ID)
    Winston->>Loki: Push logs
    Loki->>Loki: Index & store
    
    User->>Grafana: Query logs
    Grafana->>Loki: LogQL query
    Loki-->>Grafana: Log results

Description: Structured logging with correlation IDs for request tracing.

Implementation:

import winston from 'winston';

export const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: process.env.SERVICE_NAME || 'unknown-service',
    environment: process.env.NODE_ENV || 'development'
  },
  transports: [
    new winston.transports.Console(),
    // Loki transport (if configured)
  ]
});

// Logger middleware
export function loggerMiddleware(req, res, next) {
  const correlationId = req.headers['x-correlation-id'] || generateId();
  
  req.correlationId = correlationId;
  req.logger = logger.child({ correlationId });
  
  req.logger.info('Incoming request', {
    method: req.method,
    path: req.path,
    ip: req.ip
  });
  
  res.on('finish', () => {
    req.logger.info('Request completed', {
      method: req.method,
      path: req.path,
      status: res.statusCode,
      duration: Date.now() - req.startTime
    });
  });
  
  next();
}

3. Tracing (OpenTelemetry + Jaeger)

graph LR
    Request[Incoming Request] --> Trace[Create Trace]
    Trace --> SpanA[Span: HTTP Request]
    SpanA --> SpanB[Span: DB Query]
    SpanA --> SpanC[Span: Cache Check]
    SpanA --> SpanD[Span: External API]
    
    SpanB --> Jaeger[Jaeger]
    SpanC --> Jaeger
    SpanD --> Jaeger
    
    Jaeger --> Timeline[Trace Timeline]
    
    classDef default fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef trace fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef jaeger fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;

    class Trace trace;
    class Jaeger jaeger;

Description: Distributed tracing to track requests across services.

Implementation:

import { trace, SpanStatusCode } from '@opentelemetry/api';

// Create traced function
export function traced<T>(
  name: string,
  fn: () => Promise<T>
): Promise<T> {
  const tracer = trace.getTracer('app');
  const span = tracer.startSpan(name);
  
  return fn()
    .then(result => {
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    })
    .catch(error => {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message
      });
      span.recordException(error);
      throw error;
    })
    .finally(() => {
      span.end();
    });
}

// Usage
async getUserWithTracing(userId: string): Promise<User> {
  return traced('getUserById', async () => {
    return await userRepository.findById(userId);
  });
}

Health Checks

// Liveness probe - is service running?
app.get('/health/live', (req, res) => {
  res.json({ status: 'ok', timestamp: new Date().toISOString() });
});

// Readiness probe - is service ready for traffic?
app.get('/health/ready', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    disk: await checkDiskSpace()
  };
  
  const ready = Object.values(checks).every(check => check === true);
  
  res.status(ready ? 200 : 503).json({
    ready,
    checks,
    timestamp: new Date().toISOString()
  });
});

async function checkDatabase(): Promise<boolean> {
  try {
    await prisma.$queryRaw`SELECT 1`;
    return true;
  } catch {
    return false;
  }
}

Alerting Rules

# Prometheus alerting rules
groups:
  - name: service_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} (> 5%)"
      
      # High latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "P95 latency is {{ $value }}s"
      
      # Service down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"

Performance Targets

Performance Goals

Metric	Target	Notes
Metric Scrape Interval	15s	Critical services
Log Ingestion Latency	< 1s	Time from emit to queryable
Trace Sampling Rate	10%	Production (100% in Dev/Staging)
Dashboard Load Time	< 2s	P95 Latency
Alert Evaluation	Every 1m	Evaluation interval
Retention Policy	14 days	Logs & Traces (Metrics: 30 days)

Security Considerations

Observability Security

Log Scrubbing: Automatically remove PII (emails, ssn, credit cards) and secrets from logs before ingestion.
Access Control: Grafana integrated with OAuth2/OIDC, with Viewer/Editor/Admin roles.
Network Policy: Only allow traffic from internal namespace to ingestion ports (9090, 3100, 14268).
TLS: Encrypt traffic between agents and collectors.

Deployment

graph TD
    subgraph "Kubernetes Monitoring Namespace"
        Grafana[Grafana]
        Prom[Prometheus Server]
        Loki[Loki Gateway]
        Jaeger[Jaeger Collector]
    end
    
    subgraph "App Namespace"
        App[Application Pods]
        Agent[Grafana Agent / Promtail]
    end
    
    App -->|Push Logs| Agent
    Agent -->|Push| Loki
    
    Prom -->|Pull Metrics| App
    Prom -->|Pull Metrics| Agent
    
    App -->|Push Traces| Jaeger
    
    Grafana --> Prom
    Grafana --> Loki
    Grafana --> Jaeger
    
    classDef k8s fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef app fill:#4A5568,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef grafana fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef loki fill:#C05621,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef jaeger fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
    classDef prom fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;

    class Grafana grafana;
    class Loki loki;
    class Jaeger jaeger;
    class Prom prom;
    class App,Agent app;

Deployment Description:

Agent: Promtail or Grafana Agent runs as DaemonSet or Sidecar to collect logs.
Pull Model: Prometheus scrapes metrics from /metrics endpoints.
Push Model: Traces and Logs are pushed to collectors.
Resources: Dedicated nodes for monitoring stack in production to avoid impacting main workload.

System Design - Overall architecture
Caching Architecture - Cache metrics

Quick Tips

Mermaid Common Issues

Issue	Solution
Parse Error	Check for special characters like `()` or `[]` inside node text without quotes. Use `"text"` for complex strings.
Color Not Showing	Ensure `style` or `classDef` definitions are correct and IDs match.
Arrow Direction	`TD` = Top-Down, `LR` = Left-Right. Choose appropriately for layout.

Color Pattern Quick Reference

Element	Color	Hex	Use Case
Primary	Dark Blue	`#2D3748`	System components, core services
Secondary	Grey	`#4A5568`	Supporting modules, libraries
Accent	Blue	`#2C5282`	Databases, external APIs
Highlight	Teal	`#285E61`	User interactions, highlights
Success	Green	`#2F855A`	Successful states, active
Warning	Orange	`#C05621`	Warning/Caution states
Error	Red	`#C53030`	Error states, failures

Visual Indicators

Indicator	Meaning
🟢	Safe / Recommended
🟡	Warning / Caution
🔴	Danger / Anti-pattern
💡	Tip / Best Practice

Last Updated: 2026-01-10
Author: GoodGo Architecture Team

13 KiB Raw Blame History

Observability Architecture

Overview Diagram

System Context

Context Description

Three Pillars of Observability

1. Metrics (Prometheus + Grafana)

2. Logging (Winston + Loki)

3. Tracing (OpenTelemetry + Jaeger)

Health Checks

Alerting Rules

Performance Targets

Performance Goals

Security Considerations

Observability Security

Deployment

Related Documentation

Quick Tips

Mermaid Common Issues

Color Pattern Quick Reference

Visual Indicators

13 KiB

Raw Blame History