Files
pos-system/docs/en/skills/observability-monitoring.md
Ho Ngoc Hai b104fafa85 Refactor auth-service to iam-service and update related documentation
- Renamed auth-service to iam-service across various files for consistency.
- Updated Dockerfiles, deployment configurations, and documentation to reflect the service name change.
- Enhanced testing commands in documentation to point to the new iam-service.
- Removed outdated auth-service files and configurations to streamline the project structure.
- Improved bilingual documentation for clarity on the new service structure and usage.
2025-12-30 20:54:21 +07:00

18 KiB

Observability & Monitoring / Khả Năng Quan Sát & Giám Sát

EN: Observability and monitoring patterns for GoodGo microservices. Use when adding metrics, implementing logging, setting up tracing, creating health checks, or debugging production issues. VI: Các pattern observability và monitoring cho microservices GoodGo. Sử dụng khi thêm metrics, triển khai logging, thiết lập tracing, tạo health checks, hoặc debug các vấn đề production.

Overview / Tổng Quan

EN: This skill covers the three pillars of observability (logs, metrics, traces) and how to implement them in GoodGo microservices. It includes structured logging, Prometheus metrics, distributed tracing with OpenTelemetry, health checks, and error tracking.

VI: Skill này bao gồm ba trụ cột của observability (logs, metrics, traces) và cách triển khai chúng trong microservices GoodGo. Nó bao gồm structured logging, Prometheus metrics, distributed tracing với OpenTelemetry, health checks, và error tracking.

When to Use / Khi Nào Sử Dụng

EN: Use this skill when:

  • Setting up logging infrastructure
  • Implementing metrics collection
  • Adding distributed tracing
  • Creating health check endpoints
  • Setting up monitoring dashboards
  • Debugging production issues
  • Implementing alerting rules
  • Analyzing performance bottlenecks

VI: Sử dụng skill này khi:

  • Thiết lập hạ tầng logging
  • Triển khai thu thập metrics
  • Thêm distributed tracing
  • Tạo health check endpoints
  • Thiết lập monitoring dashboards
  • Debug các vấn đề production
  • Triển khai alerting rules
  • Phân tích performance bottlenecks

Key Concepts / Khái Niệm Chính

Three Pillars of Observability / Ba Trụ Cột Của Observability

EN:

  1. Logs: Event records for debugging and auditing
  2. Metrics: Numerical measurements over time (counters, gauges, histograms)
  3. Traces: Request flow across services (distributed tracing)

VI:

  1. Logs: Bản ghi sự kiện để debug và audit
  2. Metrics: Đo lường số học theo thời gian (counters, gauges, histograms)
  3. Traces: Luồng request qua các services (distributed tracing)

Tech Stack / Công Nghệ

EN:

  • Logging: @goodgo/logger (Pino-based structured logging)
  • Metrics: Prometheus + Grafana
  • Tracing: OpenTelemetry + Jaeger (@goodgo/tracing)
  • Correlation IDs: Request tracking across services

VI:

  • Logging: @goodgo/logger (structured logging dựa trên Pino)
  • Metrics: Prometheus + Grafana
  • Tracing: OpenTelemetry + Jaeger (@goodgo/tracing)
  • Correlation IDs: Theo dõi request qua các services

Common Patterns / Các Pattern Thường Dùng

Structured Logging / Logging Có Cấu Trúc

EN: Use structured logging with correlation IDs for request tracking.

VI: Sử dụng structured logging với correlation IDs để theo dõi request.

Example from codebase: services/iam-service/src/middlewares/logger.middleware.ts

import { Request, Response, NextFunction } from 'express';
import { logger } from '@goodgo/logger';
import { getCorrelationId, getRequestId } from './correlation.middleware';

export const requestLogger = (req: Request, res: Response, next: NextFunction): void => {
  // Skip detailed logging for health checks and metrics
  if (req.path.startsWith('/health') || req.path.startsWith('/metrics')) {
    return next();
  }

  const start = Date.now();

  res.on('finish', () => {
    const duration = Date.now() - start;
    const correlationId = getCorrelationId(req);
    const requestId = getRequestId(req);

    logger.info('Request processed / Request đã xử lý', {
      correlationId,
      requestId,
      method: req.method,
      path: req.path,
      query: req.query,
      statusCode: res.statusCode,
      duration: `${duration}ms`,
      contentLength: res.get('Content-Length') || 0,
      userAgent: req.get('User-Agent'),
      ip: req.ip,
      userId: (req as any).user?.userId,
    });
  });

  next();
};

Correlation IDs / Correlation IDs

EN: Use correlation IDs to track requests across services.

VI: Sử dụng correlation IDs để theo dõi request qua các services.

Example from codebase: services/iam-service/src/middlewares/correlation.middleware.ts

import { Request, Response, NextFunction } from 'express';
import { randomUUID } from 'crypto';
import { logger } from '@goodgo/logger';

export const CORRELATION_ID_HEADER = 'x-correlation-id';
export const REQUEST_ID_HEADER = 'x-request-id';

export const correlationMiddleware = (
  options: {
    headerName?: string;
    generateId?: () => string;
    skipPaths?: string[];
  } = {}
) => {
  const {
    headerName = CORRELATION_ID_HEADER,
    generateId = randomUUID,
    skipPaths = ['/health', '/metrics', '/favicon.ico'],
  } = options;

  return (req: Request, res: Response, next: NextFunction) => {
    // Get correlation ID from header or generate new one
    const correlationId = req.headers[headerName.toLowerCase()] as string || generateId();
    const requestId = generateId();

    // Attach to request object
    req.correlationId = correlationId;
    req.requestId = requestId;

    // Add to response headers
    res.setHeader(headerName, correlationId);
    res.setHeader(REQUEST_ID_HEADER, requestId);

    // Log request start
    logger.info('Request started / Request bắt đầu', {
      correlationId,
      requestId,
      method: req.method,
      url: req.url,
      userAgent: req.get('User-Agent'),
      ip: req.ip,
    });

    next();
  };
};

Metrics Collection / Thu Thập Metrics

EN: Expose Prometheus metrics for monitoring and alerting.

VI: Expose Prometheus metrics để monitoring và alerting.

Example from codebase: services/iam-service/src/middlewares/metrics.middleware.ts

import { Request, Response, NextFunction } from 'express';
import client from 'prom-client';
import { getCorrelationId } from './correlation.middleware';

// Create a Registry which registers the metrics
const register = client.register;

// Collect default metrics
client.collectDefaultMetrics({ register });

// Create histogram for HTTP request duration
const httpRequestDurationSeconds = new client.Histogram({
    name: 'http_request_duration_seconds',
    help: 'Duration of HTTP requests in seconds',
    labelNames: ['method', 'route', 'status_code', 'correlation_id'],
    buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10],
});

// Create counter for total HTTP requests
const httpRequestsTotal = new client.Counter({
    name: 'http_requests_total',
    help: 'Total number of HTTP requests',
    labelNames: ['method', 'route', 'status_code'],
});

// Create gauge for active requests
const activeRequests = new client.Gauge({
    name: 'http_active_requests',
    help: 'Number of active HTTP requests',
});

// Create counter for HTTP request errors
const httpRequestErrors = new client.Counter({
    name: 'http_request_errors_total',
    help: 'Total number of HTTP request errors',
    labelNames: ['method', 'route', 'error_type'],
});

export const metricsMiddleware = (req: Request, res: Response, next: NextFunction) => {
    // Increment active requests
    activeRequests.inc();

    // Start timer
    const start = process.hrtime.bigint();

    res.on('finish', () => {
        // Decrement active requests
        activeRequests.dec();

        // Calculate duration
        const end = process.hrtime.bigint();
        const durationInSeconds = Number(end - start) / 1e9;

        // Normalize path to avoid high cardinality
        const route = normalizeRoutePath(req);
        const correlationId = getCorrelationId(req) || 'unknown';

        // Record duration
        httpRequestDurationSeconds
            .labels(req.method, route, res.statusCode.toString(), correlationId)
            .observe(durationInSeconds);

        // Increment request counter
        httpRequestsTotal
            .labels(req.method, route, res.statusCode.toString())
            .inc();

        // Track errors
        if (res.statusCode >= 400) {
            const errorType = res.statusCode >= 500 ? 'server_error' : 'client_error';
            httpRequestErrors
                .labels(req.method, route, errorType)
                .inc();
        }
    });

    next();
};

// Normalize route path to prevent high cardinality metrics
function normalizeRoutePath(req: Request): string {
    if (req.route && req.route.path) {
        return req.route.path;
    }

    let path = req.path;
    // Replace UUIDs and numeric IDs with placeholders
    path = path.replace(/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi, ':uuid');
    path = path.replace(/\d+/g, ':id');

    return path;
}

Distributed Tracing / Distributed Tracing

EN: Use OpenTelemetry for distributed tracing across services.

VI: Sử dụng OpenTelemetry cho distributed tracing qua các services.

Example from codebase: packages/tracing/src/index.ts

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

export interface TracingConfig {
  serviceName: string;
  jaegerEndpoint?: string;
  enabled?: boolean;
}

export const initTracing = (config: TracingConfig): NodeSDK | null => {
  if (config.enabled === false) {
    return null;
  }

  // Create Jaeger exporter if endpoint is provided
  const jaegerExporter = config.jaegerEndpoint
    ? new JaegerExporter({
        endpoint: config.jaegerEndpoint,
      })
    : undefined;

  // Initialize OpenTelemetry NodeSDK with auto-instrumentations
  const sdk = new NodeSDK({
    resource: new Resource({
      [SemanticResourceAttributes.SERVICE_NAME]: config.serviceName,
    }),
    traceExporter: jaegerExporter,
    instrumentations: [getNodeAutoInstrumentations()],
  });

  // Start the tracing SDK
  sdk.start();

  return sdk;
};

Usage in service:

// services/iam-service/src/main.ts
import { initTracing } from '@goodgo/tracing';

// Initialize tracing
if (process.env.TRACING_ENABLED === 'true') {
  initTracing({
    serviceName: process.env.SERVICE_NAME || 'iam-service',
    jaegerEndpoint: process.env.JAEGER_ENDPOINT,
    enabled: true,
  });
}

Health Checks / Kiểm Tra Sức Khỏe

EN: Implement liveness and readiness probes for Kubernetes.

VI: Triển khai liveness và readiness probes cho Kubernetes.

Example from codebase: services/iam-service/src/modules/health/health.controller.ts

import { Request, Response } from 'express';
import { prisma } from '../../config/database.config';
import { ApiResponse } from '@goodgo/types';

export class HealthController {
  /**
   * EN: Basic liveness probe
   * VI: Kiểm tra liveness cơ bản
   */
  health = async (_req: Request, res: Response): Promise<void> => {
    const response: ApiResponse<{ status: string; timestamp: string }> = {
      success: true,
      data: {
        status: 'ok',
        timestamp: new Date().toISOString(),
      },
      timestamp: new Date().toISOString(),
    };

    res.json(response);
  };

  /**
   * EN: Readiness probe (checks database connection)
   * VI: Kiểm tra readiness (kiểm tra kết nối database)
   */
  ready = async (_req: Request, res: Response): Promise<void> => {
    try {
      // Check database connection
      await prisma.$queryRaw`SELECT 1`;
      res.json({
        success: true,
        data: { status: 'ready' },
        timestamp: new Date().toISOString(),
      });
    } catch (error) {
      // Return 503 if database is not ready
      res.status(503).json({
        success: false,
        error: {
          code: 'HEALTH_001',
          message: 'Service not ready',
        },
        timestamp: new Date().toISOString(),
      });
    }
  };

  /**
   * EN: Alias for health check
   * VI: Alias cho kiểm tra sức khỏe
   */
  live = async (_req: Request, res: Response): Promise<void> => {
    res.json({
      success: true,
      data: { status: 'live' },
      timestamp: new Date().toISOString(),
    });
  };
}

Best Practices / Thực Hành Tốt Nhất

Logging / Logging

EN:

  • Use structured logging (JSON format)
  • Include correlation IDs for request tracing
  • Log at appropriate levels (ERROR, WARN, INFO, DEBUG)
  • Avoid logging sensitive data (passwords, tokens, PII)
  • Use consistent log format across services

VI:

  • Sử dụng structured logging (định dạng JSON)
  • Bao gồm correlation IDs để theo dõi request
  • Log ở mức độ phù hợp (ERROR, WARN, INFO, DEBUG)
  • Tránh log dữ liệu nhạy cảm (mật khẩu, tokens, PII)
  • Sử dụng format log nhất quán giữa các services

Metrics / Metrics

EN:

  • Use standard metric types (Counter, Gauge, Histogram)
  • Keep cardinality low (avoid high-cardinality labels)
  • Define SLIs and SLOs for critical paths
  • Monitor business metrics, not just technical ones
  • Normalize route paths to prevent high cardinality

VI:

  • Sử dụng các loại metric chuẩn (Counter, Gauge, Histogram)
  • Giữ cardinality thấp (tránh high-cardinality labels)
  • Định nghĩa SLIs và SLOs cho các đường dẫn quan trọng
  • Giám sát business metrics, không chỉ technical metrics
  • Chuẩn hóa route paths để tránh high cardinality

Tracing / Tracing

EN:

  • Add traces for critical operations
  • Include relevant context in spans
  • Sample appropriately to control costs
  • Use distributed tracing for microservices
  • Propagate correlation IDs across service boundaries

VI:

  • Thêm traces cho các thao tác quan trọng
  • Bao gồm context liên quan trong spans
  • Sample phù hợp để kiểm soát chi phí
  • Sử dụng distributed tracing cho microservices
  • Truyền correlation IDs qua ranh giới service

Alerting / Cảnh Báo

EN:

  • Alert on symptoms, not causes
  • Include runbook links in alerts
  • Avoid alert fatigue with proper thresholds
  • Test alerting rules regularly
  • Use correlation IDs in alert context

VI:

  • Cảnh báo về triệu chứng, không phải nguyên nhân
  • Bao gồm links runbook trong alerts
  • Tránh alert fatigue với thresholds phù hợp
  • Test alerting rules thường xuyên
  • Sử dụng correlation IDs trong alert context

Examples from Project / Ví Dụ Từ Dự Án

Logging Implementation / Triển Khai Logging

Metrics Implementation / Triển Khai Metrics

Tracing Implementation / Triển Khai Tracing

Health Checks / Health Checks

Quick Reference / Tham Khảo Nhanh

Log Levels / Mức Độ Log

EN:

  • ERROR: Errors that require immediate attention
  • WARN: Warnings that may indicate issues
  • INFO: Informational messages (default)
  • DEBUG: Detailed debugging information

VI:

  • ERROR: Lỗi cần chú ý ngay lập tức
  • WARN: Cảnh báo có thể chỉ ra vấn đề
  • INFO: Thông điệp thông tin (mặc định)
  • DEBUG: Thông tin debug chi tiết

Metric Types / Loại Metrics

EN:

  • Counter: Monotonically increasing value (e.g., request count)
  • Gauge: Value that can go up or down (e.g., active connections)
  • Histogram: Distribution of values (e.g., request duration)

VI:

  • Counter: Giá trị tăng đơn điệu (ví dụ: số lượng request)
  • Gauge: Giá trị có thể tăng hoặc giảm (ví dụ: kết nối đang hoạt động)
  • Histogram: Phân phối giá trị (ví dụ: thời lượng request)

Health Check Endpoints / Endpoints Health Check

EN:

  • /health or /health/live: Liveness probe (service is running)
  • /health/ready: Readiness probe (service is ready to accept traffic)

VI:

  • /health hoặc /health/live: Liveness probe (service đang chạy)
  • /health/ready: Readiness probe (service sẵn sàng nhận traffic)

Prometheus Queries / Truy Vấn Prometheus

# Request rate
rate(http_requests_total[5m])

# Error rate
rate(http_requests_total{status_code=~"5.."}[5m])

# 95th percentile latency
histogram_quantile(0.95, http_request_duration_seconds)

# Active requests
http_active_requests

Resources / Tài Nguyên

Official Documentation / Tài Liệu Chính Thức

GoodGo Resources / Tài Nguyên GoodGo