451 lines
13 KiB
Markdown
451 lines
13 KiB
Markdown
# Kiến trúc Khả năng Quan sát / Observability Architecture
|
|
|
|
> **VI**: Khả năng quan sát toàn diện với metrics, logging và tracing
|
|
> **EN**: Comprehensive observability with metrics, logging, and tracing
|
|
|
|
## Sơ đồ Tổng quan / Overview Diagram
|
|
|
|
```mermaid
|
|
graph TD
|
|
subgraph "Services"
|
|
Service1[Service A]
|
|
Service2[Service B]
|
|
end
|
|
|
|
subgraph "Metrics"
|
|
Service1 -->|/metrics| Prom[Prometheus]
|
|
Service2 -->|/metrics| Prom
|
|
Prom --> Grafana[Grafana<br/>Dashboards]
|
|
end
|
|
|
|
subgraph "Logging"
|
|
Service1 -->|JSON Logs| Loki
|
|
Service2 -->|JSON Logs| Loki
|
|
Loki --> GrafanaLogs[Grafana<br/>Log Explorer]
|
|
end
|
|
|
|
subgraph "Tracing"
|
|
Service1 -->|Spans| Jaeger
|
|
Service2 -->|Spans| Jaeger
|
|
Jaeger --> JaegerUI[Jaeger UI]
|
|
end
|
|
|
|
style Prom fill:#d4edda
|
|
style Loki fill:#fff4e1
|
|
style Jaeger fill:#e1f5ff
|
|
```
|
|
|
|
```
|
|
|
|
## Bối cảnh Hệ thống / System Context
|
|
|
|
```mermaid
|
|
C4Context
|
|
title Sơ đồ Bối cảnh Khả năng Quan sát / Observability System Context
|
|
|
|
Person(dev, "Developer", "Uses dashboards to monitor system")
|
|
Person(sre, "SRE", "Manages infrastructure & alerts")
|
|
|
|
System(obs, "Observability Stack", "Prometheus, Loki, Jaeger, Grafana")
|
|
|
|
System_Ext(service, "Microservices", "Sends telemetry data")
|
|
System_Ext(k8s, "Kubernetes", "Sends cluster metrics")
|
|
|
|
Rel(dev, obs, "Views Dashboards", "HTTPS")
|
|
Rel(sre, obs, "Configures Alerts", "HTTPS")
|
|
Rel(service, obs, "Push/Pull Telemetry", "HTTP/gRPC")
|
|
Rel(k8s, obs, "Exposes Metrics", "HTTP")
|
|
```
|
|
|
|
### VI Mô tả Bối cảnh
|
|
- **Observability Stack**: Trung tâm thu thập và hiển thị dữ liệu (Prometheus, Loki, Jaeger, Grafana).
|
|
- **Microservices**: Gửi logs, metrics và traces (OpenTelemetry).
|
|
- **Developer/SRE**: Sử dụng Grafana để theo dõi sức khỏe hệ thống và debug.
|
|
|
|
### EN Context Description
|
|
- **Observability Stack**: Central collection and visualization (Prometheus, Loki, Jaeger, Grafana).
|
|
- **Microservices**: Send logs, metrics, and traces (OpenTelemetry).
|
|
- **Developer/SRE**: Use Grafana to monitor system health and debug.
|
|
|
|
## Ba Trụ cột Khả năng Quan sát / Three Pillars of Observability
|
|
|
|
### 1. Metrics (Prometheus + Grafana)
|
|
|
|
```mermaid
|
|
graph LR
|
|
Service[Service] -->|Expose /metrics| Prom[Prometheus]
|
|
Prom -->|Scrape every 15s| Metrics[Time Series DB]
|
|
Metrics --> Grafana[Grafana]
|
|
Grafana --> Dashboard1[Request Dashboard]
|
|
Grafana --> Dashboard2[Error Dashboard]
|
|
Grafana --> Dashboard3[Performance Dashboard]
|
|
|
|
style Prom fill:#d4edda
|
|
style Grafana fill:#e1f5ff
|
|
```
|
|
|
|
**VI**: Các phép đo số theo thời gian (requests/sec, latency, errors).
|
|
|
|
**EN**: Numerical measurements over time (requests/sec, latency, errors).
|
|
|
|
**Triển khai / Implementation**:
|
|
```typescript
|
|
import { Counter, Histogram, Gauge } from 'prom-client';
|
|
|
|
// VI: HTTP request metrics
|
|
// EN: HTTP request metrics
|
|
export const httpRequestDuration = new Histogram({
|
|
name: 'http_request_duration_seconds',
|
|
help: 'Duration of HTTP requests in seconds',
|
|
labelNames: ['method', 'route', 'status'],
|
|
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]
|
|
});
|
|
|
|
export const httpRequestTotal = new Counter({
|
|
name: 'http_requests_total',
|
|
help: 'Total HTTP requests',
|
|
labelNames: ['method', 'route', 'status']
|
|
});
|
|
|
|
export const activeRequests = new Gauge({
|
|
name: 'http_requests_active',
|
|
help: 'Number of active HTTP requests'
|
|
});
|
|
|
|
// VI: Middleware để track metrics
|
|
// EN: Middleware to track metrics
|
|
export function metricsMiddleware(req, res, next) {
|
|
const start = Date.now();
|
|
activeRequests.inc();
|
|
|
|
res.on('finish', () => {
|
|
const duration = (Date.now() - start) / 1000;
|
|
|
|
httpRequestDuration.observe(
|
|
{ method: req.method, route: req.route?.path || req.path, status: res.statusCode },
|
|
duration
|
|
);
|
|
|
|
httpRequestTotal.inc({
|
|
method: req.method,
|
|
route: req.route?.path || req.path,
|
|
status: res.statusCode
|
|
});
|
|
|
|
activeRequests.dec();
|
|
});
|
|
|
|
next();
|
|
}
|
|
```
|
|
|
|
### 2. Logging (Winston + Loki)
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant Service
|
|
participant Winston as Winston Logger
|
|
participant Loki
|
|
participant Grafana
|
|
|
|
Service->>Winston: Log event
|
|
Winston->>Winston: Format JSON
|
|
Winston->>Winston: Add metadata<br/>(correlation ID, trace ID)
|
|
Winston->>Loki: Push logs
|
|
Loki->>Loki: Index & store
|
|
|
|
User->>Grafana: Query logs
|
|
Grafana->>Loki: LogQL query
|
|
Loki-->>Grafana: Log results
|
|
```
|
|
|
|
**VI**: Structured logging với correlation IDs để tracing requests.
|
|
|
|
**EN**: Structured logging with correlation IDs for request tracing.
|
|
|
|
**Triển khai / Implementation**:
|
|
```typescript
|
|
import winston from 'winston';
|
|
|
|
export const logger = winston.createLogger({
|
|
level: process.env.LOG_LEVEL || 'info',
|
|
format: winston.format.combine(
|
|
winston.format.timestamp(),
|
|
winston.format.errors({ stack: true }),
|
|
winston.format.json()
|
|
),
|
|
defaultMeta: {
|
|
service: process.env.SERVICE_NAME || 'unknown-service',
|
|
environment: process.env.NODE_ENV || 'development'
|
|
},
|
|
transports: [
|
|
new winston.transports.Console(),
|
|
// VI: Loki transport (nếu configured)
|
|
// EN: Loki transport (if configured)
|
|
]
|
|
});
|
|
|
|
// VI: Logger middleware
|
|
// EN: Logger middleware
|
|
export function loggerMiddleware(req, res, next) {
|
|
const correlationId = req.headers['x-correlation-id'] || generateId();
|
|
|
|
req.correlationId = correlationId;
|
|
req.logger = logger.child({ correlationId });
|
|
|
|
req.logger.info('Incoming request', {
|
|
method: req.method,
|
|
path: req.path,
|
|
ip: req.ip
|
|
});
|
|
|
|
res.on('finish', () => {
|
|
req.logger.info('Request completed', {
|
|
method: req.method,
|
|
path: req.path,
|
|
status: res.statusCode,
|
|
duration: Date.now() - req.startTime
|
|
});
|
|
});
|
|
|
|
next();
|
|
}
|
|
```
|
|
|
|
### 3. Tracing (OpenTelemetry + Jaeger)
|
|
|
|
```mermaid
|
|
graph LR
|
|
Request[Incoming Request] --> Trace[Create Trace]
|
|
Trace --> SpanA[Span: HTTP Request]
|
|
SpanA --> SpanB[Span: DB Query]
|
|
SpanA --> SpanC[Span: Cache Check]
|
|
SpanA --> SpanD[Span: External API]
|
|
|
|
SpanB --> Jaeger[Jaeger]
|
|
SpanC --> Jaeger
|
|
SpanD --> Jaeger
|
|
|
|
Jaeger --> Timeline[Trace Timeline]
|
|
|
|
style Trace fill:#e1f5ff
|
|
style Jaeger fill:#d4edda
|
|
```
|
|
|
|
**VI**: Distributed tracing để track requests giữa các services.
|
|
|
|
**EN**: Distributed tracing to track requests across services.
|
|
|
|
**Triển khai / Implementation**:
|
|
```typescript
|
|
import { trace, SpanStatusCode } from '@opentelemetry/api';
|
|
|
|
// VI: Tạo traced function
|
|
// EN: Create traced function
|
|
export function traced<T>(
|
|
name: string,
|
|
fn: () => Promise<T>
|
|
): Promise<T> {
|
|
const tracer = trace.getTracer('app');
|
|
const span = tracer.startSpan(name);
|
|
|
|
return fn()
|
|
.then(result => {
|
|
span.setStatus({ code: SpanStatusCode.OK });
|
|
return result;
|
|
})
|
|
.catch(error => {
|
|
span.setStatus({
|
|
code: SpanStatusCode.ERROR,
|
|
message: error.message
|
|
});
|
|
span.recordException(error);
|
|
throw error;
|
|
})
|
|
.finally(() => {
|
|
span.end();
|
|
});
|
|
}
|
|
|
|
// VI: Sử dụng
|
|
// EN: Usage
|
|
async getUserWithTracing(userId: string): Promise<User> {
|
|
return traced('getUserById', async () => {
|
|
return await userRepository.findById(userId);
|
|
});
|
|
}
|
|
```
|
|
|
|
## Kiểm tra Sức khỏe / Health Checks
|
|
|
|
```typescript
|
|
// VI: Liveness probe - service có đang chạy không?
|
|
// EN: Liveness probe - is service running?
|
|
app.get('/health/live', (req, res) => {
|
|
res.json({ status: 'ok', timestamp: new Date().toISOString() });
|
|
});
|
|
|
|
// VI: Readiness probe - service có sẵn sàng nhận traffic không?
|
|
// EN: Readiness probe - is service ready for traffic?
|
|
app.get('/health/ready', async (req, res) => {
|
|
const checks = {
|
|
database: await checkDatabase(),
|
|
redis: await checkRedis(),
|
|
disk: await checkDiskSpace()
|
|
};
|
|
|
|
const ready = Object.values(checks).every(check => check === true);
|
|
|
|
res.status(ready ? 200 : 503).json({
|
|
ready,
|
|
checks,
|
|
timestamp: new Date().toISOString()
|
|
});
|
|
});
|
|
|
|
async function checkDatabase(): Promise<boolean> {
|
|
try {
|
|
await prisma.$queryRaw`SELECT 1`;
|
|
return true;
|
|
} catch {
|
|
return false;
|
|
}
|
|
}
|
|
```
|
|
|
|
## Quy tắc Cảnh báo / Alerting Rules
|
|
|
|
```yaml
|
|
# VI: Prometheus alerting rules
|
|
# EN: Prometheus alerting rules
|
|
groups:
|
|
- name: service_alerts
|
|
interval: 30s
|
|
rules:
|
|
# VI: Tỷ lệ lỗi cao
|
|
# EN: High error rate
|
|
- alert: HighErrorRate
|
|
expr: |
|
|
rate(http_requests_total{status=~"5.."}[5m]) > 0.05
|
|
for: 2m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High error rate detected"
|
|
description: "Error rate is {{ $value }} (> 5%)"
|
|
|
|
# VI: Độ trễ cao
|
|
# EN: High latency
|
|
- alert: HighLatency
|
|
expr: |
|
|
histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High latency detected"
|
|
description: "P95 latency is {{ $value }}s"
|
|
|
|
# VI: Service down
|
|
# EN: Service down
|
|
- alert: ServiceDown
|
|
expr: up == 0
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Service is down"
|
|
```
|
|
|
|
## Đặc điểm Hiệu suất / Performance Characteristics
|
|
|
|
### VI: Mục tiêu Hiệu suất
|
|
| Chỉ số / Metric | Mục tiêu / Target | Ghi chú / Notes |
|
|
|-----------------|-------------------|-----------------|
|
|
| **Metric Scrape Interval** | 15s | Critical services |
|
|
| **Log Ingestion Latency** | < 1s | Time from emit to queryable |
|
|
| **Trace Sampling Rate** | 10% | Production (100% in Dev/Staging) |
|
|
| **Dashboard Load Time** | < 2s | P95 Latency |
|
|
| **Alert Evaluation** | Every 1m | Evaluation interval |
|
|
| **Retention Policy** | 14 days | Logs & Traces (Metrics: 30 days) |
|
|
|
|
### EN: Performance Targets
|
|
| Metric | Target | Notes |
|
|
|--------|--------|-------|
|
|
| **Metric Scrape Interval** | 15s | Critical services |
|
|
| **Log Ingestion Latency** | < 1s | Time from emit to queryable |
|
|
| **Trace Sampling Rate** | 10% | Production (100% in Dev/Staging) |
|
|
| **Dashboard Load Time** | < 2s | P95 Latency |
|
|
| **Alert Evaluation** | Every 1m | Evaluation interval |
|
|
| **Retention Policy** | 14 days | Logs & Traces (Metrics: 30 days) |
|
|
|
|
## Cân nhắc Bảo mật / Security Considerations
|
|
|
|
### VI: Bảo mật Observability
|
|
- **Log Scrubbing**: Tự động loại bỏ PII (emails, ssn, credit cards) và secrets khỏi logs trước khi ingestion.
|
|
- **Access Control**: Grafana integrated với OAuth2/OIDC, phân quyền Viewer/Editor/Admin.
|
|
- **Network Policy**: Chỉ cho phép traffic từ namespace nội bộ tới các cổng ingestion (9090, 3100, 14268).
|
|
- **TLS**: Mã hóa traffic giữa agents và collectors.
|
|
|
|
### EN: Observability Security
|
|
- **Log Scrubbing**: Automatically scrub PII (emails, ssn, credit cards) and secrets from logs before ingestion.
|
|
- **Access Control**: Grafana integrated with OAuth2/OIDC, roles for Viewer/Editor/Admin.
|
|
- **Network Policy**: Allow traffic only from internal namespaces to ingestion ports (9090, 3100, 14268).
|
|
- **TLS**: Encrypt traffic between agents and collectors.
|
|
|
|
## Triển khai / Deployment
|
|
|
|
```mermaid
|
|
graph TD
|
|
subgraph "Kubernetes Monitoring Namespace"
|
|
Grafana[Grafana]
|
|
Prom[Prometheus Server]
|
|
Loki[Loki Gateway]
|
|
Jaeger[Jaeger Collector]
|
|
end
|
|
|
|
subgraph "App Namespace"
|
|
App[Application Pods]
|
|
Agent[Grafana Agent / Promtail]
|
|
end
|
|
|
|
App -->|Push Logs| Agent
|
|
Agent -->|Push| Loki
|
|
|
|
Prom -->|Pull Metrics| App
|
|
Prom -->|Pull Metrics| Agent
|
|
|
|
App -->|Push Traces| Jaeger
|
|
|
|
Grafana --> Prom
|
|
Grafana --> Loki
|
|
Grafana --> Jaeger
|
|
|
|
style Grafana fill:#ffe1e1
|
|
style Prom fill:#d4edda
|
|
style Loki fill:#fff4e1
|
|
style Jaeger fill:#e1f5ff
|
|
```
|
|
|
|
**VI Mô tả Triển khai**:
|
|
- **Agent**: Promtail hoặc Grafana Agent chạy như DaemonSet hoặc Sidecar để thu thập logs.
|
|
- **Pull Model**: Prometheus scrape metrics từ endpoints `/metrics`.
|
|
- **Push Model**: Traces và Logs được push tới collectors.
|
|
- **Resources**: Dedicated nodes cho monitoring stack trong production để tránh ảnh hưởng workload chính.
|
|
|
|
**EN Deployment Description**:
|
|
- **Agent**: Promtail or Grafana Agent runs as DaemonSet or Sidecar to collect logs.
|
|
- **Pull Model**: Prometheus scrapes metrics from `/metrics` endpoints.
|
|
- **Push Model**: Traces and Logs are pushed to collectors.
|
|
- **Resources**: Dedicated nodes for monitoring stack in production to prevent impact on main workload.
|
|
|
|
## Tài liệu Liên quan / Related Documentation
|
|
|
|
- [System Design](./system-design.md) - Kiến trúc tổng thể / Overall architecture
|
|
- [Caching Architecture](./caching-architecture.md) - Cache metrics
|
|
|
|
---
|
|
|
|
**Cập nhật Lần cuối / Last Updated**: 2026-01-07
|
|
**Tác giả / Authors**: GoodGo Architecture Team
|