Files
pos-system/docs/vi/architecture/observability-architecture.md
Ho Ngoc Hai c851fd97eb docs: Revise architecture and template documentation for GoodGo Platform
- Updated the architecture documentation to enhance clarity with detailed diagrams and descriptions for the GoodGo Microservices Platform.
- Revised the .NET and Node.js template documentation to reflect new naming conventions, project structures, and setup instructions for local development.
- Improved the guide documentation with verification checklists, troubleshooting steps, and real-world examples to assist developers in deploying and managing services effectively.
- Ensured bilingual support in documentation to enhance accessibility for a wider audience.
2026-01-14 12:38:41 +07:00

450 lines
14 KiB
Markdown

# Kiến trúc Khả năng Quan sát
> **Note**: Khả năng quan sát toàn diện với metrics, logging và tracing
## Sơ đồ Tổng quan
```mermaid
graph TD
subgraph "Services"
Service1[Service A]
Service2[Service B]
end
subgraph "Metrics"
Service1 -->|/metrics| Prom[Prometheus]
Service2 -->|/metrics| Prom
Prom --> Grafana[Grafana<br/>Dashboards]
end
subgraph "Logging"
Service1 -->|JSON Logs| Loki
Service2 -->|JSON Logs| Loki
Loki --> GrafanaLogs[Grafana<br/>Log Explorer]
end
subgraph "Tracing"
Service1 -->|Spans| Jaeger
Service2 -->|Spans| Jaeger
Jaeger --> JaegerUI[Jaeger UI]
end
classDef service fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef metrics fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef logging fill:#C05621,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef tracing fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef dashboard fill:#4A5568,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
class Service1,Service2 service;
class Prom metrics;
class Loki logging;
class Jaeger,JaegerUI tracing;
class Grafana,GrafanaLogs dashboard;
```
## Bối cảnh Hệ thống
```mermaid
C4Context
title Sơ đồ Bối cảnh Khả năng Quan sát
Person(dev, "Developer", "Uses dashboards to monitor system")
Person(sre, "SRE", "Manages infrastructure & alerts")
System(obs, "Observability Stack", "Prometheus, Loki, Jaeger, Grafana")
System_Ext(service, "Microservices", "Sends telemetry data")
System_Ext(k8s, "Kubernetes", "Sends cluster metrics")
Rel(dev, obs, "Views Dashboards", "HTTPS")
Rel(sre, obs, "Configures Alerts", "HTTPS")
Rel(service, obs, "Push/Pull Telemetry", "HTTP/gRPC")
Rel(k8s, obs, "Exposes Metrics", "HTTP")
UpdateElementStyle(dev, $fontColor="white", $bgColor="#2D3748", $borderColor="white")
UpdateElementStyle(sre, $fontColor="white", $bgColor="#2D3748", $borderColor="white")
UpdateElementStyle(obs, $fontColor="white", $bgColor="#2C5282", $borderColor="white")
UpdateElementStyle(service, $fontColor="white", $bgColor="#4A5568", $borderColor="white")
UpdateElementStyle(k8s, $fontColor="white", $bgColor="#4A5568", $borderColor="white")
```
### Mô tả Bối cảnh
- **Observability Stack**: Trung tâm thu thập và hiển thị dữ liệu (Prometheus, Loki, Jaeger, Grafana).
- **Microservices**: Gửi logs, metrics và traces (OpenTelemetry).
- **Developer/SRE**: Sử dụng Grafana để theo dõi sức khỏe hệ thống và debug.
## Ba Trụ cột Khả năng Quan sát
### 1. Metrics (Prometheus + Grafana)
```mermaid
graph LR
Service[Service] -->|Expose /metrics| Prom[Prometheus]
Prom -->|Scrape every 15s| Metrics[Time Series DB]
Metrics --> Grafana[Grafana]
Grafana --> Dashboard1[Request Dashboard]
Grafana --> Dashboard2[Error Dashboard]
Grafana --> Dashboard3[Performance Dashboard]
classDef default fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef prom fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef grafana fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
class Prom prom;
class Grafana grafana;
```
**Mô tả**: Các phép đo số theo thời gian (requests/sec, latency, errors).
**Triển khai**:
```typescript
import { Counter, Histogram, Gauge } from 'prom-client';
// HTTP request metrics
export const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status'],
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});
export const httpRequestTotal = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status']
});
export const activeRequests = new Gauge({
name: 'http_requests_active',
help: 'Number of active HTTP requests'
});
// Middleware để track metrics
export function metricsMiddleware(req, res, next) {
const start = Date.now();
activeRequests.inc();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestDuration.observe(
{ method: req.method, route: req.route?.path || req.path, status: res.statusCode },
duration
);
httpRequestTotal.inc({
method: req.method,
route: req.route?.path || req.path,
status: res.statusCode
});
activeRequests.dec();
});
next();
}
```
### 2. Logging (Serilog + Loki)
```mermaid
sequenceDiagram
participant Service
participant Serilog as Serilog Logger
participant Loki
participant Grafana
Service->>Serilog: Log event
Serilog->>Serilog: Format JSON
Serilog->>Serilog: Add metadata<br/>(correlation ID, trace ID)
Serilog->>Loki: Push logs
Loki->>Loki: Index & store
User->>Grafana: Query logs
Grafana->>Loki: LogQL query
Loki-->>Grafana: Log results
```
**Mô tả**: Structured logging với correlation IDs để tracing requests.
**Triển khai (.NET)**:
```csharp
// Program.cs - Serilog configuration
builder.Host.UseSerilog((context, config) => config
.ReadFrom.Configuration(context.Configuration)
.Enrich.FromLogContext()
.Enrich.WithProperty("Service", serviceName)
.Enrich.WithProperty("Environment", environment)
.WriteTo.Console(new JsonFormatter())
.WriteTo.GrafanaLoki(
"http://loki:3100",
labels: new [] { new LokiLabel { Key = "app", Value = serviceName } }
));
// Middleware - Add correlation ID
public class CorrelationIdMiddleware
{
private readonly RequestDelegate _next;
private readonly ILogger<CorrelationIdMiddleware> _logger;
public async Task InvokeAsync(HttpContext context)
{
var correlationId = context.Request.Headers["X-Correlation-Id"].FirstOrDefault()
?? Guid.NewGuid().ToString();
context.Items["CorrelationId"] = correlationId;
context.Response.Headers["X-Correlation-Id"] = correlationId;
using (LogContext.PushProperty("CorrelationId", correlationId))
{
_logger.LogInformation("Request started: {Method} {Path}",
context.Request.Method, context.Request.Path);
var sw = Stopwatch.StartNew();
await _next(context);
sw.Stop();
_logger.LogInformation("Request completed: {StatusCode} in {Duration}ms",
context.Response.StatusCode, sw.ElapsedMilliseconds);
}
}
}
```
### 3. Tracing (OpenTelemetry + Jaeger)
```mermaid
graph LR
Request[Incoming Request] --> Trace[Create Trace]
Trace --> SpanA[Span: HTTP Request]
SpanA --> SpanB[Span: DB Query]
SpanA --> SpanC[Span: Cache Check]
SpanA --> SpanD[Span: External API]
SpanB --> Jaeger[Jaeger]
SpanC --> Jaeger
SpanD --> Jaeger
Jaeger --> Timeline[Trace Timeline]
classDef default fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef trace fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef jaeger fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
class Trace trace;
class Jaeger jaeger;
```
**Mô tả**: Distributed tracing để track requests giữa các services.
> [!NOTE]
> Distributed Tracing với Jaeger đang trong kế hoạch triển khai. Hiện tại sử dụng correlation IDs cho request tracking.
**Triển khai (.NET với OpenTelemetry)**:
```csharp
// Program.cs - OpenTelemetry configuration (planned)
builder.Services.AddOpenTelemetry()
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddEntityFrameworkCoreInstrumentation()
.AddJaegerExporter(options =>
{
options.AgentHost = "jaeger";
options.AgentPort = 6831;
}));
// Manual span creation
public async Task<User?> GetUserByIdAsync(Guid userId, CancellationToken ct)
{
using var activity = ActivitySource.StartActivity("GetUserById");
activity?.SetTag("user.id", userId.ToString());
try
{
var user = await _context.Users.FindAsync([userId], ct);
activity?.SetStatus(ActivityStatusCode.Ok);
return user;
}
catch (Exception ex)
{
activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
throw;
}
}
```
## Kiểm tra Sức khỏe
```typescript
// Health check (.NET)
app.MapHealthChecks("/health", new HealthCheckOptions
{
ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
Predicate = _ => false // Liveness - always return healthy
});
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
Predicate = check => check.Tags.Contains("ready")
});
// Health check registration
builder.Services.AddHealthChecks()
.AddNpgSql(connectionString, name: "database", tags: new[] { "ready" })
.AddRedis(redisConnectionString, name: "redis", tags: new[] { "ready" });
```
## Quy tắc Cảnh báo
```yaml
# Prometheus alerting rules
groups:
- name: service_alerts
interval: 30s
rules:
# Tỷ lệ lỗi cao
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} (> 5%)"
# Độ trễ cao
- alert: HighLatency
expr: |
histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }}s"
# Service down
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
```
## Đặc điểm Hiệu suất
### Mục tiêu Hiệu suất
| Chỉ số | Mục tiêu | Ghi chú |
|--------|----------|---------|
| **Metric Scrape Interval** | 15s | Critical services |
| **Log Ingestion Latency** | < 1s | Time from emit to queryable |
| **Trace Sampling Rate** | 10% | Production (100% in Dev/Staging) |
| **Dashboard Load Time** | < 2s | P95 Latency |
| **Alert Evaluation** | Every 1m | Evaluation interval |
| **Retention Policy** | 14 days | Logs & Traces (Metrics: 30 days) |
## Cân nhắc Bảo mật
### Bảo mật Observability
- **Log Scrubbing**: Tự động loại bỏ PII (emails, ssn, credit cards) và secrets khỏi logs trước khi ingestion.
- **Access Control**: Grafana integrated với OAuth2/OIDC, phân quyền Viewer/Editor/Admin.
- **Network Policy**: Chỉ cho phép traffic từ namespace nội bộ tới các cổng ingestion (9090, 3100, 14268).
- **TLS**: Mã hóa traffic giữa agents và collectors.
## Triển khai
```mermaid
graph TD
subgraph "Kubernetes Monitoring Namespace"
Grafana[Grafana]
Prom[Prometheus Server]
Loki[Loki Gateway]
Jaeger[Jaeger Collector]
end
subgraph "App Namespace"
App[Application Pods]
Agent[Grafana Agent / Promtail]
end
App -->|Push Logs| Agent
Agent -->|Push| Loki
Prom -->|Pull Metrics| App
Prom -->|Pull Metrics| Agent
App -->|Push Traces| Jaeger
Grafana --> Prom
Grafana --> Loki
Grafana --> Jaeger
classDef k8s fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef app fill:#4A5568,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef grafana fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef loki fill:#C05621,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef jaeger fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef prom fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
class Grafana grafana;
class Loki loki;
class Jaeger jaeger;
class Prom prom;
class App,Agent app;
```
**Mô tả Triển khai**:
- **Agent**: Promtail hoặc Grafana Agent chạy như DaemonSet hoặc Sidecar để thu thập logs.
- **Pull Model**: Prometheus scrape metrics từ endpoints `/metrics`.
- **Push Model**: Traces và Logs được push tới collectors.
- **Resources**: Dedicated nodes cho monitoring stack trong production để tránh ảnh hưởng workload chính.
## Tài liệu Liên quan
- [System Design](./system-design.md) - Kiến trúc tổng thể
- [Caching Architecture](./caching-architecture.md) - Cache metrics
## Quick Tips
### Mermaid Common Issues
| Issue | Solution |
|-------|----------|
| **Parse Error** | Check for special characters like `()` or `[]` inside node text without quotes. Use `"text"` for complex strings. |
| **Color Not Showing** | Ensure `style` or `classDef` definitions are correct and IDs match. |
| **Arrow Direction** | `TD` = Top-Down, `LR` = Left-Right. Choose appropriately for layout. |
### Color Pattern Quick Reference
| Element | Color | Hex | Use Case |
|---------|-------|-----|----------|
| **Primary** | Dark Blue | `#2D3748` | System components, core services |
| **Secondary** | Grey | `#4A5568` | Supporting modules, libraries |
| **Accent** | Blue | `#2C5282` | Databases, external APIs |
| **Highlight** | Teal | `#285E61` | User interactions, highlights |
| **Success** | Green | `#2F855A` | Successful states, active |
| **Warning** | Orange | `#C05621` | Warning/Caution states |
| **Error** | Red | `#C53030` | Error states, failures |
### Visual Indicators
| Indicator | Meaning |
|-----------|---------|
| 🟢 | Safe / Recommended |
| 🟡 | Warning / Caution |
| 🔴 | Danger / Anti-pattern |
| 💡 | Tip / Best Practice |
---
**Cập nhật Lần cuối**: 2026-01-14
**Tác giả**: GoodGo Architecture Team