Files
pos-system/docs/en/skills/observability-monitoring.md
Ho Ngoc Hai 2640b351c3 Enhance documentation with detailed diagrams and structured flows
- Added request/response flow diagrams to api-design and api-gateway-advanced skills for better visualization of processes.
- Introduced configuration loading flow in configuration-management skill to clarify the configuration process.
- Included error propagation flow in error-handling-patterns skill to illustrate error handling across layers.
- Enhanced various skills with additional diagrams to improve understanding of complex concepts.

These updates aim to provide clearer guidance and improve the overall documentation experience for developers.
2026-01-01 23:22:54 +07:00

658 lines
17 KiB
Markdown

---
name: observability-monitoring
description: Observability and monitoring patterns for GoodGo microservices. Use when adding metrics, implementing logging, setting up tracing, creating health checks, or debugging production issues.
---
# Observability & Monitoring Patterns
## When to Use This Skill
Use this skill when:
- Setting up logging infrastructure
- Implementing metrics collection
- Adding distributed tracing
- Creating health check endpoints
- Setting up monitoring dashboards
- Debugging production issues
- Implementing alerting rules
- Analyzing performance bottlenecks
## Core Concepts
### Three Pillars of Observability
1. **Logs**: Event records for debugging
2. **Metrics**: Numerical measurements over time
3. **Traces**: Request flow across services
### Tech Stack
- **Logging**: Winston, Pino
- **Metrics**: Prometheus + Grafana
- **Tracing**: OpenTelemetry + Jaeger
- **APM**: DataDog or New Relic (optional)
### Observability Stack Architecture
The observability stack consists of three pillars working together to provide comprehensive visibility into system behavior:
```mermaid
graph TB
subgraph "Application Layer"
App[Microservice]
end
subgraph "Three Pillars of Observability"
Logs[Logs<br/>Winston/Pino]
Metrics[Metrics<br/>Prometheus]
Traces[Traces<br/>OpenTelemetry]
end
subgraph "Aggregation & Storage"
Loki[Loki<br/>Log Aggregation]
Prom[Prometheus<br/>Metrics Storage]
Jaeger[Jaeger<br/>Trace Storage]
end
subgraph "Visualization & Alerting"
Grafana[Grafana<br/>Dashboards]
AlertManager[AlertManager<br/>Alerts]
end
App -->|Structured Logs| Logs
App -->|HTTP Metrics| Metrics
App -->|Distributed Spans| Traces
Logs -->|Collect| Loki
Metrics -->|Scrape /metrics| Prom
Traces -->|Export| Jaeger
Loki -->|Query| Grafana
Prom -->|Query| Grafana
Prom -->|Alerts| AlertManager
Jaeger -->|Query| Grafana
style App fill:#e1f5ff
style Logs fill:#fff4e1
style Metrics fill:#e1ffe1
style Traces fill:#ffe1f5
style Grafana fill:#e1e1ff
```
## Structured Logging
```typescript
// src/lib/logger.ts
import winston from 'winston';
const logFormat = winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
);
export const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: logFormat,
defaultMeta: {
service: process.env.SERVICE_NAME || 'unknown',
environment: process.env.NODE_ENV || 'development'
},
transports: [
new winston.transports.Console({
format: process.env.NODE_ENV === 'development'
? winston.format.combine(
winston.format.colorize(),
winston.format.simple()
)
: logFormat
}),
// Production: Send to log aggregation service
...(process.env.NODE_ENV === 'production'
? [new winston.transports.Http({
host: 'logs.example.com',
path: '/collect',
ssl: true
})]
: [])
]
});
// Request logger middleware
export const requestLogger = (req: Request, res: Response, next: NextFunction) => {
const start = Date.now();
res.on('finish', () => {
const duration = Date.now() - start;
logger.info('HTTP Request', {
method: req.method,
url: req.url,
status: res.statusCode,
duration,
ip: req.ip,
userAgent: req.get('user-agent'),
correlationId: req.headers['x-correlation-id']
});
});
next();
};
```
### Logging Flow
The logging flow shows how requests are logged with correlation IDs and flow through the system:
```mermaid
sequenceDiagram
participant Client
participant Service as Microservice
participant Logger as Winston/Pino Logger
participant Aggregator as Log Aggregator<br/>(Loki)
participant Dashboard as Grafana Dashboard
Client->>Service: HTTP Request<br/>(with x-correlation-id)
Service->>Service: Generate/Extract<br/>Correlation ID
Service->>Logger: Log Request Start<br/>{correlationId, method, url}
Service->>Service: Process Request
Service->>Logger: Log Business Event<br/>{correlationId, event, data}
Service->>Client: HTTP Response<br/>(with x-correlation-id)
Service->>Logger: Log Request End<br/>{correlationId, status, duration}
Logger->>Aggregator: Send Structured Logs<br/>(JSON format)
Aggregator->>Dashboard: Index & Store Logs
Dashboard->>Dashboard: Query by correlationId<br/>to trace request flow
```
## Metrics Collection
```typescript
// src/lib/metrics.ts
import { Registry, Counter, Histogram, Gauge } from 'prom-client';
export const register = new Registry();
// HTTP metrics
export const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status'],
buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});
export const httpRequestTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status']
});
// Business metrics
export const userRegistrations = new Counter({
name: 'user_registrations_total',
help: 'Total number of user registrations',
labelNames: ['type']
});
export const activeUsers = new Gauge({
name: 'active_users',
help: 'Number of active users',
labelNames: ['status']
});
// Register metrics
register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
register.registerMetric(userRegistrations);
register.registerMetric(activeUsers);
// Metrics middleware
export const metricsMiddleware = (req: Request, res: Response, next: NextFunction) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route?.path || req.path;
httpRequestDuration
.labels(req.method, route, res.statusCode.toString())
.observe(duration);
httpRequestTotal
.labels(req.method, route, res.statusCode.toString())
.inc();
});
next();
};
// Metrics endpoint
export const metricsHandler = async (req: Request, res: Response) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
};
```
### Metrics Collection Flow
Metrics are collected from services and exposed to Prometheus for monitoring and alerting:
```mermaid
graph LR
subgraph "Service Instance"
App[Application]
Middleware[Metrics Middleware]
Registry[Prometheus Registry]
Endpoint[/metrics Endpoint]
end
subgraph "Metrics Types"
Counter[Counter<br/>http_requests_total]
Gauge[Gauge<br/>active_users]
Histogram[Histogram<br/>request_duration]
end
subgraph "Collection"
Prometheus[Prometheus<br/>Scraper]
end
subgraph "Storage & Query"
PromDB[(Prometheus<br/>Time Series DB)]
end
subgraph "Visualization"
Grafana[Grafana<br/>Dashboards]
Alerts[AlertManager<br/>Rules]
end
App -->|HTTP Request| Middleware
Middleware -->|Record| Counter
Middleware -->|Record| Histogram
App -->|Update| Gauge
Counter --> Registry
Gauge --> Registry
Histogram --> Registry
Registry --> Endpoint
Prometheus -->|Scrape every 15s| Endpoint
Prometheus -->|Store| PromDB
PromDB -->|Query| Grafana
PromDB -->|Evaluate| Alerts
Alerts -->|Trigger| Grafana
style App fill:#e1f5ff
style Prometheus fill:#ffe1e1
style Grafana fill:#e1e1ff
```
## Distributed Tracing
```typescript
// src/lib/tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
export const initTracing = () => {
const jaegerExporter = new JaegerExporter({
endpoint: process.env.JAEGER_ENDPOINT || 'http://localhost:14268/api/traces',
});
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME || 'unknown',
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0',
}),
traceExporter: jaegerExporter,
instrumentations: [getNodeAutoInstrumentations()]
});
sdk.start();
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});
};
// Custom span creation
import { trace, SpanStatusCode } from '@opentelemetry/api';
export const tracedOperation = async (name: string, fn: Function) => {
const tracer = trace.getTracer('application');
const span = tracer.startSpan(name);
try {
const result = await fn();
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
span.recordException(error);
throw error;
} finally {
span.end();
}
};
```
### Distributed Tracing Flow
Distributed tracing tracks requests across multiple services using OpenTelemetry:
```mermaid
sequenceDiagram
participant Client
participant Gateway as API Gateway
participant ServiceA as Service A<br/>(User Service)
participant ServiceB as Service B<br/>(Order Service)
participant DB as Database
participant Jaeger as Jaeger<br/>Collector
Client->>Gateway: Request<br/>(Trace ID: abc123)
Gateway->>Gateway: Create Root Span<br/>Span: gateway.request
Gateway->>ServiceA: HTTP Call<br/>(Trace ID: abc123,<br/>Span ID: span-1)
ServiceA->>ServiceA: Create Child Span<br/>Span: user.getById
ServiceA->>DB: Query User<br/>(Trace ID: abc123,<br/>Span ID: span-2)
DB-->>ServiceA: User Data
ServiceA->>ServiceA: End Span span-2
ServiceA->>ServiceB: HTTP Call<br/>(Trace ID: abc123,<br/>Span ID: span-3)
ServiceB->>ServiceB: Create Child Span<br/>Span: order.getByUserId
ServiceB->>DB: Query Orders<br/>(Trace ID: abc123,<br/>Span ID: span-4)
DB-->>ServiceB: Orders Data
ServiceB->>ServiceB: End Span span-4
ServiceB->>ServiceB: End Span span-3
ServiceB-->>ServiceA: Response
ServiceA->>ServiceA: End Span span-1
ServiceA-->>Gateway: Response
Gateway->>Gateway: End Span gateway.request
Gateway-->>Client: Final Response
ServiceA->>Jaeger: Export Spans<br/>(Trace ID: abc123)
ServiceB->>Jaeger: Export Spans<br/>(Trace ID: abc123)
Gateway->>Jaeger: Export Spans<br/>(Trace ID: abc123)
Note over Jaeger: All spans linked by<br/>Trace ID: abc123
```
## Health Checks
```typescript
// src/modules/health/health.controller.ts
export class HealthController {
constructor(
private prisma: PrismaClient,
private redis: Redis
) {}
// Liveness probe - is the service running?
async liveness(req: Request, res: Response) {
res.json({
status: 'ok',
timestamp: new Date().toISOString()
});
}
// Readiness probe - is the service ready for traffic?
async readiness(req: Request, res: Response) {
const checks = await this.runHealthChecks();
const isHealthy = Object.values(checks).every(check => check.status === 'healthy');
res.status(isHealthy ? 200 : 503).json({
status: isHealthy ? 'ready' : 'not ready',
checks,
timestamp: new Date().toISOString()
});
}
// Detailed health check
async health(req: Request, res: Response) {
const checks = await this.runHealthChecks();
const isHealthy = Object.values(checks).every(check => check.status === 'healthy');
res.status(isHealthy ? 200 : 503).json({
status: isHealthy ? 'healthy' : 'unhealthy',
version: process.env.SERVICE_VERSION || '1.0.0',
uptime: process.uptime(),
checks,
timestamp: new Date().toISOString()
});
}
private async runHealthChecks() {
const checks: Record<string, any> = {};
// Database check
try {
const start = Date.now();
await this.prisma.$queryRaw`SELECT 1`;
checks.database = {
status: 'healthy',
responseTime: Date.now() - start
};
} catch (error) {
checks.database = {
status: 'unhealthy',
error: error.message
};
}
// Redis check
try {
const start = Date.now();
await this.redis.ping();
checks.redis = {
status: 'healthy',
responseTime: Date.now() - start
};
} catch (error) {
checks.redis = {
status: 'unhealthy',
error: error.message
};
}
// Memory check
const memUsage = process.memoryUsage();
checks.memory = {
status: memUsage.heapUsed < 500 * 1024 * 1024 ? 'healthy' : 'warning',
heapUsed: Math.round(memUsage.heapUsed / 1024 / 1024),
heapTotal: Math.round(memUsage.heapTotal / 1024 / 1024),
rss: Math.round(memUsage.rss / 1024 / 1024)
};
return checks;
}
}
```
## Error Tracking
```typescript
// src/lib/error-tracking.ts
import * as Sentry from '@sentry/node';
export const initErrorTracking = () => {
if (process.env.SENTRY_DSN) {
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
tracesSampleRate: 0.1,
beforeSend(event, hint) {
// Filter sensitive data
if (event.request?.cookies) {
delete event.request.cookies;
}
return event;
}
});
}
};
// Error handler middleware
export const errorHandler = (
err: Error,
req: Request,
res: Response,
next: NextFunction
) => {
// Log error
logger.error('Unhandled error', {
error: err.message,
stack: err.stack,
url: req.url,
method: req.method,
correlationId: req.headers['x-correlation-id']
});
// Report to Sentry
Sentry.captureException(err, {
tags: {
service: process.env.SERVICE_NAME
},
user: {
id: req.user?.id
}
});
// Send response
res.status(500).json({
success: false,
error: {
code: 'INTERNAL_ERROR',
message: process.env.NODE_ENV === 'production'
? 'Internal server error'
: err.message
}
});
};
```
## Performance Monitoring
```typescript
// src/middlewares/performance.middleware.ts
export const performanceMiddleware = (req: Request, res: Response, next: NextFunction) => {
const start = process.hrtime.bigint();
res.on('finish', () => {
const end = process.hrtime.bigint();
const duration = Number(end - start) / 1000000; // Convert to milliseconds
// Log slow requests
if (duration > 1000) {
logger.warn('Slow request detected', {
method: req.method,
url: req.url,
duration,
threshold: 1000
});
}
// Add to response header
res.set('X-Response-Time', `${duration}ms`);
});
next();
};
```
## Grafana Dashboard Config
```json
{
"dashboard": {
"title": "Service Metrics",
"panels": [
{
"title": "Request Rate",
"targets": [{
"expr": "rate(http_requests_total[5m])"
}]
},
{
"title": "Request Duration",
"targets": [{
"expr": "histogram_quantile(0.95, http_request_duration_seconds)"
}]
},
{
"title": "Error Rate",
"targets": [{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
}]
},
{
"title": "Active Users",
"targets": [{
"expr": "active_users"
}]
}
]
}
}
```
## Alerting Rules
```yaml
# prometheus/alerts.yml
groups:
- name: service_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% for 5 minutes"
- alert: HighLatency
expr: histogram_quantile(0.95, http_request_duration_seconds) > 1
for: 5m
annotations:
summary: "High latency detected"
description: "95th percentile latency is above 1s"
- alert: ServiceDown
expr: up{job="service"} == 0
for: 1m
annotations:
summary: "Service is down"
description: "Service has been down for 1 minute"
```
## Best Practices
1. **Logging**
- Use structured logging (JSON format)
- Include correlation IDs for request tracing
- Log at appropriate levels (ERROR, WARN, INFO, DEBUG)
- Avoid logging sensitive data
2. **Metrics**
- Use standard metric types (Counter, Gauge, Histogram)
- Keep cardinality low (avoid high-cardinality labels)
- Define SLIs and SLOs for critical paths
- Monitor business metrics, not just technical ones
3. **Tracing**
- Add traces for critical operations
- Include relevant context in spans
- Sample appropriately to control costs
- Use distributed tracing for microservices
4. **Alerting**
- Alert on symptoms, not causes
- Include runbook links in alerts
- Avoid alert fatigue with proper thresholds
- Test alerting rules regularly