- Added request/response flow diagrams to api-design and api-gateway-advanced skills for better visualization of processes. - Introduced configuration loading flow in configuration-management skill to clarify the configuration process. - Included error propagation flow in error-handling-patterns skill to illustrate error handling across layers. - Enhanced various skills with additional diagrams to improve understanding of complex concepts. These updates aim to provide clearer guidance and improve the overall documentation experience for developers.
658 lines
17 KiB
Markdown
658 lines
17 KiB
Markdown
---
|
|
name: observability-monitoring
|
|
description: Observability and monitoring patterns for GoodGo microservices. Use when adding metrics, implementing logging, setting up tracing, creating health checks, or debugging production issues.
|
|
---
|
|
|
|
# Observability & Monitoring Patterns
|
|
|
|
## When to Use This Skill
|
|
|
|
Use this skill when:
|
|
- Setting up logging infrastructure
|
|
- Implementing metrics collection
|
|
- Adding distributed tracing
|
|
- Creating health check endpoints
|
|
- Setting up monitoring dashboards
|
|
- Debugging production issues
|
|
- Implementing alerting rules
|
|
- Analyzing performance bottlenecks
|
|
|
|
## Core Concepts
|
|
|
|
### Three Pillars of Observability
|
|
1. **Logs**: Event records for debugging
|
|
2. **Metrics**: Numerical measurements over time
|
|
3. **Traces**: Request flow across services
|
|
|
|
### Tech Stack
|
|
- **Logging**: Winston, Pino
|
|
- **Metrics**: Prometheus + Grafana
|
|
- **Tracing**: OpenTelemetry + Jaeger
|
|
- **APM**: DataDog or New Relic (optional)
|
|
|
|
### Observability Stack Architecture
|
|
|
|
The observability stack consists of three pillars working together to provide comprehensive visibility into system behavior:
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Application Layer"
|
|
App[Microservice]
|
|
end
|
|
|
|
subgraph "Three Pillars of Observability"
|
|
Logs[Logs<br/>Winston/Pino]
|
|
Metrics[Metrics<br/>Prometheus]
|
|
Traces[Traces<br/>OpenTelemetry]
|
|
end
|
|
|
|
subgraph "Aggregation & Storage"
|
|
Loki[Loki<br/>Log Aggregation]
|
|
Prom[Prometheus<br/>Metrics Storage]
|
|
Jaeger[Jaeger<br/>Trace Storage]
|
|
end
|
|
|
|
subgraph "Visualization & Alerting"
|
|
Grafana[Grafana<br/>Dashboards]
|
|
AlertManager[AlertManager<br/>Alerts]
|
|
end
|
|
|
|
App -->|Structured Logs| Logs
|
|
App -->|HTTP Metrics| Metrics
|
|
App -->|Distributed Spans| Traces
|
|
|
|
Logs -->|Collect| Loki
|
|
Metrics -->|Scrape /metrics| Prom
|
|
Traces -->|Export| Jaeger
|
|
|
|
Loki -->|Query| Grafana
|
|
Prom -->|Query| Grafana
|
|
Prom -->|Alerts| AlertManager
|
|
Jaeger -->|Query| Grafana
|
|
|
|
style App fill:#e1f5ff
|
|
style Logs fill:#fff4e1
|
|
style Metrics fill:#e1ffe1
|
|
style Traces fill:#ffe1f5
|
|
style Grafana fill:#e1e1ff
|
|
```
|
|
|
|
## Structured Logging
|
|
|
|
```typescript
|
|
// src/lib/logger.ts
|
|
import winston from 'winston';
|
|
|
|
const logFormat = winston.format.combine(
|
|
winston.format.timestamp(),
|
|
winston.format.errors({ stack: true }),
|
|
winston.format.json()
|
|
);
|
|
|
|
export const logger = winston.createLogger({
|
|
level: process.env.LOG_LEVEL || 'info',
|
|
format: logFormat,
|
|
defaultMeta: {
|
|
service: process.env.SERVICE_NAME || 'unknown',
|
|
environment: process.env.NODE_ENV || 'development'
|
|
},
|
|
transports: [
|
|
new winston.transports.Console({
|
|
format: process.env.NODE_ENV === 'development'
|
|
? winston.format.combine(
|
|
winston.format.colorize(),
|
|
winston.format.simple()
|
|
)
|
|
: logFormat
|
|
}),
|
|
// Production: Send to log aggregation service
|
|
...(process.env.NODE_ENV === 'production'
|
|
? [new winston.transports.Http({
|
|
host: 'logs.example.com',
|
|
path: '/collect',
|
|
ssl: true
|
|
})]
|
|
: [])
|
|
]
|
|
});
|
|
|
|
// Request logger middleware
|
|
export const requestLogger = (req: Request, res: Response, next: NextFunction) => {
|
|
const start = Date.now();
|
|
|
|
res.on('finish', () => {
|
|
const duration = Date.now() - start;
|
|
|
|
logger.info('HTTP Request', {
|
|
method: req.method,
|
|
url: req.url,
|
|
status: res.statusCode,
|
|
duration,
|
|
ip: req.ip,
|
|
userAgent: req.get('user-agent'),
|
|
correlationId: req.headers['x-correlation-id']
|
|
});
|
|
});
|
|
|
|
next();
|
|
};
|
|
```
|
|
|
|
### Logging Flow
|
|
|
|
The logging flow shows how requests are logged with correlation IDs and flow through the system:
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant Client
|
|
participant Service as Microservice
|
|
participant Logger as Winston/Pino Logger
|
|
participant Aggregator as Log Aggregator<br/>(Loki)
|
|
participant Dashboard as Grafana Dashboard
|
|
|
|
Client->>Service: HTTP Request<br/>(with x-correlation-id)
|
|
Service->>Service: Generate/Extract<br/>Correlation ID
|
|
Service->>Logger: Log Request Start<br/>{correlationId, method, url}
|
|
Service->>Service: Process Request
|
|
Service->>Logger: Log Business Event<br/>{correlationId, event, data}
|
|
Service->>Client: HTTP Response<br/>(with x-correlation-id)
|
|
Service->>Logger: Log Request End<br/>{correlationId, status, duration}
|
|
|
|
Logger->>Aggregator: Send Structured Logs<br/>(JSON format)
|
|
Aggregator->>Dashboard: Index & Store Logs
|
|
Dashboard->>Dashboard: Query by correlationId<br/>to trace request flow
|
|
```
|
|
|
|
## Metrics Collection
|
|
|
|
```typescript
|
|
// src/lib/metrics.ts
|
|
import { Registry, Counter, Histogram, Gauge } from 'prom-client';
|
|
|
|
export const register = new Registry();
|
|
|
|
// HTTP metrics
|
|
export const httpRequestDuration = new Histogram({
|
|
name: 'http_request_duration_seconds',
|
|
help: 'Duration of HTTP requests in seconds',
|
|
labelNames: ['method', 'route', 'status'],
|
|
buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
|
|
});
|
|
|
|
export const httpRequestTotal = new Counter({
|
|
name: 'http_requests_total',
|
|
help: 'Total number of HTTP requests',
|
|
labelNames: ['method', 'route', 'status']
|
|
});
|
|
|
|
// Business metrics
|
|
export const userRegistrations = new Counter({
|
|
name: 'user_registrations_total',
|
|
help: 'Total number of user registrations',
|
|
labelNames: ['type']
|
|
});
|
|
|
|
export const activeUsers = new Gauge({
|
|
name: 'active_users',
|
|
help: 'Number of active users',
|
|
labelNames: ['status']
|
|
});
|
|
|
|
// Register metrics
|
|
register.registerMetric(httpRequestDuration);
|
|
register.registerMetric(httpRequestTotal);
|
|
register.registerMetric(userRegistrations);
|
|
register.registerMetric(activeUsers);
|
|
|
|
// Metrics middleware
|
|
export const metricsMiddleware = (req: Request, res: Response, next: NextFunction) => {
|
|
const start = Date.now();
|
|
|
|
res.on('finish', () => {
|
|
const duration = (Date.now() - start) / 1000;
|
|
const route = req.route?.path || req.path;
|
|
|
|
httpRequestDuration
|
|
.labels(req.method, route, res.statusCode.toString())
|
|
.observe(duration);
|
|
|
|
httpRequestTotal
|
|
.labels(req.method, route, res.statusCode.toString())
|
|
.inc();
|
|
});
|
|
|
|
next();
|
|
};
|
|
|
|
// Metrics endpoint
|
|
export const metricsHandler = async (req: Request, res: Response) => {
|
|
res.set('Content-Type', register.contentType);
|
|
res.end(await register.metrics());
|
|
};
|
|
```
|
|
|
|
### Metrics Collection Flow
|
|
|
|
Metrics are collected from services and exposed to Prometheus for monitoring and alerting:
|
|
|
|
```mermaid
|
|
graph LR
|
|
subgraph "Service Instance"
|
|
App[Application]
|
|
Middleware[Metrics Middleware]
|
|
Registry[Prometheus Registry]
|
|
Endpoint[/metrics Endpoint]
|
|
end
|
|
|
|
subgraph "Metrics Types"
|
|
Counter[Counter<br/>http_requests_total]
|
|
Gauge[Gauge<br/>active_users]
|
|
Histogram[Histogram<br/>request_duration]
|
|
end
|
|
|
|
subgraph "Collection"
|
|
Prometheus[Prometheus<br/>Scraper]
|
|
end
|
|
|
|
subgraph "Storage & Query"
|
|
PromDB[(Prometheus<br/>Time Series DB)]
|
|
end
|
|
|
|
subgraph "Visualization"
|
|
Grafana[Grafana<br/>Dashboards]
|
|
Alerts[AlertManager<br/>Rules]
|
|
end
|
|
|
|
App -->|HTTP Request| Middleware
|
|
Middleware -->|Record| Counter
|
|
Middleware -->|Record| Histogram
|
|
App -->|Update| Gauge
|
|
|
|
Counter --> Registry
|
|
Gauge --> Registry
|
|
Histogram --> Registry
|
|
Registry --> Endpoint
|
|
|
|
Prometheus -->|Scrape every 15s| Endpoint
|
|
Prometheus -->|Store| PromDB
|
|
|
|
PromDB -->|Query| Grafana
|
|
PromDB -->|Evaluate| Alerts
|
|
Alerts -->|Trigger| Grafana
|
|
|
|
style App fill:#e1f5ff
|
|
style Prometheus fill:#ffe1e1
|
|
style Grafana fill:#e1e1ff
|
|
```
|
|
|
|
## Distributed Tracing
|
|
|
|
```typescript
|
|
// src/lib/tracing.ts
|
|
import { NodeSDK } from '@opentelemetry/sdk-node';
|
|
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
|
|
import { Resource } from '@opentelemetry/resources';
|
|
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
|
|
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
|
|
|
|
export const initTracing = () => {
|
|
const jaegerExporter = new JaegerExporter({
|
|
endpoint: process.env.JAEGER_ENDPOINT || 'http://localhost:14268/api/traces',
|
|
});
|
|
|
|
const sdk = new NodeSDK({
|
|
resource: new Resource({
|
|
[SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME || 'unknown',
|
|
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0',
|
|
}),
|
|
traceExporter: jaegerExporter,
|
|
instrumentations: [getNodeAutoInstrumentations()]
|
|
});
|
|
|
|
sdk.start();
|
|
|
|
process.on('SIGTERM', () => {
|
|
sdk.shutdown()
|
|
.then(() => console.log('Tracing terminated'))
|
|
.catch((error) => console.log('Error terminating tracing', error))
|
|
.finally(() => process.exit(0));
|
|
});
|
|
};
|
|
|
|
// Custom span creation
|
|
import { trace, SpanStatusCode } from '@opentelemetry/api';
|
|
|
|
export const tracedOperation = async (name: string, fn: Function) => {
|
|
const tracer = trace.getTracer('application');
|
|
const span = tracer.startSpan(name);
|
|
|
|
try {
|
|
const result = await fn();
|
|
span.setStatus({ code: SpanStatusCode.OK });
|
|
return result;
|
|
} catch (error) {
|
|
span.setStatus({
|
|
code: SpanStatusCode.ERROR,
|
|
message: error.message
|
|
});
|
|
span.recordException(error);
|
|
throw error;
|
|
} finally {
|
|
span.end();
|
|
}
|
|
};
|
|
```
|
|
|
|
### Distributed Tracing Flow
|
|
|
|
Distributed tracing tracks requests across multiple services using OpenTelemetry:
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant Client
|
|
participant Gateway as API Gateway
|
|
participant ServiceA as Service A<br/>(User Service)
|
|
participant ServiceB as Service B<br/>(Order Service)
|
|
participant DB as Database
|
|
participant Jaeger as Jaeger<br/>Collector
|
|
|
|
Client->>Gateway: Request<br/>(Trace ID: abc123)
|
|
Gateway->>Gateway: Create Root Span<br/>Span: gateway.request
|
|
Gateway->>ServiceA: HTTP Call<br/>(Trace ID: abc123,<br/>Span ID: span-1)
|
|
|
|
ServiceA->>ServiceA: Create Child Span<br/>Span: user.getById
|
|
ServiceA->>DB: Query User<br/>(Trace ID: abc123,<br/>Span ID: span-2)
|
|
DB-->>ServiceA: User Data
|
|
ServiceA->>ServiceA: End Span span-2
|
|
ServiceA->>ServiceB: HTTP Call<br/>(Trace ID: abc123,<br/>Span ID: span-3)
|
|
|
|
ServiceB->>ServiceB: Create Child Span<br/>Span: order.getByUserId
|
|
ServiceB->>DB: Query Orders<br/>(Trace ID: abc123,<br/>Span ID: span-4)
|
|
DB-->>ServiceB: Orders Data
|
|
ServiceB->>ServiceB: End Span span-4
|
|
ServiceB->>ServiceB: End Span span-3
|
|
ServiceB-->>ServiceA: Response
|
|
ServiceA->>ServiceA: End Span span-1
|
|
ServiceA-->>Gateway: Response
|
|
Gateway->>Gateway: End Span gateway.request
|
|
Gateway-->>Client: Final Response
|
|
|
|
ServiceA->>Jaeger: Export Spans<br/>(Trace ID: abc123)
|
|
ServiceB->>Jaeger: Export Spans<br/>(Trace ID: abc123)
|
|
Gateway->>Jaeger: Export Spans<br/>(Trace ID: abc123)
|
|
|
|
Note over Jaeger: All spans linked by<br/>Trace ID: abc123
|
|
```
|
|
|
|
## Health Checks
|
|
|
|
```typescript
|
|
// src/modules/health/health.controller.ts
|
|
export class HealthController {
|
|
constructor(
|
|
private prisma: PrismaClient,
|
|
private redis: Redis
|
|
) {}
|
|
|
|
// Liveness probe - is the service running?
|
|
async liveness(req: Request, res: Response) {
|
|
res.json({
|
|
status: 'ok',
|
|
timestamp: new Date().toISOString()
|
|
});
|
|
}
|
|
|
|
// Readiness probe - is the service ready for traffic?
|
|
async readiness(req: Request, res: Response) {
|
|
const checks = await this.runHealthChecks();
|
|
const isHealthy = Object.values(checks).every(check => check.status === 'healthy');
|
|
|
|
res.status(isHealthy ? 200 : 503).json({
|
|
status: isHealthy ? 'ready' : 'not ready',
|
|
checks,
|
|
timestamp: new Date().toISOString()
|
|
});
|
|
}
|
|
|
|
// Detailed health check
|
|
async health(req: Request, res: Response) {
|
|
const checks = await this.runHealthChecks();
|
|
const isHealthy = Object.values(checks).every(check => check.status === 'healthy');
|
|
|
|
res.status(isHealthy ? 200 : 503).json({
|
|
status: isHealthy ? 'healthy' : 'unhealthy',
|
|
version: process.env.SERVICE_VERSION || '1.0.0',
|
|
uptime: process.uptime(),
|
|
checks,
|
|
timestamp: new Date().toISOString()
|
|
});
|
|
}
|
|
|
|
private async runHealthChecks() {
|
|
const checks: Record<string, any> = {};
|
|
|
|
// Database check
|
|
try {
|
|
const start = Date.now();
|
|
await this.prisma.$queryRaw`SELECT 1`;
|
|
checks.database = {
|
|
status: 'healthy',
|
|
responseTime: Date.now() - start
|
|
};
|
|
} catch (error) {
|
|
checks.database = {
|
|
status: 'unhealthy',
|
|
error: error.message
|
|
};
|
|
}
|
|
|
|
// Redis check
|
|
try {
|
|
const start = Date.now();
|
|
await this.redis.ping();
|
|
checks.redis = {
|
|
status: 'healthy',
|
|
responseTime: Date.now() - start
|
|
};
|
|
} catch (error) {
|
|
checks.redis = {
|
|
status: 'unhealthy',
|
|
error: error.message
|
|
};
|
|
}
|
|
|
|
// Memory check
|
|
const memUsage = process.memoryUsage();
|
|
checks.memory = {
|
|
status: memUsage.heapUsed < 500 * 1024 * 1024 ? 'healthy' : 'warning',
|
|
heapUsed: Math.round(memUsage.heapUsed / 1024 / 1024),
|
|
heapTotal: Math.round(memUsage.heapTotal / 1024 / 1024),
|
|
rss: Math.round(memUsage.rss / 1024 / 1024)
|
|
};
|
|
|
|
return checks;
|
|
}
|
|
}
|
|
```
|
|
|
|
## Error Tracking
|
|
|
|
```typescript
|
|
// src/lib/error-tracking.ts
|
|
import * as Sentry from '@sentry/node';
|
|
|
|
export const initErrorTracking = () => {
|
|
if (process.env.SENTRY_DSN) {
|
|
Sentry.init({
|
|
dsn: process.env.SENTRY_DSN,
|
|
environment: process.env.NODE_ENV,
|
|
tracesSampleRate: 0.1,
|
|
beforeSend(event, hint) {
|
|
// Filter sensitive data
|
|
if (event.request?.cookies) {
|
|
delete event.request.cookies;
|
|
}
|
|
return event;
|
|
}
|
|
});
|
|
}
|
|
};
|
|
|
|
// Error handler middleware
|
|
export const errorHandler = (
|
|
err: Error,
|
|
req: Request,
|
|
res: Response,
|
|
next: NextFunction
|
|
) => {
|
|
// Log error
|
|
logger.error('Unhandled error', {
|
|
error: err.message,
|
|
stack: err.stack,
|
|
url: req.url,
|
|
method: req.method,
|
|
correlationId: req.headers['x-correlation-id']
|
|
});
|
|
|
|
// Report to Sentry
|
|
Sentry.captureException(err, {
|
|
tags: {
|
|
service: process.env.SERVICE_NAME
|
|
},
|
|
user: {
|
|
id: req.user?.id
|
|
}
|
|
});
|
|
|
|
// Send response
|
|
res.status(500).json({
|
|
success: false,
|
|
error: {
|
|
code: 'INTERNAL_ERROR',
|
|
message: process.env.NODE_ENV === 'production'
|
|
? 'Internal server error'
|
|
: err.message
|
|
}
|
|
});
|
|
};
|
|
```
|
|
|
|
## Performance Monitoring
|
|
|
|
```typescript
|
|
// src/middlewares/performance.middleware.ts
|
|
export const performanceMiddleware = (req: Request, res: Response, next: NextFunction) => {
|
|
const start = process.hrtime.bigint();
|
|
|
|
res.on('finish', () => {
|
|
const end = process.hrtime.bigint();
|
|
const duration = Number(end - start) / 1000000; // Convert to milliseconds
|
|
|
|
// Log slow requests
|
|
if (duration > 1000) {
|
|
logger.warn('Slow request detected', {
|
|
method: req.method,
|
|
url: req.url,
|
|
duration,
|
|
threshold: 1000
|
|
});
|
|
}
|
|
|
|
// Add to response header
|
|
res.set('X-Response-Time', `${duration}ms`);
|
|
});
|
|
|
|
next();
|
|
};
|
|
```
|
|
|
|
## Grafana Dashboard Config
|
|
|
|
```json
|
|
{
|
|
"dashboard": {
|
|
"title": "Service Metrics",
|
|
"panels": [
|
|
{
|
|
"title": "Request Rate",
|
|
"targets": [{
|
|
"expr": "rate(http_requests_total[5m])"
|
|
}]
|
|
},
|
|
{
|
|
"title": "Request Duration",
|
|
"targets": [{
|
|
"expr": "histogram_quantile(0.95, http_request_duration_seconds)"
|
|
}]
|
|
},
|
|
{
|
|
"title": "Error Rate",
|
|
"targets": [{
|
|
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
|
|
}]
|
|
},
|
|
{
|
|
"title": "Active Users",
|
|
"targets": [{
|
|
"expr": "active_users"
|
|
}]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
## Alerting Rules
|
|
|
|
```yaml
|
|
# prometheus/alerts.yml
|
|
groups:
|
|
- name: service_alerts
|
|
rules:
|
|
- alert: HighErrorRate
|
|
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
|
|
for: 5m
|
|
annotations:
|
|
summary: "High error rate detected"
|
|
description: "Error rate is above 5% for 5 minutes"
|
|
|
|
- alert: HighLatency
|
|
expr: histogram_quantile(0.95, http_request_duration_seconds) > 1
|
|
for: 5m
|
|
annotations:
|
|
summary: "High latency detected"
|
|
description: "95th percentile latency is above 1s"
|
|
|
|
- alert: ServiceDown
|
|
expr: up{job="service"} == 0
|
|
for: 1m
|
|
annotations:
|
|
summary: "Service is down"
|
|
description: "Service has been down for 1 minute"
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Logging**
|
|
- Use structured logging (JSON format)
|
|
- Include correlation IDs for request tracing
|
|
- Log at appropriate levels (ERROR, WARN, INFO, DEBUG)
|
|
- Avoid logging sensitive data
|
|
|
|
2. **Metrics**
|
|
- Use standard metric types (Counter, Gauge, Histogram)
|
|
- Keep cardinality low (avoid high-cardinality labels)
|
|
- Define SLIs and SLOs for critical paths
|
|
- Monitor business metrics, not just technical ones
|
|
|
|
3. **Tracing**
|
|
- Add traces for critical operations
|
|
- Include relevant context in spans
|
|
- Sample appropriately to control costs
|
|
- Use distributed tracing for microservices
|
|
|
|
4. **Alerting**
|
|
- Alert on symptoms, not causes
|
|
- Include runbook links in alerts
|
|
- Avoid alert fatigue with proper thresholds
|
|
- Test alerting rules regularly |