- Added request/response flow diagrams to api-design and api-gateway-advanced skills for better visualization of processes. - Introduced configuration loading flow in configuration-management skill to clarify the configuration process. - Included error propagation flow in error-handling-patterns skill to illustrate error handling across layers. - Enhanced various skills with additional diagrams to improve understanding of complex concepts. These updates aim to provide clearer guidance and improve the overall documentation experience for developers.
17 KiB
17 KiB
name, description
| name | description |
|---|---|
| observability-monitoring | Observability and monitoring patterns for GoodGo microservices. Use when adding metrics, implementing logging, setting up tracing, creating health checks, or debugging production issues. |
Observability & Monitoring Patterns
When to Use This Skill
Use this skill when:
- Setting up logging infrastructure
- Implementing metrics collection
- Adding distributed tracing
- Creating health check endpoints
- Setting up monitoring dashboards
- Debugging production issues
- Implementing alerting rules
- Analyzing performance bottlenecks
Core Concepts
Three Pillars of Observability
- Logs: Event records for debugging
- Metrics: Numerical measurements over time
- Traces: Request flow across services
Tech Stack
- Logging: Winston, Pino
- Metrics: Prometheus + Grafana
- Tracing: OpenTelemetry + Jaeger
- APM: DataDog or New Relic (optional)
Observability Stack Architecture
The observability stack consists of three pillars working together to provide comprehensive visibility into system behavior:
graph TB
subgraph "Application Layer"
App[Microservice]
end
subgraph "Three Pillars of Observability"
Logs[Logs<br/>Winston/Pino]
Metrics[Metrics<br/>Prometheus]
Traces[Traces<br/>OpenTelemetry]
end
subgraph "Aggregation & Storage"
Loki[Loki<br/>Log Aggregation]
Prom[Prometheus<br/>Metrics Storage]
Jaeger[Jaeger<br/>Trace Storage]
end
subgraph "Visualization & Alerting"
Grafana[Grafana<br/>Dashboards]
AlertManager[AlertManager<br/>Alerts]
end
App -->|Structured Logs| Logs
App -->|HTTP Metrics| Metrics
App -->|Distributed Spans| Traces
Logs -->|Collect| Loki
Metrics -->|Scrape /metrics| Prom
Traces -->|Export| Jaeger
Loki -->|Query| Grafana
Prom -->|Query| Grafana
Prom -->|Alerts| AlertManager
Jaeger -->|Query| Grafana
style App fill:#e1f5ff
style Logs fill:#fff4e1
style Metrics fill:#e1ffe1
style Traces fill:#ffe1f5
style Grafana fill:#e1e1ff
Structured Logging
// src/lib/logger.ts
import winston from 'winston';
const logFormat = winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
);
export const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: logFormat,
defaultMeta: {
service: process.env.SERVICE_NAME || 'unknown',
environment: process.env.NODE_ENV || 'development'
},
transports: [
new winston.transports.Console({
format: process.env.NODE_ENV === 'development'
? winston.format.combine(
winston.format.colorize(),
winston.format.simple()
)
: logFormat
}),
// Production: Send to log aggregation service
...(process.env.NODE_ENV === 'production'
? [new winston.transports.Http({
host: 'logs.example.com',
path: '/collect',
ssl: true
})]
: [])
]
});
// Request logger middleware
export const requestLogger = (req: Request, res: Response, next: NextFunction) => {
const start = Date.now();
res.on('finish', () => {
const duration = Date.now() - start;
logger.info('HTTP Request', {
method: req.method,
url: req.url,
status: res.statusCode,
duration,
ip: req.ip,
userAgent: req.get('user-agent'),
correlationId: req.headers['x-correlation-id']
});
});
next();
};
Logging Flow
The logging flow shows how requests are logged with correlation IDs and flow through the system:
sequenceDiagram
participant Client
participant Service as Microservice
participant Logger as Winston/Pino Logger
participant Aggregator as Log Aggregator<br/>(Loki)
participant Dashboard as Grafana Dashboard
Client->>Service: HTTP Request<br/>(with x-correlation-id)
Service->>Service: Generate/Extract<br/>Correlation ID
Service->>Logger: Log Request Start<br/>{correlationId, method, url}
Service->>Service: Process Request
Service->>Logger: Log Business Event<br/>{correlationId, event, data}
Service->>Client: HTTP Response<br/>(with x-correlation-id)
Service->>Logger: Log Request End<br/>{correlationId, status, duration}
Logger->>Aggregator: Send Structured Logs<br/>(JSON format)
Aggregator->>Dashboard: Index & Store Logs
Dashboard->>Dashboard: Query by correlationId<br/>to trace request flow
Metrics Collection
// src/lib/metrics.ts
import { Registry, Counter, Histogram, Gauge } from 'prom-client';
export const register = new Registry();
// HTTP metrics
export const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status'],
buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});
export const httpRequestTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status']
});
// Business metrics
export const userRegistrations = new Counter({
name: 'user_registrations_total',
help: 'Total number of user registrations',
labelNames: ['type']
});
export const activeUsers = new Gauge({
name: 'active_users',
help: 'Number of active users',
labelNames: ['status']
});
// Register metrics
register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
register.registerMetric(userRegistrations);
register.registerMetric(activeUsers);
// Metrics middleware
export const metricsMiddleware = (req: Request, res: Response, next: NextFunction) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route?.path || req.path;
httpRequestDuration
.labels(req.method, route, res.statusCode.toString())
.observe(duration);
httpRequestTotal
.labels(req.method, route, res.statusCode.toString())
.inc();
});
next();
};
// Metrics endpoint
export const metricsHandler = async (req: Request, res: Response) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
};
Metrics Collection Flow
Metrics are collected from services and exposed to Prometheus for monitoring and alerting:
graph LR
subgraph "Service Instance"
App[Application]
Middleware[Metrics Middleware]
Registry[Prometheus Registry]
Endpoint[/metrics Endpoint]
end
subgraph "Metrics Types"
Counter[Counter<br/>http_requests_total]
Gauge[Gauge<br/>active_users]
Histogram[Histogram<br/>request_duration]
end
subgraph "Collection"
Prometheus[Prometheus<br/>Scraper]
end
subgraph "Storage & Query"
PromDB[(Prometheus<br/>Time Series DB)]
end
subgraph "Visualization"
Grafana[Grafana<br/>Dashboards]
Alerts[AlertManager<br/>Rules]
end
App -->|HTTP Request| Middleware
Middleware -->|Record| Counter
Middleware -->|Record| Histogram
App -->|Update| Gauge
Counter --> Registry
Gauge --> Registry
Histogram --> Registry
Registry --> Endpoint
Prometheus -->|Scrape every 15s| Endpoint
Prometheus -->|Store| PromDB
PromDB -->|Query| Grafana
PromDB -->|Evaluate| Alerts
Alerts -->|Trigger| Grafana
style App fill:#e1f5ff
style Prometheus fill:#ffe1e1
style Grafana fill:#e1e1ff
Distributed Tracing
// src/lib/tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
export const initTracing = () => {
const jaegerExporter = new JaegerExporter({
endpoint: process.env.JAEGER_ENDPOINT || 'http://localhost:14268/api/traces',
});
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME || 'unknown',
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0',
}),
traceExporter: jaegerExporter,
instrumentations: [getNodeAutoInstrumentations()]
});
sdk.start();
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});
};
// Custom span creation
import { trace, SpanStatusCode } from '@opentelemetry/api';
export const tracedOperation = async (name: string, fn: Function) => {
const tracer = trace.getTracer('application');
const span = tracer.startSpan(name);
try {
const result = await fn();
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
span.recordException(error);
throw error;
} finally {
span.end();
}
};
Distributed Tracing Flow
Distributed tracing tracks requests across multiple services using OpenTelemetry:
sequenceDiagram
participant Client
participant Gateway as API Gateway
participant ServiceA as Service A<br/>(User Service)
participant ServiceB as Service B<br/>(Order Service)
participant DB as Database
participant Jaeger as Jaeger<br/>Collector
Client->>Gateway: Request<br/>(Trace ID: abc123)
Gateway->>Gateway: Create Root Span<br/>Span: gateway.request
Gateway->>ServiceA: HTTP Call<br/>(Trace ID: abc123,<br/>Span ID: span-1)
ServiceA->>ServiceA: Create Child Span<br/>Span: user.getById
ServiceA->>DB: Query User<br/>(Trace ID: abc123,<br/>Span ID: span-2)
DB-->>ServiceA: User Data
ServiceA->>ServiceA: End Span span-2
ServiceA->>ServiceB: HTTP Call<br/>(Trace ID: abc123,<br/>Span ID: span-3)
ServiceB->>ServiceB: Create Child Span<br/>Span: order.getByUserId
ServiceB->>DB: Query Orders<br/>(Trace ID: abc123,<br/>Span ID: span-4)
DB-->>ServiceB: Orders Data
ServiceB->>ServiceB: End Span span-4
ServiceB->>ServiceB: End Span span-3
ServiceB-->>ServiceA: Response
ServiceA->>ServiceA: End Span span-1
ServiceA-->>Gateway: Response
Gateway->>Gateway: End Span gateway.request
Gateway-->>Client: Final Response
ServiceA->>Jaeger: Export Spans<br/>(Trace ID: abc123)
ServiceB->>Jaeger: Export Spans<br/>(Trace ID: abc123)
Gateway->>Jaeger: Export Spans<br/>(Trace ID: abc123)
Note over Jaeger: All spans linked by<br/>Trace ID: abc123
Health Checks
// src/modules/health/health.controller.ts
export class HealthController {
constructor(
private prisma: PrismaClient,
private redis: Redis
) {}
// Liveness probe - is the service running?
async liveness(req: Request, res: Response) {
res.json({
status: 'ok',
timestamp: new Date().toISOString()
});
}
// Readiness probe - is the service ready for traffic?
async readiness(req: Request, res: Response) {
const checks = await this.runHealthChecks();
const isHealthy = Object.values(checks).every(check => check.status === 'healthy');
res.status(isHealthy ? 200 : 503).json({
status: isHealthy ? 'ready' : 'not ready',
checks,
timestamp: new Date().toISOString()
});
}
// Detailed health check
async health(req: Request, res: Response) {
const checks = await this.runHealthChecks();
const isHealthy = Object.values(checks).every(check => check.status === 'healthy');
res.status(isHealthy ? 200 : 503).json({
status: isHealthy ? 'healthy' : 'unhealthy',
version: process.env.SERVICE_VERSION || '1.0.0',
uptime: process.uptime(),
checks,
timestamp: new Date().toISOString()
});
}
private async runHealthChecks() {
const checks: Record<string, any> = {};
// Database check
try {
const start = Date.now();
await this.prisma.$queryRaw`SELECT 1`;
checks.database = {
status: 'healthy',
responseTime: Date.now() - start
};
} catch (error) {
checks.database = {
status: 'unhealthy',
error: error.message
};
}
// Redis check
try {
const start = Date.now();
await this.redis.ping();
checks.redis = {
status: 'healthy',
responseTime: Date.now() - start
};
} catch (error) {
checks.redis = {
status: 'unhealthy',
error: error.message
};
}
// Memory check
const memUsage = process.memoryUsage();
checks.memory = {
status: memUsage.heapUsed < 500 * 1024 * 1024 ? 'healthy' : 'warning',
heapUsed: Math.round(memUsage.heapUsed / 1024 / 1024),
heapTotal: Math.round(memUsage.heapTotal / 1024 / 1024),
rss: Math.round(memUsage.rss / 1024 / 1024)
};
return checks;
}
}
Error Tracking
// src/lib/error-tracking.ts
import * as Sentry from '@sentry/node';
export const initErrorTracking = () => {
if (process.env.SENTRY_DSN) {
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
tracesSampleRate: 0.1,
beforeSend(event, hint) {
// Filter sensitive data
if (event.request?.cookies) {
delete event.request.cookies;
}
return event;
}
});
}
};
// Error handler middleware
export const errorHandler = (
err: Error,
req: Request,
res: Response,
next: NextFunction
) => {
// Log error
logger.error('Unhandled error', {
error: err.message,
stack: err.stack,
url: req.url,
method: req.method,
correlationId: req.headers['x-correlation-id']
});
// Report to Sentry
Sentry.captureException(err, {
tags: {
service: process.env.SERVICE_NAME
},
user: {
id: req.user?.id
}
});
// Send response
res.status(500).json({
success: false,
error: {
code: 'INTERNAL_ERROR',
message: process.env.NODE_ENV === 'production'
? 'Internal server error'
: err.message
}
});
};
Performance Monitoring
// src/middlewares/performance.middleware.ts
export const performanceMiddleware = (req: Request, res: Response, next: NextFunction) => {
const start = process.hrtime.bigint();
res.on('finish', () => {
const end = process.hrtime.bigint();
const duration = Number(end - start) / 1000000; // Convert to milliseconds
// Log slow requests
if (duration > 1000) {
logger.warn('Slow request detected', {
method: req.method,
url: req.url,
duration,
threshold: 1000
});
}
// Add to response header
res.set('X-Response-Time', `${duration}ms`);
});
next();
};
Grafana Dashboard Config
{
"dashboard": {
"title": "Service Metrics",
"panels": [
{
"title": "Request Rate",
"targets": [{
"expr": "rate(http_requests_total[5m])"
}]
},
{
"title": "Request Duration",
"targets": [{
"expr": "histogram_quantile(0.95, http_request_duration_seconds)"
}]
},
{
"title": "Error Rate",
"targets": [{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
}]
},
{
"title": "Active Users",
"targets": [{
"expr": "active_users"
}]
}
]
}
}
Alerting Rules
# prometheus/alerts.yml
groups:
- name: service_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% for 5 minutes"
- alert: HighLatency
expr: histogram_quantile(0.95, http_request_duration_seconds) > 1
for: 5m
annotations:
summary: "High latency detected"
description: "95th percentile latency is above 1s"
- alert: ServiceDown
expr: up{job="service"} == 0
for: 1m
annotations:
summary: "Service is down"
description: "Service has been down for 1 minute"
Best Practices
-
Logging
- Use structured logging (JSON format)
- Include correlation IDs for request tracing
- Log at appropriate levels (ERROR, WARN, INFO, DEBUG)
- Avoid logging sensitive data
-
Metrics
- Use standard metric types (Counter, Gauge, Histogram)
- Keep cardinality low (avoid high-cardinality labels)
- Define SLIs and SLOs for critical paths
- Monitor business metrics, not just technical ones
-
Tracing
- Add traces for critical operations
- Include relevant context in spans
- Sample appropriately to control costs
- Use distributed tracing for microservices
-
Alerting
- Alert on symptoms, not causes
- Include runbook links in alerts
- Avoid alert fatigue with proper thresholds
- Test alerting rules regularly