--- name: observability-monitoring description: Observability and monitoring patterns for GoodGo microservices. Use when adding metrics, implementing logging, setting up tracing, creating health checks, or debugging production issues. --- # Observability & Monitoring Patterns ## When to Use This Skill Use this skill when: - Setting up logging infrastructure - Implementing metrics collection - Adding distributed tracing - Creating health check endpoints - Setting up monitoring dashboards - Debugging production issues - Implementing alerting rules - Analyzing performance bottlenecks ## Core Concepts ### Three Pillars of Observability 1. **Logs**: Event records for debugging 2. **Metrics**: Numerical measurements over time 3. **Traces**: Request flow across services ### Tech Stack - **Logging**: Winston, Pino - **Metrics**: Prometheus + Grafana - **Tracing**: OpenTelemetry + Jaeger - **APM**: DataDog or New Relic (optional) ### Observability Stack Architecture The observability stack consists of three pillars working together to provide comprehensive visibility into system behavior: ```mermaid graph TB subgraph "Application Layer" App[Microservice] end subgraph "Three Pillars of Observability" Logs[Logs
Winston/Pino] Metrics[Metrics
Prometheus] Traces[Traces
OpenTelemetry] end subgraph "Aggregation & Storage" Loki[Loki
Log Aggregation] Prom[Prometheus
Metrics Storage] Jaeger[Jaeger
Trace Storage] end subgraph "Visualization & Alerting" Grafana[Grafana
Dashboards] AlertManager[AlertManager
Alerts] end App -->|Structured Logs| Logs App -->|HTTP Metrics| Metrics App -->|Distributed Spans| Traces Logs -->|Collect| Loki Metrics -->|Scrape /metrics| Prom Traces -->|Export| Jaeger Loki -->|Query| Grafana Prom -->|Query| Grafana Prom -->|Alerts| AlertManager Jaeger -->|Query| Grafana style App fill:#e1f5ff style Logs fill:#fff4e1 style Metrics fill:#e1ffe1 style Traces fill:#ffe1f5 style Grafana fill:#e1e1ff ``` ## Structured Logging ```typescript // src/lib/logger.ts import winston from 'winston'; const logFormat = winston.format.combine( winston.format.timestamp(), winston.format.errors({ stack: true }), winston.format.json() ); export const logger = winston.createLogger({ level: process.env.LOG_LEVEL || 'info', format: logFormat, defaultMeta: { service: process.env.SERVICE_NAME || 'unknown', environment: process.env.NODE_ENV || 'development' }, transports: [ new winston.transports.Console({ format: process.env.NODE_ENV === 'development' ? winston.format.combine( winston.format.colorize(), winston.format.simple() ) : logFormat }), // Production: Send to log aggregation service ...(process.env.NODE_ENV === 'production' ? [new winston.transports.Http({ host: 'logs.example.com', path: '/collect', ssl: true })] : []) ] }); // Request logger middleware export const requestLogger = (req: Request, res: Response, next: NextFunction) => { const start = Date.now(); res.on('finish', () => { const duration = Date.now() - start; logger.info('HTTP Request', { method: req.method, url: req.url, status: res.statusCode, duration, ip: req.ip, userAgent: req.get('user-agent'), correlationId: req.headers['x-correlation-id'] }); }); next(); }; ``` ### Logging Flow The logging flow shows how requests are logged with correlation IDs and flow through the system: ```mermaid sequenceDiagram participant Client participant Service as Microservice participant Logger as Winston/Pino Logger participant Aggregator as Log Aggregator
(Loki) participant Dashboard as Grafana Dashboard Client->>Service: HTTP Request
(with x-correlation-id) Service->>Service: Generate/Extract
Correlation ID Service->>Logger: Log Request Start
{correlationId, method, url} Service->>Service: Process Request Service->>Logger: Log Business Event
{correlationId, event, data} Service->>Client: HTTP Response
(with x-correlation-id) Service->>Logger: Log Request End
{correlationId, status, duration} Logger->>Aggregator: Send Structured Logs
(JSON format) Aggregator->>Dashboard: Index & Store Logs Dashboard->>Dashboard: Query by correlationId
to trace request flow ``` ## Metrics Collection ```typescript // src/lib/metrics.ts import { Registry, Counter, Histogram, Gauge } from 'prom-client'; export const register = new Registry(); // HTTP metrics export const httpRequestDuration = new Histogram({ name: 'http_request_duration_seconds', help: 'Duration of HTTP requests in seconds', labelNames: ['method', 'route', 'status'], buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10] }); export const httpRequestTotal = new Counter({ name: 'http_requests_total', help: 'Total number of HTTP requests', labelNames: ['method', 'route', 'status'] }); // Business metrics export const userRegistrations = new Counter({ name: 'user_registrations_total', help: 'Total number of user registrations', labelNames: ['type'] }); export const activeUsers = new Gauge({ name: 'active_users', help: 'Number of active users', labelNames: ['status'] }); // Register metrics register.registerMetric(httpRequestDuration); register.registerMetric(httpRequestTotal); register.registerMetric(userRegistrations); register.registerMetric(activeUsers); // Metrics middleware export const metricsMiddleware = (req: Request, res: Response, next: NextFunction) => { const start = Date.now(); res.on('finish', () => { const duration = (Date.now() - start) / 1000; const route = req.route?.path || req.path; httpRequestDuration .labels(req.method, route, res.statusCode.toString()) .observe(duration); httpRequestTotal .labels(req.method, route, res.statusCode.toString()) .inc(); }); next(); }; // Metrics endpoint export const metricsHandler = async (req: Request, res: Response) => { res.set('Content-Type', register.contentType); res.end(await register.metrics()); }; ``` ### Metrics Collection Flow Metrics are collected from services and exposed to Prometheus for monitoring and alerting: ```mermaid graph LR subgraph "Service Instance" App[Application] Middleware[Metrics Middleware] Registry[Prometheus Registry] Endpoint[/metrics Endpoint] end subgraph "Metrics Types" Counter[Counter
http_requests_total] Gauge[Gauge
active_users] Histogram[Histogram
request_duration] end subgraph "Collection" Prometheus[Prometheus
Scraper] end subgraph "Storage & Query" PromDB[(Prometheus
Time Series DB)] end subgraph "Visualization" Grafana[Grafana
Dashboards] Alerts[AlertManager
Rules] end App -->|HTTP Request| Middleware Middleware -->|Record| Counter Middleware -->|Record| Histogram App -->|Update| Gauge Counter --> Registry Gauge --> Registry Histogram --> Registry Registry --> Endpoint Prometheus -->|Scrape every 15s| Endpoint Prometheus -->|Store| PromDB PromDB -->|Query| Grafana PromDB -->|Evaluate| Alerts Alerts -->|Trigger| Grafana style App fill:#e1f5ff style Prometheus fill:#ffe1e1 style Grafana fill:#e1e1ff ``` ## Distributed Tracing ```typescript // src/lib/tracing.ts import { NodeSDK } from '@opentelemetry/sdk-node'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { Resource } from '@opentelemetry/resources'; import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions'; import { JaegerExporter } from '@opentelemetry/exporter-jaeger'; export const initTracing = () => { const jaegerExporter = new JaegerExporter({ endpoint: process.env.JAEGER_ENDPOINT || 'http://localhost:14268/api/traces', }); const sdk = new NodeSDK({ resource: new Resource({ [SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME || 'unknown', [SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0', }), traceExporter: jaegerExporter, instrumentations: [getNodeAutoInstrumentations()] }); sdk.start(); process.on('SIGTERM', () => { sdk.shutdown() .then(() => console.log('Tracing terminated')) .catch((error) => console.log('Error terminating tracing', error)) .finally(() => process.exit(0)); }); }; // Custom span creation import { trace, SpanStatusCode } from '@opentelemetry/api'; export const tracedOperation = async (name: string, fn: Function) => { const tracer = trace.getTracer('application'); const span = tracer.startSpan(name); try { const result = await fn(); span.setStatus({ code: SpanStatusCode.OK }); return result; } catch (error) { span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }); span.recordException(error); throw error; } finally { span.end(); } }; ``` ### Distributed Tracing Flow Distributed tracing tracks requests across multiple services using OpenTelemetry: ```mermaid sequenceDiagram participant Client participant Gateway as API Gateway participant ServiceA as Service A
(User Service) participant ServiceB as Service B
(Order Service) participant DB as Database participant Jaeger as Jaeger
Collector Client->>Gateway: Request
(Trace ID: abc123) Gateway->>Gateway: Create Root Span
Span: gateway.request Gateway->>ServiceA: HTTP Call
(Trace ID: abc123,
Span ID: span-1) ServiceA->>ServiceA: Create Child Span
Span: user.getById ServiceA->>DB: Query User
(Trace ID: abc123,
Span ID: span-2) DB-->>ServiceA: User Data ServiceA->>ServiceA: End Span span-2 ServiceA->>ServiceB: HTTP Call
(Trace ID: abc123,
Span ID: span-3) ServiceB->>ServiceB: Create Child Span
Span: order.getByUserId ServiceB->>DB: Query Orders
(Trace ID: abc123,
Span ID: span-4) DB-->>ServiceB: Orders Data ServiceB->>ServiceB: End Span span-4 ServiceB->>ServiceB: End Span span-3 ServiceB-->>ServiceA: Response ServiceA->>ServiceA: End Span span-1 ServiceA-->>Gateway: Response Gateway->>Gateway: End Span gateway.request Gateway-->>Client: Final Response ServiceA->>Jaeger: Export Spans
(Trace ID: abc123) ServiceB->>Jaeger: Export Spans
(Trace ID: abc123) Gateway->>Jaeger: Export Spans
(Trace ID: abc123) Note over Jaeger: All spans linked by
Trace ID: abc123 ``` ## Health Checks ```typescript // src/modules/health/health.controller.ts export class HealthController { constructor( private prisma: PrismaClient, private redis: Redis ) {} // Liveness probe - is the service running? async liveness(req: Request, res: Response) { res.json({ status: 'ok', timestamp: new Date().toISOString() }); } // Readiness probe - is the service ready for traffic? async readiness(req: Request, res: Response) { const checks = await this.runHealthChecks(); const isHealthy = Object.values(checks).every(check => check.status === 'healthy'); res.status(isHealthy ? 200 : 503).json({ status: isHealthy ? 'ready' : 'not ready', checks, timestamp: new Date().toISOString() }); } // Detailed health check async health(req: Request, res: Response) { const checks = await this.runHealthChecks(); const isHealthy = Object.values(checks).every(check => check.status === 'healthy'); res.status(isHealthy ? 200 : 503).json({ status: isHealthy ? 'healthy' : 'unhealthy', version: process.env.SERVICE_VERSION || '1.0.0', uptime: process.uptime(), checks, timestamp: new Date().toISOString() }); } private async runHealthChecks() { const checks: Record = {}; // Database check try { const start = Date.now(); await this.prisma.$queryRaw`SELECT 1`; checks.database = { status: 'healthy', responseTime: Date.now() - start }; } catch (error) { checks.database = { status: 'unhealthy', error: error.message }; } // Redis check try { const start = Date.now(); await this.redis.ping(); checks.redis = { status: 'healthy', responseTime: Date.now() - start }; } catch (error) { checks.redis = { status: 'unhealthy', error: error.message }; } // Memory check const memUsage = process.memoryUsage(); checks.memory = { status: memUsage.heapUsed < 500 * 1024 * 1024 ? 'healthy' : 'warning', heapUsed: Math.round(memUsage.heapUsed / 1024 / 1024), heapTotal: Math.round(memUsage.heapTotal / 1024 / 1024), rss: Math.round(memUsage.rss / 1024 / 1024) }; return checks; } } ``` ## Error Tracking ```typescript // src/lib/error-tracking.ts import * as Sentry from '@sentry/node'; export const initErrorTracking = () => { if (process.env.SENTRY_DSN) { Sentry.init({ dsn: process.env.SENTRY_DSN, environment: process.env.NODE_ENV, tracesSampleRate: 0.1, beforeSend(event, hint) { // Filter sensitive data if (event.request?.cookies) { delete event.request.cookies; } return event; } }); } }; // Error handler middleware export const errorHandler = ( err: Error, req: Request, res: Response, next: NextFunction ) => { // Log error logger.error('Unhandled error', { error: err.message, stack: err.stack, url: req.url, method: req.method, correlationId: req.headers['x-correlation-id'] }); // Report to Sentry Sentry.captureException(err, { tags: { service: process.env.SERVICE_NAME }, user: { id: req.user?.id } }); // Send response res.status(500).json({ success: false, error: { code: 'INTERNAL_ERROR', message: process.env.NODE_ENV === 'production' ? 'Internal server error' : err.message } }); }; ``` ## Performance Monitoring ```typescript // src/middlewares/performance.middleware.ts export const performanceMiddleware = (req: Request, res: Response, next: NextFunction) => { const start = process.hrtime.bigint(); res.on('finish', () => { const end = process.hrtime.bigint(); const duration = Number(end - start) / 1000000; // Convert to milliseconds // Log slow requests if (duration > 1000) { logger.warn('Slow request detected', { method: req.method, url: req.url, duration, threshold: 1000 }); } // Add to response header res.set('X-Response-Time', `${duration}ms`); }); next(); }; ``` ## Grafana Dashboard Config ```json { "dashboard": { "title": "Service Metrics", "panels": [ { "title": "Request Rate", "targets": [{ "expr": "rate(http_requests_total[5m])" }] }, { "title": "Request Duration", "targets": [{ "expr": "histogram_quantile(0.95, http_request_duration_seconds)" }] }, { "title": "Error Rate", "targets": [{ "expr": "rate(http_requests_total{status=~\"5..\"}[5m])" }] }, { "title": "Active Users", "targets": [{ "expr": "active_users" }] } ] } } ``` ## Alerting Rules ```yaml # prometheus/alerts.yml groups: - name: service_alerts rules: - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 5m annotations: summary: "High error rate detected" description: "Error rate is above 5% for 5 minutes" - alert: HighLatency expr: histogram_quantile(0.95, http_request_duration_seconds) > 1 for: 5m annotations: summary: "High latency detected" description: "95th percentile latency is above 1s" - alert: ServiceDown expr: up{job="service"} == 0 for: 1m annotations: summary: "Service is down" description: "Service has been down for 1 minute" ``` ## Best Practices 1. **Logging** - Use structured logging (JSON format) - Include correlation IDs for request tracing - Log at appropriate levels (ERROR, WARN, INFO, DEBUG) - Avoid logging sensitive data 2. **Metrics** - Use standard metric types (Counter, Gauge, Histogram) - Keep cardinality low (avoid high-cardinality labels) - Define SLIs and SLOs for critical paths - Monitor business metrics, not just technical ones 3. **Tracing** - Add traces for critical operations - Include relevant context in spans - Sample appropriately to control costs - Use distributed tracing for microservices 4. **Alerting** - Alert on symptoms, not causes - Include runbook links in alerts - Avoid alert fatigue with proper thresholds - Test alerting rules regularly