pos-system/docs/en/skills/observability-monitoring.md

---
name: observability-monitoring
description: Observability and monitoring patterns for GoodGo microservices. Use when adding metrics, implementing logging, setting up tracing, creating health checks, or debugging production issues.
---

# Observability & Monitoring Patterns

## When to Use This Skill

Use this skill when:
- Setting up logging infrastructure
- Implementing metrics collection
- Adding distributed tracing
- Creating health check endpoints
- Setting up monitoring dashboards
- Debugging production issues
- Implementing alerting rules
- Analyzing performance bottlenecks

## Core Concepts

### Three Pillars of Observability
1. **Logs**: Event records for debugging
2. **Metrics**: Numerical measurements over time
3. **Traces**: Request flow across services

### Tech Stack
- **Logging**: Winston, Pino
- **Metrics**: Prometheus + Grafana
- **Tracing**: OpenTelemetry + Jaeger
- **APM**: DataDog or New Relic (optional)

### Observability Stack Architecture

The observability stack consists of three pillars working together to provide comprehensive visibility into system behavior:

```mermaid
graph TB
    subgraph "Application Layer"
        App[Microservice]
    end

    subgraph "Three Pillars of Observability"
        Logs[Logs<br/>Winston/Pino]
        Metrics[Metrics<br/>Prometheus]
        Traces[Traces<br/>OpenTelemetry]
    end

    subgraph "Aggregation & Storage"
        Loki[Loki<br/>Log Aggregation]
        Prom[Prometheus<br/>Metrics Storage]
        Jaeger[Jaeger<br/>Trace Storage]
    end

    subgraph "Visualization & Alerting"
        Grafana[Grafana<br/>Dashboards]
        AlertManager[AlertManager<br/>Alerts]
    end

    App -->|Structured Logs| Logs
    App -->|HTTP Metrics| Metrics
    App -->|Distributed Spans| Traces

    Logs -->|Collect| Loki
    Metrics -->|Scrape /metrics| Prom
    Traces -->|Export| Jaeger

    Loki -->|Query| Grafana
    Prom -->|Query| Grafana
    Prom -->|Alerts| AlertManager
    Jaeger -->|Query| Grafana

    style App fill:#e1f5ff
    style Logs fill:#fff4e1
    style Metrics fill:#e1ffe1
    style Traces fill:#ffe1f5
    style Grafana fill:#e1e1ff
```

## Structured Logging

```typescript
// src/lib/logger.ts
import winston from 'winston';

const logFormat = winston.format.combine(
  winston.format.timestamp(),
  winston.format.errors({ stack: true }),
  winston.format.json()
);

export const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: logFormat,
  defaultMeta: {
    service: process.env.SERVICE_NAME || 'unknown',
    environment: process.env.NODE_ENV || 'development'
  },
  transports: [
    new winston.transports.Console({
      format: process.env.NODE_ENV === 'development'
        ? winston.format.combine(
            winston.format.colorize(),
            winston.format.simple()
          )
        : logFormat
    }),
    // Production: Send to log aggregation service
    ...(process.env.NODE_ENV === 'production'
      ? [new winston.transports.Http({
          host: 'logs.example.com',
          path: '/collect',
          ssl: true
        })]
      : [])
  ]
});

// Request logger middleware
export const requestLogger = (req: Request, res: Response, next: NextFunction) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = Date.now() - start;

    logger.info('HTTP Request', {
      method: req.method,
      url: req.url,
      status: res.statusCode,
      duration,
      ip: req.ip,
      userAgent: req.get('user-agent'),
      correlationId: req.headers['x-correlation-id']
    });
  });

  next();
};
```

### Logging Flow

The logging flow shows how requests are logged with correlation IDs and flow through the system:

```mermaid
sequenceDiagram
    participant Client
    participant Service as Microservice
    participant Logger as Winston/Pino Logger
    participant Aggregator as Log Aggregator<br/>(Loki)
    participant Dashboard as Grafana Dashboard

    Client->>Service: HTTP Request<br/>(with x-correlation-id)
    Service->>Service: Generate/Extract<br/>Correlation ID
    Service->>Logger: Log Request Start<br/>{correlationId, method, url}
    Service->>Service: Process Request
    Service->>Logger: Log Business Event<br/>{correlationId, event, data}
    Service->>Client: HTTP Response<br/>(with x-correlation-id)
    Service->>Logger: Log Request End<br/>{correlationId, status, duration}

    Logger->>Aggregator: Send Structured Logs<br/>(JSON format)
    Aggregator->>Dashboard: Index & Store Logs
    Dashboard->>Dashboard: Query by correlationId<br/>to trace request flow
```

## Metrics Collection

```typescript
// src/lib/metrics.ts
import { Registry, Counter, Histogram, Gauge } from 'prom-client';

export const register = new Registry();

// HTTP metrics
export const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});

export const httpRequestTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status']
});

// Business metrics
export const userRegistrations = new Counter({
  name: 'user_registrations_total',
  help: 'Total number of user registrations',
  labelNames: ['type']
});

export const activeUsers = new Gauge({
  name: 'active_users',
  help: 'Number of active users',
  labelNames: ['status']
});

// Register metrics
register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
register.registerMetric(userRegistrations);
register.registerMetric(activeUsers);

// Metrics middleware
export const metricsMiddleware = (req: Request, res: Response, next: NextFunction) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const route = req.route?.path || req.path;

    httpRequestDuration
      .labels(req.method, route, res.statusCode.toString())
      .observe(duration);

    httpRequestTotal
      .labels(req.method, route, res.statusCode.toString())
      .inc();
  });

  next();
};

// Metrics endpoint
export const metricsHandler = async (req: Request, res: Response) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
};
```

### Metrics Collection Flow

Metrics are collected from services and exposed to Prometheus for monitoring and alerting:

```mermaid
graph LR
    subgraph "Service Instance"
        App[Application]
        Middleware[Metrics Middleware]
        Registry[Prometheus Registry]
        Endpoint[/metrics Endpoint]
    end

    subgraph "Metrics Types"
        Counter[Counter<br/>http_requests_total]
        Gauge[Gauge<br/>active_users]
        Histogram[Histogram<br/>request_duration]
    end

    subgraph "Collection"
        Prometheus[Prometheus<br/>Scraper]
    end

    subgraph "Storage & Query"
        PromDB[(Prometheus<br/>Time Series DB)]
    end

    subgraph "Visualization"
        Grafana[Grafana<br/>Dashboards]
        Alerts[AlertManager<br/>Rules]
    end

    App -->|HTTP Request| Middleware
    Middleware -->|Record| Counter
    Middleware -->|Record| Histogram
    App -->|Update| Gauge

    Counter --> Registry
    Gauge --> Registry
    Histogram --> Registry
    Registry --> Endpoint

    Prometheus -->|Scrape every 15s| Endpoint
    Prometheus -->|Store| PromDB

    PromDB -->|Query| Grafana
    PromDB -->|Evaluate| Alerts
    Alerts -->|Trigger| Grafana

    style App fill:#e1f5ff
    style Prometheus fill:#ffe1e1
    style Grafana fill:#e1e1ff
```

## Distributed Tracing

```typescript
// src/lib/tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';

export const initTracing = () => {
  const jaegerExporter = new JaegerExporter({
    endpoint: process.env.JAEGER_ENDPOINT || 'http://localhost:14268/api/traces',
  });

  const sdk = new NodeSDK({
    resource: new Resource({
      [SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME || 'unknown',
      [SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0',
    }),
    traceExporter: jaegerExporter,
    instrumentations: [getNodeAutoInstrumentations()]
  });

  sdk.start();

  process.on('SIGTERM', () => {
    sdk.shutdown()
      .then(() => console.log('Tracing terminated'))
      .catch((error) => console.log('Error terminating tracing', error))
      .finally(() => process.exit(0));
  });
};

// Custom span creation
import { trace, SpanStatusCode } from '@opentelemetry/api';

export const tracedOperation = async (name: string, fn: Function) => {
  const tracer = trace.getTracer('application');
  const span = tracer.startSpan(name);

  try {
    const result = await fn();
    span.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (error) {
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: error.message
    });
    span.recordException(error);
    throw error;
  } finally {
    span.end();
  }
};
```

### Distributed Tracing Flow

Distributed tracing tracks requests across multiple services using OpenTelemetry:

```mermaid
sequenceDiagram
    participant Client
    participant Gateway as API Gateway
    participant ServiceA as Service A<br/>(User Service)
    participant ServiceB as Service B<br/>(Order Service)
    participant DB as Database
    participant Jaeger as Jaeger<br/>Collector

    Client->>Gateway: Request<br/>(Trace ID: abc123)
    Gateway->>Gateway: Create Root Span<br/>Span: gateway.request
    Gateway->>ServiceA: HTTP Call<br/>(Trace ID: abc123,<br/>Span ID: span-1)

    ServiceA->>ServiceA: Create Child Span<br/>Span: user.getById
    ServiceA->>DB: Query User<br/>(Trace ID: abc123,<br/>Span ID: span-2)
    DB-->>ServiceA: User Data
    ServiceA->>ServiceA: End Span span-2
    ServiceA->>ServiceB: HTTP Call<br/>(Trace ID: abc123,<br/>Span ID: span-3)

    ServiceB->>ServiceB: Create Child Span<br/>Span: order.getByUserId
    ServiceB->>DB: Query Orders<br/>(Trace ID: abc123,<br/>Span ID: span-4)
    DB-->>ServiceB: Orders Data
    ServiceB->>ServiceB: End Span span-4
    ServiceB->>ServiceB: End Span span-3
    ServiceB-->>ServiceA: Response
    ServiceA->>ServiceA: End Span span-1
    ServiceA-->>Gateway: Response
    Gateway->>Gateway: End Span gateway.request
    Gateway-->>Client: Final Response

    ServiceA->>Jaeger: Export Spans<br/>(Trace ID: abc123)
    ServiceB->>Jaeger: Export Spans<br/>(Trace ID: abc123)
    Gateway->>Jaeger: Export Spans<br/>(Trace ID: abc123)

    Note over Jaeger: All spans linked by<br/>Trace ID: abc123
```

## Health Checks

```typescript
// src/modules/health/health.controller.ts
export class HealthController {
  constructor(
    private prisma: PrismaClient,
    private redis: Redis
  ) {}

  // Liveness probe - is the service running?
  async liveness(req: Request, res: Response) {
    res.json({
      status: 'ok',
      timestamp: new Date().toISOString()
    });
  }

  // Readiness probe - is the service ready for traffic?
  async readiness(req: Request, res: Response) {
    const checks = await this.runHealthChecks();
    const isHealthy = Object.values(checks).every(check => check.status === 'healthy');

    res.status(isHealthy ? 200 : 503).json({
      status: isHealthy ? 'ready' : 'not ready',
      checks,
      timestamp: new Date().toISOString()
    });
  }

  // Detailed health check
  async health(req: Request, res: Response) {
    const checks = await this.runHealthChecks();
    const isHealthy = Object.values(checks).every(check => check.status === 'healthy');

    res.status(isHealthy ? 200 : 503).json({
      status: isHealthy ? 'healthy' : 'unhealthy',
      version: process.env.SERVICE_VERSION || '1.0.0',
      uptime: process.uptime(),
      checks,
      timestamp: new Date().toISOString()
    });
  }

  private async runHealthChecks() {
    const checks: Record<string, any> = {};

    // Database check
    try {
      const start = Date.now();
      await this.prisma.$queryRaw`SELECT 1`;
      checks.database = {
        status: 'healthy',
        responseTime: Date.now() - start
      };
    } catch (error) {
      checks.database = {
        status: 'unhealthy',
        error: error.message
      };
    }

    // Redis check
    try {
      const start = Date.now();
      await this.redis.ping();
      checks.redis = {
        status: 'healthy',
        responseTime: Date.now() - start
      };
    } catch (error) {
      checks.redis = {
        status: 'unhealthy',
        error: error.message
      };
    }

    // Memory check
    const memUsage = process.memoryUsage();
    checks.memory = {
      status: memUsage.heapUsed < 500 * 1024 * 1024 ? 'healthy' : 'warning',
      heapUsed: Math.round(memUsage.heapUsed / 1024 / 1024),
      heapTotal: Math.round(memUsage.heapTotal / 1024 / 1024),
      rss: Math.round(memUsage.rss / 1024 / 1024)
    };

    return checks;
  }
}
```

## Error Tracking

```typescript
// src/lib/error-tracking.ts
import * as Sentry from '@sentry/node';

export const initErrorTracking = () => {
  if (process.env.SENTRY_DSN) {
    Sentry.init({
      dsn: process.env.SENTRY_DSN,
      environment: process.env.NODE_ENV,
      tracesSampleRate: 0.1,
      beforeSend(event, hint) {
        // Filter sensitive data
        if (event.request?.cookies) {
          delete event.request.cookies;
        }
        return event;
      }
    });
  }
};

// Error handler middleware
export const errorHandler = (
  err: Error,
  req: Request,
  res: Response,
  next: NextFunction
) => {
  // Log error
  logger.error('Unhandled error', {
    error: err.message,
    stack: err.stack,
    url: req.url,
    method: req.method,
    correlationId: req.headers['x-correlation-id']
  });

  // Report to Sentry
  Sentry.captureException(err, {
    tags: {
      service: process.env.SERVICE_NAME
    },
    user: {
      id: req.user?.id
    }
  });

  // Send response
  res.status(500).json({
    success: false,
    error: {
      code: 'INTERNAL_ERROR',
      message: process.env.NODE_ENV === 'production'
        ? 'Internal server error'
        : err.message
    }
  });
};
```

## Performance Monitoring

```typescript
// src/middlewares/performance.middleware.ts
export const performanceMiddleware = (req: Request, res: Response, next: NextFunction) => {
  const start = process.hrtime.bigint();

  res.on('finish', () => {
    const end = process.hrtime.bigint();
    const duration = Number(end - start) / 1000000; // Convert to milliseconds

    // Log slow requests
    if (duration > 1000) {
      logger.warn('Slow request detected', {
        method: req.method,
        url: req.url,
        duration,
        threshold: 1000
      });
    }

    // Add to response header
    res.set('X-Response-Time', `${duration}ms`);
  });

  next();
};
```

## Grafana Dashboard Config

```json
{
  "dashboard": {
    "title": "Service Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [{
          "expr": "rate(http_requests_total[5m])"
        }]
      },
      {
        "title": "Request Duration",
        "targets": [{
          "expr": "histogram_quantile(0.95, http_request_duration_seconds)"
        }]
      },
      {
        "title": "Error Rate",
        "targets": [{
          "expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
        }]
      },
      {
        "title": "Active Users",
        "targets": [{
          "expr": "active_users"
        }]
      }
    ]
  }
}
```

## Alerting Rules

```yaml
# prometheus/alerts.yml
groups:
  - name: service_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"
          description: "Error rate is above 5% for 5 minutes"

      - alert: HighLatency
        expr: histogram_quantile(0.95, http_request_duration_seconds) > 1
        for: 5m
        annotations:
          summary: "High latency detected"
          description: "95th percentile latency is above 1s"

      - alert: ServiceDown
        expr: up{job="service"} == 0
        for: 1m
        annotations:
          summary: "Service is down"
          description: "Service has been down for 1 minute"
```

## Best Practices

1. **Logging**
   - Use structured logging (JSON format)
   - Include correlation IDs for request tracing
   - Log at appropriate levels (ERROR, WARN, INFO, DEBUG)
   - Avoid logging sensitive data

2. **Metrics**
   - Use standard metric types (Counter, Gauge, Histogram)
   - Keep cardinality low (avoid high-cardinality labels)
   - Define SLIs and SLOs for critical paths
   - Monitor business metrics, not just technical ones

3. **Tracing**
   - Add traces for critical operations
   - Include relevant context in spans
   - Sample appropriately to control costs
   - Use distributed tracing for microservices

4. **Alerting**
   - Alert on symptoms, not causes
   - Include runbook links in alerts
   - Avoid alert fatigue with proper thresholds
   - Test alerting rules regularly