- Changed the Mermaid theme from 'default' to 'base' for improved visual consistency. - Expanded the architecture documentation by adding new sections on caching, data consistency, and observability patterns. - Enhanced existing diagrams with clearer color coding and class definitions for better readability. - Removed outdated service communication documentation to streamline content. - Included quick tips and common issues sections to assist users in navigating the documentation effectively.
13 KiB
13 KiB
Observability Architecture
Note
: Comprehensive observability with metrics, logging, and tracing
Overview Diagram
graph TD
subgraph "Services"
Service1[Service A]
Service2[Service B]
end
subgraph "Metrics"
Service1 -->|/metrics| Prom[Prometheus]
Service2 -->|/metrics| Prom
Prom --> Grafana[Grafana<br/>Dashboards]
end
subgraph "Logging"
Service1 -->|JSON Logs| Loki
Service2 -->|JSON Logs| Loki
Loki --> GrafanaLogs[Grafana<br/>Log Explorer]
end
subgraph "Tracing"
Service1 -->|Spans| Jaeger
Service2 -->|Spans| Jaeger
Jaeger --> JaegerUI[Jaeger UI]
end
classDef service fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef metrics fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef logging fill:#C05621,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef tracing fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef dashboard fill:#4A5568,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
class Service1,Service2 service;
class Prom metrics;
class Loki logging;
class Jaeger,JaegerUI tracing;
class Grafana,GrafanaLogs dashboard;
System Context
C4Context
title Observability System Context
Person(dev, "Developer", "Uses dashboards to monitor system")
Person(sre, "SRE", "Manages infrastructure & alerts")
System(obs, "Observability Stack", "Prometheus, Loki, Jaeger, Grafana")
System_Ext(service, "Microservices", "Sends telemetry data")
System_Ext(k8s, "Kubernetes", "Sends cluster metrics")
Rel(dev, obs, "Views Dashboards", "HTTPS")
Rel(sre, obs, "Configures Alerts", "HTTPS")
Rel(service, obs, "Push/Pull Telemetry", "HTTP/gRPC")
Rel(k8s, obs, "Exposes Metrics", "HTTP")
UpdateElementStyle(dev, $fontColor="white", $bgColor="#2D3748", $borderColor="white")
UpdateElementStyle(sre, $fontColor="white", $bgColor="#2D3748", $borderColor="white")
UpdateElementStyle(obs, $fontColor="white", $bgColor="#2C5282", $borderColor="white")
UpdateElementStyle(service, $fontColor="white", $bgColor="#4A5568", $borderColor="white")
UpdateElementStyle(k8s, $fontColor="white", $bgColor="#4A5568", $borderColor="white")
Context Description
- Observability Stack: Central hub for collecting and displaying data (Prometheus, Loki, Jaeger, Grafana).
- Microservices: Send logs, metrics, and traces (OpenTelemetry).
- Developer/SRE: Use Grafana to monitor system health and debug.
Three Pillars of Observability
1. Metrics (Prometheus + Grafana)
graph LR
Service[Service] -->|Expose /metrics| Prom[Prometheus]
Prom -->|Scrape every 15s| Metrics[Time Series DB]
Metrics --> Grafana[Grafana]
Grafana --> Dashboard1[Request Dashboard]
Grafana --> Dashboard2[Error Dashboard]
Grafana --> Dashboard3[Performance Dashboard]
classDef default fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef prom fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef grafana fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
class Prom prom;
class Grafana grafana;
Description: Numerical measurements over time (requests/sec, latency, errors).
Implementation:
import { Counter, Histogram, Gauge } from 'prom-client';
// HTTP request metrics
export const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status'],
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});
export const httpRequestTotal = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status']
});
export const activeRequests = new Gauge({
name: 'http_requests_active',
help: 'Number of active HTTP requests'
});
// Middleware to track metrics
export function metricsMiddleware(req, res, next) {
const start = Date.now();
activeRequests.inc();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestDuration.observe(
{ method: req.method, route: req.route?.path || req.path, status: res.statusCode },
duration
);
httpRequestTotal.inc({
method: req.method,
route: req.route?.path || req.path,
status: res.statusCode
});
activeRequests.dec();
});
next();
}
2. Logging (Winston + Loki)
sequenceDiagram
participant Service
participant Winston as Winston Logger
participant Loki
participant Grafana
Service->>Winston: Log event
Winston->>Winston: Format JSON
Winston->>Winston: Add metadata<br/>(correlation ID, trace ID)
Winston->>Loki: Push logs
Loki->>Loki: Index & store
User->>Grafana: Query logs
Grafana->>Loki: LogQL query
Loki-->>Grafana: Log results
Description: Structured logging with correlation IDs for request tracing.
Implementation:
import winston from 'winston';
export const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: {
service: process.env.SERVICE_NAME || 'unknown-service',
environment: process.env.NODE_ENV || 'development'
},
transports: [
new winston.transports.Console(),
// Loki transport (if configured)
]
});
// Logger middleware
export function loggerMiddleware(req, res, next) {
const correlationId = req.headers['x-correlation-id'] || generateId();
req.correlationId = correlationId;
req.logger = logger.child({ correlationId });
req.logger.info('Incoming request', {
method: req.method,
path: req.path,
ip: req.ip
});
res.on('finish', () => {
req.logger.info('Request completed', {
method: req.method,
path: req.path,
status: res.statusCode,
duration: Date.now() - req.startTime
});
});
next();
}
3. Tracing (OpenTelemetry + Jaeger)
graph LR
Request[Incoming Request] --> Trace[Create Trace]
Trace --> SpanA[Span: HTTP Request]
SpanA --> SpanB[Span: DB Query]
SpanA --> SpanC[Span: Cache Check]
SpanA --> SpanD[Span: External API]
SpanB --> Jaeger[Jaeger]
SpanC --> Jaeger
SpanD --> Jaeger
Jaeger --> Timeline[Trace Timeline]
classDef default fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef trace fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef jaeger fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
class Trace trace;
class Jaeger jaeger;
Description: Distributed tracing to track requests across services.
Implementation:
import { trace, SpanStatusCode } from '@opentelemetry/api';
// Create traced function
export function traced<T>(
name: string,
fn: () => Promise<T>
): Promise<T> {
const tracer = trace.getTracer('app');
const span = tracer.startSpan(name);
return fn()
.then(result => {
span.setStatus({ code: SpanStatusCode.OK });
return result;
})
.catch(error => {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
span.recordException(error);
throw error;
})
.finally(() => {
span.end();
});
}
// Usage
async getUserWithTracing(userId: string): Promise<User> {
return traced('getUserById', async () => {
return await userRepository.findById(userId);
});
}
Health Checks
// Liveness probe - is service running?
app.get('/health/live', (req, res) => {
res.json({ status: 'ok', timestamp: new Date().toISOString() });
});
// Readiness probe - is service ready for traffic?
app.get('/health/ready', async (req, res) => {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
disk: await checkDiskSpace()
};
const ready = Object.values(checks).every(check => check === true);
res.status(ready ? 200 : 503).json({
ready,
checks,
timestamp: new Date().toISOString()
});
});
async function checkDatabase(): Promise<boolean> {
try {
await prisma.$queryRaw`SELECT 1`;
return true;
} catch {
return false;
}
}
Alerting Rules
# Prometheus alerting rules
groups:
- name: service_alerts
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} (> 5%)"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }}s"
# Service down
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
Performance Targets
Performance Goals
| Metric | Target | Notes |
|---|---|---|
| Metric Scrape Interval | 15s | Critical services |
| Log Ingestion Latency | < 1s | Time from emit to queryable |
| Trace Sampling Rate | 10% | Production (100% in Dev/Staging) |
| Dashboard Load Time | < 2s | P95 Latency |
| Alert Evaluation | Every 1m | Evaluation interval |
| Retention Policy | 14 days | Logs & Traces (Metrics: 30 days) |
Security Considerations
Observability Security
- Log Scrubbing: Automatically remove PII (emails, ssn, credit cards) and secrets from logs before ingestion.
- Access Control: Grafana integrated with OAuth2/OIDC, with Viewer/Editor/Admin roles.
- Network Policy: Only allow traffic from internal namespace to ingestion ports (9090, 3100, 14268).
- TLS: Encrypt traffic between agents and collectors.
Deployment
graph TD
subgraph "Kubernetes Monitoring Namespace"
Grafana[Grafana]
Prom[Prometheus Server]
Loki[Loki Gateway]
Jaeger[Jaeger Collector]
end
subgraph "App Namespace"
App[Application Pods]
Agent[Grafana Agent / Promtail]
end
App -->|Push Logs| Agent
Agent -->|Push| Loki
Prom -->|Pull Metrics| App
Prom -->|Pull Metrics| Agent
App -->|Push Traces| Jaeger
Grafana --> Prom
Grafana --> Loki
Grafana --> Jaeger
classDef k8s fill:#2D3748,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef app fill:#4A5568,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef grafana fill:#2C5282,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef loki fill:#C05621,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef jaeger fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
classDef prom fill:#2F855A,stroke:#FFFFFF,stroke-width:2px,color:#FFFFFF;
class Grafana grafana;
class Loki loki;
class Jaeger jaeger;
class Prom prom;
class App,Agent app;
Deployment Description:
- Agent: Promtail or Grafana Agent runs as DaemonSet or Sidecar to collect logs.
- Pull Model: Prometheus scrapes metrics from
/metricsendpoints. - Push Model: Traces and Logs are pushed to collectors.
- Resources: Dedicated nodes for monitoring stack in production to avoid impacting main workload.
Related Documentation
- System Design - Overall architecture
- Caching Architecture - Cache metrics
Quick Tips
Mermaid Common Issues
| Issue | Solution |
|---|---|
| Parse Error | Check for special characters like () or [] inside node text without quotes. Use "text" for complex strings. |
| Color Not Showing | Ensure style or classDef definitions are correct and IDs match. |
| Arrow Direction | TD = Top-Down, LR = Left-Right. Choose appropriately for layout. |
Color Pattern Quick Reference
| Element | Color | Hex | Use Case |
|---|---|---|---|
| Primary | Dark Blue | #2D3748 |
System components, core services |
| Secondary | Grey | #4A5568 |
Supporting modules, libraries |
| Accent | Blue | #2C5282 |
Databases, external APIs |
| Highlight | Teal | #285E61 |
User interactions, highlights |
| Success | Green | #2F855A |
Successful states, active |
| Warning | Orange | #C05621 |
Warning/Caution states |
| Error | Red | #C53030 |
Error states, failures |
Visual Indicators
| Indicator | Meaning |
|---|---|
| 🟢 | Safe / Recommended |
| 🟡 | Warning / Caution |
| 🔴 | Danger / Anti-pattern |
| 💡 | Tip / Best Practice |
Last Updated: 2026-01-10
Author: GoodGo Architecture Team