pos-system/docs/en/architecture/microservices-communication.md

# Microservices Communication

> Communication patterns and protocols for inter-service communication

## Overview Diagram

```mermaid
graph TD
    Client[Client Apps] --> Gateway[API Gateway<br/>Traefik]

    Gateway --> ServiceA[Service A]
    Gateway --> ServiceB[Service B]

    ServiceA <-->|REST/HTTP| ServiceB
    ServiceA -->|Events| Kafka[Kafka Broker]
    ServiceB <-.->|Sub| Kafka

    ServiceA --> SD[Service Discovery<br/>Docker DNS / K8s DNS]
    ServiceB --> SD

    classDef blue fill:#253041,stroke:#4b6584,color:#ffffff
    classDef orange fill:#3a2e1e,stroke:#7a5f3c,color:#ffffff
    classDef green fill:#1e3a29,stroke:#3c7a52,color:#ffffff

    class Gateway blue
    class Kafka orange
    class SD green
```

## System Context

```mermaid
C4Context
    title System Context Diagram for GoodGo Microservices Communication

    Person(client_web, "Web Client", "Browser/Mobile App")
    Person(client_api, "API Consumer", "External API clients")

    System_Boundary(goodgo, "GoodGo Platform") {
        System(gateway, "API Gateway", "Traefik - Routes requests to services")
        System(services, "Microservices", "IAM, User, Order, Product services")
        System(kafka, "Event Bus", "Kafka - Async communication")
        System(discovery, "Service Discovery", "Docker DNS / K8s DNS")
    }

    System_Ext(db, "Database", "Neon PostgreSQL")
    System_Ext(cache, "Cache", "Redis")
    System_Ext(external_api, "External APIs", "Payment, Email, SMS")

    Rel(client_web, gateway, "Uses", "HTTPS")
    Rel(client_api, gateway, "Calls", "HTTPS/REST")
    Rel(gateway, services, "Routes to", "HTTP")
    Rel(services, kafka, "Pub/Sub", "Kafka Protocol")
    Rel(services, discovery, "Lookup", "DNS")
    Rel(services, db, "Reads/Writes", "PostgreSQL")
    Rel(services, cache, "Gets/Sets", "Redis Protocol")
    Rel(services, external_api, "Integrates", "HTTPS")
```

The GoodGo platform uses a microservices architecture where all client requests flow through an API Gateway (Traefik), which routes them to appropriate microservices. Services communicate synchronously via REST/HTTP for request-response patterns and asynchronously via Kafka for event-driven workflows. Service discovery is handled by Docker DNS in local environments and Kubernetes DNS in production.

## Communication Protocols

### Protocol Comparison

| Protocol | Latency | Complexity | Use Case |
|----------|---------|------------|----------|
| **REST** | Medium | Low | External APIs, CRUD |
| **gRPC** | Low | High | Internal high-performance |
| **Events** | Async | Medium | Decoupled workflows |
| **GraphQL** | Medium | Medium | Complex data fetching |

### REST/HTTP Pattern

```mermaid
sequenceDiagram
    participant Client
    participant Gateway as API Gateway
    participant ServiceA as Service A
    participant ServiceB as Service B

    Client->>Gateway: GET /api/v1/users/123
    Gateway->>ServiceA: Forward Request
    ServiceA->>ServiceB: GET /internal/permissions/123
    ServiceB-->>ServiceA: Permissions
    ServiceA-->>Gateway: User + Permissions
    Gateway-->>Client: JSON Response
```

Synchronous request-response using HTTP/REST.

**Implementation**:
```typescript
// Service-to-service HTTP client
import axios from 'axios';

export class UserServiceClient {
  private client = axios.create({
    baseURL: process.env.USER_SERVICE_URL,
    timeout: 5000,
    headers: {
      'x-service-auth': process.env.INTERNAL_API_KEY
    }
  });

  async getUser(userId: string): Promise<User> {
    const response = await this.client.get(`/users/${userId}`);
    return response.data;
  }
}
```

### Event-Driven Pattern

```mermaid
sequenceDiagram
    participant ServiceA
    participant Kafka
    participant ServiceB
    participant ServiceC

    ServiceA->>Kafka: Publish: user.created
    Kafka->>ServiceB: Deliver event
    Kafka->>ServiceC: Deliver event

    par Parallel Processing
        ServiceB->>ServiceB: Send welcome email
        ServiceC->>ServiceC: Create user profile
    end
```

Asynchronous event-based communication via Kafka.

### Service Discovery

**Local (Docker Compose)**:
```yaml
# Services discover via Docker DNS
http://service-name:port
http://iam-service:3001
```

**Kubernetes**:
```yaml
# Services discover via K8s DNS
http://service-name.namespace.svc.cluster.local
http://iam-service.default.svc.cluster.local:3001
```

## API Gateway Pattern

```mermaid
graph LR
    Client --> Gateway[API Gateway<br/>Traefik]

    subgraph "Gateway Features"
        Gateway --> Route[Routing]
        Gateway --> LB[Load Balancing]
        Gateway --> Auth[Authentication]
        Gateway --> Rate[Rate Limiting]
        Gateway --> CORS
    end

    Route --> Service1[Service 1]
    Route --> Service2[Service 2]
    LB --> Service1A[Instance A]
    LB --> Service1B[Instance B]

    classDef blue fill:#253041,stroke:#4b6584,color:#ffffff
    class Gateway blue
```

Single entry point for all client requests with routing, auth, rate limiting.

## Performance Characteristics

Performance expectations and optimization strategies for inter-service communication.

| Metric | Target | Notes |
|--------|--------|-------|
| **REST API Response Time** | < 100ms | P95 for internal service-to-service calls |
| **Event Publishing Latency** | < 50ms | Time to publish to Kafka |
| **Service Discovery Lookup** | < 10ms | DNS resolution time |
| **Gateway Routing Overhead** | < 20ms | Additional latency added by Traefik |
| **Throughput** | 10,000 req/s | Per service instance |
| **Kafka Event Processing** | < 500ms | P95 end-to-end event processing |

**Optimization Strategies**:
- **Connection Pooling**: Reuse HTTP connections between services
- **Circuit Breaker**: Prevent cascading failures with Opossum library
- **Retry with Backoff**: Exponential backoff for transient failures
- **Compression**: Enable gzip for large payloads
- **Caching**: Cache service discovery results and responses

## Security Considerations

Security measures for protecting inter-service communication.

### Service-to-Service Authentication

- **Internal API Keys**: Services authenticate using `x-service-auth` header
- **JWT Tokens**: For user context propagation between services
- **Mutual TLS (mTLS)**: Optional for production environments (Kubernetes service mesh)

### Network Security

- **Network Policies**: Kubernetes NetworkPolicies restrict service-to-service traffic
- **Service Mesh**: Istio/Linkerd for advanced security policies (optional)
- **Private Networks**: Services communicate within private VPC/cluster network

### Data Protection

- **Encryption in Transit**: TLS 1.2+ for all external communication
- **Event Payload Encryption**: Sensitive data encrypted before publishing to Kafka
- **API Gateway**: Traefik handles SSL termination and request validation

### Security Best Practices

```typescript
// Service client with authentication
export class SecureServiceClient {
  private client = axios.create({
    baseURL: process.env.SERVICE_URL,
    timeout: 5000,
    headers: {
      'x-service-auth': process.env.INTERNAL_API_KEY,
      'x-correlation-id': generateCorrelationId()
    },
    httpsAgent: new https.Agent({
      rejectUnauthorized: true // Verify SSL certificates
    })
  });
}
```

## Deployment

How microservices communication is deployed and scaled across environments.

```mermaid
graph TD
    subgraph "Production Cluster"
        LB[Load Balancer] --> Gateway[API Gateway\n3 replicas]

        Gateway --> ServiceA1[Service A\nInstance 1]
        Gateway --> ServiceA2[Service A\nInstance 2]
        Gateway --> ServiceB1[Service B\nInstance 1]
        Gateway --> ServiceB2[Service B\nInstance 2]

        ServiceA1 & ServiceA2 --> Kafka[Kafka Cluster\n3 brokers]
        ServiceB1 & ServiceB2 --> Kafka

        ServiceA1 & ServiceA2 --> DB[(PostgreSQL\nPrimary + Replica)]
        ServiceB1 & ServiceB2 --> DB

        ServiceA1 & ServiceA2 --> Redis[(Redis Cluster\n3 nodes)]
        ServiceB1 & ServiceB2 --> Redis
    end

    classDef blue fill:#253041,stroke:#4b6584,color:#ffffff
    classDef orange fill:#3a2e1e,stroke:#7a5f3c,color:#ffffff
    classDef green fill:#1e3a29,stroke:#3c7a52,color:#ffffff
    classDef red fill:#3a1e1e,stroke:#7a3c3c,color:#ffffff

    class Gateway blue
    class Kafka orange
    class DB green
    class Redis red
```

### Deployment Environments

| Environment | Gateway | Services | Kafka | Service Discovery |
|-------------|---------|----------|-------|-------------------|
| **Local** | Traefik (Docker) | Single instance per service | Single broker | Docker DNS |
| **Staging** | Traefik (2 replicas) | 2 replicas per service | 3 brokers | Kubernetes DNS |
| **Production** | Traefik (3+ replicas) | 3+ replicas per service | 5+ brokers | Kubernetes DNS + Service Mesh |

### Scaling Strategy

- **Horizontal Pod Autoscaler (HPA)**: Auto-scale based on CPU/memory
- **Kafka Partitions**: Scale event processing by increasing partitions
- **Load Balancing**: Kubernetes Service load balances across pod replicas
- **Gateway Scaling**: Traefik scales independently from backend services

## Monitoring & Observability

How to monitor and observe microservices communication.

### Key Metrics

**Service-to-Service Metrics**:
- `http_request_duration_seconds` - Request latency histogram
- `http_requests_total` - Total requests counter
- `http_request_errors_total` - Failed requests counter
- `service_client_timeout_total` - Timeout counter

**Gateway Metrics**:
- `traefik_service_requests_total` - Requests per service
- `traefik_service_request_duration_seconds` - Routing latency
- `traefik_service_retries_total` - Retry attempts

**Kafka Metrics**:
- `kafka_producer_record_send_total` - Events published
- `kafka_consumer_lag` - Consumer lag
- `kafka_consumer_records_consumed_total` - Events consumed

### Health Checks

**Service Endpoints**:
```typescript
// Liveness - is service running?
app.get('/health/live', (req, res) => {
  res.json({ status: 'ok', timestamp: new Date().toISOString() });
});

// Readiness - can service handle traffic?
app.get('/health/ready', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    kafka: await checkKafka()
  };

  const healthy = Object.values(checks).every(c => c);
  res.status(healthy ? 200 : 503).json({ ready: healthy, checks });
});
```

**Kubernetes Probes**:
```yaml
livenessProbe:
  httpGet:
    path: /health/live
    port: 3000
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 5
```

### Distributed Tracing

- **OpenTelemetry**: Instrument all service-to-service calls
- **Jaeger**: Visualize distributed traces
- **Correlation IDs**: Propagate via `x-correlation-id` header for request tracking

### Monitoring Dashboard

**Grafana Panels**:
- Service Communication Overview (request rate, latency, errors)
- Gateway Performance (routing time, backend health)
- Event Bus Health (Kafka lag, throughput)
- Service Dependencies (service map from traces)

## Related Documentation

- [System Design](./system-design.md) - Overall architecture
- [Event-Driven Architecture](./event-driven-architecture.md) - Event patterns
- [API Gateway Advanced](../skills/api-gateway-advanced.md) - Gateway patterns
- [Inter-Service Communication](../skills/inter-service-communication.md) - Communication patterns
- [Resilience Patterns](../skills/resilience-patterns.md) - Circuit breaker, retry

---

## Quick Tips

### Mermaid Common Issues
- **Arrow Syntax**: `-->` (solid), `-.->` (dotted), `==>` (thick)
- **Special Characters**: Escape with quote marks `"`
- **Subgraphs**: Use `subgraph "Title"` ... `end`

### Color Pattern Quick Reference
| Element | Color | Hex | Stroke | Usage |
|---------|-------|-----|--------|-------|
| **Core** | Blue | `#253041` | `#4b6584` | Primary components |
| **Logic** | Purple | `#2e1e3a` | `#5f3c7a` | Processing steps |
| **Data** | Green | `#1e3a29` | `#3c7a52` | Database, Cache |
| **External** | Orange | `#3a2e1e` | `#7a5f3c` | External APIs |
| **Error** | Red | `#3a1e1e` | `#7a3c3c` | Failures, Alerts |

### Visual Indicators
- 🔵 **Blue**: Core Infrastructure
- 🟢 **Green**: Data Operations
- 🟠 **Orange**: Event/External
- 🔴 **Red**: Critical/Error
- ⚪ **Grey**: Neutral/Boundary

---

**Last Updated**: 2026-01-07
**Authors**: GoodGo Architecture Team
**Reviewers**: To be assigned