11 KiB
11 KiB
Event-Driven Architecture
Event-driven architecture for asynchronous communication using Apache Kafka
Overview Diagram
graph TD
subgraph "Event Producers"
IAM[IAM Service]
Service1[Service A]
end
subgraph "Event Broker"
Kafka[Apache Kafka]
Topics[Topics: user.events, auth.events]
end
subgraph "Event Consumers"
Consumer1[Notification Service]
Consumer2[Audit Service]
end
IAM -->|Publish| Kafka
Service1 -->|Publish| Kafka
Kafka --> Topics
Topics -->|Subscribe| Consumer1
Topics -->|Subscribe| Consumer2
style Kafka fill:#e1f5ff
style Topics fill:#fff4e1
Architecture Description
The GoodGo platform implements Event-Driven Architecture (EDA) for asynchronous communication between microservices.
Core Principles:
- Event-First Design: All state changes emit domain events
- Loose Coupling: Services communicate through events
- Eventual Consistency: Accept temporary inconsistency
- Event Sourcing: Store changes as event sequence
- CQRS Pattern: Separate read/write operations
Technology Stack:
- Apache Kafka - Event streaming platform
- Schema Registry - Avro schemas for validation
- KafkaJS - Node.js client library
- Event Sourcing - Custom implementation in IAM
Event Flow
sequenceDiagram
participant Producer as IAM Service
participant Kafka as Kafka Broker
participant Consumer as Notification Service
Producer->>Kafka: Publish Event (user.created)
Kafka->>Consumer: Deliver Event
Consumer->>Consumer: Process Event
Consumer-->>Kafka: Acknowledge
Steps: Publish → Distribute → Consume → Retry (if failed) → DLQ (after max retries) → Acknowledge
Event Structure
interface BaseEvent {
eventId: string; // UUID
eventType: string; // user.created.v1
eventVersion: string; // 1.0.0
timestamp: string; // ISO 8601
source: string; // iam-service
correlationId?: string; // Request correlation
data: unknown; // Event payload
}
Example:
{
"eventId": "550e8400-e29b-41d4-a716-446655440000",
"eventType": "user.created.v1",
"timestamp": "2024-01-15T10:30:00Z",
"source": "iam-service",
"data": {
"userId": "user_123",
"email": "user@example.com"
}
}
Kafka Topics
graph LR
UserCreated[user.created<br/>Partitions: 3]
AuthLogin[auth.login.success<br/>Partitions: 5]
AuditEvents[audit.events<br/>Partitions: 10]
style UserCreated fill:#e1f5ff
style AuthLogin fill:#fff4e1
style AuditEvents fill:#f8d7da
Naming Convention: {domain}.{action}.{version}
Examples:
user.created.v1auth.login.success.v1audit.event.logged.v1
Error Handling
graph TD
Event[Event] --> Process[Process]
Process -->|Success| Ack[Acknowledge]
Process -->|Failure| Retry[Retry 3x]
Retry -->|Max Retries| DLQ[Dead Letter Queue]
DLQ --> Alert[Alert Team]
Strategy:
- Retry with exponential backoff (100ms → 200ms → 400ms)
- Max 3 attempts
- Move to DLQ after max retries
- Manual review and reprocess
System Context
C4Context
title Event-Driven Architecture Context
System(iam, "IAM Service", "Event producer")
System(service_a, "Service A", "Event producer")
System(notification, "Notification Service", "Event consumer")
System(audit, "Audit Service", "Event consumer")
System_Ext(kafka, "Apache Kafka", "Event streaming platform")
System_Ext(registry, "Schema Registry", "Schema management")
System_Ext(monitoring, "Monitoring", "Kafka metrics & alerts")
Rel(iam, kafka, "Publishes events", "Kafka Protocol")
Rel(service_a, kafka, "Publishes events", "Kafka Protocol")
Rel(kafka, notification, "Delivers events", "Kafka Protocol")
Rel(kafka, audit, "Delivers events", "Kafka Protocol")
Rel(kafka, registry, "Validates schemas", "HTTP")
Rel(kafka, monitoring, "Sends metrics", "JMX")
Context Description:
- Producers: IAM Service and other services publish domain events
- Kafka: Central event broker, manages topics and partitions
- Consumers: Notification and Audit services consume events
- Schema Registry: Manages and validates Avro schemas
- Monitoring: Collects metrics from Kafka cluster
Performance Characteristics
| Metric | Target | Notes |
|---|---|---|
| Event Publish Latency (P95) | < 10ms | Fire-and-forget, async |
| Event Delivery Latency (P95) | < 100ms | End-to-end from publish to consume |
| Throughput | 10,000 events/s | Per topic, scalable with partitions |
| Consumer Lag | < 1000 messages | Per partition, monitored |
| Event Size | < 1MB | Recommended max size |
| Retention | 7 days | Default, configurable per topic |
| Replication Factor | 3 | For fault tolerance |
Performance Optimizations:
- Batch Publishing: Group multiple events to reduce network overhead
- Compression: Use Snappy or LZ4 compression
- Partitioning: Divide topics into multiple partitions for parallel processing
- Consumer Groups: Multiple consumers in same group for horizontal scaling
- Async Publishing: Fire-and-forget pattern, don't block request handlers
Security Considerations
Event Encryption:
- TLS in-transit for all Kafka connections
- Optional payload encryption for sensitive data
- End-to-end encryption with custom encryption layer
Access Control:
- Kafka ACLs (Access Control Lists) per topic
- SASL/SCRAM authentication for producers and consumers
- Separate credentials per service
- Principle of least privilege - grant only necessary permissions
Schema Validation:
- Avro schemas in Schema Registry
- Schema evolution with backward/forward compatibility
- Reject events that don't match schema
Audit:
- Log all event publishes and consumes
- Correlation IDs to trace event flow
- Retention policy for audit logs (7 years)
Data Retention:
- Default 7 days retention
- Configurable per topic
- Automatic deletion after retention period
- GDPR compliance (right to erasure)
Deployment
graph TD
subgraph "Kafka Cluster"
subgraph "Brokers"
Broker1[Kafka Broker 1<br/>Leader for partitions 0,3,6]
Broker2[Kafka Broker 2<br/>Leader for partitions 1,4,7]
Broker3[Kafka Broker 3<br/>Leader for partitions 2,5,8]
end
subgraph "Coordination"
ZK[Zookeeper Ensemble<br/>3 nodes]
end
Broker1 --> ZK
Broker2 --> ZK
Broker3 --> ZK
end
subgraph "Producers"
IAM[IAM Service]
ServiceA[Service A]
end
subgraph "Consumers"
Notification[Notification Service<br/>Consumer Group: notifications]
Audit[Audit Service<br/>Consumer Group: audit]
end
IAM --> Broker1
IAM --> Broker2
IAM --> Broker3
ServiceA --> Broker1
ServiceA --> Broker2
ServiceA --> Broker3
Broker1 --> Notification
Broker2 --> Notification
Broker3 --> Notification
Broker1 --> Audit
Broker2 --> Audit
Broker3 --> Audit
style Broker1 fill:#e1f5ff
style Broker2 fill:#fff4e1
style Broker3 fill:#d4edda
style ZK fill:#f0e1ff
Kafka Cluster Configuration:
- Brokers: 3 brokers minimum (5 for production)
- Replication Factor: 3 (for fault tolerance)
- Min In-Sync Replicas: 2 (ensure data durability)
- Partitions: 3-10 per topic (based on throughput needs)
- Zookeeper: 3-node ensemble (for coordination)
Resource Allocation:
| Component | CPU | Memory | Disk |
|---|---|---|---|
| Kafka Broker | 2 cores | 4GB RAM | 100GB SSD |
| Zookeeper | 1 core | 2GB RAM | 20GB SSD |
| Schema Registry | 500m | 1GB RAM | 10GB |
Topic Configuration:
user.created:
partitions: 3
replication-factor: 3
retention-ms: 604800000 # 7 days
compression-type: snappy
auth.login.success:
partitions: 5
replication-factor: 3
retention-ms: 604800000
compression-type: snappy
audit.events:
partitions: 10
replication-factor: 3
retention-ms: 220752000000 # 7 years
compression-type: lz4
High Availability:
- Multiple brokers with partition replication
- Automatic leader election when broker fails
- Consumer group rebalancing
- Monitoring and alerting for broker health
Monitoring & Observability
Key Metrics:
Kafka Broker Metrics:
kafka_server_brokertopicmetrics_messagesinpersec- Messages in/seckafka_server_brokertopicmetrics_bytesinpersec- Bytes in/seckafka_server_brokertopicmetrics_bytesoutpersec- Bytes out/seckafka_controller_kafkacontroller_activecontrollercount- Active controllerkafka_server_replicamanager_underreplicatedpartitions- Under-replicated partitions
Consumer Metrics:
kafka_consumer_fetch_manager_records_lag_max- Max consumer lagkafka_consumer_fetch_manager_records_consumed_rate- Records consumed/seckafka_consumer_coordinator_commit_latency_avg- Commit latency
Producer Metrics:
kafka_producer_record_send_total- Total records sentkafka_producer_record_error_total- Total send errorskafka_producer_request_latency_avg- Request latency
Application Metrics:
// Custom metrics for event processing
const eventPublished = new Counter({
name: 'events_published_total',
help: 'Total events published',
labelNames: ['event_type', 'topic']
});
const eventConsumed = new Counter({
name: 'events_consumed_total',
help: 'Total events consumed',
labelNames: ['event_type', 'topic', 'consumer_group']
});
const eventProcessingDuration = new Histogram({
name: 'event_processing_duration_seconds',
help: 'Event processing duration',
labelNames: ['event_type'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});
Dashboards:
- Kafka Cluster Overview (brokers, topics, partitions)
- Producer Performance (throughput, latency, errors)
- Consumer Performance (lag, throughput, errors)
- Topic Metrics (messages/sec, bytes/sec, retention)
Logging:
// Structured logging for events
logger.info('Event published', {
eventId: event.eventId,
eventType: event.eventType,
topic: 'user.created',
correlationId: event.correlationId
});
logger.info('Event consumed', {
eventId: event.eventId,
eventType: event.eventType,
topic: 'user.created',
consumerGroup: 'notifications',
processingTime: duration
});
Related Documentation
- System Design - Overall architecture
- IAM Architecture - Event sourcing implementation