Files
pos-system/docs/en/architecture/caching-architecture.md

383 lines
10 KiB
Markdown

# Caching Architecture
> Multi-layer caching strategy for optimal performance
## Overview Diagram
```mermaid
graph TD
Request[API Request] --> L1{L1 Cache<br/>Memory}
L1 -->|Hit| Return1[Return<br/>< 1ms]
L1 -->|Miss| L2{L2 Cache<br/>Redis}
L2 -->|Hit| WarmL1[Warm L1]
WarmL1 --> Return2[Return<br/>< 5ms]
L2 -->|Miss| DB[(Database)]
DB --> StoreL2[Store L2 + L1]
StoreL2 --> Return3[Return<br/>< 50ms]
style L1 fill:#d4edda
style L2 fill:#fff4e1
style DB fill:#f0e1ff
```
## System Context
```mermaid
C4Context
title Caching System Context
System(service, "Microservice", "Client service using cache")
System_Ext(db, "Neon PostgreSQL", "Primary database")
Boundary(caching, "Caching Layer") {
System(l1, "L1 Cache", "In-memory NodeCache")
System(l2, "L2 Cache", "Redis Cluster")
}
Rel(service, l1, "Reads/Writes", "In-process")
Rel(service, l2, "Reads/Writes", "Redis Protocol")
Rel(l1, l2, "Fills from", "On miss")
Rel(l2, db, "Cache aside", "On miss")
```
### Context Description
- **Service**: Communicates directly with L1 Cache (in-memory) for lowest latency.
- **L1 Cache**: Local cache, not shared, automatic expiration (short TTL).
- **L2 Cache**: Shared Redis cluster, holds data longer and syncs across instances.
- **Database**: Source of truth, accessed only on cache miss.
## Architecture Description
### Multi-Layer Caching
GoodGo platform uses 2-layer caching for performance:
**L1 Cache (Memory)**:
- In-memory cache per service instance
- Very fast access (< 1ms)
- Limited capacity (10k keys default)
- Short TTL (60 seconds default, max 5 minutes)
- Not shared across instances
**L2 Cache (Redis)**:
- Shared distributed cache
- Fast access (< 5ms)
- Large capacity
- Longer TTL (configurable, typically 5-15 minutes)
- Shared across all service instances
**Cache Flow**:
```
Request → L1 → L2 → Database
↓ ↓ ↓ ↓
40-50% 80-90% 10-20% Cache miss
hit rate hit rate rate
```
## Cache Implementation
### Multi-Layer Cache Service
```typescript
export class MultiLayerCache {
private l1Cache: NodeCache;
private l2Cache: Redis;
constructor() {
// L1: Memory cache
this.l1Cache = new NodeCache({
stdTTL: 60, // 60 seconds default
maxKeys: 10000, // Max 10k keys
checkperiod: 120 // Check for expired keys every 2min
});
// L2: Redis cache
this.l2Cache = new Redis({
host: process.env.REDIS_HOST,
port: parseInt(process.env.REDIS_PORT),
db: 0
});
}
async get<T>(key: string): Promise<T | null> {
// Try L1 first
const l1Value = this.l1Cache.get<T>(key);
if (l1Value) {
logger.debug('L1 cache hit', { key });
return l1Value;
}
// Try L2
const l2Value = await this.l2Cache.get(key);
if (l2Value) {
logger.debug('L2 cache hit', { key });
const parsed = JSON.parse(l2Value) as T;
// Warm L1 cache
this.l1Cache.set(key, parsed);
return parsed;
}
logger.debug('Cache miss', { key });
return null;
}
async set(key: string, value: any, ttl: number = 300): Promise<void> {
// Store in both L1 and L2
this.l1Cache.set(key, value, Math.min(ttl, 300)); // L1 max 5min
await this.l2Cache.setex(key, ttl, JSON.stringify(value));
}
async del(key: string): Promise<void> {
this.l1Cache.del(key);
await this.l2Cache.del(key);
}
async invalidatePattern(pattern: string): Promise<void> {
// L1: Clear all (simple approach)
this.l1Cache.flushAll();
// L2: Delete by pattern
const keys = await this.l2Cache.keys(pattern);
if (keys.length > 0) {
await this.l2Cache.del(...keys);
}
}
}
```
### Cache Key Naming
**Pattern**: `{service}:{entity}:{identifier}:{sub-resource}`
**Examples**:
```typescript
const keys = {
user: (userId: string) => `iam:user:${userId}`,
userPermissions: (userId: string) => `iam:user:${userId}:permissions`,
userRoles: (userId: string) => `iam:user:${userId}:roles`,
session: (sessionId: string) => `iam:session:${sessionId}`,
};
// Usage
const user = await cache.get(keys.user('user_123'));
const permissions = await cache.get(keys.userPermissions('user_123'));
```
## TTL Strategies
```mermaid
graph LR
subgraph "TTL Tiers"
Short[Short TTL<br/>60-300s<br/>Frequently changing]
Medium[Medium TTL<br/>300-1800s<br/>Moderately changing]
Long[Long TTL<br/>1800-3600s<br/>Rarely changing]
end
Short --> Permissions[User Permissions]
Short --> Sessions[Session Data]
Medium --> UserProfiles[User Profiles]
Medium --> OrgData[Organization Data]
Long --> Config[Static Config]
Long --> RefData[Reference Data]
style Short fill:#f8d7da
style Medium fill:#fff3cd
style Long fill:#d4edda
```
**TTL Guidelines**:
| Data Type | TTL | Reason |
|-----------|-----|--------|
| User permissions | 5 min | Security-sensitive |
| Session data | Varies | Based on session length |
| User profiles | 10 min | Moderate update frequency |
| Organization data | 15 min | Infrequent updates |
| Static config | 30-60 min | Very stable |
| Reference data | 1-2 hours | Almost never changes |
## Cache Invalidation
```mermaid
sequenceDiagram
participant API
participant Service
participant Cache
participant DB
API->>Service: Update User
Service->>DB: UPDATE user
DB-->>Service: Success
Service->>Cache: Invalidate user:123
Service->>Cache: Invalidate user:123:permissions
Service->>Cache: Invalidate user:123:roles
Cache-->>Service: Cleared
Service-->>API: Success
Note over Service,Cache: Next request will fetch fresh data
```
**Invalidation Strategies**:
```typescript
// 1. Single key invalidation
async updateUser(userId: string, data: UpdateUserDto): Promise<User> {
const user = await userRepository.update(userId, data);
// Invalidate user cache
await cache.del(cacheKeys.user(userId));
return user;
}
// 2. Pattern-based invalidation
async updateUserRole(userId: string, roleId: string): Promise<void> {
await userRoleRepository.assign(userId, roleId);
// Invalidate all user-related cache
await cache.invalidatePattern(`iam:user:${userId}:*`);
}
// 3. Time-based invalidation (TTL expiry)
// Automatically handled by cache
```
## Cache Warming
```typescript
// Preload frequently accessed data
async warmCache(): Promise<void> {
logger.info('Starting cache warming');
// Warm user permissions for active users
const activeUsers = await userRepository.findActive({ limit: 1000 });
for (const user of activeUsers) {
const permissions = await rbacService.getUserPermissions(user.id);
await cache.set(
cacheKeys.userPermissions(user.id),
permissions,
300 // 5 minutes
);
}
logger.info('Cache warming completed', { count: activeUsers.length });
}
// Run on service startup
warmCache().catch(err => logger.error('Cache warming failed', { err }));
```
## Design Decisions
### Decision 1: Multi-layer Caching (L1 + L2)
**Context**: Need to reduce load on Redis and achieve ultra-low latency for hot data.
**Decision**: Use combination of L1 (NodeCache) and L2 (Redis).
**Consequences**:
- ✅ Latency < 1ms for 40-50% requests.
- ✅ Reduced network traffic to Redis.
- ❌ Synchronization complexity (L1 might be stale for short duration).
## Performance Characteristics
### Performance Targets
| Metric | Target | Notes |
|--------|--------|-------|
| **L1 Hit Latency** | < 0.5ms | In-memory lookup |
| **L2 Hit Latency** | < 5ms | Network RTT + Redis processing |
| **Combine Hit Rate** | > 90% | L1 + L2 combined |
| **L1 Capacity** | 10k items | Per instance limit to protect heap |
| **Cache Warmup Time** | < 30s | At service startup |
## Security Considerations
### Cache Security
- **Encryption**: Sensitive data (PII) MUST be encrypted before storing in L2 Redis (AES-256). L1 can store plaintext as it is in process memory (unless memory dump).
- **Isolation**: Redis instance protected by password and Network Policy (allow internal K8s traffic only).
- **TLS**: Connect to Redis via TLS 1.2+.
- **Data Sanitization**: Do not cache entire user objects if they contain password hashes or secrets.
## Deployment
```mermaid
graph TD
subgraph "Kubernetes Pod"
Service[Microservice Container]
L1[L1 Cache (RAM)]
Service --- L1
end
subgraph "Infrastructure"
RedisMaster[Redis Master]
RedisSlave1[Redis Slave 1]
RedisSlave2[Redis Slave 2]
end
Service -->|Write| RedisMaster
Service -->|Read| RedisSlave1
Service -->|Read| RedisSlave2
RedisMaster -.->|Replication| RedisSlave1
RedisMaster -.->|Replication| RedisSlave2
style Service fill:#e1f5ff
style L1 fill:#d4edda
style RedisMaster fill:#fff4e1
```
**Deployment Description**:
- **L1**: Embedded directly in Microservice process, scales with number of Pods.
- **L2**: Redis Cluster (or Sentinel) with at least 3 nodes for High Availability.
- **Connection Pooling**: Use ioredis with connection pooling for efficient connection management.
## Monitoring & Observability
### Monitoring Metrics
- **Metrics**: Prometheus metrics for hit rate, miss rate, latency, memory usage.
- **Logs**: Log cache miss/hit at debug level (sampled), log connection errors at error level.
- **Health Checks**: Readiness probe checks connection to Redis.
### Monitoring Code
**Cache Hit Rates**:
```typescript
// Track cache performance
export class CacheMetrics {
// ... Prometheus Implementation ...
}
```
**Expected Performance**:
| Metric | L1 Cache | L2 Cache | Database |
|--------|----------|----------|----------|
| Latency | < 1ms | < 5ms | < 50ms |
| Hit Rate | 40-50% | 80-90% | - |
| Capacity | 10k keys | Unlimited | - |
## Best Practices
**DO**:
- ✅ Use cache for frequently accessed data
- ✅ Set appropriate TTLs based on data change frequency
- ✅ Invalidate cache on data updates
- ✅ Use cache key namespacing
- ✅ Monitor cache hit rates
- ✅ Warm cache on startup for critical data
**DON'T**:
- ❌ Cache data that changes very frequently
- ❌ Set TTL too long (stale data risk)
- ❌ Set TTL too short (negates cache benefit)
- ❌ Cache sensitive data without encryption
- ❌ Ignore cache invalidation on updates
- ❌ Use cache as primary data store