383 lines
10 KiB
Markdown
383 lines
10 KiB
Markdown
# Caching Architecture
|
|
|
|
> Multi-layer caching strategy for optimal performance
|
|
|
|
## Overview Diagram
|
|
|
|
```mermaid
|
|
graph TD
|
|
Request[API Request] --> L1{L1 Cache<br/>Memory}
|
|
|
|
L1 -->|Hit| Return1[Return<br/>< 1ms]
|
|
L1 -->|Miss| L2{L2 Cache<br/>Redis}
|
|
|
|
L2 -->|Hit| WarmL1[Warm L1]
|
|
WarmL1 --> Return2[Return<br/>< 5ms]
|
|
|
|
L2 -->|Miss| DB[(Database)]
|
|
DB --> StoreL2[Store L2 + L1]
|
|
StoreL2 --> Return3[Return<br/>< 50ms]
|
|
|
|
style L1 fill:#d4edda
|
|
style L2 fill:#fff4e1
|
|
style DB fill:#f0e1ff
|
|
```
|
|
|
|
## System Context
|
|
|
|
```mermaid
|
|
C4Context
|
|
title Caching System Context
|
|
|
|
System(service, "Microservice", "Client service using cache")
|
|
System_Ext(db, "Neon PostgreSQL", "Primary database")
|
|
|
|
Boundary(caching, "Caching Layer") {
|
|
System(l1, "L1 Cache", "In-memory NodeCache")
|
|
System(l2, "L2 Cache", "Redis Cluster")
|
|
}
|
|
|
|
Rel(service, l1, "Reads/Writes", "In-process")
|
|
Rel(service, l2, "Reads/Writes", "Redis Protocol")
|
|
Rel(l1, l2, "Fills from", "On miss")
|
|
Rel(l2, db, "Cache aside", "On miss")
|
|
```
|
|
|
|
### Context Description
|
|
- **Service**: Communicates directly with L1 Cache (in-memory) for lowest latency.
|
|
- **L1 Cache**: Local cache, not shared, automatic expiration (short TTL).
|
|
- **L2 Cache**: Shared Redis cluster, holds data longer and syncs across instances.
|
|
- **Database**: Source of truth, accessed only on cache miss.
|
|
|
|
## Architecture Description
|
|
|
|
### Multi-Layer Caching
|
|
|
|
GoodGo platform uses 2-layer caching for performance:
|
|
|
|
**L1 Cache (Memory)**:
|
|
- In-memory cache per service instance
|
|
- Very fast access (< 1ms)
|
|
- Limited capacity (10k keys default)
|
|
- Short TTL (60 seconds default, max 5 minutes)
|
|
- Not shared across instances
|
|
|
|
**L2 Cache (Redis)**:
|
|
- Shared distributed cache
|
|
- Fast access (< 5ms)
|
|
- Large capacity
|
|
- Longer TTL (configurable, typically 5-15 minutes)
|
|
- Shared across all service instances
|
|
|
|
**Cache Flow**:
|
|
```
|
|
Request → L1 → L2 → Database
|
|
↓ ↓ ↓ ↓
|
|
40-50% 80-90% 10-20% Cache miss
|
|
hit rate hit rate rate
|
|
```
|
|
|
|
## Cache Implementation
|
|
|
|
### Multi-Layer Cache Service
|
|
|
|
```typescript
|
|
export class MultiLayerCache {
|
|
private l1Cache: NodeCache;
|
|
private l2Cache: Redis;
|
|
|
|
constructor() {
|
|
// L1: Memory cache
|
|
this.l1Cache = new NodeCache({
|
|
stdTTL: 60, // 60 seconds default
|
|
maxKeys: 10000, // Max 10k keys
|
|
checkperiod: 120 // Check for expired keys every 2min
|
|
});
|
|
|
|
// L2: Redis cache
|
|
this.l2Cache = new Redis({
|
|
host: process.env.REDIS_HOST,
|
|
port: parseInt(process.env.REDIS_PORT),
|
|
db: 0
|
|
});
|
|
}
|
|
|
|
async get<T>(key: string): Promise<T | null> {
|
|
// Try L1 first
|
|
const l1Value = this.l1Cache.get<T>(key);
|
|
if (l1Value) {
|
|
logger.debug('L1 cache hit', { key });
|
|
return l1Value;
|
|
}
|
|
|
|
// Try L2
|
|
const l2Value = await this.l2Cache.get(key);
|
|
if (l2Value) {
|
|
logger.debug('L2 cache hit', { key });
|
|
const parsed = JSON.parse(l2Value) as T;
|
|
|
|
// Warm L1 cache
|
|
this.l1Cache.set(key, parsed);
|
|
return parsed;
|
|
}
|
|
|
|
logger.debug('Cache miss', { key });
|
|
return null;
|
|
}
|
|
|
|
async set(key: string, value: any, ttl: number = 300): Promise<void> {
|
|
// Store in both L1 and L2
|
|
this.l1Cache.set(key, value, Math.min(ttl, 300)); // L1 max 5min
|
|
await this.l2Cache.setex(key, ttl, JSON.stringify(value));
|
|
}
|
|
|
|
async del(key: string): Promise<void> {
|
|
this.l1Cache.del(key);
|
|
await this.l2Cache.del(key);
|
|
}
|
|
|
|
async invalidatePattern(pattern: string): Promise<void> {
|
|
// L1: Clear all (simple approach)
|
|
this.l1Cache.flushAll();
|
|
|
|
// L2: Delete by pattern
|
|
const keys = await this.l2Cache.keys(pattern);
|
|
if (keys.length > 0) {
|
|
await this.l2Cache.del(...keys);
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Cache Key Naming
|
|
|
|
**Pattern**: `{service}:{entity}:{identifier}:{sub-resource}`
|
|
|
|
**Examples**:
|
|
```typescript
|
|
const keys = {
|
|
user: (userId: string) => `iam:user:${userId}`,
|
|
userPermissions: (userId: string) => `iam:user:${userId}:permissions`,
|
|
userRoles: (userId: string) => `iam:user:${userId}:roles`,
|
|
session: (sessionId: string) => `iam:session:${sessionId}`,
|
|
};
|
|
|
|
// Usage
|
|
const user = await cache.get(keys.user('user_123'));
|
|
const permissions = await cache.get(keys.userPermissions('user_123'));
|
|
```
|
|
|
|
## TTL Strategies
|
|
|
|
```mermaid
|
|
graph LR
|
|
subgraph "TTL Tiers"
|
|
Short[Short TTL<br/>60-300s<br/>Frequently changing]
|
|
Medium[Medium TTL<br/>300-1800s<br/>Moderately changing]
|
|
Long[Long TTL<br/>1800-3600s<br/>Rarely changing]
|
|
end
|
|
|
|
Short --> Permissions[User Permissions]
|
|
Short --> Sessions[Session Data]
|
|
|
|
Medium --> UserProfiles[User Profiles]
|
|
Medium --> OrgData[Organization Data]
|
|
|
|
Long --> Config[Static Config]
|
|
Long --> RefData[Reference Data]
|
|
|
|
style Short fill:#f8d7da
|
|
style Medium fill:#fff3cd
|
|
style Long fill:#d4edda
|
|
```
|
|
|
|
**TTL Guidelines**:
|
|
| Data Type | TTL | Reason |
|
|
|-----------|-----|--------|
|
|
| User permissions | 5 min | Security-sensitive |
|
|
| Session data | Varies | Based on session length |
|
|
| User profiles | 10 min | Moderate update frequency |
|
|
| Organization data | 15 min | Infrequent updates |
|
|
| Static config | 30-60 min | Very stable |
|
|
| Reference data | 1-2 hours | Almost never changes |
|
|
|
|
## Cache Invalidation
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant API
|
|
participant Service
|
|
participant Cache
|
|
participant DB
|
|
|
|
API->>Service: Update User
|
|
Service->>DB: UPDATE user
|
|
DB-->>Service: Success
|
|
|
|
Service->>Cache: Invalidate user:123
|
|
Service->>Cache: Invalidate user:123:permissions
|
|
Service->>Cache: Invalidate user:123:roles
|
|
Cache-->>Service: Cleared
|
|
|
|
Service-->>API: Success
|
|
|
|
Note over Service,Cache: Next request will fetch fresh data
|
|
```
|
|
|
|
**Invalidation Strategies**:
|
|
|
|
```typescript
|
|
// 1. Single key invalidation
|
|
async updateUser(userId: string, data: UpdateUserDto): Promise<User> {
|
|
const user = await userRepository.update(userId, data);
|
|
|
|
// Invalidate user cache
|
|
await cache.del(cacheKeys.user(userId));
|
|
|
|
return user;
|
|
}
|
|
|
|
// 2. Pattern-based invalidation
|
|
async updateUserRole(userId: string, roleId: string): Promise<void> {
|
|
await userRoleRepository.assign(userId, roleId);
|
|
|
|
// Invalidate all user-related cache
|
|
await cache.invalidatePattern(`iam:user:${userId}:*`);
|
|
}
|
|
|
|
// 3. Time-based invalidation (TTL expiry)
|
|
// Automatically handled by cache
|
|
```
|
|
|
|
## Cache Warming
|
|
|
|
```typescript
|
|
// Preload frequently accessed data
|
|
async warmCache(): Promise<void> {
|
|
logger.info('Starting cache warming');
|
|
|
|
// Warm user permissions for active users
|
|
const activeUsers = await userRepository.findActive({ limit: 1000 });
|
|
|
|
for (const user of activeUsers) {
|
|
const permissions = await rbacService.getUserPermissions(user.id);
|
|
|
|
await cache.set(
|
|
cacheKeys.userPermissions(user.id),
|
|
permissions,
|
|
300 // 5 minutes
|
|
);
|
|
}
|
|
|
|
logger.info('Cache warming completed', { count: activeUsers.length });
|
|
}
|
|
|
|
// Run on service startup
|
|
warmCache().catch(err => logger.error('Cache warming failed', { err }));
|
|
```
|
|
|
|
## Design Decisions
|
|
|
|
### Decision 1: Multi-layer Caching (L1 + L2)
|
|
|
|
**Context**: Need to reduce load on Redis and achieve ultra-low latency for hot data.
|
|
**Decision**: Use combination of L1 (NodeCache) and L2 (Redis).
|
|
**Consequences**:
|
|
- ✅ Latency < 1ms for 40-50% requests.
|
|
- ✅ Reduced network traffic to Redis.
|
|
- ❌ Synchronization complexity (L1 might be stale for short duration).
|
|
|
|
## Performance Characteristics
|
|
|
|
### Performance Targets
|
|
| Metric | Target | Notes |
|
|
|--------|--------|-------|
|
|
| **L1 Hit Latency** | < 0.5ms | In-memory lookup |
|
|
| **L2 Hit Latency** | < 5ms | Network RTT + Redis processing |
|
|
| **Combine Hit Rate** | > 90% | L1 + L2 combined |
|
|
| **L1 Capacity** | 10k items | Per instance limit to protect heap |
|
|
| **Cache Warmup Time** | < 30s | At service startup |
|
|
|
|
## Security Considerations
|
|
|
|
### Cache Security
|
|
- **Encryption**: Sensitive data (PII) MUST be encrypted before storing in L2 Redis (AES-256). L1 can store plaintext as it is in process memory (unless memory dump).
|
|
- **Isolation**: Redis instance protected by password and Network Policy (allow internal K8s traffic only).
|
|
- **TLS**: Connect to Redis via TLS 1.2+.
|
|
- **Data Sanitization**: Do not cache entire user objects if they contain password hashes or secrets.
|
|
|
|
## Deployment
|
|
|
|
```mermaid
|
|
graph TD
|
|
subgraph "Kubernetes Pod"
|
|
Service[Microservice Container]
|
|
L1[L1 Cache (RAM)]
|
|
Service --- L1
|
|
end
|
|
|
|
subgraph "Infrastructure"
|
|
RedisMaster[Redis Master]
|
|
RedisSlave1[Redis Slave 1]
|
|
RedisSlave2[Redis Slave 2]
|
|
end
|
|
|
|
Service -->|Write| RedisMaster
|
|
Service -->|Read| RedisSlave1
|
|
Service -->|Read| RedisSlave2
|
|
|
|
RedisMaster -.->|Replication| RedisSlave1
|
|
RedisMaster -.->|Replication| RedisSlave2
|
|
|
|
style Service fill:#e1f5ff
|
|
style L1 fill:#d4edda
|
|
style RedisMaster fill:#fff4e1
|
|
```
|
|
|
|
**Deployment Description**:
|
|
- **L1**: Embedded directly in Microservice process, scales with number of Pods.
|
|
- **L2**: Redis Cluster (or Sentinel) with at least 3 nodes for High Availability.
|
|
- **Connection Pooling**: Use ioredis with connection pooling for efficient connection management.
|
|
|
|
## Monitoring & Observability
|
|
|
|
### Monitoring Metrics
|
|
- **Metrics**: Prometheus metrics for hit rate, miss rate, latency, memory usage.
|
|
- **Logs**: Log cache miss/hit at debug level (sampled), log connection errors at error level.
|
|
- **Health Checks**: Readiness probe checks connection to Redis.
|
|
|
|
### Monitoring Code
|
|
|
|
**Cache Hit Rates**:
|
|
```typescript
|
|
// Track cache performance
|
|
export class CacheMetrics {
|
|
// ... Prometheus Implementation ...
|
|
}
|
|
```
|
|
|
|
**Expected Performance**:
|
|
| Metric | L1 Cache | L2 Cache | Database |
|
|
|--------|----------|----------|----------|
|
|
| Latency | < 1ms | < 5ms | < 50ms |
|
|
| Hit Rate | 40-50% | 80-90% | - |
|
|
| Capacity | 10k keys | Unlimited | - |
|
|
|
|
## Best Practices
|
|
|
|
**DO**:
|
|
- ✅ Use cache for frequently accessed data
|
|
- ✅ Set appropriate TTLs based on data change frequency
|
|
- ✅ Invalidate cache on data updates
|
|
- ✅ Use cache key namespacing
|
|
- ✅ Monitor cache hit rates
|
|
- ✅ Warm cache on startup for critical data
|
|
|
|
**DON'T**:
|
|
- ❌ Cache data that changes very frequently
|
|
- ❌ Set TTL too long (stale data risk)
|
|
- ❌ Set TTL too short (negates cache benefit)
|
|
- ❌ Cache sensitive data without encryption
|
|
- ❌ Ignore cache invalidation on updates
|
|
- ❌ Use cache as primary data store
|