pos-system/docs/en/architecture/caching-architecture.md

# Caching Architecture

> Multi-layer caching strategy for optimal performance

## Overview Diagram

```mermaid
graph TD
    Request[API Request] --> L1{L1 Cache<br/>Memory}

    L1 -->|Hit| Return1[Return<br/>< 1ms]
    L1 -->|Miss| L2{L2 Cache<br/>Redis}

    L2 -->|Hit| WarmL1[Warm L1]
    WarmL1 --> Return2[Return<br/>< 5ms]

    L2 -->|Miss| DB[(Database)]
    DB --> StoreL2[Store L2 + L1]
    StoreL2 --> Return3[Return<br/>< 50ms]

    style L1 fill:#d4edda
    style L2 fill:#fff4e1
    style DB fill:#f0e1ff
```

## System Context

```mermaid
C4Context
    title Caching System Context

    System(service, "Microservice", "Client service using cache")
    System_Ext(db, "Neon PostgreSQL", "Primary database")

    Boundary(caching, "Caching Layer") {
        System(l1, "L1 Cache", "In-memory NodeCache")
        System(l2, "L2 Cache", "Redis Cluster")
    }

    Rel(service, l1, "Reads/Writes", "In-process")
    Rel(service, l2, "Reads/Writes", "Redis Protocol")
    Rel(l1, l2, "Fills from", "On miss")
    Rel(l2, db, "Cache aside", "On miss")
```

### Context Description
- **Service**: Communicates directly with L1 Cache (in-memory) for lowest latency.
- **L1 Cache**: Local cache, not shared, automatic expiration (short TTL).
- **L2 Cache**: Shared Redis cluster, holds data longer and syncs across instances.
- **Database**: Source of truth, accessed only on cache miss.

## Architecture Description

### Multi-Layer Caching

GoodGo platform uses 2-layer caching for performance:

**L1 Cache (Memory)**:
- In-memory cache per service instance
- Very fast access (< 1ms)
- Limited capacity (10k keys default)
- Short TTL (60 seconds default, max 5 minutes)
- Not shared across instances

**L2 Cache (Redis)**:
- Shared distributed cache
- Fast access (< 5ms)
- Large capacity
- Longer TTL (configurable, typically 5-15 minutes)
- Shared across all service instances

**Cache Flow**:
```
Request → L1 → L2 → Database
  ↓        ↓    ↓      ↓
40-50%  80-90% 10-20%  Cache miss
hit rate hit rate        rate
```

## Cache Implementation

### Multi-Layer Cache Service

```typescript
export class MultiLayerCache {
  private l1Cache: NodeCache;
  private l2Cache: Redis;

  constructor() {
    // L1: Memory cache
    this.l1Cache = new NodeCache({
      stdTTL: 60,        // 60 seconds default
      maxKeys: 10000,    // Max 10k keys
      checkperiod: 120   // Check for expired keys every 2min
    });

    // L2: Redis cache
    this.l2Cache = new Redis({
      host: process.env.REDIS_HOST,
      port: parseInt(process.env.REDIS_PORT),
      db: 0
    });
  }

  async get<T>(key: string): Promise<T | null> {
    // Try L1 first
    const l1Value = this.l1Cache.get<T>(key);
    if (l1Value) {
      logger.debug('L1 cache hit', { key });
      return l1Value;
    }

    // Try L2
    const l2Value = await this.l2Cache.get(key);
    if (l2Value) {
      logger.debug('L2 cache hit', { key });
      const parsed = JSON.parse(l2Value) as T;

      // Warm L1 cache
      this.l1Cache.set(key, parsed);
      return parsed;
    }

    logger.debug('Cache miss', { key });
    return null;
  }

  async set(key: string, value: any, ttl: number = 300): Promise<void> {
    // Store in both L1 and L2
    this.l1Cache.set(key, value, Math.min(ttl, 300)); // L1 max 5min
    await this.l2Cache.setex(key, ttl, JSON.stringify(value));
  }

  async del(key: string): Promise<void> {
    this.l1Cache.del(key);
    await this.l2Cache.del(key);
  }

  async invalidatePattern(pattern: string): Promise<void> {
    // L1: Clear all (simple approach)
    this.l1Cache.flushAll();

    // L2: Delete by pattern
    const keys = await this.l2Cache.keys(pattern);
    if (keys.length > 0) {
      await this.l2Cache.del(...keys);
    }
  }
}
```

### Cache Key Naming

**Pattern**: `{service}:{entity}:{identifier}:{sub-resource}`

**Examples**:
```typescript
const keys = {
  user: (userId: string) => `iam:user:${userId}`,
  userPermissions: (userId: string) => `iam:user:${userId}:permissions`,
  userRoles: (userId: string) => `iam:user:${userId}:roles`,
  session: (sessionId: string) => `iam:session:${sessionId}`,
};

// Usage
const user = await cache.get(keys.user('user_123'));
const permissions = await cache.get(keys.userPermissions('user_123'));
```

## TTL Strategies

```mermaid
graph LR
    subgraph "TTL Tiers"
        Short[Short TTL<br/>60-300s<br/>Frequently changing]
        Medium[Medium TTL<br/>300-1800s<br/>Moderately changing]
        Long[Long TTL<br/>1800-3600s<br/>Rarely changing]
    end

    Short --> Permissions[User Permissions]
    Short --> Sessions[Session Data]

    Medium --> UserProfiles[User Profiles]
    Medium --> OrgData[Organization Data]

    Long --> Config[Static Config]
    Long --> RefData[Reference Data]

    style Short fill:#f8d7da
    style Medium fill:#fff3cd
    style Long fill:#d4edda
```

**TTL Guidelines**:
| Data Type | TTL | Reason |
|-----------|-----|--------|
| User permissions | 5 min | Security-sensitive |
| Session data | Varies | Based on session length |
| User profiles | 10 min | Moderate update frequency |
| Organization data | 15 min | Infrequent updates |
| Static config | 30-60 min | Very stable |
| Reference data | 1-2 hours | Almost never changes |

## Cache Invalidation

```mermaid
sequenceDiagram
    participant API
    participant Service
    participant Cache
    participant DB

    API->>Service: Update User
    Service->>DB: UPDATE user
    DB-->>Service: Success

    Service->>Cache: Invalidate user:123
    Service->>Cache: Invalidate user:123:permissions
    Service->>Cache: Invalidate user:123:roles
    Cache-->>Service: Cleared

    Service-->>API: Success

    Note over Service,Cache: Next request will fetch fresh data
```

**Invalidation Strategies**:

```typescript
// 1. Single key invalidation
async updateUser(userId: string, data: UpdateUserDto): Promise<User> {
  const user = await userRepository.update(userId, data);

  // Invalidate user cache
  await cache.del(cacheKeys.user(userId));

  return user;
}

// 2. Pattern-based invalidation
async updateUserRole(userId: string, roleId: string): Promise<void> {
  await userRoleRepository.assign(userId, roleId);

  // Invalidate all user-related cache
  await cache.invalidatePattern(`iam:user:${userId}:*`);
}

// 3. Time-based invalidation (TTL expiry)
// Automatically handled by cache
```

## Cache Warming

```typescript
// Preload frequently accessed data
async warmCache(): Promise<void> {
  logger.info('Starting cache warming');

  // Warm user permissions for active users
  const activeUsers = await userRepository.findActive({ limit: 1000 });

  for (const user of activeUsers) {
    const permissions = await rbacService.getUserPermissions(user.id);

    await cache.set(
      cacheKeys.userPermissions(user.id),
      permissions,
      300 // 5 minutes
    );
  }

  logger.info('Cache warming completed', { count: activeUsers.length });
}

// Run on service startup
warmCache().catch(err => logger.error('Cache warming failed', { err }));
```

## Design Decisions

### Decision 1: Multi-layer Caching (L1 + L2)

**Context**: Need to reduce load on Redis and achieve ultra-low latency for hot data.
**Decision**: Use combination of L1 (NodeCache) and L2 (Redis).
**Consequences**:
- ✅ Latency < 1ms for 40-50% requests.
- ✅ Reduced network traffic to Redis.
- ❌ Synchronization complexity (L1 might be stale for short duration).

## Performance Characteristics

### Performance Targets
| Metric | Target | Notes |
|--------|--------|-------|
| **L1 Hit Latency** | < 0.5ms | In-memory lookup |
| **L2 Hit Latency** | < 5ms | Network RTT + Redis processing |
| **Combine Hit Rate** | > 90% | L1 + L2 combined |
| **L1 Capacity** | 10k items | Per instance limit to protect heap |
| **Cache Warmup Time** | < 30s | At service startup |

## Security Considerations

### Cache Security
- **Encryption**: Sensitive data (PII) MUST be encrypted before storing in L2 Redis (AES-256). L1 can store plaintext as it is in process memory (unless memory dump).
- **Isolation**: Redis instance protected by password and Network Policy (allow internal K8s traffic only).
- **TLS**: Connect to Redis via TLS 1.2+.
- **Data Sanitization**: Do not cache entire user objects if they contain password hashes or secrets.

## Deployment

```mermaid
graph TD
    subgraph "Kubernetes Pod"
        Service[Microservice Container]
        L1[L1 Cache (RAM)]
        Service --- L1
    end

    subgraph "Infrastructure"
        RedisMaster[Redis Master]
        RedisSlave1[Redis Slave 1]
        RedisSlave2[Redis Slave 2]
    end

    Service -->|Write| RedisMaster
    Service -->|Read| RedisSlave1
    Service -->|Read| RedisSlave2

    RedisMaster -.->|Replication| RedisSlave1
    RedisMaster -.->|Replication| RedisSlave2

    style Service fill:#e1f5ff
    style L1 fill:#d4edda
    style RedisMaster fill:#fff4e1
```

**Deployment Description**:
- **L1**: Embedded directly in Microservice process, scales with number of Pods.
- **L2**: Redis Cluster (or Sentinel) with at least 3 nodes for High Availability.
- **Connection Pooling**: Use ioredis with connection pooling for efficient connection management.

## Monitoring & Observability

### Monitoring Metrics
- **Metrics**: Prometheus metrics for hit rate, miss rate, latency, memory usage.
- **Logs**: Log cache miss/hit at debug level (sampled), log connection errors at error level.
- **Health Checks**: Readiness probe checks connection to Redis.

### Monitoring Code

**Cache Hit Rates**:
```typescript
// Track cache performance
export class CacheMetrics {
  // ... Prometheus Implementation ...
}
```

**Expected Performance**:
| Metric | L1 Cache | L2 Cache | Database |
|--------|----------|----------|----------|
| Latency | < 1ms | < 5ms | < 50ms |
| Hit Rate | 40-50% | 80-90% | - |
| Capacity | 10k keys | Unlimited | - |

## Best Practices

**DO**:
- ✅ Use cache for frequently accessed data
- ✅ Set appropriate TTLs based on data change frequency
- ✅ Invalidate cache on data updates
- ✅ Use cache key namespacing
- ✅ Monitor cache hit rates
- ✅ Warm cache on startup for critical data

**DON'T**:
- ❌ Cache data that changes very frequently
- ❌ Set TTL too long (stale data risk)
- ❌ Set TTL too short (negates cache benefit)
- ❌ Cache sensitive data without encryption
- ❌ Ignore cache invalidation on updates
- ❌ Use cache as primary data store