10 KiB
10 KiB
trigger
| trigger |
|---|
| always_on |
Resilience Patterns
When to Use This Skill
Use this skill when:
- Implementing circuit breaker patterns for external services
- Adding retry logic for transient failures
- Setting timeout handling for long-running operations
- Implementing graceful degradation strategies
- Handling external service failures
- Improving system fault tolerance
Core Concepts
Resilience Patterns
- Circuit Breaker: Prevents cascading failures by stopping calls to failing services
- Retry: Automatically retries failed operations with backoff
- Timeout: Sets maximum time limits for operations
- Bulkhead: Isolates failures to prevent spread
- Graceful Degradation: Provides fallback behavior when services fail
Patterns
Circuit Breaker Pattern
Protects against cascading failures:
import CircuitBreaker from 'opossum';
import { logger } from '@goodgo/logger';
export const createCircuitBreaker = <TArgs extends any[], TResult>(
action: (...args: TArgs) => Promise<TResult>,
name: string,
options: Partial<CircuitBreaker.Options> = {}
): CircuitBreaker<TArgs, TResult> => {
const breaker = new CircuitBreaker(action, {
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 30000,
...options,
name,
});
breaker.on('open', () => {
logger.warn(`Circuit Breaker OPEN: ${name}`);
});
breaker.on('halfOpen', () => {
logger.info(`Circuit Breaker HALF-OPEN: ${name}`);
});
breaker.on('close', () => {
logger.info(`Circuit Breaker CLOSED: ${name}`);
});
return breaker;
};
// Usage
const externalApiBreaker = createCircuitBreaker(
async (data) => await externalApi.call(data),
'external-api'
);
try {
const result = await externalApiBreaker.fire(requestData);
} catch (error) {
// Handle circuit breaker error or fallback
}
Retry Pattern
Retry transient failures with exponential backoff:
async function retryWithBackoff<T>(
fn: () => Promise<T>,
maxRetries: number = 3,
baseDelay: number = 1000
): Promise<T> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxRetries) throw error;
const delay = baseDelay * Math.pow(2, attempt);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
throw new Error('Retry exhausted');
}
Timeout Pattern
Set maximum time limits:
async function withTimeout<T>(
promise: Promise<T>,
timeoutMs: number
): Promise<T> {
const timeout = new Promise<never>((_, reject) => {
setTimeout(() => reject(new Error('Operation timeout')), timeoutMs);
});
return Promise.race([promise, timeout]);
}
// Usage
try {
const result = await withTimeout(
externalService.call(),
5000 // 5 second timeout
);
} catch (error) {
if (error.message === 'Operation timeout') {
// Handle timeout
}
}
Graceful Degradation
Provide fallback behavior:
async function getDataWithFallback() {
try {
return await primaryDataSource.get();
} catch (error) {
logger.warn('Primary source failed, using fallback', { error });
return await fallbackDataSource.get();
}
}
Bulkhead Pattern
Isolate failures to prevent spread:
import PQueue from 'p-queue';
// Create separate queues for different operations
const externalApiQueue = new PQueue({
concurrency: 10, // Max 10 concurrent calls
timeout: 30000 // 30 second timeout per operation
});
const databaseQueue = new PQueue({
concurrency: 20
});
// Usage - operations are isolated
async function fetchExternalData(id: string) {
return externalApiQueue.add(async () => {
return await externalApi.getData(id);
});
}
async function queryDatabase(query: string) {
return databaseQueue.add(async () => {
return await database.execute(query);
});
}
Combined Resilience Service
// src/core/resilience/resilience.service.ts
import CircuitBreaker from 'opossum';
import { logger } from '@goodgo/logger';
interface ResilienceOptions {
timeout?: number;
maxRetries?: number;
circuitBreaker?: boolean;
fallback?: () => Promise<any>;
}
export class ResilienceService {
async execute<T>(
operation: () => Promise<T>,
name: string,
options: ResilienceOptions = {}
): Promise<T> {
const {
timeout = 5000,
maxRetries = 3,
circuitBreaker = true,
fallback
} = options;
let fn = operation;
// Wrap with timeout
fn = () => this.withTimeout(operation(), timeout);
// Wrap with retry
fn = () => this.retryWithBackoff(fn, maxRetries);
// Wrap with circuit breaker
if (circuitBreaker) {
const breaker = this.createCircuitBreaker(fn, name);
try {
return await breaker.fire();
} catch (error) {
if (fallback) {
logger.warn(`${name}: Using fallback`, { error: error.message });
return await fallback();
}
throw error;
}
}
try {
return await fn();
} catch (error) {
if (fallback) {
return await fallback();
}
throw error;
}
}
private withTimeout<T>(promise: Promise<T>, ms: number): Promise<T> {
const timeout = new Promise<never>((_, reject) => {
setTimeout(() => reject(new Error('Operation timeout')), ms);
});
return Promise.race([promise, timeout]);
}
private async retryWithBackoff<T>(
fn: () => Promise<T>,
maxRetries: number
): Promise<T> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxRetries) throw error;
const delay = 1000 * Math.pow(2, attempt);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
throw new Error('Retry exhausted');
}
private createCircuitBreaker<T>(
fn: () => Promise<T>,
name: string
): CircuitBreaker<[], T> {
return new CircuitBreaker(fn, {
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 30000,
name
});
}
}
// Usage
const resilience = new ResilienceService();
const result = await resilience.execute(
() => externalApi.fetchUser(userId),
'fetch-user',
{
timeout: 3000,
maxRetries: 2,
fallback: () => Promise.resolve({ id: userId, name: 'Unknown' })
}
);
Health Check with Resilience
// src/health/health.controller.ts
export class HealthController {
async checkDependencies(): Promise<HealthStatus> {
const checks = await Promise.allSettled([
this.checkDatabase(),
this.checkRedis(),
this.checkExternalApi()
]);
const results = {
database: checks[0].status === 'fulfilled' ? 'healthy' : 'unhealthy',
redis: checks[1].status === 'fulfilled' ? 'healthy' : 'unhealthy',
externalApi: checks[2].status === 'fulfilled' ? 'healthy' : 'degraded'
};
// Service is healthy even if external API is down (graceful degradation)
const isHealthy = results.database === 'healthy' && results.redis === 'healthy';
return {
status: isHealthy ? 'healthy' : 'unhealthy',
dependencies: results
};
}
}
Best Practices
- Circuit Breaker: Use for external service calls
- Retry: Retry only transient failures (network, timeout)
- Timeout: Set appropriate timeouts for all external calls
- Fallback: Always provide fallback behavior
- Monitoring: Monitor circuit breaker states and retry rates
- Logging: Log all resilience actions for debugging
Common Mistakes
-
Retrying Non-Retryable Errors: Retrying 4xx errors (client errors)
// ❌ BAD: Retry all errors catch (error) { await retry(operation); } // ✅ GOOD: Only retry transient errors catch (error) { if (isTransientError(error)) { await retry(operation); } else { throw error; } } -
No Timeout: Missing timeouts on external calls
// ❌ BAD: No timeout const data = await externalApi.fetch(); // ✅ GOOD: With timeout const data = await withTimeout(externalApi.fetch(), 5000); -
No Fallback: No graceful degradation strategy
// ❌ BAD: Service crashes if dependency fails const user = await userService.get(id); // ✅ GOOD: Fallback to cached/default data const user = await userService.get(id).catch(() => cachedUser); -
Too Many Retries: Excessive retries causing performance issues
// ❌ BAD: Too many retries with short delay retry(fn, { maxRetries: 10, delay: 100 }); // ✅ GOOD: Limited retries with exponential backoff retry(fn, { maxRetries: 3, baseDelay: 1000, exponential: true }); -
Circuit Breaker Misconfiguration: Wrong thresholds
// ❌ BAD: Circuit opens too easily or never { errorThresholdPercentage: 5 } // Opens after 5% errors { errorThresholdPercentage: 99 } // Almost never opens // ✅ GOOD: Balanced threshold { errorThresholdPercentage: 50, resetTimeout: 30000 }
Quick Reference
| Pattern | Use Case | Key Config |
|---|---|---|
| Circuit Breaker | External API calls | threshold: 50%, reset: 30s |
| Retry | Transient failures | max: 3, exponential backoff |
| Timeout | All external calls | 3-5s for API, 30s for batch |
| Bulkhead | Resource isolation | 10-20 concurrent ops |
| Fallback | Critical operations | Cache, default, or degraded |
Opossum Circuit Breaker States:
CLOSED → (errors exceed threshold) → OPEN
OPEN → (reset timeout expires) → HALF-OPEN
HALF-OPEN → (success) → CLOSED
HALF-OPEN → (failure) → OPEN
Retry Delays (Exponential Backoff):
Attempt 1: 1s
Attempt 2: 2s
Attempt 3: 4s
Attempt 4: 8s
Essential Imports:
import CircuitBreaker from 'opossum';
import PQueue from 'p-queue';
import { logger } from '@goodgo/logger';
Resources
- Opossum Documentation - Circuit breaker library
- Microsoft Resilience Patterns
- API Gateway Advanced - Gateway circuit breaker
- Observability & Monitoring - Health checks, metrics
- Event-Driven Architecture - Event retry patterns
- Error Handling Patterns - Error handling
- Project Rules - GoodGo standards