docs: Thêm tài liệu kiến trúc bảo mật, hướng sự kiện, nhất quán dữ liệu, khả năng quan sát và caching bằng tiếng Việt, đồng thời cập nhật các tài liệu hướng dẫn và kiến trúc hiện có.

This commit is contained in:
Ho Ngoc Hai
2026-01-07 10:22:42 +07:00
parent d8faffd41d
commit 495618ded7
17 changed files with 7357 additions and 779 deletions

View File

@@ -1,9 +1,8 @@
# Caching Architecture / Kiến trúc Caching
# Caching Architecture
> **EN**: Multi-layer caching strategy for optimal performance
> **VI**: Chiến lược caching nhiều tầng để tối ưu hiệu suất
> Multi-layer caching strategy for optimal performance
## Overview Diagram / Sơ đồ Tổng quan
## Overview Diagram
```mermaid
graph TD
@@ -24,9 +23,35 @@ graph TD
style DB fill:#f0e1ff
```
## Architecture Description / Mô tả Kiến trúc
## System Context
### EN: Multi-Layer Caching
```mermaid
C4Context
title Caching System Context
System(service, "Microservice", "Client service using cache")
System_Ext(db, "Neon PostgreSQL", "Primary database")
Boundary(caching, "Caching Layer") {
System(l1, "L1 Cache", "In-memory NodeCache")
System(l2, "L2 Cache", "Redis Cluster")
}
Rel(service, l1, "Reads/Writes", "In-process")
Rel(service, l2, "Reads/Writes", "Redis Protocol")
Rel(l1, l2, "Fills from", "On miss")
Rel(l2, db, "Cache aside", "On miss")
```
### Context Description
- **Service**: Communicates directly with L1 Cache (in-memory) for lowest latency.
- **L1 Cache**: Local cache, not shared, automatic expiration (short TTL).
- **L2 Cache**: Shared Redis cluster, holds data longer and syncs across instances.
- **Database**: Source of truth, accessed only on cache miss.
## Architecture Description
### Multi-Layer Caching
GoodGo platform uses 2-layer caching for performance:
@@ -52,30 +77,11 @@ Request → L1 → L2 → Database
hit rate hit rate rate
```
### VI: Caching Nhiều Tầng
Nền tảng GoodGo sử dụng caching 2 tầng để tối ưu hiệu suất:
**L1 Cache (Memory)**:
- In-memory cache trên mỗi service instance
- Truy cập rất nhanh (< 1ms)
- Dung lượng giới hạn (10k keys mặc định)
- TTL ngắn (60 giây mặc định, tối đa 5 phút)
- Không share giữa instances
**L2 Cache (Redis)**:
- Shared distributed cache
- Truy cập nhanh (< 5ms)
- Dung lượng lớn
- TTL dài hơn (configurable, thường 5-15 phút)
- Share giữa tất cả service instances
## Cache Implementation / Triển khai Cache
## Cache Implementation
### Multi-Layer Cache Service
```typescript
// Multi-layer cache implementation
export class MultiLayerCache {
private l1Cache: NodeCache;
private l2Cache: Redis;
@@ -143,13 +149,12 @@ export class MultiLayerCache {
}
```
### Cache Key Naming / Quy ước Đặt tên Key
### Cache Key Naming
**Pattern**: `{service}:{entity}:{identifier}:{sub-resource}`
**Examples**:
```typescript
// User cache keys
const keys = {
user: (userId: string) => `iam:user:${userId}`,
userPermissions: (userId: string) => `iam:user:${userId}:permissions`,
@@ -162,7 +167,7 @@ const user = await cache.get(keys.user('user_123'));
const permissions = await cache.get(keys.userPermissions('user_123'));
```
## TTL Strategies / Chiến lược TTL
## TTL Strategies
```mermaid
graph LR
@@ -196,7 +201,7 @@ graph LR
| Static config | 30-60 min | Very stable |
| Reference data | 1-2 hours | Almost never changes |
## Cache Invalidation / Vô hiệu hóa Cache
## Cache Invalidation
```mermaid
sequenceDiagram
@@ -244,7 +249,7 @@ async updateUserRole(userId: string, roleId: string): Promise<void> {
// Automatically handled by cache
```
## Cache Warming / Làm ấm Cache
## Cache Warming
```typescript
// Preload frequently accessed data
@@ -271,33 +276,83 @@ async warmCache(): Promise<void> {
warmCache().catch(err => logger.error('Cache warming failed', { err }));
```
## Performance Metrics / Chỉ số Hiệu suất
## Design Decisions
### Decision 1: Multi-layer Caching (L1 + L2)
**Context**: Need to reduce load on Redis and achieve ultra-low latency for hot data.
**Decision**: Use combination of L1 (NodeCache) and L2 (Redis).
**Consequences**:
- ✅ Latency < 1ms for 40-50% requests.
- ✅ Reduced network traffic to Redis.
- ❌ Synchronization complexity (L1 might be stale for short duration).
## Performance Characteristics
### Performance Targets
| Metric | Target | Notes |
|--------|--------|-------|
| **L1 Hit Latency** | < 0.5ms | In-memory lookup |
| **L2 Hit Latency** | < 5ms | Network RTT + Redis processing |
| **Combine Hit Rate** | > 90% | L1 + L2 combined |
| **L1 Capacity** | 10k items | Per instance limit to protect heap |
| **Cache Warmup Time** | < 30s | At service startup |
## Security Considerations
### Cache Security
- **Encryption**: Sensitive data (PII) MUST be encrypted before storing in L2 Redis (AES-256). L1 can store plaintext as it is in process memory (unless memory dump).
- **Isolation**: Redis instance protected by password and Network Policy (allow internal K8s traffic only).
- **TLS**: Connect to Redis via TLS 1.2+.
- **Data Sanitization**: Do not cache entire user objects if they contain password hashes or secrets.
## Deployment
```mermaid
graph TD
subgraph "Kubernetes Pod"
Service[Microservice Container]
L1[L1 Cache (RAM)]
Service --- L1
end
subgraph "Infrastructure"
RedisMaster[Redis Master]
RedisSlave1[Redis Slave 1]
RedisSlave2[Redis Slave 2]
end
Service -->|Write| RedisMaster
Service -->|Read| RedisSlave1
Service -->|Read| RedisSlave2
RedisMaster -.->|Replication| RedisSlave1
RedisMaster -.->|Replication| RedisSlave2
style Service fill:#e1f5ff
style L1 fill:#d4edda
style RedisMaster fill:#fff4e1
```
**Deployment Description**:
- **L1**: Embedded directly in Microservice process, scales with number of Pods.
- **L2**: Redis Cluster (or Sentinel) with at least 3 nodes for High Availability.
- **Connection Pooling**: Use ioredis with connection pooling for efficient connection management.
## Monitoring & Observability
### Monitoring Metrics
- **Metrics**: Prometheus metrics for hit rate, miss rate, latency, memory usage.
- **Logs**: Log cache miss/hit at debug level (sampled), log connection errors at error level.
- **Health Checks**: Readiness probe checks connection to Redis.
### Monitoring Code
**Cache Hit Rates**:
```typescript
// Track cache performance
export class CacheMetrics {
private hits = new Counter({
name: 'cache_hits_total',
help: 'Total cache hits',
labelNames: ['layer', 'key_prefix']
});
private misses = new Counter({
name: 'cache_misses_total',
help: 'Total cache misses',
labelNames: ['layer', 'key_prefix']
});
recordHit(layer: 'l1' | 'l2', key: string): void {
const prefix = key.split(':')[0];
this.hits.inc({ layer, key_prefix: prefix });
}
recordMiss(key: string): void {
const prefix = key.split(':')[0];
this.misses.inc({ layer: 'db', key_prefix: prefix });
}
// ... Prometheus Implementation ...
}
```
@@ -308,7 +363,7 @@ export class CacheMetrics {
| Hit Rate | 40-50% | 80-90% | - |
| Capacity | 10k keys | Unlimited | - |
## Best Practices / Best Practices
## Best Practices
**DO**:
- ✅ Use cache for frequently accessed data
@@ -325,13 +380,3 @@ export class CacheMetrics {
- ❌ Cache sensitive data without encryption
- ❌ Ignore cache invalidation on updates
- ❌ Use cache as primary data store
## Related Documentation / Tài liệu Liên quan
- [System Design](./system-design.md) - Overall architecture with caching
- [Data Consistency Patterns](./data-consistency-patterns.md) - Cache invalidation patterns
---
**Last Updated**: 2024-01-15
**Authors**: GoodGo Architecture Team

View File

@@ -1,9 +1,8 @@
# Event-Driven Architecture / Kiến trúc Hướng Sự kiện
# Event-Driven Architecture
> **EN**: Event-driven architecture for asynchronous communication using Apache Kafka
> **VI**: Kiến trúc hướng sự kiện cho giao tiếp bất đồng bộ sử dụng Apache Kafka
> Event-driven architecture for asynchronous communication using Apache Kafka
## Overview Diagram / Sơ đồ Tổng quan
## Overview Diagram
```mermaid
graph TD
@@ -32,9 +31,7 @@ graph TD
style Topics fill:#fff4e1
```
## Architecture Description / Mô tả Kiến trúc
### EN: English Section
## Architecture Description
The GoodGo platform implements Event-Driven Architecture (EDA) for asynchronous communication between microservices.
@@ -47,28 +44,11 @@ The GoodGo platform implements Event-Driven Architecture (EDA) for asynchronous
**Technology Stack**:
- Apache Kafka - Event streaming platform
- Schema Registry - Avro schemas for validation
- Schema Registry - Avro schemas for validation
- KafkaJS - Node.js client library
- Event Sourcing - Custom implementation in IAM
### VI: Vietnamese Section
Nền tảng GoodGo triển khai Kiến trúc Hướng Sự kiện (EDA) cho giao tiếp bất đồng bộ giữa microservices.
**Nguyên tắc Cốt lõi**:
1. **Event-First Design**: Mọi thay đổi trạng thái phát ra domain events
2. **Loose Coupling**: Services giao tiếp qua events
3. **Eventual Consistency**: Chấp nhận inconsistency tạm thời
4. **Event Sourcing**: Lưu thay đổi dưới dạng chuỗi event
5. **CQRS Pattern**: Tách biệt read/write operations
**Công nghệ**:
- Apache Kafka - Nền tảng event streaming
- Schema Registry - Avro schemas để validation
- KafkaJS - Thư viện Node.js client
- Event Sourcing - Triển khai tùy chỉnh trong IAM
## Event Flow / Luồng Sự kiện
## Event Flow
```mermaid
sequenceDiagram
@@ -82,11 +62,9 @@ sequenceDiagram
Consumer-->>Kafka: Acknowledge
```
**EN Steps**: Publish → Distribute → Consume → Retry (if failed) → DLQ (after max retries) → Acknowledge
**Steps**: Publish → Distribute → Consume → Retry (if failed) → DLQ (after max retries) → Acknowledge
**VI Các Bước**: Publish → Distribute → Consume → Retry (nếu thất bại) → DLQ (sau retry tối đa) → Acknowledge
## Event Structure / Cấu trúc Sự kiện
## Event Structure
```typescript
interface BaseEvent {
@@ -114,7 +92,7 @@ interface BaseEvent {
}
```
## Kafka Topics / Kafka Topics
## Kafka Topics
```mermaid
graph LR
@@ -134,7 +112,7 @@ graph LR
- `auth.login.success.v1`
- `audit.event.logged.v1`
## Error Handling / Xử lý Lỗi
## Error Handling
```mermaid
graph TD
@@ -151,12 +129,247 @@ graph TD
3. Move to DLQ after max retries
4. Manual review and reprocess
## Related Documentation / Tài liệu Liên quan
## System Context
```mermaid
C4Context
title Event-Driven Architecture Context
System(iam, "IAM Service", "Event producer")
System(service_a, "Service A", "Event producer")
System(notification, "Notification Service", "Event consumer")
System(audit, "Audit Service", "Event consumer")
System_Ext(kafka, "Apache Kafka", "Event streaming platform")
System_Ext(registry, "Schema Registry", "Schema management")
System_Ext(monitoring, "Monitoring", "Kafka metrics & alerts")
Rel(iam, kafka, "Publishes events", "Kafka Protocol")
Rel(service_a, kafka, "Publishes events", "Kafka Protocol")
Rel(kafka, notification, "Delivers events", "Kafka Protocol")
Rel(kafka, audit, "Delivers events", "Kafka Protocol")
Rel(kafka, registry, "Validates schemas", "HTTP")
Rel(kafka, monitoring, "Sends metrics", "JMX")
```
**Context Description**:
- **Producers**: IAM Service and other services publish domain events
- **Kafka**: Central event broker, manages topics and partitions
- **Consumers**: Notification and Audit services consume events
- **Schema Registry**: Manages and validates Avro schemas
- **Monitoring**: Collects metrics from Kafka cluster
## Performance Characteristics
| Metric | Target | Notes |
|--------|--------|-------|
| **Event Publish Latency (P95)** | < 10ms | Fire-and-forget, async |
| **Event Delivery Latency (P95)** | < 100ms | End-to-end from publish to consume |
| **Throughput** | 10,000 events/s | Per topic, scalable with partitions |
| **Consumer Lag** | < 1000 messages | Per partition, monitored |
| **Event Size** | < 1MB | Recommended max size |
| **Retention** | 7 days | Default, configurable per topic |
| **Replication Factor** | 3 | For fault tolerance |
**Performance Optimizations**:
- **Batch Publishing**: Group multiple events to reduce network overhead
- **Compression**: Use Snappy or LZ4 compression
- **Partitioning**: Divide topics into multiple partitions for parallel processing
- **Consumer Groups**: Multiple consumers in same group for horizontal scaling
- **Async Publishing**: Fire-and-forget pattern, don't block request handlers
## Security Considerations
**Event Encryption**:
- TLS in-transit for all Kafka connections
- Optional payload encryption for sensitive data
- End-to-end encryption with custom encryption layer
**Access Control**:
- Kafka ACLs (Access Control Lists) per topic
- SASL/SCRAM authentication for producers and consumers
- Separate credentials per service
- Principle of least privilege - grant only necessary permissions
**Schema Validation**:
- Avro schemas in Schema Registry
- Schema evolution with backward/forward compatibility
- Reject events that don't match schema
**Audit**:
- Log all event publishes and consumes
- Correlation IDs to trace event flow
- Retention policy for audit logs (7 years)
**Data Retention**:
- Default 7 days retention
- Configurable per topic
- Automatic deletion after retention period
- GDPR compliance (right to erasure)
## Deployment
```mermaid
graph TD
subgraph "Kafka Cluster"
subgraph "Brokers"
Broker1[Kafka Broker 1<br/>Leader for partitions 0,3,6]
Broker2[Kafka Broker 2<br/>Leader for partitions 1,4,7]
Broker3[Kafka Broker 3<br/>Leader for partitions 2,5,8]
end
subgraph "Coordination"
ZK[Zookeeper Ensemble<br/>3 nodes]
end
Broker1 --> ZK
Broker2 --> ZK
Broker3 --> ZK
end
subgraph "Producers"
IAM[IAM Service]
ServiceA[Service A]
end
subgraph "Consumers"
Notification[Notification Service<br/>Consumer Group: notifications]
Audit[Audit Service<br/>Consumer Group: audit]
end
IAM --> Broker1
IAM --> Broker2
IAM --> Broker3
ServiceA --> Broker1
ServiceA --> Broker2
ServiceA --> Broker3
Broker1 --> Notification
Broker2 --> Notification
Broker3 --> Notification
Broker1 --> Audit
Broker2 --> Audit
Broker3 --> Audit
style Broker1 fill:#e1f5ff
style Broker2 fill:#fff4e1
style Broker3 fill:#d4edda
style ZK fill:#f0e1ff
```
**Kafka Cluster Configuration**:
- **Brokers**: 3 brokers minimum (5 for production)
- **Replication Factor**: 3 (for fault tolerance)
- **Min In-Sync Replicas**: 2 (ensure data durability)
- **Partitions**: 3-10 per topic (based on throughput needs)
- **Zookeeper**: 3-node ensemble (for coordination)
**Resource Allocation**:
| Component | CPU | Memory | Disk |
|-----------|-----|--------|------|
| **Kafka Broker** | 2 cores | 4GB RAM | 100GB SSD |
| **Zookeeper** | 1 core | 2GB RAM | 20GB SSD |
| **Schema Registry** | 500m | 1GB RAM | 10GB |
**Topic Configuration**:
```yaml
user.created:
partitions: 3
replication-factor: 3
retention-ms: 604800000 # 7 days
compression-type: snappy
auth.login.success:
partitions: 5
replication-factor: 3
retention-ms: 604800000
compression-type: snappy
audit.events:
partitions: 10
replication-factor: 3
retention-ms: 220752000000 # 7 years
compression-type: lz4
```
**High Availability**:
- Multiple brokers with partition replication
- Automatic leader election when broker fails
- Consumer group rebalancing
- Monitoring and alerting for broker health
## Monitoring & Observability
**Key Metrics**:
**Kafka Broker Metrics**:
- `kafka_server_brokertopicmetrics_messagesinpersec` - Messages in/sec
- `kafka_server_brokertopicmetrics_bytesinpersec` - Bytes in/sec
- `kafka_server_brokertopicmetrics_bytesoutpersec` - Bytes out/sec
- `kafka_controller_kafkacontroller_activecontrollercount` - Active controller
- `kafka_server_replicamanager_underreplicatedpartitions` - Under-replicated partitions
**Consumer Metrics**:
- `kafka_consumer_fetch_manager_records_lag_max` - Max consumer lag
- `kafka_consumer_fetch_manager_records_consumed_rate` - Records consumed/sec
- `kafka_consumer_coordinator_commit_latency_avg` - Commit latency
**Producer Metrics**:
- `kafka_producer_record_send_total` - Total records sent
- `kafka_producer_record_error_total` - Total send errors
- `kafka_producer_request_latency_avg` - Request latency
**Application Metrics**:
```typescript
// Custom metrics for event processing
const eventPublished = new Counter({
name: 'events_published_total',
help: 'Total events published',
labelNames: ['event_type', 'topic']
});
const eventConsumed = new Counter({
name: 'events_consumed_total',
help: 'Total events consumed',
labelNames: ['event_type', 'topic', 'consumer_group']
});
const eventProcessingDuration = new Histogram({
name: 'event_processing_duration_seconds',
help: 'Event processing duration',
labelNames: ['event_type'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});
```
**Dashboards**:
- Kafka Cluster Overview (brokers, topics, partitions)
- Producer Performance (throughput, latency, errors)
- Consumer Performance (lag, throughput, errors)
- Topic Metrics (messages/sec, bytes/sec, retention)
**Logging**:
```typescript
// Structured logging for events
logger.info('Event published', {
eventId: event.eventId,
eventType: event.eventType,
topic: 'user.created',
correlationId: event.correlationId
});
logger.info('Event consumed', {
eventId: event.eventId,
eventType: event.eventType,
topic: 'user.created',
consumerGroup: 'notifications',
processingTime: duration
});
```
## Related Documentation
- [System Design](./system-design.md) - Overall architecture
- [IAM Architecture](./iam-proposal.md) - Event sourcing implementation
---
**Last Updated**: 2024-01-15
**Authors**: GoodGo Architecture Team

View File

@@ -1,9 +1,8 @@
# Security Architecture / Kiến trúc Bảo mật
# Security Architecture
> **EN**: Comprehensive security architecture for GoodGo platform with zero-trust model, RBAC, and compliance
> **VI**: Kiến trúc bảo mật toàn diện cho nền tảng GoodGo với mô hình zero-trust, RBAC và compliance
> Comprehensive security architecture for GoodGo platform with zero-trust model, RBAC, and compliance
## Overview Diagram / Sơ đồ Tổng quan
## Overview Diagram
```mermaid
graph TD
@@ -26,9 +25,7 @@ graph TD
style Audit fill:#fff4e1
```
## Architecture Description / Mô tả Kiến trúc
### EN: English Section
## Architecture Description
The GoodGo Security Architecture implements defense-in-depth with multiple security layers:
@@ -47,26 +44,7 @@ The GoodGo Security Architecture implements defense-in-depth with multiple secur
- Event Sourcing for Audit Trail
- Compliance (GDPR, SOC2, ISO27001, HIPAA)
### VI: Vietnamese Section
Kiến trúc Bảo mật GoodGo triển khai defense-in-depth với nhiều tầng bảo mật:
**Nguyên tắc Bảo mật**:
1. **Zero Trust**: Không bao giờ tin tưởng, luôn xác minh
2. **Least Privilege**: Quyền tối thiểu cần thiết
3. **Defense in Depth**: Nhiều tầng bảo mật
4. **Audit Everything**: Audit trail hoàn chỉnh
5. **Encryption**: Mã hóa dữ liệu at rest và in transit
**Thành phần Chính**:
- JWT Authentication (15min access, 7 ngày refresh)
- RBAC + ABAC Authorization
- Zero-Trust Device Validation
- AES-256-GCM Encryption
- Event Sourcing cho Audit Trail
- Compliance (GDPR, SOC2, ISO27001, HIPAA)
## Authentication Flow / Luồng Xác thực
## Authentication Flow
```mermaid
sequenceDiagram
@@ -93,7 +71,7 @@ sequenceDiagram
end
```
**EN: Authentication Details**:
**Authentication Details**:
**1. Password Hashing**:
- Algorithm: bcrypt with cost factor 12
@@ -116,30 +94,7 @@ sequenceDiagram
- Backup codes (10 single-use)
- Recovery email verification
**VI: Chi tiết Xác thực**:
**1. Password Hashing**:
- Thuật toán: bcrypt với cost factor 12
- Không bao giờ lưu plaintext passwords
- Password tối thiểu: 8 ký tự với quy tắc phức tạp
**2. JWT Tokens**:
- Access Token: 15 phút expiry
- Refresh Token: 7 ngày expiry
- Thuật toán: RS256 (asymmetric signing)
- Payload: userId, roles, permissions
**3. Token Storage**:
- Access: httpOnly cookie (secure, sameSite)
- Refresh: Database SHA-256 hash
- Rotation: Refresh token mới mỗi lần sử dụng
**4 MFA Support**:
- TOTP (Time-based One-Time Password)
- Backup codes (10 single-use)
- Recovery email verification
## Authorization Model / Mô hình Phân quyền
## Authorization Model
```mermaid
graph TD
@@ -162,7 +117,7 @@ graph TD
style Perm fill:#fff4e1
```
**EN: RBAC (Role-Based Access Control)**
**RBAC (Role-Based Access Control)**:
**1. Role Hierarchy**:
```
@@ -186,24 +141,7 @@ SuperAdmin > OrgAdmin > Manager > User > Guest
// Invalidate on: role change, permission change
```
**VI: RBAC (Role-Based Access Control)**
**1. Cấp bậc Role**:
```
SuperAdmin > OrgAdmin > Manager > User > Guest
```
**2. Format Permission**: `resource:action:scope`
- Resource: `users`, `roles`, `permissions`
- Action: `create`, `read`, `update`, `delete`
- Scope: `own`, `org`, `global`
**Ví dụ**:
- `users:read:own` - Đọc profile của chính mình
- `users:update:org` - Update users trong organization
- `roles:create:global` - Tạo roles globally
## Zero-Trust Architecture / Kiến trúc Zero-Trust
## Zero-Trust Architecture
```mermaid
graph TD
@@ -221,7 +159,7 @@ graph TD
style Allow fill:#d4edda
```
**EN: Zero-Trust Components**:
**Zero-Trust Components**:
**1. Device Fingerprinting**:
- Browser: User-Agent, Canvas, WebGL
@@ -245,22 +183,9 @@ graph TD
- Bind session to IP address
- Invalidate on mismatch
**VI: Thành phần Zero-Trust**:
## Data Protection
**1. Device Fingerprinting**:
- Browser: User-Agent, Canvas, WebGL
- Screen resolution, timezone, language
- Phát hiện plugin, fonts có sẵn
- Hash fingerprint → Lưu với session
**2. IP Address Validation**:
- Whitelist IPs đã biết cho user
- Alert với IP mới + require MFA
- Block IPs đáng ngờ (VPN, Tor)
## Data Protection / Bảo vệ Dữ liệu
**EN: Encryption Strategy**:
**Encryption Strategy**:
**1. Data at Rest**:
- PII: AES-256-GCM encryption
@@ -279,22 +204,9 @@ graph TD
- Rotate keys quarterly
- Never hardcode secrets
**VI: Chiến lược Mã hóa**:
## Compliance & Audit
**1. Data at Rest**:
- PII: AES-256-GCM encryption
- Passwords: bcrypt (cost 12)
- Tokens: SHA-256 hash
- Keys: Environment variables + K8s secrets
**2. Data in Transit**:
- TLS 1.2+ cho mọi giao tiếp
- HTTPS enforcement
- Certificate pinning (mobile clients)
## Compliance & Audit / Tuân thủ & Kiểm toán
**EN: Compliance Requirements**:
**Compliance Requirements**:
**1. GDPR**:
- Right to erasure (soft delete + hard delete after 90 days)
@@ -308,7 +220,6 @@ graph TD
- Audit logging (7-year retention)
- Incident response plan
**3. Audit Trail**:
```typescript
// Event sourcing for all auth events
{
@@ -321,27 +232,338 @@ graph TD
}
```
**VI: Yêu cầu Tuân thủ**:
## System Context
**1. GDPR**:
- Right to erasure (soft delete + hard delete sau 90 ngày)
- Data portability (export dữ liệu user)
- Quản lý consent
- Thông báo breach (72 giờ)
```mermaid
C4Context
title Security Architecture Context
Person(user, "User", "End user accessing platform")
Person(admin, "Admin", "System administrator")
Person(attacker, "Attacker", "Potential threat actor")
System(iam, "IAM Service", "Authentication & Authorization")
System_Ext(db, "Neon PostgreSQL", "Encrypted user credentials & sessions")
System_Ext(cache, "Redis", "Permission & session cache")
System_Ext(audit, "Audit Service", "Security event logging")
System_Ext(mfa, "MFA Provider", "TOTP verification")
System_Ext(monitoring, "Security Monitoring", "SIEM & alerting")
Rel(user, iam, "Authenticates", "HTTPS + TLS 1.2+")
Rel(admin, iam, "Manages permissions", "HTTPS + TLS 1.2+")
Rel(attacker, iam, "Blocked by security layers", "")
Rel(iam, db, "Stores credentials", "PostgreSQL + TLS")
Rel(iam, cache, "Caches permissions", "Redis + TLS")
Rel(iam, audit, "Logs security events", "Kafka")
Rel(iam, mfa, "Verifies MFA", "HTTPS")
Rel(iam, monitoring, "Sends security metrics", "Prometheus + Loki")
```
**2. SOC2**:
- Access controls (RBAC)
- Encryption at rest và in transit
- Audit logging (7 năm retention)
- Incident response plan
**Context Description**:
- **IAM Service**: Central authentication and authorization
- **Database**: Stores encrypted credentials, sessions, permissions
- **Cache**: Caches permissions and sessions to reduce database load
- **Audit Service**: Receives and stores all security events
- **MFA Provider**: External TOTP verification service (Google Authenticator compatible)
- **Security Monitoring**: SIEM (Security Information and Event Management) and alerting
## Related Documentation / Tài liệu Liên quan
## Database Architecture
- [System Design](./system-design.md) - Overall architecture
- [IAM Architecture](./iam-proposal.md) - IAM service implementation
- [Event-Driven Architecture](./event-driven-architecture.md) - Audit event streaming
```mermaid
erDiagram
User ||--o{ Session : has
User ||--o{ UserRole : has
User ||--o{ UserPermission : has
User ||--o{ MFADevice : has
User ||--o{ LoginHistory : has
User ||--o{ DeviceFingerprint : has
Role ||--o{ UserRole : assigned_to
Role ||--o{ RolePermission : has
Permission ||--o{ RolePermission : granted_to
Permission ||--o{ UserPermission : granted_to
Organization ||--o{ User : contains
Organization ||--o{ Role : defines
User {
string id PK "CUID"
string email UK "Unique, indexed"
string passwordHash "bcrypt cost 12"
string organizationId FK
boolean mfaEnabled "MFA required?"
datetime lastLoginAt "Tracking"
datetime createdAt "Timestamp"
datetime updatedAt "Timestamp"
datetime deletedAt "Soft delete"
}
Session {
string id PK "CUID"
string userId FK
string refreshTokenHash "SHA-256"
string deviceFingerprint "Hashed"
string ipAddress "IPv4/IPv6"
string userAgent "Browser info"
datetime expiresAt "7 days TTL"
datetime lastActivityAt "Tracking"
datetime createdAt "Timestamp"
}
Role {
string id PK "CUID"
string name "role-name"
string organizationId FK
int hierarchy "Priority level"
boolean isSystem "Built-in?"
datetime createdAt "Timestamp"
}
Permission {
string id PK "CUID"
string resource "users, roles, etc"
string action "create, read, update, delete"
string scope "own, org, global"
datetime createdAt "Timestamp"
}
MFADevice {
string id PK "CUID"
string userId FK
string type "totp, backup"
string secret "Encrypted TOTP secret"
boolean verified "Verified?"
datetime lastUsedAt "Tracking"
datetime createdAt "Timestamp"
}
LoginHistory {
string id PK "CUID"
string userId FK
boolean success "Success/Failure"
string ipAddress "IPv4/IPv6"
string deviceFingerprint "Hashed"
string failureReason "If failed"
datetime timestamp "Event time"
}
DeviceFingerprint {
string id PK "CUID"
string userId FK
string fingerprint "Hashed"
boolean trusted "Auto-approved?"
datetime firstSeenAt "First use"
datetime lastSeenAt "Last use"
}
```
---
**Description**:
- **User**: Stores hashed credentials, MFA settings, organization membership
- **Session**: Stores hashed refresh tokens, device fingerprint, IP tracking
- **Role & Permission**: RBAC hierarchy with system roles and custom roles
- **MFADevice**: TOTP secrets (encrypted), backup codes
- **LoginHistory**: Audit trail for all login attempts (success/failure)
- **DeviceFingerprint**: Trusted device tracking for zero-trust model
**Last Updated**: 2024-01-15
**Authors**: GoodGo Security Team
**Database Security**:
- Password hashes: bcrypt with cost factor 12
- Token hashes: SHA-256
- MFA secrets: AES-256-GCM encryption
- Soft deletes: `deletedAt` field, hard delete after 90 days (GDPR)
- Indexes: email (unique), userId (foreign keys), timestamps
## Design Decisions
### Decision 1: JWT with RS256 (Asymmetric)
**Context**: Need stateless authentication with ability to verify tokens in multiple services
**Decision**: Use JWT with RS256 (RSA asymmetric signing) instead of HS256 (HMAC symmetric)
**Consequences**:
-**Positive**:
- Services can verify tokens with public key, don't need secret
- Easier key rotation (only distribute new public key)
- Higher security (private key only in IAM service)
- Compliance: Clear audit trail of who signs tokens
-**Negative**:
- Slightly slower than HS256 (~10-20% slower)
- More complex key management
- Public/private key pair must be carefully protected
**Alternatives**: HS256 (symmetric), EdDSA, OAuth 2.0 with Opaque Tokens
### Decision 2: Zero-Trust Model with Device Fingerprinting
**Context**: Need to protect against credential theft, session hijacking, and unauthorized access
**Decision**: Implement zero-trust model with device fingerprinting, IP validation, behavioral analysis
**Consequences**:
-**Positive**:
- Detect anomalies (new device, new IP, unusual behavior)
- Increased security by detecting and blocking suspicious activities
- Compliance: SOC2, ISO27001 requirements
- User experience: Auto-approve trusted devices
-**Negative**:
- Higher complexity
- Potential false positives (legitimate users blocked)
- Performance overhead (fingerprint hash, IP check)
- Privacy concerns (tracking devices, IPs)
**Alternatives**: Basic authentication only, IP whitelist only, MFA required for all
### Decision 3: Event Sourcing for Audit Trail
**Context**: Need immutable audit trail for compliance (GDPR, SOC2, HIPAA) and security forensics
**Decision**: Use event sourcing pattern to store all auth/security events
**Consequences**:
-**Positive**:
- Immutable audit trail (cannot modify/delete)
- Complete history of all security events
- Compliance: GDPR (7-year retention), SOC2, HIPAA
- Security forensics: Trace back attacks, breaches
- Replay events to reconstruct state
-**Negative**:
- High storage cost (retain 7 years)
- Complexity in event schema versioning
- Performance: Event publishing overhead
- Data privacy: Must anonymize PII after retention period
**Alternatives**: Database audit logs only, External SIEM only, No audit trail
## Performance Characteristics
| Metric | Target | Notes |
|--------|--------|-------|
| **Login Time (P95)** | < 500ms | Including bcrypt verification |
| **Login Time (P99)** | < 1s | Peak load |
| **Token Generation (P95)** | < 50ms | JWT sign with RS256 |
| **Token Verification (P95)** | < 10ms | JWT verify with public key |
| **Permission Check (P95)** | < 5ms | From cache (L1 or L2) |
| **Permission Check (Cache Miss)** | < 50ms | Database query |
| **MFA Verification (P95)** | < 100ms | TOTP validation |
| **Session Lookup (P95)** | < 10ms | Redis cache |
| **Password Hash (P95)** | < 200ms | bcrypt cost 12 |
| **Device Fingerprint Hash** | < 5ms | SHA-256 |
| **Failed Login Rate Limit** | 5 attempts / 15min | Per user |
| **Auth Throughput** | 500 req/s | Per IAM instance |
**Performance Optimizations**:
- **Permission Caching**: L1 (memory) + L2 (Redis), TTL 5 minutes
- **Token Caching**: Cache public key in memory for JWT verification
- **Connection Pooling**: Reuse database connections
- **Async Operations**: Event publishing, audit logging (fire-and-forget)
- **Rate Limiting**: Prevent brute force attacks, reduce load
- **Horizontal Scaling**: Multiple IAM service instances
## Deployment
```mermaid
graph TD
subgraph "Security Layer"
LB[Load Balancer<br/>TLS Termination]
WAF[WAF / Firewall<br/>Rate Limiting<br/>DDoS Protection]
end
subgraph "IAM Service Layer"
IAM1[IAM Service Pod 1<br/>Stateless]
IAM2[IAM Service Pod 2<br/>Stateless]
IAM3[IAM Service Pod 3<br/>Stateless]
end
subgraph "Data Layer"
DB[(Neon PostgreSQL<br/>Encrypted at Rest)]
Cache[(Redis Cluster<br/>TLS Enabled)]
Vault[Secrets Manager<br/>K8s Secrets]
end
subgraph "Security Monitoring"
SIEM[SIEM / Security Monitoring]
Alerts[Alerting System]
end
Client[Clients] --> LB
LB --> WAF
WAF --> IAM1
WAF --> IAM2
WAF --> IAM3
IAM1 --> DB
IAM1 --> Cache
IAM1 --> Vault
IAM2 --> DB
IAM2 --> Cache
IAM2 --> Vault
IAM3 --> DB
IAM3 --> Cache
IAM3 --> Vault
IAM1 -.->|Security Events| SIEM
IAM2 -.->|Security Events| SIEM
IAM3 -.->|Security Events| SIEM
SIEM -.->|Alerts| Alerts
style LB fill:#d4edda
style WAF fill:#fff3cd
style DB fill:#f0e1ff
style Cache fill:#fff4e1
style Vault fill:#f8d7da
style SIEM fill:#e1f5ff
```
**Deployment Strategy**:
**Security Deployment**:
- **TLS 1.2+ Enforcement**: All connections require TLS
- **Network Policies (K8s)**: Deny all by default, whitelist specific services
- **Pod Security Policies**: Non-root user, read-only filesystem, no privilege escalation
- **Secrets Management**: Kubernetes secrets with encryption at rest
- **Image Scanning**: Trivy/Clair scan before deployment
- **RBAC (K8s)**: Least privilege for service accounts
**Resource Allocation**:
| Component | CPU | Memory | Replicas |
|-----------|-----|--------|----------|
| **IAM Service** | 500m | 1GB | 3-10 (HPA) |
| **Redis** | 1 core | 2GB | 3 masters + 3 slaves |
**Security Configuration**:
```yaml
# K8s Network Policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: iam-service-policy
spec:
podSelector:
matchLabels:
app: iam-service
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: api-gateway
ports:
- protocol: TCP
port: 5000
egress:
- to:
- podSelector:
matchLabels:
app: postgresql
ports:
- protocol: TCP
port: 5432
```

View File

@@ -1,106 +1,234 @@
# Deployment Guide
> **Note**: This guide covers deployment strategies for GoodGo Microservices Platform across Local, Staging, and Production environments using Kubernetes and Neon PostgreSQL.
## Table of Contents
1. [Deployment Architecture](#deployment-architecture)
2. [Prerequisites](#prerequisites)
3. [Database Setup (Neon)](#database-setup-neon)
4. [Local Deployment](#local-deployment)
5. [CI/CD Pipeline](#cicd-pipeline)
6. [Staging Deployment](#staging-deployment)
7. [Production Deployment](#production-deployment)
8. [Scaling & Resilience](#scaling--resilience)
9. [Rollback Procedures](#rollback-procedures)
---
## Deployment Architecture
```mermaid
graph TD
subgraph "CI/CD Pipeline (GitHub Actions)"
Code[Code Push] --> Test[Run Tests]
Test --> Build[Build Docker Image]
Build --> Registry[Push to Registry]
Registry --> Deploy[Deploy to K8s]
end
subgraph "Infrastructure (Kubernetes)"
Ingress[Traefik Ingress] --> Service[K8s Service]
Service --> Pods[Application Pods]
Pods --> Secrets[K8s Secrets]
end
subgraph "External Services"
Pods --> Neon[(Neon PostgreSQL)]
Pods --> Redis[(Redis Cloud)]
end
Deploy --> Ingress
```
---
## Prerequisites
Before deploying, ensure you have:
* **Tools**: `kubectl`, `helm`, `docker` installed.
* **Access**:
* Kubernetes Cluster (EKS/GKE/DigitalOcean).
* Container Registry (GHCR/DockerHub).
* Neon Console Account.
* **Configuration**:
* `KUBECONFIG` file set up.
* GitHub Secrets configured for CI/CD.
---
## Database Setup (Neon)
All environments use **Neon PostgreSQL**. Setup once before deployment:
We use **Neon Serverless PostgreSQL** for all environments to leverage branching and auto-scaling.
1. Create Neon project at https://neon.tech
2. Create branches: `main` (dev), `staging`, `production`
3. Get connection strings for each branch
4. Configure in environment variables (see below)
1. **Create Project**: Log in to [neon.tech](https://neon.tech) and create a project `goodgo-platform`.
2. **Create Branches**:
* `main` -> For Development/Local.
* `staging` -> For Staging environment.
* `production` -> For Production environment (Protected).
3. **Get Connection Strings**:
* Note the connection string for each branch (Pooler mode recommended).
See [Neon Setup Guide](../../infra/databases/neon/README.md) for details.
---
## Local Deployment
```bash
# Setup Neon database URL
cp deployments/local/env.local.example deployments/local/.env.local
# Edit .env.local and add your Neon DATABASE_URL
For local development, we use Docker Compose.
# Start services (no PostgreSQL container needed)
```bash
# 1. Setup Environment
cp deployments/local/env.local.example deployments/local/.env.local
# Edit .env.local with Neon `main` branch connection string
# 2. Start Infrastructure (Redis, Traefik, etc.)
cd deployments/local
docker-compose up -d
# 3. Start Services (Hot-reload)
pnpm dev
```
---
## CI/CD Pipeline
We use GitHub Actions for automated deployments.
| Workflow | Trigger | Description |
| :--- | :--- | :--- |
| `ci-check.yml` | Pull Request | Runs unit tests, linting, and build check. |
| `deploy-staging.yml` | Push to `develop` | Build image -> Deploy to Staging Namespace. |
| `deploy-prod.yml` | Release / Tag | Build image -> Deploy to Production Namespace. |
### Secrets Configuration (GitHub)
Set these secrets in your repository settings:
* `NEON_DATABASE_URL_STAGING`: Connection string for staging branch.
* `NEON_DATABASE_URL_PRODUCTION`: Connection string for production branch.
* `KUBECONFIG_STAGING`: Base64 encoded kubeconfig for staging.
* `KUBECONFIG_PRODUCTION`: Base64 encoded kubeconfig for production.
* `DOCKER_REGISTRY_TOKEN`: For pushing images.
---
## Staging Deployment
### Prerequisites
- Kubernetes cluster access
- kubectl configured
- KUBECONFIG set
- Neon staging branch created
- GitHub Secrets configured:
- `NEON_DATABASE_URL_STAGING`
- `KUBECONFIG_STAGING`
Staging mirrors production but uses cost-effective resources.
### Setup Secrets
### Manual Deployment
```bash
# Create Kubernetes secret
# 1. Create Secrets
kubectl create secret generic iam-service-secrets \
--from-literal=database-url='postgresql://user:pass@ep-xxx.region.neon.tech/dbname?sslmode=require&pgbouncer=true' \
--from-literal=jwt-secret='your-staging-jwt-secret' \
--from-literal=jwt-refresh-secret='your-staging-refresh-secret' \
--from-literal=database-url='<STAGING_NEON_URL>' \
--from-literal=jwt-secret='<RANDOM_SECRET>' \
-n staging
# 2. Apply Manifests
kubectl apply -f deployments/staging/kubernetes/ -n staging
# 3. Verify
kubectl get pods -n staging
```
### Deploy
### via CI/CD
```bash
./scripts/deploy/deploy-staging.sh
```
Push code to `develop` branch. The action will:
1. Run tests.
2. Run `prisma migrate deploy` against Staging DB.
3. Update Kubernetes deployment image.
Or manually:
```bash
kubectl apply -f deployments/staging/kubernetes/
```
**Note**: Migrations run automatically in CI/CD before deployment.
---
## Production Deployment
### Prerequisites
- Production Kubernetes cluster
- kubectl configured with production context
- Neon production branch created
- GitHub Secrets configured:
- `NEON_DATABASE_URL_PRODUCTION`
- `KUBECONFIG_PRODUCTION`
Production uses high-availability configurations.
### Setup Secrets
### 1. Database Preparation
* Ensure Production branch in Neon is **protected**.
* Configure **Point-in-Time Recovery (PITR)** window (e.g., 7 days).
### 2. Manual Deployment Steps
```bash
# Create Kubernetes secret
# 1. Create Namespace
kubectl create namespace production
# 2. Create Sealed Secrets (Recommended) or Standard Secrets
kubectl create secret generic iam-service-secrets \
--from-literal=database-url='postgresql://user:pass@ep-xxx.region.neon.tech/dbname?sslmode=require&pgbouncer=true' \
--from-literal=jwt-secret='your-production-jwt-secret' \
--from-literal=jwt-refresh-secret='your-production-refresh-secret' \
--from-literal=database-url='<PROD_NEON_URL>' \
--from-literal=jwt-secret='<SECURE_RANDOM_SECRET>' \
--from-literal=jwt-refresh-secret='<SECURE_RANDOM_SECRET>' \
-n production
# 3. Deploy
kubectl apply -f deployments/production/kubernetes/ -n production
```
### Deploy
### 3. Verification
```bash
./scripts/deploy/deploy-prod.sh
# Check Rollout Status
kubectl rollout status deployment/iam-service -n production
# Check Logs
kubectl logs -l app=iam-service -n production
```
**Note**: Migrations run automatically in CI/CD before deployment (with approval).
---
### Rollback
## Scaling & Resilience
### Horizontal Pod Autoscaler (HPA)
We use HPA to automatically scale pods based on CPU/Memory.
```yaml
# Example HPA Config
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: iam-service-hpa
spec:
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
```
### Zero-Downtime Deployment
Kubernetes handles this via Rolling Updates.
* **MaxSurge**: 25% (Add new pods before removing old ones).
* **MaxUnavailable**: 0 (Ensure no downtime during update).
---
## Rollback Procedures
If a deployment fails or introduces a critical bug:
### Kubernetes Rollback
```bash
# Undo last deployment
kubectl rollout undo deployment/iam-service -n production
# Undo to specific revision
kubectl rollout undo deployment/iam-service -n production --to-revision=2
```
## Health Checks
### Database Rollback
- Liveness: `GET /health/live`
- Readiness: `GET /health/ready`
- Health: `GET /health`
## Monitoring
- Prometheus: http://prometheus:9090
- Grafana: http://grafana:3000
- Traefik Dashboard: http://traefik:8080
Since Neon supports branching and PITR:
1. Go to Neon Console.
2. Restore the `production` branch to a timestamp before the bad migration.
3. **Warning**: This may result in data loss for new transactions. Use with caution.

View File

@@ -1,111 +1,211 @@
# Development Guide
> **Note**: This guide provides comprehensive standards and workflows for contributing to the GoodGo Microservices Platform.
## Table of Contents
1. [Project Structure](#project-structure)
2. [Code Standards](#code-standards)
3. [Git Workflow](#git-workflow)
4. [Backend Development](#backend-development)
5. [Testing Strategy](#testing-strategy)
6. [Database Workflow](#database-workflow)
7. [Kubernetes Deployment](#kubernetes-deployment)
---
## Project Structure
We follow a strict monorepo structure managed by PNPM Workspaces.
```
├── apps/ # Frontend applications
├── services/ # Backend microservices
├── packages/ # Shared libraries
├── infra/ # Infrastructure configs
├── deployments/ # Deployment configs
├── scripts/ # Automation scripts
└── docs/ # Documentation
Base/
├── apps/ # Frontend applications
│ ├── web-client/ # Next.js 14+ (App Router)
│ └── mobile-client/ # Flutter
├── services/ # Backend microservices
│ ├── _template/ # Template for new services
│ ├── iam-service/ # Identity & Access Management
│ └── ...
├── packages/ # Shared libraries
│ ├── logger/ # Structured logging (Winston)
│ ├── types/ # Shared DTOs & Interfaces
│ ├── http-client/ # Internal Service Client
│ └── tracing/ # OpenTelemetry configuration
├── infra/ # Infrastructure-as-Code
│ ├── traefik/ # API Gateway
│ └── databases/ # Database setup scripts
└── docs/ # Documentation (EN & VI)
```
## Development Workflow
---
### 1. Create a Feature Branch
## Code Standards
### Naming Conventions
* **Files**: `kebab-case.ts` (e.g., `user.controller.ts`, `app.config.ts`)
* **Classes**: `PascalCase` (e.g., `UserController`, `AuthService`)
* **Functions/Variables**: `camelCase` (e.g., `getUserById`, `isValid`)
* **Constants**: `UPPER_SNAKE_CASE` (e.g., `MAX_RETRIES`, `DEFAULT_TIMEOUT`)
* **Interfaces**: `PascalCase` (e.g., `User`, `CreateUserDto`) - *No 'I' prefix*
### Bilingual Comments
For core logic and public APIs, assume both international and Vietnamese developers reading the code.
```typescript
/**
* EN: Validates user credentials and returns a token
* VI: Xác thực thông tin người dùng và trả về token
*/
async login(dto: LoginDto): Promise<TokenResponse> { ... }
```
### TypeScript Usage
* **Strict Mode**: Enabled in `tsconfig.json`. No `any` allowed (use `unknown` if needed).
* **DTOs**: Use Zod for runtime validation and type inference.
* **Return Types**: Explicitly declare return types for all public methods.
---
## Git Workflow
### Branching Strategy
* `main`: Production-ready code.
* `develop`: Integration branch for next release.
* `feature/xyz`: New features (branch off `develop`).
* `fix/xyz`: Bug fixes (branch off `develop`).
* `hotfix/xyz`: Critical fixes (branch off `main`).
### Commit Messages
We follow [Conventional Commits](https://www.conventionalcommits.org/):
```
feat(iam): add multi-factor authentication
fix(db): correct unique constraint on email
docs(guide): update development setup
style: format code with prettier
refactor: simplify auth middleware
test: add unit tests for user service
chore: update dependencies
```
---
## Backend Development
### Creating a New API Endpoint
1. **Define DTO** (`modules/user/user.dto.ts`):
```typescript
export const CreateUserDto = z.object({
email: z.string().email(),
name: z.string().min(2),
});
export type CreateUserDto = z.infer<typeof CreateUserDto>;
```
2. **Create Service Method** (`modules/user/user.service.ts`):
* Implement business logic.
* Use `BaseRepository`.
* Throw `HttpError` (e.g., `NotFound`, `BadRequest`).
3. **Create Controller** (`modules/user/user.controller.ts`):
* Parse body with DTO: `const dto = CreateUserDto.parse(req.body)`.
* Call service.
* Return success response: `res.json({ success: true, data: result })`.
4. **Register Route** (`modules/user/index.ts`):
* Add to Express router with middlewares.
### Error Handling
Always use the custom error classes from `core/errors`:
```typescript
import { NotFoundError, ConflictError } from '../../core/errors';
if (!user) {
throw new NotFoundError('User not found');
}
```
---
## Testing Strategy
### Unit Tests (`*.test.ts`)
* **Scope**: Individual classes/functions.
* **Mocking**: Mock all external dependencies (DB, other services) using `jest-mock-extended`.
* **Location**: Co-located with source files.
* **Run**: `pnpm test`
### E2E Tests (`tests/**/*.e2e.ts`)
* **Scope**: Full API flows (Controller -> Service -> DB).
* **Database**: Use a separate test database (Dockerized).
* **Run**: `pnpm test:e2e`
### Linting & Formatting
* **Lint**: `pnpm lint` (ESLint)
* **Format**: `pnpm format` (Prettier)
* **Typecheck**: `pnpm typecheck` (TSC)
---
## Database Workflow
We use **Prisma** with **Neon PostgreSQL**.
### Migrations
1. Modify `prisma/schema.prisma`.
2. Create migration (Dev):
```bash
./scripts/db/migrate.sh iam-service dev --name add_user_profile
```
3. Apply to Production (CI/CD):
```bash
./scripts/db/migrate.sh iam-service deploy
```
### Seed Data
Populate database with initial data:
```bash
git checkout -b feature/my-feature
./scripts/db/seed.sh iam-service
```
### 2. Make Changes
- Write code following TypeScript strict mode
- Add tests for new functionality
- Update documentation if needed
### 3. Run Tests Locally
### Visualizing Data
Use Prisma Studio:
```bash
# All tests
pnpm test
# Specific service
pnpm --filter @goodgo/iam-service test
pnpm --filter @goodgo/iam-service prisma studio
```
### 4. Lint and Format
```bash
pnpm lint
pnpm format
```
### 5. Create Pull Request
- Push your branch
- Create PR targeting `develop`
- CI/CD will run automatically
## Adding a New Service
1. Use the template:
```bash
./scripts/utils/create-service.sh my-new-service
```
2. Update service configuration
3. Implement business logic
4. Add tests
5. Update documentation
## Adding a New Package
1. Create package in `packages/new-package`
2. Add to workspace in `pnpm-workspace.yaml`
3. Export from `index.ts`
4. Add tests
5. Document usage
## Database Migrations
## Database Migrations
```bash
# Create migration (dev)
./scripts/db/migrate.sh iam-service dev
# Apply migrations (production)
./scripts/db/migrate.sh iam-service deploy
```
---
## Kubernetes Deployment
### Local Kubernetes (Docker Desktop)
For local Kubernetes testing (Docker Desktop / Minikube):
```bash
# Enable Kubernetes in Docker Desktop
# Settings → Kubernetes → Enable Kubernetes
# 1. Build images
docker build -t goodgo/iam-service:latest -f services/iam-service/Dockerfile .
# Deploy service
# 2. Deploy
cd deployments/local/kubernetes
./deploy.sh
# Verify deployment
# 3. Verify
kubectl get pods -n iam-local
kubectl logs -f -n iam-local -l app=iam-service
# Port forward for testing
kubectl port-forward svc/iam-service 5002:80 -n iam-local
curl http://localhost:5002/health/live
kubectl logs -f -l app=iam-service -n iam-local
```
**See detailed guide**: [Kubernetes Local Deployment Guide](./kubernetes-local.md)
## Debugging
- Use logger from `@goodgo/logger`
- Check Traefik logs: `docker logs traefik-local`
- Check service logs: `./scripts/dev/logs.sh iam-service`
See [Kubernetes Guide](./kubernetes-local.md) for detailed setup.

View File

@@ -1,81 +1,214 @@
# Getting Started
> **Note**: This guide assumes you are setting up the project on macOS or Linux. Windows users should use WSL2.
## Table of Contents
1. [Prerequisites](#prerequisites)
2. [Architecture Overview](#architecture-overview)
3. [Project Structure](#project-structure)
4. [Installation & Setup](#installation--setup)
5. [Development Workflow](#development-workflow)
6. [Common Commands](#common-commands)
7. [Troubleshooting](#troubleshooting)
## Prerequisites
- Node.js >= 20.0.0
- PNPM >= 8.0.0
- Docker & Docker Compose
- Git
- Neon account (https://neon.tech) - for database
Before starting, ensure you have the following installed:
## Initial Setup
* **Node.js**: v20.0.0 or higher
```bash
node -v
# v20.10.0
```
* **PNPM**: v8.0.0 or higher (we use pnpm workspaces)
```bash
pnpm -v
# 8.12.0
```
* **Docker & Docker Compose**: For local infrastructure
```bash
docker -v
# Docker version 24.0.0
```
* **Git**: For version control
* **Neon Account**: Serverless PostgreSQL (https://neon.tech)
1. **Clone the repository**
```bash
git clone <repository-url>
cd Base
```
## Architecture Overview
2. **Setup Neon Database**
```bash
# Run setup script
./scripts/db/setup-neon.sh
# Or manually:
# 1. Create Neon project at https://neon.tech
# 2. Create branches: main (dev), staging, production
# 3. Get connection strings
# 4. Update deployments/local/.env.local
```
See [Neon Setup Guide](../../infra/databases/neon/README.md) for details.
GoodGo Platform uses a microservices architecture with a shared infrastructure layer.
3. **Initialize the project**
```bash
./scripts/setup/init-project.sh
```
```mermaid
graph TD
Client[Client Apps] --> Traefik[Traefik Gateway]
Traefik --> IAM[IAM Service]
Traefik --> Template[Template Service]
IAM --> DB[(Neon PostgreSQL)]
IAM --> Redis[(Redis Cache)]
IAM --> Kafka[Kafka Events]
style Traefik fill:#e1f5ff
style DB fill:#f0e1ff
style Redis fill:#fff4e1
```
4. **Start infrastructure** (Redis, Traefik - no PostgreSQL needed)
```bash
cd deployments/local
docker-compose up -d
cd ../..
```
## Project Structure
5. **Run database migrations**
```bash
./scripts/db/migrate.sh iam-service dev
```
The repository follows a monorepo structure:
6. **Seed the database**
```bash
./scripts/db/seed.sh iam-service
```
```
Base/
├── apps/ # Frontend applications
├── web-client/ # Next.js web application
│ └── mobile-client/ # Flutter mobile application
├── services/ # Backend microservices
│ ├── iam-service/ # Authentication & Authorization
│ └── _template/ # Template for new services
├── packages/ # Shared libraries
│ ├── logger/ # Structured logging
│ ├── types/ # Shared TypeScript types
│ └── http-client/ # Internal HTTP client
├── infra/ # Infrastructure configuration
│ ├── traefik/ # API Gateway config
│ └── databases/ # Database setup scripts
├── deployments/ # Deployment configurations
│ ├── local/ # Docker Compose for dev
│ └── k8s/ # Kubernetes manifests
└── docs/ # Documentation
```
7. **Start all services**
```bash
./scripts/dev/start-all.sh
```
## Installation & Setup
## Access Points
### 1. Clone the Repository
- **API Gateway**: http://localhost/api/v1
- **Auth Service**: http://localhost:5001
- **Web Admin**: http://admin.localhost or http://localhost:3000
- **Web Client**: http://localhost or http://localhost:3001
- **Traefik Dashboard**: http://localhost:8080
```bash
git clone <repository-url>
cd Base
```
## Database
### 2. Configure Environment
This project uses **Neon PostgreSQL** for all environments:
- **Development**: Neon main branch
- **Staging**: Neon staging branch
- **Production**: Neon production branch
Each service and the local infrastructure needs environment variables. We provide templates for these.
No local PostgreSQL needed! See [Neon Setup](../../infra/databases/neon/README.md) for details.
```bash
# Initialize project setup (copies .env.example to .env)
./scripts/setup/init-project.sh
```
### 3. Setup Neon Database
We use Neon (Serverless PostgreSQL) for all environments (Dev, Staging, Prod).
1. Create a project at [neon.tech](https://neon.tech).
2. Create a branch named `dev` (or use `main`).
3. Get the Connection String from the Neon dashboard.
4. Update `deployments/local/.env.local`:
```env
DATABASE_URL="postgres://user:pass@ep-xyz.region.neon.tech/neondb"
```
### 4. Start Infrastructure
Start the supporting infrastructure (Redis, Traefik, Observability) using Docker Compose.
```bash
cd deployments/local
docker-compose up -d
# Expected output: Containers for traefik, redis, kafka created
```
### 5. Install Dependencies
```bash
pnpm install
```
### 6. Setup Database Schema
Push the Prisma schema to your Neon database.
```bash
# Run migrations for IAM service
pnpm --filter @goodgo/iam-service prisma migrate dev
```
### 7. Start Services
Start all backend services in development mode.
```bash
pnpm dev
# or start specific service
pnpm --filter @goodgo/iam-service dev
```
## Development Workflow
### Creating a New Service
1. Copy the template:
```bash
cp -r services/_template services/my-new-service
```
2. Update `package.json` name.
3. Add logic in `src/modules/`.
4. Register in `deployments/local/docker-compose.yml`.
### Making Changes
1. Create a new branch: `feature/my-feature`.
2. Implement changes.
3. Run tests: `pnpm test`.
4. Commit with conventional commits: `feat(iam): add login endpoint`.
## Common Commands
| Command | Description |
| :--- | :--- |
| `pnpm install` | Install all dependencies |
| `pnpm dev` | Start all services in dev mode |
| `pnpm build` | Build all packages and services |
| `pnpm test` | Run unit tests |
| `pnpm lint` | Lint code |
| `docker-compose up -d` | Start local infra |
| `docker-compose down` | Stop local infra |
## Troubleshooting
### Port Conflicts
**Error**: `Bind for 0.0.0.0:80 failed: port is already allocated`
**Solution**: Check what's using port 80 (likely another web server) and stop it, or change Traefik ports in `docker-compose.yml`.
```bash
lsof -i :80
kill -9 <PID>
```
### Database Connection Failed
**Error**: `P1001: Can't reach database server`
**Solution**:
1. Check your internet connection (Neon is cloud-based).
2. Verify `DATABASE_URL` in `deployments/local/.env.local`.
3. Ensure your IP is allowed in Neon dashboard settings.
### Service Not Found in Gateway
**Error**: `404 Not Found` from api.localhost
**Solution**:
1. Check if service is running.
2. Check Traefik dashboard at http://localhost:8080.
3. Verify `PathPrefix` labels in `docker-compose.yml`.
## Next Steps
- Read [Development Guide](development.md)
- Check [API Documentation](../api/openapi/)
- Review [Architecture Overview](../architecture/system-design.md)
* [Development Guide](development.md) - Deep dive into coding standards
* [API Documentation](../api/openapi/) - Explore the APIs
* [Architecture](../architecture/system-design.md) - Understand the system design

View File

@@ -1,57 +1,218 @@
# Troubleshooting Guide
## Common Issues
> **Note**: This guide focuses on debugging the GoodGo Microservices Platform in a local development environment (Docker Compose).
### Database Connection Failed
## Table of Contents
**Symptoms**: Service can't connect to database
1. [General Diagnosis](#general-diagnosis)
2. [Infrastructure Issues](#infrastructure-issues)
- [Database (Neon/PostgreSQL)](#database-neonpostgresql)
- [Redis](#redis)
- [Traefik Gateway](#traefik-gateway)
3. [Service Issues](#service-issues)
- [Service Fails to Start](#service-fails-to-start)
- [Prisma/Database Errors](#prismadatabase-errors)
- [Authentication Errors](#authentication-errors)
4. [Debugging Tools](#debugging-tools)
5. [FAQ](#faq)
**Solutions**:
1. Check if PostgreSQL is running: `docker ps`
2. Verify DATABASE_URL in .env
3. Check network connectivity: `docker network ls`
4. Review logs: `docker logs postgres-auth-local`
---
### Port Already in Use
## General Diagnosis
**Symptoms**: Service fails to start with port error
When something goes wrong, follow this checklist:
**Solutions**:
1. Find process using port: `lsof -i :5001`
2. Kill process or change PORT in .env
3. Check docker-compose for port conflicts
1. **Check Service Status**:
```bash
cd deployments/local
docker-compose ps
```
*All services should be `Up` or `Running`.*
### Prisma Client Not Generated
2. **Check Logs**:
```bash
# View logs for a specific service
docker-compose logs -f <service-name>
# View last 100 lines for all
docker-compose logs --tail=100
```
**Symptoms**: Import errors for Prisma Client
3. **Check Connectivity**:
* Can you reach the Gateway? `curl http://localhost/health`
* Can you reach the Dashboard? http://localhost:8080
---
## Infrastructure Issues
### Database (Neon/PostgreSQL)
**Problem**: `P1001: Can't reach database server` or `Connection timed out`
* **Cause 1**: Internet connectivity issues (Neon is cloud-based).
* **Cause 2**: Incorrect `DATABASE_URL` in `.env`.
* **Cause 3**: IP address blocked by Neon.
**Solution**:
1. Verify internet connection: `ping neon.tech`.
2. Check `deployments/local/.env.local`. The URL should look like:
`postgres://user:pass@ep-xyz.aws.neon.tech/neondb`
3. Go to Neon Dashboard -> Settings, ensure "Allow all IPs" or add your current IP.
**Problem**: `P1003: Database does not exist`
* **Reason**: You are connecting to the wrong database name.
* **Fix**: Check the end of your connection string (e.g., `/neondb` usually). If you are using a custom DB name, ensure it exists in Neon.
### Redis
**Problem**: `Redis connection refused` or `ECONNREFUSED`
* **Cause**: Redis container is not running or port mapping is wrong.
**Solution**:
1. Check Redis status: `docker-compose ps redis`.
2. Restart Redis: `docker-compose restart redis`.
3. Check logs: `docker-compose logs redis`.
4. Connection string from services:
* **Inside Docker**: `redis:6379`
* **From Host**: `localhost:6379`
### Traefik Gateway
**Problem**: `404 Not Found` when accessing APIs (e.g., `http://localhost/api/v1/auth`)
* **Cause**: Service is down or Labels are misconfigured.
**Solution**:
1. Check Traefik Dashboard at http://localhost:8080.
* Look for "HTTP Routers" and "Services".
* If your service is missing, check `docker-compose.yml` labels.
2. Verify `PathPrefix` in labels matches your request.
```yaml
- "traefik.http.routers.iam.rule=PathPrefix(`/api/v1/auth`)"
```
3. Check if the service passed health checks (Health status in dashboard).
**Problem**: `Bad Gateway` or `Gateway Timeout`
* **Cause**: Service is crashing or taking too long to respond.
* **Fix**: Check the specific service logs (`docker-compose logs iam-service`).
---
## Service Issues
### Service Fails to Start
**Symptom**: Container status is `Exited (1)` or `Restarting`.
**Debugging**:
1. Check logs immediately:
```bash
docker-compose logs iam-service
```
2. **Common Error**: `Config validation error`
* **Fix**: Check environment variables. Using `./scripts/setup/init-project.sh` ensures `.env` exists.
3. **Common Error**: `PrismaClientInitializationError`
* **Fix**: Database connectivity issue (see Infrastructure section).
### Prisma/Database Errors
**Error**: `P2025: Record to update not found`
* **Fix**: Logic error. Ensure the ID exists before updating.
**Error**: `P2002: Unique constraint failed`
* **Fix**: You are trying to insert duplicate data (e.g., same email).
**Error**: `Migration failed`
* **Fix**:
1. Delete `prisma/migrations` folder (only in dev!).
2. Reset database: `pnpm prisma migrate reset`.
3. Regenerate client: `pnpm prisma generate`.
### Authentication Errors
**Problem**: `401 Unauthorized` despite valid token
* **Cause 1**: Token expired.
* **Cause 2**: Public key mismatch (Service can't verify token signed by IAM).
* **Cause 3**: Clock skew (Docker time vs Host time).
**Solution**:
1. Check server logs for JWT verification errors.
2. Restart services to refresh keys.
3. Sync Docker time: restart Docker Desktop.
---
## Debugging Tools
### 1. Accessing Container Shell
To inspect files or run commands inside a running container:
**Solutions**:
```bash
cd services/iam-service
pnpm prisma generate
docker-compose exec iam-service sh
# or /bin/bash
```
### Build Failures
### 2. Inspecting Database (via Prisma Studio)
**Symptoms**: TypeScript or build errors
Use Prisma Studio to view/edit data visually:
**Solutions**:
1. Clean build artifacts: `./scripts/utils/cleanup.sh`
2. Reinstall dependencies: `pnpm install`
3. Check TypeScript errors: `pnpm typecheck`
```bash
pnpm --filter @goodgo/iam-service prisma studio
# Opens http://localhost:5555
```
### Traefik Not Routing
### 3. Inspecting Redis
**Symptoms**: 404 errors from Traefik
```bash
docker-compose exec redis redis-cli
> PING
PONG
> KEYS *
1) "user:123:session"
```
**Solutions**:
1. Check Traefik dashboard: http://localhost:8080
2. Verify service labels in docker-compose
3. Check routes.yml configuration
4. Review Traefik logs: `docker logs traefik-local`
### 4. Direct API Testing
## Getting Help
Use `curl` or Postman.
1. Check service logs: `./scripts/dev/logs.sh <service>`
2. Review GitHub Issues
3. Contact team lead
```bash
# Health Check
curl -v http://localhost/api/v1/auth/health/live
# Login (example)
curl -X POST http://localhost/api/v1/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"admin@example.com", "password":"password"}'
```
---
## FAQ
**Q: Why is my change not reflecting?**
A: If you changed `.env` or `docker-compose.yml`, you must restart:
```bash
docker-compose down && docker-compose up -d
```
If you changed code, hot-reloading (nodemon) should pick it up. If not, restart container.
**Q: How do I reset everything?**
A: Be careful, this deletes all data!
```bash
docker-compose down -v
# -v removes volumes (Redis data, etc.)
```
**Q: My computer is slow when running everything.**
A: Docker consumes RAM.
1. Stop unused services (e.g., `future-service`).
2. Increase Docker resource limits in Docker Desktop settings.