- Update POS_DEPLOYMENT_STATE.md with live staging status - Create TROUBLESHOOTING.md with common issues & fixes - Add architecture visual, quick reference, and analysis docs - Document Network Policy gap (inter-service ingress) - Document DNS/ingress routing setup - Document CI/CD pipeline (Gitea Actions + Kaniko) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8.3 KiB
8.3 KiB
Troubleshooting Guide - GoodGo POS System
Last Updated: 2026-04-11
Quick Reference
| Symptom | Likely Cause | Fix |
|---|---|---|
Pod Pending |
Cluster out of CPU/memory | Reduce requests or add nodes |
Pod CrashLoopBackOff |
Missing DB or config | Check logs + secrets |
Service 504 Gateway Timeout |
Network Policy blocks traffic | Add ingress/egress rule |
Service 503 |
Pod not ready or scaled to 0 | Scale up + check health |
401 Unauthorized on API |
Expected - JWT required | Service is working correctly |
ImagePullBackOff |
Harbor auth issue | Check harbor-pull-secret |
| DNS not resolving | Cloudflare cache or wrong IP | Flush DNS, check A records |
1. Network Policy Issues
Problem: Services cannot communicate with each other
Symptom: promotion-service health check fails (WalletServiceHealthCheck timeout)
Root Cause: default-deny-all blocks all traffic. Need explicit allow rules.
Required Network Policies:
allow-traefik-ingress— ingress-nginx → services (port 8080)allow-inter-service-ingress— services → services (port 8080) ⚠️ MISSINGallow-inter-service-egress— services → services (port 8080) ✅ EXISTSallow-dns-egress— all pods → kube-dns (port 53)allow-app-to-redis-egress— services → redis (port 6379)allow-app-to-rabbitmq-egress— services → rabbitmq (port 5672)
Fix:
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-inter-service-ingress
namespace: staging
spec:
podSelector:
matchExpressions:
- key: app
operator: In
values: [iam-service, merchant-service, order-service, fnb-engine,
catalog-service, inventory-service, wallet-service, storage-service,
booking-service, chat-service, social-service, promotion-service,
membership-service, mining-service, mission-service,
ads-manager-service, ads-serving-service, ads-billing-service,
ads-tracking-service, ads-analytics-service,
mkt-facebook-service, mkt-whatsapp-service, mkt-x-service, mkt-zalo-service,
pos-web]
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchExpressions:
- key: app
operator: In
values: [iam-service, merchant-service, order-service, fnb-engine,
catalog-service, inventory-service, wallet-service, storage-service,
booking-service, chat-service, social-service, promotion-service,
membership-service, mining-service, mission-service,
ads-manager-service, ads-serving-service, ads-billing-service,
ads-tracking-service, ads-analytics-service,
mkt-facebook-service, mkt-whatsapp-service, mkt-x-service, mkt-zalo-service,
pos-web]
ports:
- port: 8080
protocol: TCP
EOF
2. Resource Exhaustion
Problem: Pods stuck in Pending state
Symptom: 0/3 nodes are available: Insufficient cpu/memory
Check:
kubectl top nodes
kubectl describe nodes | grep -A5 "Allocated resources"
Fix options:
- Reduce CPU requests:
kubectl patch deployment X -p '{"spec":{"template":{"spec":{"containers":[{"name":"X","resources":{"requests":{"cpu":"100m","memory":"256Mi"}}}]}}}}' - Scale down unnecessary services
- Add worker nodes
Current resource usage (2026-04-11):
- All 3 nodes at ~99% CPU requests (6 cores each)
- Memory: 45-52% used
3. Database Connection Issues
Problem: Service CrashLoopBackOff with DB error
Symptom: Npgsql.NpgsqlException: Failed to connect
Database Architecture:
- Neon PostgreSQL runs in
neonnamespace - Services connect via NodePort:
Host=212.28.186.239;Port=30992 - Each service has its own database:
{service_name}(e.g.,iam_service)
Check:
# Verify Neon compute is running
kubectl get pods -n neon | grep compute
# Check NodePort service
kubectl get svc -n neon | grep 30992
# Test connectivity from service pod
kubectl exec deployment/catalog-service -n staging -- env | grep DATABASE_URL
Common causes:
- Neon compute pod restarted → wait for it to be ready
- Network policy blocks egress to port 30992 → add
allow-external-egress - Wrong credentials → check
goodgo-secrets
4. Ingress / DNS Issues
Problem: 504 Gateway Timeout on platform.techbi.org
Root Cause: Ingress-nginx on control plane (212.28.186.239) has port conflicts
Current Setup:
- DNS:
*.techbi.org→ 212.28.186.239 (control plane) - Ingress-nginx on control plane works correctly (resolves cluster DNS, routes to ClusterIPs)
- Ingress-nginx on worker nodes has hostNetwork issue (cannot route to ClusterIPs)
- TLS: Let's Encrypt certificates valid until Jul 2026
Fix (if DNS needs to change):
# Cloudflare API
CF_TOKEN="0739e5df538e9543b7c7a9861b99974c218f0"
CF_EMAIL="hongochai10@icloud.com"
ZONE_ID="ac7415c1822dbd1f1ba9474073ebced5"
# Update A record
curl -X PUT "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
-H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_TOKEN" \
-H "Content-Type: application/json" \
-d '{"type":"A","name":"platform.techbi.org","content":"185.225.233.97","ttl":1,"proxied":false}'
DNS Records (Cloudflare zone: ac7415c1822dbd1f1ba9474073ebced5):
| Record | ID | Value |
|---|---|---|
| platform.techbi.org | 42b0f325d2afe89c0190cd91e27cc0c2 | 212.28.186.239 |
| api.techbi.org | 07c3803f5c9ac3647659df22b93bea8f | 212.28.186.239 |
5. CI/CD Pipeline (Gitea Actions)
Problem: Builds fail or timeout
Workflow: .gitea/workflows/deploy.yaml
Architecture:
- GitHub → Gitea mirror (CronJob
github-gitea-sync-pos) - Gitea detects changes → triggers workflow
- Workflow builds images in parallel batches of 5 via Kaniko Jobs
- Images pushed to Harbor (
harbor.techbi.org/goodgo/) - Deploys to K8s staging namespace
Common issues:
- Sync not triggered:
kubectl create job --from=cronjob/github-gitea-sync-pos github-gitea-sync-pos-manual -n gitea - Kaniko clone fails: Check
allow-build-egressNetworkPolicy - Harbor push timeout: Check Harbor ingress timeout annotations (need 600s)
- Workflow timeout: Gitea runner has 60min limit; 26 services in 6 batches ~50min
Manual rebuild:
# Touch Dockerfiles to trigger rebuild
for dir in services/*/; do echo "# trigger" >> "$dir/Dockerfile"; done
git add -A && git commit -m "build: trigger rebuild" && git push
# Sync to Gitea
kubectl create job --from=cronjob/github-gitea-sync-pos sync-manual -n gitea
6. Harbor Registry
Problem: ImagePullBackOff
Check:
kubectl get secret harbor-pull-secret -n staging -o yaml
kubectl describe pod <failing-pod> -n staging | grep -A5 Events
Fix:
kubectl create secret docker-registry harbor-pull-secret -n staging \
--docker-server=harbor.techbi.org \
--docker-username=admin \
--docker-password="Velik@2026" \
--docker-email=admin@techbi.org \
--dry-run=client -o yaml | kubectl apply -f -
7. Service Health Checks
Check all services health
# From ingress-nginx pod (bypasses network policy issues)
NGINX_POD=$(kubectl get pods -n ingress-nginx -o name | head -1)
for svc in iam-service merchant-service order-service catalog-service; do
echo -n "$svc: "
kubectl exec $NGINX_POD -n ingress-nginx -- wget -qO- --timeout=5 http://$svc.staging.svc.cluster.local:8080/health/live 2>&1
echo ""
done
Expected responses:
/health/live→Healthy(app started)/health/ready→Healthy(DB + dependencies OK)- If ready fails but live OK → DB connection or dependency issue
8. Common kubectl Commands
# SSH to cluster
ssh root@212.28.186.239
# View all pods
kubectl get pods -n staging --sort-by=.metadata.name
# View logs
kubectl logs deployment/<service-name> -n staging --tail=50
# Restart a service
kubectl rollout restart deployment/<service-name> -n staging
# Scale
kubectl scale deployment/<service-name> --replicas=1 -n staging
# Check resources
kubectl top nodes
kubectl top pods -n staging --sort-by=cpu
# Network policy debug
kubectl get networkpolicy -n staging
kubectl describe networkpolicy <policy-name> -n staging