Files
pos-system/.claude/TROUBLESHOOTING.md
Ho Ngoc Hai 43a61874d3 docs: add deployment state docs and troubleshooting guide
- Update POS_DEPLOYMENT_STATE.md with live staging status
- Create TROUBLESHOOTING.md with common issues & fixes
- Add architecture visual, quick reference, and analysis docs
- Document Network Policy gap (inter-service ingress)
- Document DNS/ingress routing setup
- Document CI/CD pipeline (Gitea Actions + Kaniko)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 20:14:01 +07:00

8.3 KiB

Troubleshooting Guide - GoodGo POS System

Last Updated: 2026-04-11


Quick Reference

Symptom Likely Cause Fix
Pod Pending Cluster out of CPU/memory Reduce requests or add nodes
Pod CrashLoopBackOff Missing DB or config Check logs + secrets
Service 504 Gateway Timeout Network Policy blocks traffic Add ingress/egress rule
Service 503 Pod not ready or scaled to 0 Scale up + check health
401 Unauthorized on API Expected - JWT required Service is working correctly
ImagePullBackOff Harbor auth issue Check harbor-pull-secret
DNS not resolving Cloudflare cache or wrong IP Flush DNS, check A records

1. Network Policy Issues

Problem: Services cannot communicate with each other

Symptom: promotion-service health check fails (WalletServiceHealthCheck timeout)

Root Cause: default-deny-all blocks all traffic. Need explicit allow rules.

Required Network Policies:

  • allow-traefik-ingress — ingress-nginx → services (port 8080)
  • allow-inter-service-ingress — services → services (port 8080) ⚠️ MISSING
  • allow-inter-service-egress — services → services (port 8080) EXISTS
  • allow-dns-egress — all pods → kube-dns (port 53)
  • allow-app-to-redis-egress — services → redis (port 6379)
  • allow-app-to-rabbitmq-egress — services → rabbitmq (port 5672)

Fix:

kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-inter-service-ingress
  namespace: staging
spec:
  podSelector:
    matchExpressions:
    - key: app
      operator: In
      values: [iam-service, merchant-service, order-service, fnb-engine,
               catalog-service, inventory-service, wallet-service, storage-service,
               booking-service, chat-service, social-service, promotion-service,
               membership-service, mining-service, mission-service,
               ads-manager-service, ads-serving-service, ads-billing-service,
               ads-tracking-service, ads-analytics-service,
               mkt-facebook-service, mkt-whatsapp-service, mkt-x-service, mkt-zalo-service,
               pos-web]
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchExpressions:
        - key: app
          operator: In
          values: [iam-service, merchant-service, order-service, fnb-engine,
                   catalog-service, inventory-service, wallet-service, storage-service,
                   booking-service, chat-service, social-service, promotion-service,
                   membership-service, mining-service, mission-service,
                   ads-manager-service, ads-serving-service, ads-billing-service,
                   ads-tracking-service, ads-analytics-service,
                   mkt-facebook-service, mkt-whatsapp-service, mkt-x-service, mkt-zalo-service,
                   pos-web]
    ports:
    - port: 8080
      protocol: TCP
EOF

2. Resource Exhaustion

Problem: Pods stuck in Pending state

Symptom: 0/3 nodes are available: Insufficient cpu/memory

Check:

kubectl top nodes
kubectl describe nodes | grep -A5 "Allocated resources"

Fix options:

  1. Reduce CPU requests: kubectl patch deployment X -p '{"spec":{"template":{"spec":{"containers":[{"name":"X","resources":{"requests":{"cpu":"100m","memory":"256Mi"}}}]}}}}'
  2. Scale down unnecessary services
  3. Add worker nodes

Current resource usage (2026-04-11):

  • All 3 nodes at ~99% CPU requests (6 cores each)
  • Memory: 45-52% used

3. Database Connection Issues

Problem: Service CrashLoopBackOff with DB error

Symptom: Npgsql.NpgsqlException: Failed to connect

Database Architecture:

  • Neon PostgreSQL runs in neon namespace
  • Services connect via NodePort: Host=212.28.186.239;Port=30992
  • Each service has its own database: {service_name} (e.g., iam_service)

Check:

# Verify Neon compute is running
kubectl get pods -n neon | grep compute

# Check NodePort service
kubectl get svc -n neon | grep 30992

# Test connectivity from service pod
kubectl exec deployment/catalog-service -n staging -- env | grep DATABASE_URL

Common causes:

  1. Neon compute pod restarted → wait for it to be ready
  2. Network policy blocks egress to port 30992 → add allow-external-egress
  3. Wrong credentials → check goodgo-secrets

4. Ingress / DNS Issues

Problem: 504 Gateway Timeout on platform.techbi.org

Root Cause: Ingress-nginx on control plane (212.28.186.239) has port conflicts

Current Setup:

  • DNS: *.techbi.org → 212.28.186.239 (control plane)
  • Ingress-nginx on control plane works correctly (resolves cluster DNS, routes to ClusterIPs)
  • Ingress-nginx on worker nodes has hostNetwork issue (cannot route to ClusterIPs)
  • TLS: Let's Encrypt certificates valid until Jul 2026

Fix (if DNS needs to change):

# Cloudflare API
CF_TOKEN="0739e5df538e9543b7c7a9861b99974c218f0"
CF_EMAIL="hongochai10@icloud.com"
ZONE_ID="ac7415c1822dbd1f1ba9474073ebced5"

# Update A record
curl -X PUT "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
  -H "X-Auth-Email: $CF_EMAIL" -H "X-Auth-Key: $CF_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"type":"A","name":"platform.techbi.org","content":"185.225.233.97","ttl":1,"proxied":false}'

DNS Records (Cloudflare zone: ac7415c1822dbd1f1ba9474073ebced5):

Record ID Value
platform.techbi.org 42b0f325d2afe89c0190cd91e27cc0c2 212.28.186.239
api.techbi.org 07c3803f5c9ac3647659df22b93bea8f 212.28.186.239

5. CI/CD Pipeline (Gitea Actions)

Problem: Builds fail or timeout

Workflow: .gitea/workflows/deploy.yaml

Architecture:

  1. GitHub → Gitea mirror (CronJob github-gitea-sync-pos)
  2. Gitea detects changes → triggers workflow
  3. Workflow builds images in parallel batches of 5 via Kaniko Jobs
  4. Images pushed to Harbor (harbor.techbi.org/goodgo/)
  5. Deploys to K8s staging namespace

Common issues:

  • Sync not triggered: kubectl create job --from=cronjob/github-gitea-sync-pos github-gitea-sync-pos-manual -n gitea
  • Kaniko clone fails: Check allow-build-egress NetworkPolicy
  • Harbor push timeout: Check Harbor ingress timeout annotations (need 600s)
  • Workflow timeout: Gitea runner has 60min limit; 26 services in 6 batches ~50min

Manual rebuild:

# Touch Dockerfiles to trigger rebuild
for dir in services/*/; do echo "# trigger" >> "$dir/Dockerfile"; done
git add -A && git commit -m "build: trigger rebuild" && git push
# Sync to Gitea
kubectl create job --from=cronjob/github-gitea-sync-pos sync-manual -n gitea

6. Harbor Registry

Problem: ImagePullBackOff

Check:

kubectl get secret harbor-pull-secret -n staging -o yaml
kubectl describe pod <failing-pod> -n staging | grep -A5 Events

Fix:

kubectl create secret docker-registry harbor-pull-secret -n staging \
  --docker-server=harbor.techbi.org \
  --docker-username=admin \
  --docker-password="Velik@2026" \
  --docker-email=admin@techbi.org \
  --dry-run=client -o yaml | kubectl apply -f -

7. Service Health Checks

Check all services health

# From ingress-nginx pod (bypasses network policy issues)
NGINX_POD=$(kubectl get pods -n ingress-nginx -o name | head -1)
for svc in iam-service merchant-service order-service catalog-service; do
  echo -n "$svc: "
  kubectl exec $NGINX_POD -n ingress-nginx -- wget -qO- --timeout=5 http://$svc.staging.svc.cluster.local:8080/health/live 2>&1
  echo ""
done

Expected responses:

  • /health/liveHealthy (app started)
  • /health/readyHealthy (DB + dependencies OK)
  • If ready fails but live OK → DB connection or dependency issue

8. Common kubectl Commands

# SSH to cluster
ssh root@212.28.186.239

# View all pods
kubectl get pods -n staging --sort-by=.metadata.name

# View logs
kubectl logs deployment/<service-name> -n staging --tail=50

# Restart a service
kubectl rollout restart deployment/<service-name> -n staging

# Scale
kubectl scale deployment/<service-name> --replicas=1 -n staging

# Check resources
kubectl top nodes
kubectl top pods -n staging --sort-by=cpu

# Network policy debug
kubectl get networkpolicy -n staging
kubectl describe networkpolicy <policy-name> -n staging