goodgo-platform

Author	SHA1	Message	Date
Ho Ngoc Hai	9409706c58	feat(monitoring): add comprehensive alerting rules, Alertmanager, and DR validation Expand production monitoring with full alert coverage for database connections, Redis memory/connections, container resources, disk usage, service health, and backup integrity. Add Alertmanager service with Slack routing for critical and warning alerts, and add automated backup verification to the pg-backup cron schedule. Update runbook with DR validation procedures and quarterly checklist. - Expand Prometheus alert rules from 4 to 24 alerts across 7 groups - Add Alertmanager container (prom/alertmanager:v0.27.0) with Slack routing - Configure inhibition rules (critical suppresses warning for same service) - Schedule automated backup verification at 04:00 UTC daily - Add Alertmanager datasource to Grafana provisioning - Update runbook with Section 9: DR Validation (automated + manual procedures) - Add SLACK_WEBHOOK_URL and Grafana vars to .env.example Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-11 20:15:36 +07:00
Ho Ngoc Hai	a59bf8eda2	feat(infra): add web vitals Grafana dashboard and admin audit log migration - Add Grafana dashboard for web vitals metrics visualization - Add Prisma migration for admin audit log table Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-11 01:39:37 +07:00
Ho Ngoc Hai	90839cf542	feat(monitoring): add API latency Grafana dashboard and alerting rules Create comprehensive Grafana dashboard for API latency monitoring with: - p50/p95/p99 stat panels and time series for all endpoints - Per-endpoint latency breakdown with route/method template variables - Top 10 slowest endpoints table and bar chart (by p99) - Request rate (by method) and error rate (4xx/5xx) panels - Error rate percentage (5xx/total) with SLO threshold - Latency heatmap and histogram distribution panels Add Prometheus alerting rules: - ApiLatencyP99High: p99 > 1s for 5m (warning) - ApiEndpointLatencyP99High: per-endpoint p99 > 2s (warning) - ApiLatencyP99Critical: p99 > 3s for 3m (critical/SLO breach) - ApiErrorRate5xxHigh: 5xx rate > 1% for 5m (warning) Fix api-overview.json using wrong metric name (http_request_duration_seconds → goodgo_api_request_duration_seconds). Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-10 23:18:09 +07:00
Ho Ngoc Hai	5114f5b87e	chore: update monitoring configs, CI workflow, and web build info Update Grafana datasource and Prometheus configs for monitoring integration. Improve E2E CI workflow with Prisma generate, browser caching, and trace artifact collection. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-08 23:07:21 +07:00
Ho Ngoc Hai	775eb7b374	feat(ops): add database backup strategy and log aggregation stack - Add pg-backup container with daily automated pg_dump (02:00 UTC) and 7-day retention - Add backup/restore scripts with documented recovery procedure - Add Loki + Promtail for centralized log aggregation from all Docker containers - Add Loki as Grafana datasource with correlation ID derived fields - Add Grafana logs dashboard with volume, error rate, HTTP request, and log viewer panels - Configure Promtail to parse Pino structured JSON logs with level/context labels - Enhance LoggerService with string-level formatter and service base field - Configure 15-day log retention in Loki Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-08 04:04:32 +07:00
Ho Ngoc Hai	d99dfbafbc	feat(monitoring): add Prometheus metrics endpoint and Grafana dashboards Add observability stack with @willsoto/nestjs-prometheus for /metrics endpoint, Prometheus scraping config, and 4 auto-provisioned Grafana dashboards (API overview, database, search, business metrics). Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-08 03:08:54 +07:00

6 Commits