Commit Graph

6 Commits

Author SHA1 Message Date
Ho Ngoc Hai
9409706c58 feat(monitoring): add comprehensive alerting rules, Alertmanager, and DR validation
Expand production monitoring with full alert coverage for database connections,
Redis memory/connections, container resources, disk usage, service health, and
backup integrity. Add Alertmanager service with Slack routing for critical and
warning alerts, and add automated backup verification to the pg-backup cron
schedule. Update runbook with DR validation procedures and quarterly checklist.

- Expand Prometheus alert rules from 4 to 24 alerts across 7 groups
- Add Alertmanager container (prom/alertmanager:v0.27.0) with Slack routing
- Configure inhibition rules (critical suppresses warning for same service)
- Schedule automated backup verification at 04:00 UTC daily
- Add Alertmanager datasource to Grafana provisioning
- Update runbook with Section 9: DR Validation (automated + manual procedures)
- Add SLACK_WEBHOOK_URL and Grafana vars to .env.example

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-11 20:15:36 +07:00
Ho Ngoc Hai
a59bf8eda2 feat(infra): add web vitals Grafana dashboard and admin audit log migration
- Add Grafana dashboard for web vitals metrics visualization
- Add Prisma migration for admin audit log table

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-11 01:39:37 +07:00
Ho Ngoc Hai
90839cf542 feat(monitoring): add API latency Grafana dashboard and alerting rules
Create comprehensive Grafana dashboard for API latency monitoring with:
- p50/p95/p99 stat panels and time series for all endpoints
- Per-endpoint latency breakdown with route/method template variables
- Top 10 slowest endpoints table and bar chart (by p99)
- Request rate (by method) and error rate (4xx/5xx) panels
- Error rate percentage (5xx/total) with SLO threshold
- Latency heatmap and histogram distribution panels

Add Prometheus alerting rules:
- ApiLatencyP99High: p99 > 1s for 5m (warning)
- ApiEndpointLatencyP99High: per-endpoint p99 > 2s (warning)
- ApiLatencyP99Critical: p99 > 3s for 3m (critical/SLO breach)
- ApiErrorRate5xxHigh: 5xx rate > 1% for 5m (warning)

Fix api-overview.json using wrong metric name
(http_request_duration_seconds → goodgo_api_request_duration_seconds).

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-10 23:18:09 +07:00
Ho Ngoc Hai
5114f5b87e chore: update monitoring configs, CI workflow, and web build info
Update Grafana datasource and Prometheus configs for monitoring
integration. Improve E2E CI workflow with Prisma generate, browser
caching, and trace artifact collection.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-08 23:07:21 +07:00
Ho Ngoc Hai
775eb7b374 feat(ops): add database backup strategy and log aggregation stack
- Add pg-backup container with daily automated pg_dump (02:00 UTC) and 7-day retention
- Add backup/restore scripts with documented recovery procedure
- Add Loki + Promtail for centralized log aggregation from all Docker containers
- Add Loki as Grafana datasource with correlation ID derived fields
- Add Grafana logs dashboard with volume, error rate, HTTP request, and log viewer panels
- Configure Promtail to parse Pino structured JSON logs with level/context labels
- Enhance LoggerService with string-level formatter and service base field
- Configure 15-day log retention in Loki

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-08 04:04:32 +07:00
Ho Ngoc Hai
d99dfbafbc feat(monitoring): add Prometheus metrics endpoint and Grafana dashboards
Add observability stack with @willsoto/nestjs-prometheus for /metrics endpoint,
Prometheus scraping config, and 4 auto-provisioned Grafana dashboards
(API overview, database, search, business metrics).

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-08 03:08:54 +07:00