Create comprehensive Grafana dashboard for API latency monitoring with: - p50/p95/p99 stat panels and time series for all endpoints - Per-endpoint latency breakdown with route/method template variables - Top 10 slowest endpoints table and bar chart (by p99) - Request rate (by method) and error rate (4xx/5xx) panels - Error rate percentage (5xx/total) with SLO threshold - Latency heatmap and histogram distribution panels Add Prometheus alerting rules: - ApiLatencyP99High: p99 > 1s for 5m (warning) - ApiEndpointLatencyP99High: per-endpoint p99 > 2s (warning) - ApiLatencyP99Critical: p99 > 3s for 3m (critical/SLO breach) - ApiErrorRate5xxHigh: 5xx rate > 1% for 5m (warning) Fix api-overview.json using wrong metric name (http_request_duration_seconds → goodgo_api_request_duration_seconds). Co-Authored-By: Paperclip <noreply@paperclip.ing>
25 lines
605 B
YAML
25 lines
605 B
YAML
global:
|
|
scrape_interval: 15s
|
|
evaluation_interval: 15s
|
|
|
|
rule_files:
|
|
- 'alert-rules.yml'
|
|
|
|
scrape_configs:
|
|
- job_name: 'goodgo-api'
|
|
metrics_path: '/metrics'
|
|
static_configs:
|
|
# host.docker.internal for dev (API on host), api:3001 for prod (API in container)
|
|
- targets: ['host.docker.internal:3001']
|
|
labels:
|
|
service: 'goodgo-api'
|
|
environment: 'development'
|
|
- targets: ['api:3001']
|
|
labels:
|
|
service: 'goodgo-api'
|
|
environment: 'production'
|
|
|
|
- job_name: 'prometheus'
|
|
static_configs:
|
|
- targets: ['localhost:9090']
|