# Service Level Objectives — Top 5 GoodGo API Endpoints Status: Baseline v1 (GOO-119) Owner: SRE / Platform Last reviewed: 2026-04-23 This document defines the first round of formal SLOs for the five most user-critical API surfaces of the GoodGo platform, the Service Level Indicators (SLIs) that back them, the recording and alerting rules that implement them in Prometheus, and the error-budget policy that governs how the team responds to budget burn. The numbers below are **baseline targets** chosen against historical p95/p99 latency and 5xx ratios from the existing `goodgo_api_request_duration_seconds` and `http_requests_total` metrics. They are deliberately aggressive enough to drive investment, conservative enough to be meetable today, and they will be tightened quarterly as the platform matures. --- ## 1. Critical Endpoints | # | Endpoint | NestJS route (with `api/v1` prefix) | Why it matters | |---|-----------------------------------------|--------------------------------------------|-----------------------------------------------| | 1 | `POST /auth/login` | `POST /api/v1/auth/login` | Auth gateway; failure blocks ALL user actions | | 2 | `GET /search` (full-text listing search) | `GET /api/v1/search` | Primary discovery surface; main funnel entry | | 3 | `GET /listings/:id` | `GET /api/v1/listings/:id` | Property detail page; conversion driver | | 4 | Payment callback (VNPay/MoMo/ZaloPay) | `POST /api/v1/payments/callback/:provider` | Settles paid plans / featured listings | | 5 | `POST /inquiries` | `POST /api/v1/inquiries` | Lead capture; revenue-bearing event | > Routes are matched in Prometheus on the `route` label exposed by > `apps/api/src/modules/metrics/presentation/interceptors/http-metrics.interceptor.ts`, > which uses `request.route.path` from Express (set by Nest from the controller > decorator). The recorded label is the **parameterised** path **with** the > `/api/v1` global prefix preserved (Express's `req.route.path` is the full > matched path), so the labels stored in Prometheus are: > > - `route="/api/v1/auth/login"` > - `route="/api/v1/search"` > - `route="/api/v1/listings/:id"` > - `route="/api/v1/payments/callback/:provider"` — `:provider` is **parameterised**, not literal-per-provider, because the controller is `@Post('callback/:provider')` (single handler dispatching on the path param). All providers (VNPay, MoMo, ZaloPay, bank_transfer) collapse onto the same `route` label. > - `route="/api/v1/inquiries"` > > ### Verification (run before merging dashboard / alerting changes) > > ```promql > # Confirm the payment callback route is parameterised, not literal-per-provider > count by (route) (http_requests_total{route=~".*payments/callback.*"}) > ``` > > Expect a single series with `route="/api/v1/payments/callback/:provider"`. If > you see per-provider literals (`/payments/callback/vnpay`, `…/momo`, etc.), > the interceptor is recording the live path instead of the route template; > in that case the rules in `monitoring/prometheus/rules/slo.yaml` need their > `route="..."` matchers loosened to `route=~"/api/v1/payments/callback/.*"`. > > ```promql > # Confirm /search SLI is scoped to the main full-text endpoint, not /search/geo or saved-search > count by (route) (http_requests_total{route=~"/api/v1/search.*"}) > ``` > > The `route="/api/v1/search"` series is the SLO target. `/api/v1/search/geo` > and the `/api/v1/saved-searches` family have **different latency profiles** > (PostGIS radius vs. Typesense full-text) and are intentionally **out of > scope** for this SLO baseline. They will get their own SLOs in a follow-up > ticket once their traffic volume justifies it. The rule file in `monitoring/prometheus/rules/slo.yaml` uses these exact parameterised route values in the `route="..."` matchers; if the deploy ever changes the global prefix or the interceptor strips it, both this doc and the matchers must be updated together. --- ## 2. SLI Definitions For every endpoint we track two SLIs, both computed from the existing instrumentation: ### 2.1 Availability SLI (success ratio) ``` SLI_availability = sum(rate(http_requests_total{job="goodgo-api", route="", status_code!~"5.."}[w])) / sum(rate(http_requests_total{job="goodgo-api", route=""}[w])) ``` A request is "successful" when its HTTP status code is not in the `5xx` family. 4xx is treated as a successful response from the platform's point of view (the client asked for something it cannot have); 5xx is always a platform fault. For payment callbacks we additionally consider 4xx >= 422 a failure because those responses indicate provider signature / replay validation problems that are our fault to debug. ### 2.2 Latency SLI (proportion of fast requests) ``` SLI_latency = sum(rate(goodgo_api_request_duration_seconds_bucket{ job="goodgo-api", route="", le=""}[w])) / sum(rate(goodgo_api_request_duration_seconds_count{ job="goodgo-api", route=""}[w])) ``` The threshold `T` is endpoint specific (see SLO table below). We measure the fraction of requests that completed inside the threshold; the SLO target is the minimum acceptable value of that fraction over the rolling 30-day window. We deliberately use the success-ratio formulation rather than alerting on raw percentiles. Percentile alerts are noisy at low traffic and do not produce a budget number — the success-ratio formulation gives us a single percentage we can burn down and reason about. --- ## 3. SLO Targets (30-day rolling window) | Endpoint | Availability SLO | Latency threshold | Latency SLO | |---------------------------------------|------------------|-------------------|-------------| | `POST /auth/login` | 99.9 % | p95 < 400 ms | 99 % | | `GET /search` | 99.5 % | p95 < 800 ms | 95 % | | `GET /listings/:id` | 99.9 % | p95 < 500 ms | 99 % | | `POST /payments/callback/:provider` | 99.95 % | p99 < 2 s | 99 % | | `POST /inquiries` | 99.9 % | p95 < 600 ms | 99 % | The "Latency threshold" column is the bucket used as the `le` value in the SLI; the "Latency SLO" column is the fraction of traffic that must fall inside that bucket over the 30-day window. ### 3.1 Error budgets Error budget = `1 − SLO`, expressed as a percentage of the rolling 30-day request volume. For example, the `POST /auth/login` availability SLO of 99.9 % yields a budget of 0.1 % of all login attempts in the window; if the service serves 1 M logins per month, the budget is 1 000 failed logins. | Endpoint | Availability budget | Latency budget | |---------------------------------------|---------------------|----------------| | `POST /auth/login` | 0.1 % | 1 % | | `GET /search` | 0.5 % | 5 % | | `GET /listings/:id` | 0.1 % | 1 % | | `POST /payments/callback/:provider` | 0.05 % | 1 % | | `POST /inquiries` | 0.1 % | 1 % | --- ## 4. Burn-Rate Alert Strategy We use the standard Google SRE multi-window, multi-burn-rate alert pattern. A burn rate of 1.0 means we are on track to consume exactly 100 % of the budget over the SLO window. Alerts fire when both a short and a long evaluation window are simultaneously above the threshold; this kills the false-positive blip problem without delaying real outages. | Severity | Burn rate | Long window | Short window | Budget consumed if sustained | |----------|-----------|-------------|--------------|------------------------------| | **fast / page** | 14.4 | 1 h | 5 m | 2 % of 30-day budget in 1 h | | **slow / ticket** | 6 | 6 h | 30 m | 5 % in 6 h | | **slow / ticket** | 3 | 24 h | 2 h | 10 % in 24 h | | **slow / ticket** | 1 | 3 d | 6 h | 10 % in 3 d | The first two rows are the mandatory pair from the GOO-119 deliverable ("burn-rate alerts: fast 1h, slow 6h"). The 24 h and 3 d rows are added because they catch slow-burn regressions that the 1 h / 6 h pair will miss; they page nobody, they only ticket the on-call rotation. Each burn-rate threshold is implemented twice — once for availability, once for latency — per endpoint. --- ## 5. Error Budget Policy The error budget is the team's licence to ship. The policy is intentionally simple so it can be applied without debate: 1. **Budget healthy (> 25 % remaining)** — Default. Ship freely. 2. **Budget at risk (10 – 25 % remaining)** — Feature work continues, but every PR touching the affected endpoint requires SRE sign-off, and a reliability task must be opened with priority `high`. 3. **Budget exhausted (≤ 10 % remaining or projected to exhaust within 7 days)** — Feature freeze on the affected endpoint. Only reliability fixes, rollbacks and config changes ship until the budget recovers above 25 %. 4. **Budget overspent (negative)** — Incident is declared; the on-call commander owns the freeze and the recovery plan. The policy is enforced manually today; automation (PR labels, deploy gates) is out of scope for this baseline ticket and tracked separately. --- ## 6. Implementation The SLIs and burn-rate alerts above are implemented in [`monitoring/prometheus/rules/slo.yaml`](../../monitoring/prometheus/rules/slo.yaml). The file defines: - One **recording-rule** group per endpoint (`slo:request:ratio_rate_`, `slo:latency:ratio_rate_`) for the windows used by the burn-rate alerts (5 m, 30 m, 1 h, 2 h, 6 h, 1 d, 3 d). Recording the ratios up front keeps the alerting expressions readable and cheap. - One **alerting-rule** group per endpoint with the four burn-rate alerts for availability and latency. `monitoring/prometheus/prometheus.yml` already loads `*.yml` from the rules directory via the `rule_files` block; the new `rules/` subdirectory is included when the Prometheus container starts (see § 6.1 below). ### 6.1 Prometheus configuration `prometheus.yml` is updated to glob both the legacy `alert-rules.yml` and the new `rules/` directory: ```yaml rule_files: - 'alert-rules.yml' - 'rules/*.yaml' ``` Reload Prometheus in dev with: ```bash docker compose -f docker-compose.monitoring.yml kill -s SIGHUP prometheus ``` In production, the same SIGHUP is delivered by the deploy pipeline. --- ## 7. Dashboard A Grafana dashboard is being built in [GOO-120](/GOO/issues/GOO-120). It will expose: - Per-endpoint SLO compliance (current 30-day window vs. target). - Remaining error budget (absolute requests + percentage). - Burn-rate over the last 1 h / 6 h / 24 h / 3 d. - Drill-down: latency histogram + status-code breakdown. This document will be updated with the dashboard URL once the panel is provisioned. --- ## 8. Review cadence - **Monthly**: SRE reviews actual burn vs. target and the alert noise budget; thresholds are tightened or relaxed in this document via PR. - **Quarterly**: Product + SRE jointly re-prioritise the endpoint list (top 5 may change as new revenue surfaces ship). Changes to SLO numbers, the endpoint list, or the burn-rate matrix MUST go through PR review and reference this file.