goodgo-platform/docs/observability/slo.md

# Service Level Objectives — Top 5 GoodGo API Endpoints

Status: Baseline v1 (GOO-119)
Owner: SRE / Platform
Last reviewed: 2026-04-23

This document defines the first round of formal SLOs for the five most user-critical
API surfaces of the GoodGo platform, the Service Level Indicators (SLIs) that back
them, the recording and alerting rules that implement them in Prometheus, and the
error-budget policy that governs how the team responds to budget burn.

The numbers below are **baseline targets** chosen against historical p95/p99 latency
and 5xx ratios from the existing `goodgo_api_request_duration_seconds` and
`http_requests_total` metrics. They are deliberately aggressive enough to drive
investment, conservative enough to be meetable today, and they will be tightened
quarterly as the platform matures.

---

## 1. Critical Endpoints

| # | Endpoint                                | NestJS route (with `api/v1` prefix)        | Why it matters                                |
|---|-----------------------------------------|--------------------------------------------|-----------------------------------------------|
| 1 | `POST /auth/login`                      | `POST /api/v1/auth/login`                  | Auth gateway; failure blocks ALL user actions |
| 2 | `GET  /search` (full-text listing search) | `GET  /api/v1/search`                    | Primary discovery surface; main funnel entry  |
| 3 | `GET  /listings/:id`                    | `GET  /api/v1/listings/:id`                | Property detail page; conversion driver       |
| 4 | Payment callback (VNPay/MoMo/ZaloPay)   | `POST /api/v1/payments/callback/:provider` | Settles paid plans / featured listings        |
| 5 | `POST /inquiries`                       | `POST /api/v1/inquiries`                   | Lead capture; revenue-bearing event           |

> Routes are matched in Prometheus on the `route` label exposed by
> `apps/api/src/modules/metrics/presentation/interceptors/http-metrics.interceptor.ts`,
> which uses `request.route.path` from Express (set by Nest from the controller
> decorator). The recorded label is the **parameterised** path **with** the
> `/api/v1` global prefix preserved (Express's `req.route.path` is the full
> matched path), so the labels stored in Prometheus are:
>
> - `route="/api/v1/auth/login"`
> - `route="/api/v1/search"`
> - `route="/api/v1/listings/:id"`
> - `route="/api/v1/payments/callback/:provider"` — `:provider` is **parameterised**, not literal-per-provider, because the controller is `@Post('callback/:provider')` (single handler dispatching on the path param). All providers (VNPay, MoMo, ZaloPay, bank_transfer) collapse onto the same `route` label.
> - `route="/api/v1/inquiries"`
>
> ### Verification (run before merging dashboard / alerting changes)
>
> ```promql
> # Confirm the payment callback route is parameterised, not literal-per-provider
> count by (route) (http_requests_total{route=~".*payments/callback.*"})
> ```
>
> Expect a single series with `route="/api/v1/payments/callback/:provider"`. If
> you see per-provider literals (`/payments/callback/vnpay`, `…/momo`, etc.),
> the interceptor is recording the live path instead of the route template;
> in that case the rules in `monitoring/prometheus/rules/slo.yaml` need their
> `route="..."` matchers loosened to `route=~"/api/v1/payments/callback/.*"`.
>
> ```promql
> # Confirm /search SLI is scoped to the main full-text endpoint, not /search/geo or saved-search
> count by (route) (http_requests_total{route=~"/api/v1/search.*"})
> ```
>
> The `route="/api/v1/search"` series is the SLO target. `/api/v1/search/geo`
> and the `/api/v1/saved-searches` family have **different latency profiles**
> (PostGIS radius vs. Typesense full-text) and are intentionally **out of
> scope** for this SLO baseline. They will get their own SLOs in a follow-up
> ticket once their traffic volume justifies it.

The rule file in `monitoring/prometheus/rules/slo.yaml` uses these exact
parameterised route values in the `route="..."` matchers; if the deploy ever
changes the global prefix or the interceptor strips it, both this doc and the
matchers must be updated together.

---

## 2. SLI Definitions

For every endpoint we track two SLIs, both computed from the existing instrumentation:

### 2.1 Availability SLI (success ratio)

```
SLI_availability =
    sum(rate(http_requests_total{job="goodgo-api", route="<R>", status_code!~"5.."}[w]))
  / sum(rate(http_requests_total{job="goodgo-api", route="<R>"}[w]))
```

A request is "successful" when its HTTP status code is not in the `5xx` family.
4xx is treated as a successful response from the platform's point of view (the
client asked for something it cannot have); 5xx is always a platform fault.

For payment callbacks we additionally consider 4xx >= 422 a failure because those
responses indicate provider signature / replay validation problems that are our
fault to debug.

### 2.2 Latency SLI (proportion of fast requests)

```
SLI_latency =
    sum(rate(goodgo_api_request_duration_seconds_bucket{
        job="goodgo-api", route="<R>", le="<T>"}[w]))
  / sum(rate(goodgo_api_request_duration_seconds_count{
        job="goodgo-api", route="<R>"}[w]))
```

The threshold `T` is endpoint specific (see SLO table below). We measure the
fraction of requests that completed inside the threshold; the SLO target is the
minimum acceptable value of that fraction over the rolling 30-day window.

We deliberately use the success-ratio formulation rather than alerting on raw
percentiles. Percentile alerts are noisy at low traffic and do not produce a
budget number — the success-ratio formulation gives us a single percentage we can
burn down and reason about.

---

## 3. SLO Targets (30-day rolling window)

| Endpoint                              | Availability SLO | Latency threshold | Latency SLO |
|---------------------------------------|------------------|-------------------|-------------|
| `POST /auth/login`                    | 99.9 %           | p95 < 400 ms      | 99 %        |
| `GET  /search`                        | 99.5 %           | p95 < 800 ms      | 95 %        |
| `GET  /listings/:id`                  | 99.9 %           | p95 < 500 ms      | 99 %        |
| `POST /payments/callback/:provider`   | 99.95 %          | p99 < 2 s         | 99 %        |
| `POST /inquiries`                     | 99.9 %           | p95 < 600 ms      | 99 %        |

The "Latency threshold" column is the bucket used as the `le` value in the SLI;
the "Latency SLO" column is the fraction of traffic that must fall inside that
bucket over the 30-day window.

### 3.1 Error budgets

Error budget = `1 − SLO`, expressed as a percentage of the rolling 30-day request
volume. For example, the `POST /auth/login` availability SLO of 99.9 % yields a
budget of 0.1 % of all login attempts in the window; if the service serves 1 M
logins per month, the budget is 1 000 failed logins.

| Endpoint                              | Availability budget | Latency budget |
|---------------------------------------|---------------------|----------------|
| `POST /auth/login`                    | 0.1 %               | 1 %            |
| `GET  /search`                        | 0.5 %               | 5 %            |
| `GET  /listings/:id`                  | 0.1 %               | 1 %            |
| `POST /payments/callback/:provider`   | 0.05 %              | 1 %            |
| `POST /inquiries`                     | 0.1 %               | 1 %            |

---

## 4. Burn-Rate Alert Strategy

We use the standard Google SRE multi-window, multi-burn-rate alert pattern. A
burn rate of 1.0 means we are on track to consume exactly 100 % of the budget
over the SLO window. Alerts fire when both a short and a long evaluation window
are simultaneously above the threshold; this kills the false-positive blip
problem without delaying real outages.

| Severity | Burn rate | Long window | Short window | Budget consumed if sustained |
|----------|-----------|-------------|--------------|------------------------------|
| **fast / page**  | 14.4 | 1 h  | 5 m  | 2 % of 30-day budget in 1 h  |
| **slow / ticket** | 6   | 6 h  | 30 m | 5 % in 6 h                   |
| **slow / ticket** | 3   | 24 h | 2 h  | 10 % in 24 h                 |
| **slow / ticket** | 1   | 3 d  | 6 h  | 10 % in 3 d                  |

The first two rows are the mandatory pair from the GOO-119 deliverable
("burn-rate alerts: fast 1h, slow 6h"). The 24 h and 3 d rows are added because
they catch slow-burn regressions that the 1 h / 6 h pair will miss; they page
nobody, they only ticket the on-call rotation.

Each burn-rate threshold is implemented twice — once for availability, once for
latency — per endpoint.

---

## 5. Error Budget Policy

The error budget is the team's licence to ship. The policy is intentionally
simple so it can be applied without debate:

1. **Budget healthy (> 25 % remaining)** — Default. Ship freely.
2. **Budget at risk (10 – 25 % remaining)** — Feature work continues, but every
   PR touching the affected endpoint requires SRE sign-off, and a reliability
   task must be opened with priority `high`.
3. **Budget exhausted (≤ 10 % remaining or projected to exhaust within 7 days)**
   — Feature freeze on the affected endpoint. Only reliability fixes, rollbacks
   and config changes ship until the budget recovers above 25 %.
4. **Budget overspent (negative)** — Incident is declared; the on-call commander
   owns the freeze and the recovery plan.

The policy is enforced manually today; automation (PR labels, deploy gates) is
out of scope for this baseline ticket and tracked separately.

---

## 6. Implementation

The SLIs and burn-rate alerts above are implemented in
[`monitoring/prometheus/rules/slo.yaml`](../../monitoring/prometheus/rules/slo.yaml).

The file defines:

- One **recording-rule** group per endpoint (`slo:request:ratio_rate_<window>`,
  `slo:latency:ratio_rate_<window>`) for the windows used by the burn-rate
  alerts (5 m, 30 m, 1 h, 2 h, 6 h, 1 d, 3 d). Recording the ratios up front
  keeps the alerting expressions readable and cheap.
- One **alerting-rule** group per endpoint with the four burn-rate alerts for
  availability and latency.

`monitoring/prometheus/prometheus.yml` already loads `*.yml` from the rules
directory via the `rule_files` block; the new `rules/` subdirectory is included
when the Prometheus container starts (see § 6.1 below).

### 6.1 Prometheus configuration

`prometheus.yml` is updated to glob both the legacy `alert-rules.yml` and the
new `rules/` directory:

```yaml
rule_files:
  - 'alert-rules.yml'
  - 'rules/*.yaml'
```

Reload Prometheus in dev with:

```bash
docker compose -f docker-compose.monitoring.yml kill -s SIGHUP prometheus
```

In production, the same SIGHUP is delivered by the deploy pipeline.

---

## 7. Dashboard

A Grafana dashboard is being built in [GOO-120](/GOO/issues/GOO-120). It will
expose:

- Per-endpoint SLO compliance (current 30-day window vs. target).
- Remaining error budget (absolute requests + percentage).
- Burn-rate over the last 1 h / 6 h / 24 h / 3 d.
- Drill-down: latency histogram + status-code breakdown.

This document will be updated with the dashboard URL once the panel is provisioned.

---

## 8. Review cadence

- **Monthly**: SRE reviews actual burn vs. target and the alert noise budget;
  thresholds are tightened or relaxed in this document via PR.
- **Quarterly**: Product + SRE jointly re-prioritise the endpoint list (top 5
  may change as new revenue surfaces ship).

Changes to SLO numbers, the endpoint list, or the burn-rate matrix MUST go
through PR review and reference this file.