> ## Documentation Index > Fetch the complete documentation index at: https://docs.flowx.ai/llms.txt > Use this file to discover all available pages before exploring further. # Alerts > Threshold-based alerting with cooldowns and SLA tracking across eight LLM metrics. Alerts close the loop between observability and action. When a metric breaches the threshold you set, Observatory raises an event, optionally notifies an external system, and tracks SLA against acknowledgement and resolution times. *** ## Anatomy of an alert ```mermaid theme={"system"} flowchart LR Metric[Live metric] --> Rule{"Above
threshold?"} Rule -->|Yes| Cooldown{"Inside
cooldown?"} Cooldown -->|No| Fire[Create AlertEvent] Cooldown -->|Yes| Sleep[Skip] Fire --> Notify[Webhook / email] Fire --> SLA[Start SLA timer] ``` Two records back this: * **AlertRule** — the user-defined rule. Metric, operator, threshold, cooldown, notification channel. * **AlertEvent** — one occurrence of a rule firing. Carries acknowledged-at and resolved-at timestamps. *** ## Supported metrics | Metric | Description | | ---------------------- | ------------------------------------------------------------ | | **error\_rate** | Share of errored runs in the window. | | **p50\_latency** | Median latency in seconds. | | **p95\_latency** | 95th-percentile latency in seconds. | | **cost\_per\_hour** | Aggregated cost across runs in the last hour. | | **token\_volume** | Total tokens in the window. | | **drift\_composite** | Composite drift score from [Drift Monitor](./drift-monitor). | | **policy\_violations** | Count of policy evaluations marked as violated. | | **feedback\_negative** | Count of negative feedback events. | *** ## Operators Pick the comparison that matches the metric: | Operator | Use for | | -------- | --------------------------------------------------------------------- | | `>` | "Above threshold" — most common, used with latency, error rate, cost. | | `<` | "Below threshold" — used with feedback scores, success rate. | | `>=` | Inclusive variants of the above. | | `<=` | | *** ## Creating a rule Click **Add rule**. For example, `p95_latency > 8` seconds. Default 15 minutes. The same rule won't fire again inside the cooldown window even if the metric stays breached. This is what prevents flapping. Email, webhook, or both. The webhook payload mirrors the `AlertEvent` shape. Use the **Evaluate now** button on the rule row to fire a one-shot evaluation against current data, without touching the cooldown. *** ## API | Endpoint | Use | | -------------------------------------- | ---------------------------------------- | | `GET /api/alerts/rules` | List rules. | | `POST /api/alerts/rules` | Create a rule. | | `PUT /api/alerts/rules/{id}` | Update a rule. | | `DELETE /api/alerts/rules/{id}` | Delete a rule. | | `POST /api/alerts/rules/{id}/evaluate` | One-shot evaluation. | | `GET /api/alerts/events` | List historical events. | | `POST /api/alerts/events/{id}/ack` | Acknowledge an event (starts SLA clock). | | `POST /api/alerts/events/{id}/resolve` | Resolve an event (stops SLA clock). | *** ## SLA tracking When an event fires, two timers start: time-to-acknowledge and time-to-resolve. The Alerts page shows current values and historical compliance against the SLA targets you set per rule. Use this to: * Prove operational readiness to auditors * Spot rules that fire too often (noise) or never get acknowledged (ignored) *** ## Related resources The source of the `drift_composite` metric. Every ack and resolve is captured in the audit log.