> ## Documentation Index
> Fetch the complete documentation index at: https://docs.flowx.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Alerts

> Threshold-based alerting with cooldowns and SLA tracking across eight LLM metrics.

Alerts close the loop between observability and action. When a metric breaches the threshold you set, Observatory raises an event, optionally notifies an external system, and tracks SLA against acknowledgement and resolution times.

***

## Anatomy of an alert

```mermaid theme={"system"}
flowchart LR
    Metric[Live metric] --> Rule{"Above<br/>threshold?"}
    Rule -->|Yes| Cooldown{"Inside<br/>cooldown?"}
    Cooldown -->|No| Fire[Create AlertEvent]
    Cooldown -->|Yes| Sleep[Skip]
    Fire --> Notify[Webhook / email]
    Fire --> SLA[Start SLA timer]
```

Two records back this:

* **AlertRule** — the user-defined rule. Metric, operator, threshold, cooldown, notification channel.
* **AlertEvent** — one occurrence of a rule firing. Carries acknowledged-at and resolved-at timestamps.

***

## Supported metrics

| Metric                 | Description                                                  |
| ---------------------- | ------------------------------------------------------------ |
| **error\_rate**        | Share of errored runs in the window.                         |
| **p50\_latency**       | Median latency in seconds.                                   |
| **p95\_latency**       | 95th-percentile latency in seconds.                          |
| **cost\_per\_hour**    | Aggregated cost across runs in the last hour.                |
| **token\_volume**      | Total tokens in the window.                                  |
| **drift\_composite**   | Composite drift score from [Drift Monitor](./drift-monitor). |
| **policy\_violations** | Count of policy evaluations marked as violated.              |
| **feedback\_negative** | Count of negative feedback events.                           |

***

## Operators

Pick the comparison that matches the metric:

| Operator | Use for                                                               |
| -------- | --------------------------------------------------------------------- |
| `>`      | "Above threshold" — most common, used with latency, error rate, cost. |
| `<`      | "Below threshold" — used with feedback scores, success rate.          |
| `>=`     | Inclusive variants of the above.                                      |
| `<=`     |                                                                       |

***

## Creating a rule

<Steps>
  <Step title="Open Alerts → Rules">
    Click **Add rule**.
  </Step>

  <Step title="Pick a metric and threshold">
    For example, `p95_latency > 8` seconds.
  </Step>

  <Step title="Set the cooldown">
    Default 15 minutes. The same rule won't fire again inside the cooldown window even if the metric stays breached. This is what prevents flapping.
  </Step>

  <Step title="Choose the destination">
    Email, webhook, or both. The webhook payload mirrors the `AlertEvent` shape.
  </Step>

  <Step title="Save and test">
    Use the **Evaluate now** button on the rule row to fire a one-shot evaluation against current data, without touching the cooldown.
  </Step>
</Steps>

***

## API

| Endpoint                               | Use                                      |
| -------------------------------------- | ---------------------------------------- |
| `GET /api/alerts/rules`                | List rules.                              |
| `POST /api/alerts/rules`               | Create a rule.                           |
| `PUT /api/alerts/rules/{id}`           | Update a rule.                           |
| `DELETE /api/alerts/rules/{id}`        | Delete a rule.                           |
| `POST /api/alerts/rules/{id}/evaluate` | One-shot evaluation.                     |
| `GET /api/alerts/events`               | List historical events.                  |
| `POST /api/alerts/events/{id}/ack`     | Acknowledge an event (starts SLA clock). |
| `POST /api/alerts/events/{id}/resolve` | Resolve an event (stops SLA clock).      |

***

## SLA tracking

When an event fires, two timers start: time-to-acknowledge and time-to-resolve. The Alerts page shows current values and historical compliance against the SLA targets you set per rule. Use this to:

* Prove operational readiness to auditors
* Spot rules that fire too often (noise) or never get acknowledged (ignored)

***

## Related resources

<CardGroup cols={2}>
  <Card title="Drift Monitor" icon="wave-pulse" href="./drift-monitor">
    The source of the `drift_composite` metric.
  </Card>

  <Card title="Audit Trail" icon="clipboard-list" href="../governance/overview">
    Every ack and resolve is captured in the audit log.
  </Card>
</CardGroup>
