# LLMWise — Full Platform Documentation

> Multi-model LLM API orchestration platform. One API key to access 30+ models from OpenAI, Anthropic, Google, DeepSeek, Meta, Mistral, xAI, and more. Orchestration modes: Chat, Compare, Blend, Judge, plus Failover routing. OpenAI-style messages, credit-based pay-per-use, no subscription.

Base URL: https://llmwise.ai
API base: https://llmwise.ai/api/v1
Auth: Bearer token (mm_sk_ prefix) or Clerk JWT
Streaming: Server-Sent Events (SSE)

## Supported Models

| ID | Name | Provider | Vision |
|----|------|----------|--------|
| auto | Auto (smart routing) | LLMWise | Yes |
| gpt-5.2 | GPT-5.2 | OpenAI | Yes |
| claude-sonnet-4.5 | Claude Sonnet 4.5 | Anthropic | Yes |
| gemini-3-flash | Gemini 3 Flash | Google | Yes |
| claude-haiku-4.5 | Claude Haiku 4.5 | Anthropic | No |
| deepseek-v3 | DeepSeek V3 | DeepSeek | No |
| llama-4-maverick | Llama 4 Maverick | Meta | No |
| mistral-large | Mistral Large | Mistral | No |
| grok-3 | Grok 3 | xAI | Yes |
| zai-glm-5 | GLM 5 | Z.ai | No |
| liquid-lfm-2.2-6b | LFM2 2.6B | LiquidAI | No |
| liquid-lfm-2.5-1.2b-thinking-free | LFM2.5 1.2B Thinking (Free) | LiquidAI | No |
| liquid-lfm2-8b-a1b | LFM2 8B A1B | LiquidAI | No |
| minimax-m2.5 | MiniMax M2.5 | MiniMax | No |
| llama-3.3-70b-instruct | Llama 3.3 70B Instruct | Meta | No |
| gpt-oss-20b | GPT OSS 20B | OpenAI | No |
| gpt-oss-120b | GPT OSS 120B | OpenAI | No |
| gpt-oss-safeguard-20b | GPT OSS Safeguard 20B | OpenAI | No |
| kimi-k2.5 | Kimi K2.5 | MoonshotAI | Yes |
| nemotron-3-nano-30b-a3b | Nemotron 3 Nano 30B | NVIDIA | No |
| nemotron-nano-12b-v2-vl | Nemotron Nano 12B VL | NVIDIA | Yes |
| claude-opus-4.6 | Claude Opus 4.6 | Anthropic | Yes |
| claude-opus-4.5 | Claude Opus 4.5 | Anthropic | Yes |
| arcee-coder-large | Arcee Coder Large | Arcee AI | No |
| arcee-trinity-large-preview-free | Arcee Trinity Large (Free) | Arcee AI | No |
| qwen3-coder-next | Qwen3 Coder Next | Qwen | No |
| olmo-3.1-32b-think | OLMo 3.1 32B Think | AllenAI | No |
| llama-guard-3-8b | Llama Guard 3 8B | Meta | No |
| gpt-4o-2024-08-06 | GPT-4o (2024-08-06) | OpenAI | Yes |
| gpt-audio | GPT Audio | OpenAI | No |
| openrouter-free | OpenRouter Free | OpenRouter | Yes |
| openrouter-auto | OpenRouter Auto | OpenRouter | Yes |


> Note: OpenRouter free-model entries are synced dynamically into the backend catalog.  
> For the live free list, call `GET /api/v1/models` and filter `is_free=true`.

## Orchestration Modes

### Chat (1 credit)
Endpoint: POST /api/v1/chat
Single-model chat with OpenAI-style messages (role + content) and streaming SSE.

### Compare (2 credits)
Endpoint: POST /api/v1/compare
Same prompt hits 2-9 models simultaneously. Responses stream back with per-model latency, tokens, and cost.

### Blend (4 credits)
Endpoint: POST /api/v1/blend
Multiple models respond, then a synthesizer combines the strongest parts. Strategies: consensus, best_of, chain.

### Judge (5 credits)
Endpoint: POST /api/v1/judge
Contestant models compete on your prompt. A judge model scores, ranks, and explains why one wins.

### Failover Routing (1 credit)
Endpoint: POST /api/v1/chat (with routing parameter)
Primary model hits 429 or goes down? Auto-failover to backup chain. Circuit breakers, health checks, zero downtime.

## Pricing

- Free Trial: 20 credits, never expire, no credit card required
- Pay-per-use: Add credits anytime, paid credits never expire
- Auto top-up: Optional automatic refill with monthly safety cap
- Enterprise: Custom limits, team billing, SLAs — contact sales@llmwise.ai

---

# Documentation


## Getting Started

### Quick Start Guide

## What you get immediately

Every new account receives **20 free credits**. One credit = one Chat request. No credit card required to start.

- OpenAI-style messages format (role + content)
- Chat, Compare, Blend, Judge, and Mesh modes
- Unified usage + charged credits visibility
- Optimization and replay workflows for policy tuning

## 10-minute setup

1. Create account in `/sign-up` — you receive 20 free credits instantly.
2. Generate an API key in `/keys`.
3. Open `/api-explorer` and run your first request.
4. Open `/chat` and test `Auto` mode.
5. Open `/usage` to confirm charged credits and response latency.

## First request

```bash
curl -X POST https://llmwise.ai/api/v1/chat \
  -H "Authorization: Bearer mm_sk_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "optimization_goal": "balanced",
    "messages": [
      {"role": "user", "content": "Give me a launch checklist for an AI API product."}
    ],
    "stream": true
  }'
```

## What success looks like

In streaming mode, watch for a final `done` payload including:

- `finish_reason`
- `resolved_model`
- `credits_charged`
- `credits_remaining`

### Dashboard User Guide

## Dashboard map

## Mode behavior

## Suggested daily workflow

1. Start in Chat with `Auto` and `Balanced` goal.
2. For critical prompts, run Compare before standardizing.
3. Use Blend/Judge only for high-value outputs.
4. Add Mesh chain for reliability-sensitive flows.
5. Check Usage daily and Replay weekly.

## How to read Usage page correctly

- **Charged credits**: what user wallet is billed.
- **Latency**: request performance for user experience.
- **Tokens**: workload profile for model selection decisions.


## API Core

### Authentication and API Keys

## Authentication model

LLMWise supports two authentication methods:

Both methods use the same `Authorization: Bearer <token>` header. The backend detects which method you are using by the token prefix.

## API key details

- **Prefix:** `mm_sk_` followed by 64 hex characters
- **Storage:** Keys are SHA-256 hashed before storage — the raw key is only shown once at generation time
- **One key per account** at a time. Generating a new key invalidates the previous one

### Key lifecycle

### Chat API Reference

## Endpoint

## Request fields

## Streaming events

In single-model chat mode, SSE messages are plain JSON chunks that include a `delta` field (no explicit `event` field).

In Mesh/failover mode (when `routing` is set, or when Auto uses an implicit fallback chain), chunks are wrapped in explicit events (`event: "route" | "chunk" | "trace"`), followed by a final `done` payload with billing metadata.

## Request example

```json
{
  "model": "auto",
  "cost_saver": true,
  "optimization_goal": "cost",
  "messages": [
    {"role": "user", "content": "Design retry logic for API failures."}
  ],
  "semantic_memory": true,
  "semantic_top_k": 4,
  "stream": true
}
```

## Done event example

```json
{
  "event": "done",
  "id": "request_uuid",
  "resolved_model": "deepseek-v3",
  "finish_reason": "stop",
  "credits_charged": 1,
  "credits_remaining": 2038
}
```

## Non-stream response example

```json
{
  "id": "request_uuid",
  "model": "gpt-5.2",
  "content": "...",
  "prompt_tokens": 42,
  "completion_tokens": 312,
  "latency_ms": 1180,
  "cost": 0.0039,
  "credits_charged": 1,
  "credits_remaining": 2038,
  "finish_reason": "stop",
  "mode": "chat"
}
```

### Auto Routing and Optimization (Load Balancer Mode)

## What Auto does (in one sentence)

`model="auto"` turns LLMWise into a **load balancer for LLMs**: it picks the best primary model for each request and (optionally) applies an implicit fallback chain so transient failures do not break your flow.

## Auto decision flow

When you send a Chat request with `model="auto"`, the backend:

1. Builds a candidate model set (vision-safe if your messages contain images).
2. Loads your **optimization policy** (defaults + guardrails).
3. Resolves a goal: `balanced | cost | latency | reliability`.
4. Chooses a primary model using one of two strategies:
   - `historical_optimization`: uses your recent production traces when there is enough data.
   - `heuristic_routing`: uses a fast heuristic classifier when history is insufficient or policy disables history.

The final model is returned to you in `resolved_model` on the `done` event (streaming) or in the JSON response (non-stream).

## Auto as a load balancer (implicit failover)

Auto can also add a fallback chain even if you do not provide `routing`.

This is controlled by your optimization policy:

- If `max_fallbacks > 0`, Auto will attach a fallback chain to the request.
- If `max_fallbacks = 0`, Auto will run as **single-model routing only** (no implicit failover).

When an implicit chain is active, LLMWise retries on retryable failures (429/5xx/timeouts), emits routing events (`route`, `trace`), and settles billing once a final model succeeds.

## Cost saver mode (shortcut)

If you send `cost_saver: true`, the server normalizes your request to:

- `model = "auto"`
- `optimization_goal = "cost"`

This is supported for `POST /api/v1/chat` only (not with explicit `routing`).

## What you see in streaming

In streaming mode (`stream: true`), you will see:

- **delta chunks**: JSON objects with a `delta` field (text) and a `done` boolean.
- **Mesh/Auto failover events** (only when a fallback chain is active):
  - `event: "route"`: model attempts (trying/failed/skipped)
  - `event: "chunk"`: streamed deltas (event-wrapped)
  - `event: "trace"`: final routing summary
- **final billing event**:
  - `event: "done"` with `credits_charged`, `credits_remaining`, and (when Auto is used) `resolved_model`, `auto_strategy`, `optimization_goal`.

## API examples

### cURL (Auto + cost saver)

```bash
curl -X POST https://llmwise.ai/api/v1/chat \
  -H "Authorization: Bearer mm_sk_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "cost_saver": true,
    "messages": [{"role":"user","content":"Summarize this support thread."}],
    "stream": true
  }'
```

### Python (SDK)

```python
import os
from llmwise import LLMWise

client = LLMWise(os.environ["LLMWISE_API_KEY"])

for ev in client.chat_stream(
    model="auto",
    optimization_goal="balanced",
    messages=[{"role": "user", "content": "Write a launch plan for a SaaS product."}],
):
    if ev.get("delta"):
        print(ev["delta"], end="", flush=True)
    if ev.get("event") == "done":
        print("\\n\\nresolved_model:", ev.get("resolved_model"))
        break
```

### TypeScript (SDK)

```ts
import { LLMWise } from "llmwise";

const client = new LLMWise(process.env.LLMWISE_API_KEY!);

for await (const ev of client.chatStream({
  model: "auto",
  optimization_goal: "cost",
  messages: [{ role: "user", content: "Draft a short outbound email to a CTO." }],
})) {
  if (ev.delta) process.stdout.write(ev.delta);
  if (ev.event === "done") {
    console.log("\\nresolved_model:", (ev as any).resolved_model);
    break;
  }
}
```

### Compare / Blend / Judge API Reference

## Endpoint matrix

## Compare behavior

- Runs all selected models concurrently.
- Emits per-model completion events.
- Emits summary metadata (`fastest`, `longest`).
- Refunds when all models fail.

## Blend behavior

Blend supports strategies:

- `consensus`
- `council`
- `best_of`
- `chain`
- `moa` (Mixture-of-Agents refinement layers)
- `self_moa` (Self-MoA: multiple candidates from one base model)

Notes:

- Most strategies require **2+ models**. Passing 1 model returns a 400 error.
- For `self_moa`, pass exactly **1 model** in `models[]` and set `samples` (2–8).
- For `moa`, set `layers` (1–3). Each layer refines answers using the previous layer as references.
- The judge model cannot be one of the contestants.

## Judge behavior

Judge mode collects contestant outputs, then prompts the judge model to return ranked JSON.

```json
{
  "event": "verdict",
  "winner": "claude-sonnet-4.5",
  "scores": [
    {"model": "claude-sonnet-4.5", "rank": 1, "score": 9.2, "reasoning": "..."},
    {"model": "gpt-5.2", "rank": 2, "score": 8.8, "reasoning": "..."}
  ],
  "overall": "Claude response was more complete and better structured."
}
```

## Failure semantics

### API Explorer Guide

## Why API Explorer exists

API Explorer is the fastest way to validate payload structure and endpoint behavior before coding SDK integration.

- Mode-specific payload templates
- Live request execution with your API key
- Stream event inspector (delta chunks, `route`/`chunk`/`trace`, `done`, terminal errors)
- Raw and parsed output panes
- Product-scoped assistant for endpoint-specific snippet generation

## Typical debugging sequence

## Good assistant prompts

- "Generate Node.js fetch example with retries for this payload."
- "Show Python SSE parser for done events and finish_reason handling."
- "Explain why this request returned 402 and what user action fixes it."


## Tutorials

### Mesh Mode Tutorial (Failover Routing)

## When to use Mesh

Use Mesh mode for reliability-sensitive traffic where a single provider failure is not acceptable.

- Frequent 429 bursts
- Provider latency spikes
- High-value requests that must complete

## Mesh failover model

### Replay Lab Tutorial

## What Replay Lab does

Replay Lab simulates historical request traffic against your current policy to estimate impact before you change production behavior.

- Cost deltas
- Latency deltas
- Reliability and success-rate deltas

## Replay flow

### Prompt Regression Testing Tutorial

## What this feature covers

- Prebuilt prompt templates
- Custom suite creation
- Manual and scheduled test runs
- CSV export for historical tracking

## Workflow

### Blend Strategies & Orchestration Algorithms

LLMWise orchestrates multiple models through several algorithmic layers. This guide explains every strategy and algorithm in depth, with special focus on **Blend mode** — the most configurable.

## Blend mode overview

Blend sends your prompt to multiple models simultaneously, then feeds all responses into a **synthesizer** model that produces one final answer. The synthesis behavior changes depending on which **strategy** you choose.

All strategies follow the same two-phase execution:

## Strategy: Consensus

The default strategy. The synthesizer receives all source responses and is instructed to combine the strongest points while resolving any contradictions.

- Single-pass synthesis — no refinement layers
- Synthesizer decides which parts of each response to keep
- Contradictions are resolved by weighing the majority view

```json
{
  "models": ["gpt-5.2", "claude-sonnet-4.5", "gemini-3-flash"],
  "synthesizer": "claude-sonnet-4.5",
  "strategy": "consensus",
  "messages": [{"role": "user", "content": "Explain quantum entanglement"}]
}
```

## Strategy: Council

Structures the synthesis as a deliberation. The synthesizer produces:

1. **Final answer** — the synthesized conclusion
2. **Agreement points** — where all models aligned
3. **Disagreement points** — where models diverged, with analysis
4. **Follow-up questions** — areas that need further exploration

Best when you want transparency about model consensus vs. divergence.

## Strategy: Best-Of

The synthesizer picks the single best response, then enhances it with useful additions from the others. The quickest synthesis approach — minimal rewriting, focused on augmentation.

## Strategy: Chain

Iterative integration. The synthesizer works through each response sequentially, building a comprehensive answer by incrementally incorporating each model's contribution. Produces the most thorough output but may be longer.

## Strategy: MoA (Mixture of Agents)

The most sophisticated strategy. Inspired by the [Mixture-of-Agents](https://arxiv.org/abs/2406.04692) paper, MoA adds **refinement layers** where models can see and improve upon previous answers.

### How MoA layers work

1. **Layer 0**: Each model answers the prompt independently (same as other strategies).
2. **Layer 1+**: Each model receives the previous layer's answers as reference material, injected via system message. Models are instructed to improve upon, correct, and expand the references.
3. **Final synthesis**: The synthesizer combines all responses from the last completed layer.

### Reference injection

Previous-layer answers are injected into each model's context:

- **Total reference budget**: 12,000 characters across all references
- **Per-answer cap**: 3,200 characters (truncated if longer)
- **Injection method**: System message + follow-up user message containing formatted references

### Early stopping

If a layer produces zero successful responses, MoA keeps the previous layer's successes and skips to synthesis. This prevents total failure when models hit rate limits or errors.

```json
{
  "models": ["gpt-5.2", "claude-sonnet-4.5", "gemini-3-flash"],
  "synthesizer": "claude-sonnet-4.5",
  "strategy": "moa",
  "layers": 2,
  "messages": [{"role": "user", "content": "Design a rate limiter for a distributed system"}]
}
```

## Strategy: Self-MoA

Self-MoA generates diverse candidates from a **single model** by varying temperature and system prompts. This is useful when you trust one model but want to hedge against its variance.

### How it works

1. You provide exactly **1 model** in `models[]`
2. Set `samples` (2–8, default 4) for how many candidates to generate
3. Each candidate runs with a different **temperature offset** and **agent prompt**
4. The synthesizer combines all candidates into one final answer

### Temperature variation

Each candidate gets a different temperature to encourage diversity:

```
Base offsets: [-0.25, 0.0, +0.25, +0.45, +0.15, +0.35, -0.1, +0.3]
Final temp = clamp(base_temp + offset, 0.2, 1.4)
```

For example, with `temperature: 0.7` and 4 samples:
- Candidate 1: temp 0.45 (conservative)
- Candidate 2: temp 0.70 (baseline)
- Candidate 3: temp 0.95 (creative)
- Candidate 4: temp 1.15 (exploratory)

### Agent prompt rotation

Six distinct system prompts rotate across candidates, each emphasizing a different quality:

Plus two more: **Clarity** (plain-language explanations) and **Skepticism** (challenge assumptions, flag weaknesses).

```json
{
  "models": ["claude-sonnet-4.5"],
  "synthesizer": "claude-sonnet-4.5",
  "strategy": "self_moa",
  "samples": 4,
  "temperature": 0.7,
  "messages": [{"role": "user", "content": "Write a Python async rate limiter"}]
}
```

## Blend credit cost

All blend strategies cost **4 credits** regardless of strategy or model count. Credits are reserved upfront and refunded if all source models fail. After completion, billing is settled to actual execution usage — you may receive a partial refund if usage is lower than the reservation.

## Compare mode algorithm

Compare runs 2–9 models concurrently and streams their responses side-by-side.

- All models stream via an `asyncio.Queue` — chunks are yielded in arrival order (not round-robin)
- Queue timeout: 120 seconds per chunk
- After all models finish, a **summary event** reports the fastest model and longest response
- Total latency = max(individual latencies) — bottleneck is the slowest model
- Cost: 2 credits. Refunded if all models fail; partial status logged if some succeed.

## Judge mode algorithm

Judge runs a three-phase competitive evaluation:

### Scoring system

The judge produces structured JSON with rankings sorted by score descending:

```json
{
  "rankings": [
    {"model": "claude-sonnet-4.5", "rank": 1, "score": 9.2, "reasoning": "Most complete and well-structured"},
    {"model": "gpt-5.2", "rank": 2, "score": 8.8, "reasoning": "Accurate but less organized"}
  ],
  "overall_analysis": "Claude response covered more edge cases..."
}
```

**Default evaluation criteria**: accuracy, completeness, clarity, helpfulness, code quality. You can override these with the `criteria` parameter.

**Fallback scoring**: If the judge returns malformed JSON, default scores are assigned: `8.0 - (i * 0.5)` for each contestant in order, with a note that scores were auto-assigned. Cost: 5 credits.

## Mesh mode: circuit breaker failover

When you use mesh mode (chat with `routing` parameter), LLMWise tries models in sequence with automatic failover powered by a circuit breaker.

### Circuit breaker state machine

Each model tracks health in-memory:

### Failover sequence

1. Try **primary model** first
2. If it fails (or circuit is open), try **fallback 1**, then **fallback 2**, etc.
3. For each attempt: emit a `route` event (`trying`, `failed`, or `skipped`)
4. First success stops the chain — no further fallbacks tried
5. After all attempts, emit a `trace` event summarizing the route

### Latency tracking

Model latency is tracked with exponential smoothing:

```
avg_latency = (avg_latency * 0.8) + (new_latency * 0.2)
```

This favors recent measurements, so a model that recovers from a slow period will quickly show improved latency.

## Auto-router: heuristic classification

When you set `model: "auto"`, LLMWise classifies your query using **zero-latency regex matching** (no LLM call overhead) and routes to the best model.

### Policy-based routing

If you have an **optimization policy** enabled with sufficient historical data, auto-router upgrades from regex heuristics to **historical optimization** — routing based on actual performance data from your past requests. See the next section.

## Optimization scoring algorithm

The optimization engine analyzes your historical request logs and recommends the best model + fallback chain for each goal.

### Goals and weight vectors

Each goal uses different weights for the three scoring dimensions:

### Scoring formula

For each eligible model (minimum 3 calls in lookback window):

```
inv_latency = (max_latency - model_latency) / (max_latency - min_latency)
inv_cost    = (max_cost - model_cost) / (max_cost - min_cost)

raw_score = (Ws * success_rate) + (Wl * inv_latency) + (Wc * inv_cost)

sample_factor = min(1.0, calls / 20)
score = raw_score * (0.7 + 0.3 * sample_factor)
```

The **sample factor** gives a small boost to models with more data — a model with 20+ calls gets the full score, while a model with only 3 calls is penalized 30%. Preferred models get an additional `+0.04 * sample_factor` bonus.

### Confidence score

```
confidence = min(1.0, total_calls / 60)
```

At 60+ total calls across all models, confidence reaches 1.0 (full certainty). Below that, the recommendation carries a lower confidence signal.

### Guardrails

After scoring, models are filtered through policy guardrails:

- **Max latency**: Reject models above threshold (e.g., 5000ms)
- **Max cost**: Reject models above per-request cost (e.g., $0.05)
- **Min success rate**: Reject models below reliability threshold (e.g., 0.95)

The top model that passes all guardrails becomes the **recommended primary**. The next N models become the **fallback chain** (configurable, 0–6 fallbacks).

## Credit settlement algorithm

LLMWise uses a three-phase credit system:

### Settlement formula

Reserved credits are debited at request start.
After execution, LLMWise reconciles that reserve against actual usage.

- If usage is lower than reserved credits, unused credits are refunded.
- If usage is higher, we charge only the difference.
BYOK requests keep provider-facing billing and remain on **0 credits**.


## Billing & Limits

### Billing and Credits

## Billing principle

Users are billed in **credits**, not raw provider token costs. One dollar buys 100 credits.

- Mode-level default charge is fixed per request (reserved upfront)
- After the request completes, a settlement step reconciles actual execution usage
- Wallet balance is shown in `/credits`
- **Paid credits never expire**

## Free trial

Every new account receives **20 free credits** on signup. Free credits never expire — use them at your own pace. Purchase additional credit packs anytime to add more credits to your wallet.

## Default charges

## How settlement works

Credits are **reserved** before the request starts, then **settled** after execution:

If actual usage exceeds the reserved credits, the difference is charged. If usage is lower, unused credits are refunded. All adjustments appear as separate transactions in your history.

## Top-up flow

Minimum top-up is $3. Maximum single top-up is $10,000.

## Auto top-up

Enable automatic refills so requests never fail due to low balance:

1. Complete one Stripe checkout to save a payment method
2. Enable auto top-up in `/settings` and set your preferred amount
3. Set a balance threshold — when credits drop below it, a top-up is triggered
4. Set a monthly spending cap to control costs

Auto top-ups are processed as off-session Stripe PaymentIntents using your saved payment method. Monthly spending is tracked and capped to prevent runaway charges.

## BYOK (Bring Your Own Key)

When a BYOK provider key is configured, requests route directly to the provider using your key. **BYOK requests skip credit charges entirely** — you pay the provider directly. This is useful when customer contracts require provider-direct billing.

## Purpose of open catalog models

Provider-free models are best used for:

1. **Prompt and UX prototyping** before spending paid credits
2. **Fallback paths** for non-critical traffic during provider spikes
3. **A/B checks** against paid models so you only pay where quality difference matters

Catalog updates are synced from OpenRouter, so available `is_free=true` models can change over time.

You can always fetch the current live list from:

```bash
GET /api/v1/models
```

Filter rows where `is_free=true`.

### Rate Limits and Reliability

## Reliability stack

## Per-endpoint limits

All limits are per 60-second window. Paid users (any purchase history) get a 1.5x multiplier; free-tier users get a 0.6x multiplier.

## Dual-layer enforcement

Every request is checked against two independent counters:

1. **Per-user** — keyed by your user ID
2. **Per-IP** — keyed by your client IP address (via `X-Forwarded-For`)

IP-level limits are separate from user limits. Default IP limits: free = 120 req/min, paid = 360 req/min.

## Burst protection

A second short-window layer prevents request spikes. Within any 10-second window:

- **Free users:** 30 requests max
- **Paid users:** 90 requests max

If you exceed the burst limit, you receive a `429` with the message "Request burst detected."

## Response headers

Every API response includes rate-limit headers:

## Fail-open mode

By default, rate limiting runs in **fail-open** mode. If Redis is unavailable, requests are allowed through rather than blocked. This prevents a Redis outage from taking down your API access. Critical routes can be configured for fail-closed if needed.

## Circuit breaker (Mesh mode)

When using Mesh/failover routing, a per-model circuit breaker protects against cascading failures:

- **3 consecutive failures** → circuit opens for 30 seconds
- During open state, the model is skipped and the next fallback is tried
- After 30 seconds, **half-open**: one test request is allowed through
- A successful test closes the circuit; a failure reopens it

## Client retry baseline

```javascript
for (let attempt = 0; attempt <= 3; attempt += 1) {
  const res = await fetch(url, init);
  if (res.ok) return res;
  if (res.status === 429 || res.status >= 500) {
    const retryAfter = res.headers.get("Retry-After");
    const delay = retryAfter
      ? parseInt(retryAfter, 10) * 1000
      : 300 * (2 ** attempt);
    await new Promise((r) => setTimeout(r, delay));
    continue;
  }
  throw new Error("HTTP " + res.status);
}
```


## Security & Data

### Privacy, Security, and Data Controls

## Control matrix

## Retention impact

## Managing privacy settings

Toggle controls via `PUT /api/v1/settings/privacy`:

```json
{
  "zero_retention_mode": true,
  "data_training_opt_in": false,
  "purge_existing_data": true
}
```

- `zero_retention_mode` — when enabled, all new requests skip prompt/response storage and semantic memory
- `data_training_opt_in` — explicit consent for training data collection (auto-disabled when zero-retention is on)
- `purge_existing_data` — when enabling zero-retention, purge previously stored data

Check current settings with `GET /api/v1/settings/privacy`.

## Data purge

When you enable zero-retention mode with `purge_existing_data: true`, the following data is permanently removed:

- **Semantic memories** — all vector embeddings deleted
- **Training samples** — all opted-in training data deleted
- **Request logs** — prompt and response text redacted (metadata preserved for billing)
- **Conversations** — titles scrubbed

The API returns a count of affected records so you can verify the purge was complete.

## Enterprise baseline checklist

1. Enable zero-retention for regulated workloads.
2. Keep training opt-in disabled by default.
3. Rotate API and webhook secrets on a schedule.
4. Use BYOK when customer contract requires provider-direct billing.
5. Verify purge counts after enabling zero-retention.

### Semantic Memory API Reference

## Endpoints

## Retrieval flow

## Search call example

```bash
curl -G https://llmwise.ai/api/v1/memory/search \
  -H "Authorization: Bearer mm_sk_YOUR_KEY" \
  --data-urlencode "q=What decision did we make about retries?" \
  --data-urlencode "top_k=4"
```

## Zero-retention behavior

When zero-retention mode is enabled, memory APIs return disabled behavior and no persisted entries.


## Operations

### Webhooks and System Sync

## Endpoints

## Clerk events handled

- `user.created` — create local user with signup bonus (20 free credits)
- `user.updated` — sync email and name changes
- `user.deleted` — deactivate user account

Clerk webhooks are verified using Svix signatures. If the auth middleware already auto-created the user before the webhook arrives, the webhook gracefully updates instead of duplicating.

## Stripe events handled

- `checkout.session.completed` — wallet top-up fulfillment
- `checkout.session.async_payment_succeeded` — delayed payment confirmation

Both events trigger the same fulfillment flow: validate metadata, check idempotency, and credit the user wallet. Events are deduplicated by `stripe_payment_id` to prevent double-crediting.

## Sync hardening

## Setup checklist

1. Configure webhook endpoints in Clerk and Stripe dashboards.
2. Set webhook secrets in environment variables.
3. Send test events and verify logs.
4. Validate duplicate event handling.