# LLMWise — Full Platform Documentation > Multi-model LLM API orchestration platform. One API key to access 30+ models from OpenAI, Anthropic, Google, DeepSeek, Meta, Mistral, xAI, and more. Orchestration modes: Chat, Compare, Blend, Judge, plus Failover routing. OpenAI-style messages, credit-based pay-per-use, no subscription. Base URL: https://llmwise.ai API base: https://llmwise.ai/api/v1 Auth: Bearer token (mm_sk_ prefix) or Clerk JWT Streaming: Server-Sent Events (SSE) ## Supported Models | ID | Name | Provider | Vision | |----|------|----------|--------| | auto | Auto (smart routing) | LLMWise | Yes | | gpt-5.2 | GPT-5.2 | OpenAI | Yes | | claude-sonnet-4.5 | Claude Sonnet 4.5 | Anthropic | Yes | | gemini-3-flash | Gemini 3 Flash | Google | Yes | | claude-haiku-4.5 | Claude Haiku 4.5 | Anthropic | No | | deepseek-v3 | DeepSeek V3 | DeepSeek | No | | llama-4-maverick | Llama 4 Maverick | Meta | No | | mistral-large | Mistral Large | Mistral | No | | grok-3 | Grok 3 | xAI | Yes | | zai-glm-5 | GLM 5 | Z.ai | No | | liquid-lfm-2.2-6b | LFM2 2.6B | LiquidAI | No | | liquid-lfm-2.5-1.2b-thinking-free | LFM2.5 1.2B Thinking (Free) | LiquidAI | No | | liquid-lfm2-8b-a1b | LFM2 8B A1B | LiquidAI | No | | minimax-m2.5 | MiniMax M2.5 | MiniMax | No | | llama-3.3-70b-instruct | Llama 3.3 70B Instruct | Meta | No | | gpt-oss-20b | GPT OSS 20B | OpenAI | No | | gpt-oss-120b | GPT OSS 120B | OpenAI | No | | gpt-oss-safeguard-20b | GPT OSS Safeguard 20B | OpenAI | No | | kimi-k2.5 | Kimi K2.5 | MoonshotAI | Yes | | nemotron-3-nano-30b-a3b | Nemotron 3 Nano 30B | NVIDIA | No | | nemotron-nano-12b-v2-vl | Nemotron Nano 12B VL | NVIDIA | Yes | | claude-opus-4.6 | Claude Opus 4.6 | Anthropic | Yes | | claude-opus-4.5 | Claude Opus 4.5 | Anthropic | Yes | | arcee-coder-large | Arcee Coder Large | Arcee AI | No | | arcee-trinity-large-preview-free | Arcee Trinity Large (Free) | Arcee AI | No | | qwen3-coder-next | Qwen3 Coder Next | Qwen | No | | olmo-3.1-32b-think | OLMo 3.1 32B Think | AllenAI | No | | llama-guard-3-8b | Llama Guard 3 8B | Meta | No | | gpt-4o-2024-08-06 | GPT-4o (2024-08-06) | OpenAI | Yes | | gpt-audio | GPT Audio | OpenAI | No | | openrouter-free | OpenRouter Free | OpenRouter | Yes | | openrouter-auto | OpenRouter Auto | OpenRouter | Yes | > Note: OpenRouter free-model entries are synced dynamically into the backend catalog. > For the live free list, call `GET /api/v1/models` and filter `is_free=true`. ## Orchestration Modes ### Chat (1 credit) Endpoint: POST /api/v1/chat Single-model chat with OpenAI-style messages (role + content) and streaming SSE. ### Compare (2 credits) Endpoint: POST /api/v1/compare Same prompt hits 2-9 models simultaneously. Responses stream back with per-model latency, tokens, and cost. ### Blend (4 credits) Endpoint: POST /api/v1/blend Multiple models respond, then a synthesizer combines the strongest parts. Strategies: consensus, best_of, chain. ### Judge (5 credits) Endpoint: POST /api/v1/judge Contestant models compete on your prompt. A judge model scores, ranks, and explains why one wins. ### Failover Routing (1 credit) Endpoint: POST /api/v1/chat (with routing parameter) Primary model hits 429 or goes down? Auto-failover to backup chain. Circuit breakers, health checks, zero downtime. ## Pricing - Free Trial: 20 credits, never expire, no credit card required - Pay-per-use: Add credits anytime, paid credits never expire - Auto top-up: Optional automatic refill with monthly safety cap - Enterprise: Custom limits, team billing, SLAs — contact sales@llmwise.ai --- # Documentation ## Getting Started ### Quick Start Guide ## What you get immediately Every new account receives **20 free credits**. One credit = one Chat request. No credit card required to start. - OpenAI-style messages format (role + content) - Chat, Compare, Blend, Judge, and Mesh modes - Unified usage + charged credits visibility - Optimization and replay workflows for policy tuning ## 10-minute setup 1. Create account in `/sign-up` — you receive 20 free credits instantly. 2. Generate an API key in `/keys`. 3. Open `/api-explorer` and run your first request. 4. Open `/chat` and test `Auto` mode. 5. Open `/usage` to confirm charged credits and response latency. ## First request ```bash curl -X POST https://llmwise.ai/api/v1/chat \ -H "Authorization: Bearer mm_sk_YOUR_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "optimization_goal": "balanced", "messages": [ {"role": "user", "content": "Give me a launch checklist for an AI API product."} ], "stream": true }' ``` ## What success looks like In streaming mode, watch for a final `done` payload including: - `finish_reason` - `resolved_model` - `credits_charged` - `credits_remaining` ### Dashboard User Guide ## Dashboard map ## Mode behavior ## Suggested daily workflow 1. Start in Chat with `Auto` and `Balanced` goal. 2. For critical prompts, run Compare before standardizing. 3. Use Blend/Judge only for high-value outputs. 4. Add Mesh chain for reliability-sensitive flows. 5. Check Usage daily and Replay weekly. ## How to read Usage page correctly - **Charged credits**: what user wallet is billed. - **Latency**: request performance for user experience. - **Tokens**: workload profile for model selection decisions. ## API Core ### Authentication and API Keys ## Authentication model LLMWise supports two authentication methods: Both methods use the same `Authorization: Bearer ` header. The backend detects which method you are using by the token prefix. ## API key details - **Prefix:** `mm_sk_` followed by 64 hex characters - **Storage:** Keys are SHA-256 hashed before storage — the raw key is only shown once at generation time - **One key per account** at a time. Generating a new key invalidates the previous one ### Key lifecycle ### Chat API Reference ## Endpoint ## Request fields ## Streaming events In single-model chat mode, SSE messages are plain JSON chunks that include a `delta` field (no explicit `event` field). In Mesh/failover mode (when `routing` is set, or when Auto uses an implicit fallback chain), chunks are wrapped in explicit events (`event: "route" | "chunk" | "trace"`), followed by a final `done` payload with billing metadata. ## Request example ```json { "model": "auto", "cost_saver": true, "optimization_goal": "cost", "messages": [ {"role": "user", "content": "Design retry logic for API failures."} ], "semantic_memory": true, "semantic_top_k": 4, "stream": true } ``` ## Done event example ```json { "event": "done", "id": "request_uuid", "resolved_model": "deepseek-v3", "finish_reason": "stop", "credits_charged": 1, "credits_remaining": 2038 } ``` ## Non-stream response example ```json { "id": "request_uuid", "model": "gpt-5.2", "content": "...", "prompt_tokens": 42, "completion_tokens": 312, "latency_ms": 1180, "cost": 0.0039, "credits_charged": 1, "credits_remaining": 2038, "finish_reason": "stop", "mode": "chat" } ``` ### Auto Routing and Optimization (Load Balancer Mode) ## What Auto does (in one sentence) `model="auto"` turns LLMWise into a **load balancer for LLMs**: it picks the best primary model for each request and (optionally) applies an implicit fallback chain so transient failures do not break your flow. ## Auto decision flow When you send a Chat request with `model="auto"`, the backend: 1. Builds a candidate model set (vision-safe if your messages contain images). 2. Loads your **optimization policy** (defaults + guardrails). 3. Resolves a goal: `balanced | cost | latency | reliability`. 4. Chooses a primary model using one of two strategies: - `historical_optimization`: uses your recent production traces when there is enough data. - `heuristic_routing`: uses a fast heuristic classifier when history is insufficient or policy disables history. The final model is returned to you in `resolved_model` on the `done` event (streaming) or in the JSON response (non-stream). ## Auto as a load balancer (implicit failover) Auto can also add a fallback chain even if you do not provide `routing`. This is controlled by your optimization policy: - If `max_fallbacks > 0`, Auto will attach a fallback chain to the request. - If `max_fallbacks = 0`, Auto will run as **single-model routing only** (no implicit failover). When an implicit chain is active, LLMWise retries on retryable failures (429/5xx/timeouts), emits routing events (`route`, `trace`), and settles billing once a final model succeeds. ## Cost saver mode (shortcut) If you send `cost_saver: true`, the server normalizes your request to: - `model = "auto"` - `optimization_goal = "cost"` This is supported for `POST /api/v1/chat` only (not with explicit `routing`). ## What you see in streaming In streaming mode (`stream: true`), you will see: - **delta chunks**: JSON objects with a `delta` field (text) and a `done` boolean. - **Mesh/Auto failover events** (only when a fallback chain is active): - `event: "route"`: model attempts (trying/failed/skipped) - `event: "chunk"`: streamed deltas (event-wrapped) - `event: "trace"`: final routing summary - **final billing event**: - `event: "done"` with `credits_charged`, `credits_remaining`, and (when Auto is used) `resolved_model`, `auto_strategy`, `optimization_goal`. ## API examples ### cURL (Auto + cost saver) ```bash curl -X POST https://llmwise.ai/api/v1/chat \ -H "Authorization: Bearer mm_sk_YOUR_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "cost_saver": true, "messages": [{"role":"user","content":"Summarize this support thread."}], "stream": true }' ``` ### Python (SDK) ```python import os from llmwise import LLMWise client = LLMWise(os.environ["LLMWISE_API_KEY"]) for ev in client.chat_stream( model="auto", optimization_goal="balanced", messages=[{"role": "user", "content": "Write a launch plan for a SaaS product."}], ): if ev.get("delta"): print(ev["delta"], end="", flush=True) if ev.get("event") == "done": print("\\n\\nresolved_model:", ev.get("resolved_model")) break ``` ### TypeScript (SDK) ```ts import { LLMWise } from "llmwise"; const client = new LLMWise(process.env.LLMWISE_API_KEY!); for await (const ev of client.chatStream({ model: "auto", optimization_goal: "cost", messages: [{ role: "user", content: "Draft a short outbound email to a CTO." }], })) { if (ev.delta) process.stdout.write(ev.delta); if (ev.event === "done") { console.log("\\nresolved_model:", (ev as any).resolved_model); break; } } ``` ### Compare / Blend / Judge API Reference ## Endpoint matrix ## Compare behavior - Runs all selected models concurrently. - Emits per-model completion events. - Emits summary metadata (`fastest`, `longest`). - Refunds when all models fail. ## Blend behavior Blend supports strategies: - `consensus` - `council` - `best_of` - `chain` - `moa` (Mixture-of-Agents refinement layers) - `self_moa` (Self-MoA: multiple candidates from one base model) Notes: - Most strategies require **2+ models**. Passing 1 model returns a 400 error. - For `self_moa`, pass exactly **1 model** in `models[]` and set `samples` (2–8). - For `moa`, set `layers` (1–3). Each layer refines answers using the previous layer as references. - The judge model cannot be one of the contestants. ## Judge behavior Judge mode collects contestant outputs, then prompts the judge model to return ranked JSON. ```json { "event": "verdict", "winner": "claude-sonnet-4.5", "scores": [ {"model": "claude-sonnet-4.5", "rank": 1, "score": 9.2, "reasoning": "..."}, {"model": "gpt-5.2", "rank": 2, "score": 8.8, "reasoning": "..."} ], "overall": "Claude response was more complete and better structured." } ``` ## Failure semantics ### API Explorer Guide ## Why API Explorer exists API Explorer is the fastest way to validate payload structure and endpoint behavior before coding SDK integration. - Mode-specific payload templates - Live request execution with your API key - Stream event inspector (delta chunks, `route`/`chunk`/`trace`, `done`, terminal errors) - Raw and parsed output panes - Product-scoped assistant for endpoint-specific snippet generation ## Typical debugging sequence ## Good assistant prompts - "Generate Node.js fetch example with retries for this payload." - "Show Python SSE parser for done events and finish_reason handling." - "Explain why this request returned 402 and what user action fixes it." ## Tutorials ### Mesh Mode Tutorial (Failover Routing) ## When to use Mesh Use Mesh mode for reliability-sensitive traffic where a single provider failure is not acceptable. - Frequent 429 bursts - Provider latency spikes - High-value requests that must complete ## Mesh failover model ### Replay Lab Tutorial ## What Replay Lab does Replay Lab simulates historical request traffic against your current policy to estimate impact before you change production behavior. - Cost deltas - Latency deltas - Reliability and success-rate deltas ## Replay flow ### Prompt Regression Testing Tutorial ## What this feature covers - Prebuilt prompt templates - Custom suite creation - Manual and scheduled test runs - CSV export for historical tracking ## Workflow ### Blend Strategies & Orchestration Algorithms LLMWise orchestrates multiple models through several algorithmic layers. This guide explains every strategy and algorithm in depth, with special focus on **Blend mode** — the most configurable. ## Blend mode overview Blend sends your prompt to multiple models simultaneously, then feeds all responses into a **synthesizer** model that produces one final answer. The synthesis behavior changes depending on which **strategy** you choose. All strategies follow the same two-phase execution: ## Strategy: Consensus The default strategy. The synthesizer receives all source responses and is instructed to combine the strongest points while resolving any contradictions. - Single-pass synthesis — no refinement layers - Synthesizer decides which parts of each response to keep - Contradictions are resolved by weighing the majority view ```json { "models": ["gpt-5.2", "claude-sonnet-4.5", "gemini-3-flash"], "synthesizer": "claude-sonnet-4.5", "strategy": "consensus", "messages": [{"role": "user", "content": "Explain quantum entanglement"}] } ``` ## Strategy: Council Structures the synthesis as a deliberation. The synthesizer produces: 1. **Final answer** — the synthesized conclusion 2. **Agreement points** — where all models aligned 3. **Disagreement points** — where models diverged, with analysis 4. **Follow-up questions** — areas that need further exploration Best when you want transparency about model consensus vs. divergence. ## Strategy: Best-Of The synthesizer picks the single best response, then enhances it with useful additions from the others. The quickest synthesis approach — minimal rewriting, focused on augmentation. ## Strategy: Chain Iterative integration. The synthesizer works through each response sequentially, building a comprehensive answer by incrementally incorporating each model's contribution. Produces the most thorough output but may be longer. ## Strategy: MoA (Mixture of Agents) The most sophisticated strategy. Inspired by the [Mixture-of-Agents](https://arxiv.org/abs/2406.04692) paper, MoA adds **refinement layers** where models can see and improve upon previous answers. ### How MoA layers work 1. **Layer 0**: Each model answers the prompt independently (same as other strategies). 2. **Layer 1+**: Each model receives the previous layer's answers as reference material, injected via system message. Models are instructed to improve upon, correct, and expand the references. 3. **Final synthesis**: The synthesizer combines all responses from the last completed layer. ### Reference injection Previous-layer answers are injected into each model's context: - **Total reference budget**: 12,000 characters across all references - **Per-answer cap**: 3,200 characters (truncated if longer) - **Injection method**: System message + follow-up user message containing formatted references ### Early stopping If a layer produces zero successful responses, MoA keeps the previous layer's successes and skips to synthesis. This prevents total failure when models hit rate limits or errors. ```json { "models": ["gpt-5.2", "claude-sonnet-4.5", "gemini-3-flash"], "synthesizer": "claude-sonnet-4.5", "strategy": "moa", "layers": 2, "messages": [{"role": "user", "content": "Design a rate limiter for a distributed system"}] } ``` ## Strategy: Self-MoA Self-MoA generates diverse candidates from a **single model** by varying temperature and system prompts. This is useful when you trust one model but want to hedge against its variance. ### How it works 1. You provide exactly **1 model** in `models[]` 2. Set `samples` (2–8, default 4) for how many candidates to generate 3. Each candidate runs with a different **temperature offset** and **agent prompt** 4. The synthesizer combines all candidates into one final answer ### Temperature variation Each candidate gets a different temperature to encourage diversity: ``` Base offsets: [-0.25, 0.0, +0.25, +0.45, +0.15, +0.35, -0.1, +0.3] Final temp = clamp(base_temp + offset, 0.2, 1.4) ``` For example, with `temperature: 0.7` and 4 samples: - Candidate 1: temp 0.45 (conservative) - Candidate 2: temp 0.70 (baseline) - Candidate 3: temp 0.95 (creative) - Candidate 4: temp 1.15 (exploratory) ### Agent prompt rotation Six distinct system prompts rotate across candidates, each emphasizing a different quality: Plus two more: **Clarity** (plain-language explanations) and **Skepticism** (challenge assumptions, flag weaknesses). ```json { "models": ["claude-sonnet-4.5"], "synthesizer": "claude-sonnet-4.5", "strategy": "self_moa", "samples": 4, "temperature": 0.7, "messages": [{"role": "user", "content": "Write a Python async rate limiter"}] } ``` ## Blend credit cost All blend strategies cost **4 credits** regardless of strategy or model count. Credits are reserved upfront and refunded if all source models fail. After completion, billing is settled to actual execution usage — you may receive a partial refund if usage is lower than the reservation. ## Compare mode algorithm Compare runs 2–9 models concurrently and streams their responses side-by-side. - All models stream via an `asyncio.Queue` — chunks are yielded in arrival order (not round-robin) - Queue timeout: 120 seconds per chunk - After all models finish, a **summary event** reports the fastest model and longest response - Total latency = max(individual latencies) — bottleneck is the slowest model - Cost: 2 credits. Refunded if all models fail; partial status logged if some succeed. ## Judge mode algorithm Judge runs a three-phase competitive evaluation: ### Scoring system The judge produces structured JSON with rankings sorted by score descending: ```json { "rankings": [ {"model": "claude-sonnet-4.5", "rank": 1, "score": 9.2, "reasoning": "Most complete and well-structured"}, {"model": "gpt-5.2", "rank": 2, "score": 8.8, "reasoning": "Accurate but less organized"} ], "overall_analysis": "Claude response covered more edge cases..." } ``` **Default evaluation criteria**: accuracy, completeness, clarity, helpfulness, code quality. You can override these with the `criteria` parameter. **Fallback scoring**: If the judge returns malformed JSON, default scores are assigned: `8.0 - (i * 0.5)` for each contestant in order, with a note that scores were auto-assigned. Cost: 5 credits. ## Mesh mode: circuit breaker failover When you use mesh mode (chat with `routing` parameter), LLMWise tries models in sequence with automatic failover powered by a circuit breaker. ### Circuit breaker state machine Each model tracks health in-memory: ### Failover sequence 1. Try **primary model** first 2. If it fails (or circuit is open), try **fallback 1**, then **fallback 2**, etc. 3. For each attempt: emit a `route` event (`trying`, `failed`, or `skipped`) 4. First success stops the chain — no further fallbacks tried 5. After all attempts, emit a `trace` event summarizing the route ### Latency tracking Model latency is tracked with exponential smoothing: ``` avg_latency = (avg_latency * 0.8) + (new_latency * 0.2) ``` This favors recent measurements, so a model that recovers from a slow period will quickly show improved latency. ## Auto-router: heuristic classification When you set `model: "auto"`, LLMWise classifies your query using **zero-latency regex matching** (no LLM call overhead) and routes to the best model. ### Policy-based routing If you have an **optimization policy** enabled with sufficient historical data, auto-router upgrades from regex heuristics to **historical optimization** — routing based on actual performance data from your past requests. See the next section. ## Optimization scoring algorithm The optimization engine analyzes your historical request logs and recommends the best model + fallback chain for each goal. ### Goals and weight vectors Each goal uses different weights for the three scoring dimensions: ### Scoring formula For each eligible model (minimum 3 calls in lookback window): ``` inv_latency = (max_latency - model_latency) / (max_latency - min_latency) inv_cost = (max_cost - model_cost) / (max_cost - min_cost) raw_score = (Ws * success_rate) + (Wl * inv_latency) + (Wc * inv_cost) sample_factor = min(1.0, calls / 20) score = raw_score * (0.7 + 0.3 * sample_factor) ``` The **sample factor** gives a small boost to models with more data — a model with 20+ calls gets the full score, while a model with only 3 calls is penalized 30%. Preferred models get an additional `+0.04 * sample_factor` bonus. ### Confidence score ``` confidence = min(1.0, total_calls / 60) ``` At 60+ total calls across all models, confidence reaches 1.0 (full certainty). Below that, the recommendation carries a lower confidence signal. ### Guardrails After scoring, models are filtered through policy guardrails: - **Max latency**: Reject models above threshold (e.g., 5000ms) - **Max cost**: Reject models above per-request cost (e.g., $0.05) - **Min success rate**: Reject models below reliability threshold (e.g., 0.95) The top model that passes all guardrails becomes the **recommended primary**. The next N models become the **fallback chain** (configurable, 0–6 fallbacks). ## Credit settlement algorithm LLMWise uses a three-phase credit system: ### Settlement formula Reserved credits are debited at request start. After execution, LLMWise reconciles that reserve against actual usage. - If usage is lower than reserved credits, unused credits are refunded. - If usage is higher, we charge only the difference. BYOK requests keep provider-facing billing and remain on **0 credits**. ## Billing & Limits ### Billing and Credits ## Billing principle Users are billed in **credits**, not raw provider token costs. One dollar buys 100 credits. - Mode-level default charge is fixed per request (reserved upfront) - After the request completes, a settlement step reconciles actual execution usage - Wallet balance is shown in `/credits` - **Paid credits never expire** ## Free trial Every new account receives **20 free credits** on signup. Free credits never expire — use them at your own pace. Purchase additional credit packs anytime to add more credits to your wallet. ## Default charges ## How settlement works Credits are **reserved** before the request starts, then **settled** after execution: If actual usage exceeds the reserved credits, the difference is charged. If usage is lower, unused credits are refunded. All adjustments appear as separate transactions in your history. ## Top-up flow Minimum top-up is $3. Maximum single top-up is $10,000. ## Auto top-up Enable automatic refills so requests never fail due to low balance: 1. Complete one Stripe checkout to save a payment method 2. Enable auto top-up in `/settings` and set your preferred amount 3. Set a balance threshold — when credits drop below it, a top-up is triggered 4. Set a monthly spending cap to control costs Auto top-ups are processed as off-session Stripe PaymentIntents using your saved payment method. Monthly spending is tracked and capped to prevent runaway charges. ## BYOK (Bring Your Own Key) When a BYOK provider key is configured, requests route directly to the provider using your key. **BYOK requests skip credit charges entirely** — you pay the provider directly. This is useful when customer contracts require provider-direct billing. ## Purpose of open catalog models Provider-free models are best used for: 1. **Prompt and UX prototyping** before spending paid credits 2. **Fallback paths** for non-critical traffic during provider spikes 3. **A/B checks** against paid models so you only pay where quality difference matters Catalog updates are synced from OpenRouter, so available `is_free=true` models can change over time. You can always fetch the current live list from: ```bash GET /api/v1/models ``` Filter rows where `is_free=true`. ### Rate Limits and Reliability ## Reliability stack ## Per-endpoint limits All limits are per 60-second window. Paid users (any purchase history) get a 1.5x multiplier; free-tier users get a 0.6x multiplier. ## Dual-layer enforcement Every request is checked against two independent counters: 1. **Per-user** — keyed by your user ID 2. **Per-IP** — keyed by your client IP address (via `X-Forwarded-For`) IP-level limits are separate from user limits. Default IP limits: free = 120 req/min, paid = 360 req/min. ## Burst protection A second short-window layer prevents request spikes. Within any 10-second window: - **Free users:** 30 requests max - **Paid users:** 90 requests max If you exceed the burst limit, you receive a `429` with the message "Request burst detected." ## Response headers Every API response includes rate-limit headers: ## Fail-open mode By default, rate limiting runs in **fail-open** mode. If Redis is unavailable, requests are allowed through rather than blocked. This prevents a Redis outage from taking down your API access. Critical routes can be configured for fail-closed if needed. ## Circuit breaker (Mesh mode) When using Mesh/failover routing, a per-model circuit breaker protects against cascading failures: - **3 consecutive failures** → circuit opens for 30 seconds - During open state, the model is skipped and the next fallback is tried - After 30 seconds, **half-open**: one test request is allowed through - A successful test closes the circuit; a failure reopens it ## Client retry baseline ```javascript for (let attempt = 0; attempt <= 3; attempt += 1) { const res = await fetch(url, init); if (res.ok) return res; if (res.status === 429 || res.status >= 500) { const retryAfter = res.headers.get("Retry-After"); const delay = retryAfter ? parseInt(retryAfter, 10) * 1000 : 300 * (2 ** attempt); await new Promise((r) => setTimeout(r, delay)); continue; } throw new Error("HTTP " + res.status); } ``` ## Security & Data ### Privacy, Security, and Data Controls ## Control matrix ## Retention impact ## Managing privacy settings Toggle controls via `PUT /api/v1/settings/privacy`: ```json { "zero_retention_mode": true, "data_training_opt_in": false, "purge_existing_data": true } ``` - `zero_retention_mode` — when enabled, all new requests skip prompt/response storage and semantic memory - `data_training_opt_in` — explicit consent for training data collection (auto-disabled when zero-retention is on) - `purge_existing_data` — when enabling zero-retention, purge previously stored data Check current settings with `GET /api/v1/settings/privacy`. ## Data purge When you enable zero-retention mode with `purge_existing_data: true`, the following data is permanently removed: - **Semantic memories** — all vector embeddings deleted - **Training samples** — all opted-in training data deleted - **Request logs** — prompt and response text redacted (metadata preserved for billing) - **Conversations** — titles scrubbed The API returns a count of affected records so you can verify the purge was complete. ## Enterprise baseline checklist 1. Enable zero-retention for regulated workloads. 2. Keep training opt-in disabled by default. 3. Rotate API and webhook secrets on a schedule. 4. Use BYOK when customer contract requires provider-direct billing. 5. Verify purge counts after enabling zero-retention. ### Semantic Memory API Reference ## Endpoints ## Retrieval flow ## Search call example ```bash curl -G https://llmwise.ai/api/v1/memory/search \ -H "Authorization: Bearer mm_sk_YOUR_KEY" \ --data-urlencode "q=What decision did we make about retries?" \ --data-urlencode "top_k=4" ``` ## Zero-retention behavior When zero-retention mode is enabled, memory APIs return disabled behavior and no persisted entries. ## Operations ### Webhooks and System Sync ## Endpoints ## Clerk events handled - `user.created` — create local user with signup bonus (20 free credits) - `user.updated` — sync email and name changes - `user.deleted` — deactivate user account Clerk webhooks are verified using Svix signatures. If the auth middleware already auto-created the user before the webhook arrives, the webhook gracefully updates instead of duplicating. ## Stripe events handled - `checkout.session.completed` — wallet top-up fulfillment - `checkout.session.async_payment_succeeded` — delayed payment confirmation Both events trigger the same fulfillment flow: validate metadata, check idempotency, and credit the user wallet. Events are deduplicated by `stripe_payment_id` to prevent double-crediting. ## Sync hardening ## Setup checklist 1. Configure webhook endpoints in Clerk and Stripe dashboards. 2. Set webhook secrets in environment variables. 3. Send test events and verify logs. 4. Validate duplicate event handling.