Routing & Reliability
Configure load balancing, failover, retries, and circuit breaking across LLM providers.
About
Prism’s routing layer distributes requests across multiple providers and models for reliability and performance. If one provider is down or slow, traffic automatically shifts to healthy alternatives. This ensures your application stays responsive even when individual providers experience outages or rate limiting.
When to use
- High availability: Automatic failover to backup providers when primary is down or rate-limited
- Cost optimization: Route to the cheapest provider that supports the requested model
- Latency reduction: Route to the fastest provider based on recent response times
- Traffic distribution: Split traffic across providers by weight for capacity management
Key concepts
| Term | Definition |
|---|---|
| Failover | Automatic rerouting of requests to a backup provider when the primary provider fails or returns errors (429, 5xx) |
| Retries | Repeated attempts to send a request after a failure, using exponential backoff to avoid overwhelming the provider |
| Circuit breaking | A protection mechanism that stops sending requests to a failing provider entirely, then gradually tests recovery before resuming full traffic |
| Timeouts | Maximum duration Prism waits for a provider response before treating the request as failed |
| Routing strategy | The algorithm Prism uses to select which provider handles each request (e.g., round robin, weighted, latency-based) |
Configuration parameters
These parameters appear in the JSON configuration blocks throughout this page.
Failover:
| Parameter | Type | Description |
|---|---|---|
enabled | boolean | Turn failover on or off |
providers | string[] | Ordered list of providers to try when one fails |
failover_on | number[] | HTTP status codes that trigger failover (e.g., 429, 500, 502, 503, 504) |
Retries:
| Parameter | Type | Default | Description |
|---|---|---|---|
max_retries | number | 3 | Maximum number of retry attempts before giving up |
initial_backoff_ms | number | 100 | Wait time (ms) before the first retry |
max_backoff_ms | number | 10000 | Upper limit on wait time between retries |
backoff_multiplier | number | 2 | Multiplier applied to backoff after each retry (e.g., 100ms → 200ms → 400ms) |
Circuit breaker:
| Parameter | Type | Description |
|---|---|---|
enabled | boolean | Turn circuit breaking on or off |
error_threshold_percent | number | Error rate (%) that trips the circuit open |
min_requests | number | Minimum request count before the error threshold is evaluated |
open_duration_seconds | number | How long (seconds) the circuit stays open before testing recovery |
half_open_max_requests | number | Number of trial requests allowed during the half-open recovery test |
Timeouts:
| Parameter | Type | Description |
|---|---|---|
request_timeout_seconds | number | Maximum total time for the entire request (including retries and failovers) |
provider_timeout_seconds | number | Maximum time to wait for a single provider response |
Routing strategies
| Strategy | Config value | How it works |
|---|---|---|
| Round Robin | round-robin | Evenly across providers in rotation (default) |
| Weighted | weighted | Based on assigned weights (e.g., 70% OpenAI, 30% Anthropic) |
| Least Latency | least-latency | Routes to the fastest provider based on recent response times |
| Cost Optimized | cost-optimized | Cheapest provider that supports the requested model |
| Adaptive | adaptive | Dynamically adjusts weights based on real-time performance |
| Race | fastest | Sends to all providers simultaneously, returns the first response. You are billed for every call made, including those whose responses are discarded |
Configuring a routing policy
- Go to Prism > Routing in the Future AGI dashboard
- Click Create Policy
- Select a strategy and configure provider weights, failover, retries, etc.
- Click Save
from prism import Prism
client = Prism(
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com",
control_plane_url="https://api.futureagi.com",
)
# Create a weighted routing policy
policy = client.routing.create(
name="Production routing",
strategy="weighted",
config={"weights": {"openai": 70, "anthropic": 30}},
description="70/30 split between OpenAI and Anthropic",
)
# List all routing policies
policies = client.routing.list()
# Update an existing policy
client.routing.update(
policy["id"],
strategy="least-latency",
config={"providers": ["openai", "anthropic", "gemini"], "failover_on": [429, 500, 502, 503, 504]},
) import { Prism } from "@futureagi/prism";
const client = new Prism({
apiKey: "sk-prism-your-key",
baseUrl: "https://gateway.futureagi.com",
controlPlaneUrl: "https://api.futureagi.com",
});
const policy = await client.routing.create({
name: "Production routing",
strategy: "weighted",
config: { weights: { openai: 70, anthropic: 30 } },
description: "70/30 split between OpenAI and Anthropic",
});
const policies = await client.routing.list();
await client.routing.update(policy.id, {
strategy: "least-latency",
config: { providers: ["openai", "anthropic", "gemini"], failoverOn: [429, 500, 502, 503, 504] },
}); Failover
Failover triggers on specific HTTP status codes and error conditions: 429 (rate limit), 5xx (server errors), timeouts, and connection errors. The providers array defines the failover order. When the primary provider fails, Prism automatically routes to the next provider in the list.
{
"failover": {
"enabled": true,
"providers": ["openai", "anthropic", "gemini"],
"failover_on": [429, 500, 502, 503, 504]
}
}
Note
The providers array defines the failover order. Prism will attempt each provider in sequence until one succeeds.
Retries
Prism uses exponential backoff for retries. This means it waits progressively longer between each retry attempt. For example, 100ms, then 200ms, then 400ms. This gives struggling providers time to recover instead of flooding them with rapid retry requests.
| Setting | Description | Default |
|---|---|---|
| max_retries | Maximum number of retry attempts | 3 |
| initial_backoff_ms | Initial backoff duration in milliseconds | 100 |
| max_backoff_ms | Maximum backoff duration in milliseconds | 10000 |
| backoff_multiplier | Multiplier for exponential backoff | 2 |
{
"retries": {
"max_retries": 3,
"initial_backoff_ms": 100,
"max_backoff_ms": 10000,
"backoff_multiplier": 2
}
}
Circuit breaking
Circuit breaking stops sending requests to a provider that is failing repeatedly. After a cooldown, Prism tests the provider with a few trial requests. If those succeed, normal routing resumes. This prevents a single failing provider from degrading your entire application.
The circuit breaker has three states:
| State | Behavior |
|---|---|
| Closed | Normal operation, requests pass through |
| Open | Requests rejected immediately, no calls to provider |
| Half-Open | Limited requests allowed to test if provider recovered |
{
"circuit_breaker": {
"enabled": true,
"error_threshold_percent": 50,
"min_requests": 10,
"open_duration_seconds": 60,
"half_open_max_requests": 3
}
}
Tip
Circuit breaking works seamlessly with failover. When a circuit opens, Prism automatically routes to the next available provider.
Timeouts
Configure per-request and per-provider timeouts to prevent hanging requests.
{
"timeouts": {
"request_timeout_seconds": 30,
"provider_timeout_seconds": 25
}
}
Example: High-availability setup
This configuration combines weighted routing, failover, retries, and circuit breaking for a production setup:
{
"name": "Production HA",
"strategy": "weighted",
"config": {
"weights": {
"openai": 60,
"anthropic": 30,
"gemini": 10
},
"failover": {
"enabled": true,
"providers": ["openai", "anthropic", "gemini"],
"failover_on": [429, 500, 502, 503, 504]
},
"retries": {
"max_retries": 3,
"initial_backoff_ms": 100,
"max_backoff_ms": 10000,
"backoff_multiplier": 2
},
"circuit_breaker": {
"enabled": true,
"error_threshold_percent": 50,
"min_requests": 10,
"open_duration_seconds": 60,
"half_open_max_requests": 3
},
"timeouts": {
"request_timeout_seconds": 30,
"provider_timeout_seconds": 25
}
}
}
Conditional routing
Route requests to specific providers based on request attributes. Rules are evaluated in priority order (lower number = higher priority). First match wins.
Supported fields: model, user, stream, provider, session_id, request_id, metadata.<key>
Supported operators: $eq, $ne, $in, $nin, $regex, $gt, $lt, $gte, $lte, $exists
routing:
conditional_routes:
- name: "enterprise-to-dedicated"
priority: 10
condition:
field: "metadata.tier"
op: "$eq"
value: "enterprise"
action:
provider: "openai-dedicated"
- name: "gpt-models-to-openai"
priority: 50
condition:
field: "model"
op: "$regex"
value: "^gpt-"
action:
provider: "openai"
- name: "streaming-to-groq"
priority: 60
condition:
field: "stream"
op: "$eq"
value: true
action:
provider: "groq"
You can also combine conditions with $and, $or, and $not:
- name: "premium-non-streaming"
priority: 20
condition:
$and:
- field: "metadata.tier"
op: "$eq"
value: "premium"
- field: "stream"
op: "$eq"
value: false
action:
provider: "openai-premium"
Real-world patterns
Gradual provider migration
Migrate from one provider to another without a big-bang switch. Start with 10% traffic to the new provider and increase over time:
{
"name": "Gradual migration to Anthropic",
"strategy": "weighted",
"config": {
"weights": {
"openai": 90,
"anthropic": 10
}
}
}
Increase the Anthropic weight over days or weeks. If issues arise, dial it back immediately.
Cost optimization across tiers
Use conditional routing to direct different request types to the most cost-effective provider:
routing:
conditional_routes:
- name: "long-context-to-gemini"
priority: 10
condition:
field: "model"
op: "$in"
value: ["gpt-4o", "claude-opus-4-6"]
action:
provider: "gemini" # Lower cost for long-context tasks
- name: "fast-tasks-to-groq"
priority: 20
condition:
field: "metadata.task_type"
op: "$eq"
value: "classification"
action:
provider: "groq" # High speed, low cost for simple tasks
Rate limit absorption
Spread load across providers so a single rate limit doesn’t block your application:
{
"name": "Rate limit absorption",
"strategy": "round-robin",
"config": {
"providers": ["openai", "anthropic", "gemini"],
"failover": {
"enabled": true,
"providers": ["openai", "anthropic", "gemini"],
"failover_on": [429, 500, 502, 503, 504]
}
}
}
When OpenAI rate-limits you, traffic automatically shifts to Anthropic and Gemini.
Model fallbacks
Configure per-model fallback chains for automatic failover when a specific model is unavailable:
routing:
model_fallbacks:
gpt-4o:
- claude-sonnet-4-6
- gemini-2.0-pro
claude-sonnet-4-6:
- gpt-4o
- gemini-2.0-pro
When gpt-4o fails, Prism automatically tries claude-sonnet-4-6, then gemini-2.0-pro.
Complexity-based routing
Route requests to different models based on prompt complexity. Prism scores each request on 8 signals and maps it to a tier.
Scoring signals:
| Signal | Default weight | What it measures |
|---|---|---|
token_count | 0.15 | Total input tokens |
message_count | 0.10 | Number of messages in the conversation |
system_prompt_length | 0.10 | Length of the system prompt |
tool_count | 0.15 | Number of tools/functions provided |
multimodal | 0.15 | Whether the request contains images or audio |
keyword_heuristics | 0.15 | Presence of reasoning keywords (“analyze”, “step by step”, “compare”, etc.) |
structured_output | 0.10 | Whether response_format is set |
max_tokens | 0.10 | Requested output length |
Each signal produces a 0-100 score. The weighted sum maps to a tier:
routing:
complexity:
enabled: true
default_tier: "moderate"
tiers:
simple:
max_score: 30
model: "gpt-4o-mini"
provider: "openai"
moderate:
max_score: 70
model: "gpt-4o"
provider: "openai"
complex:
max_score: 100
model: "claude-sonnet-4-6"
provider: "anthropic"
A simple classification request scores low and routes to gpt-4o-mini. A multi-tool reasoning task scores high and routes to claude-sonnet-4-6.
You can override the tier per request with the x-prism-complexity-override header. Pass the tier name (e.g., simple, moderate, complex - matching your configured tier names).
Provider lock (sticky routing)
Force a request to a specific provider, bypassing the routing strategy. Useful for stateful workflows where you need consistency across multiple calls.
Set it via the x-prism-provider-lock header or provider_lock in request metadata:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
extra_headers={"x-prism-provider-lock": "openai"},
)
Configure which providers can be locked to:
routing:
provider_lock:
enabled: true
allowed_providers: ["openai", "anthropic"]
deny_providers: ["groq"] # never lock to Groq
If allowed_providers is empty, all providers are allowed (except those in deny_providers).
Adaptive strategy details
The adaptive strategy learns from real traffic and adjusts weights over time:
- Learning phase: For the first N requests (default: 100), uses round-robin to gather baseline latency and error data from all providers.
- Active phase: Computes per-provider weights every 30 seconds using latency (lower is better) and error rate (fewer errors is better).
- Weight smoothing: New weights are blended with old weights using a smoothing factor (default: 0.3) to prevent wild swings.
- Minimum weight: No provider drops below 5% weight, ensuring all providers stay in rotation.
routing:
default_strategy: "adaptive"
adaptive:
enabled: true
learning_requests: 100
update_interval: 30s
smoothing_factor: 0.3
min_weight: 0.05
signal_weights:
latency: 0.5
error_rate: 0.4
# cost: 0.1 (parsed but not yet used in weight calculation)
Race (fastest response) details
The fastest strategy sends the same request to all eligible providers simultaneously and returns whichever responds first. The rest are cancelled.
routing:
default_strategy: "fastest"
fastest:
max_concurrent: 3 # limit parallel calls
cancel_delay: 50ms # wait before cancelling losers
excluded_providers: # skip these in the race
- "groq"
Warning
You are billed by every provider that receives the request, not just the winner. Use this for latency-critical requests where cost is secondary.