Routing & Reliability

Configure load balancing, failover, retries, and circuit breaking across LLM providers.

About

Prism’s routing layer distributes requests across multiple providers and models for reliability and performance. If one provider is down or slow, traffic automatically shifts to healthy alternatives. This ensures your application stays responsive even when individual providers experience outages or rate limiting.


When to use

  • High availability: Automatic failover to backup providers when primary is down or rate-limited
  • Cost optimization: Route to the cheapest provider that supports the requested model
  • Latency reduction: Route to the fastest provider based on recent response times
  • Traffic distribution: Split traffic across providers by weight for capacity management

Key concepts

TermDefinition
FailoverAutomatic rerouting of requests to a backup provider when the primary provider fails or returns errors (429, 5xx)
RetriesRepeated attempts to send a request after a failure, using exponential backoff to avoid overwhelming the provider
Circuit breakingA protection mechanism that stops sending requests to a failing provider entirely, then gradually tests recovery before resuming full traffic
TimeoutsMaximum duration Prism waits for a provider response before treating the request as failed
Routing strategyThe algorithm Prism uses to select which provider handles each request (e.g., round robin, weighted, latency-based)

Configuration parameters

These parameters appear in the JSON configuration blocks throughout this page.

Failover:

ParameterTypeDescription
enabledbooleanTurn failover on or off
providersstring[]Ordered list of providers to try when one fails
failover_onnumber[]HTTP status codes that trigger failover (e.g., 429, 500, 502, 503, 504)

Retries:

ParameterTypeDefaultDescription
max_retriesnumber3Maximum number of retry attempts before giving up
initial_backoff_msnumber100Wait time (ms) before the first retry
max_backoff_msnumber10000Upper limit on wait time between retries
backoff_multipliernumber2Multiplier applied to backoff after each retry (e.g., 100ms → 200ms → 400ms)

Circuit breaker:

ParameterTypeDescription
enabledbooleanTurn circuit breaking on or off
error_threshold_percentnumberError rate (%) that trips the circuit open
min_requestsnumberMinimum request count before the error threshold is evaluated
open_duration_secondsnumberHow long (seconds) the circuit stays open before testing recovery
half_open_max_requestsnumberNumber of trial requests allowed during the half-open recovery test

Timeouts:

ParameterTypeDescription
request_timeout_secondsnumberMaximum total time for the entire request (including retries and failovers)
provider_timeout_secondsnumberMaximum time to wait for a single provider response

Routing strategies

StrategyConfig valueHow it works
Round Robinround-robinEvenly across providers in rotation (default)
WeightedweightedBased on assigned weights (e.g., 70% OpenAI, 30% Anthropic)
Least Latencyleast-latencyRoutes to the fastest provider based on recent response times
Cost Optimizedcost-optimizedCheapest provider that supports the requested model
AdaptiveadaptiveDynamically adjusts weights based on real-time performance
RacefastestSends to all providers simultaneously, returns the first response. You are billed for every call made, including those whose responses are discarded

Configuring a routing policy

  1. Go to Prism > Routing in the Future AGI dashboard
  2. Click Create Policy
  3. Select a strategy and configure provider weights, failover, retries, etc.
  4. Click Save
from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
    control_plane_url="https://api.futureagi.com",
)

# Create a weighted routing policy
policy = client.routing.create(
    name="Production routing",
    strategy="weighted",
    config={"weights": {"openai": 70, "anthropic": 30}},
    description="70/30 split between OpenAI and Anthropic",
)

# List all routing policies
policies = client.routing.list()

# Update an existing policy
client.routing.update(
    policy["id"],
    strategy="least-latency",
    config={"providers": ["openai", "anthropic", "gemini"], "failover_on": [429, 500, 502, 503, 504]},
)
import { Prism } from "@futureagi/prism";

const client = new Prism({
  apiKey: "sk-prism-your-key",
  baseUrl: "https://gateway.futureagi.com",
  controlPlaneUrl: "https://api.futureagi.com",
});

const policy = await client.routing.create({
  name: "Production routing",
  strategy: "weighted",
  config: { weights: { openai: 70, anthropic: 30 } },
  description: "70/30 split between OpenAI and Anthropic",
});

const policies = await client.routing.list();

await client.routing.update(policy.id, {
  strategy: "least-latency",
  config: { providers: ["openai", "anthropic", "gemini"], failoverOn: [429, 500, 502, 503, 504] },
});

Failover

Failover triggers on specific HTTP status codes and error conditions: 429 (rate limit), 5xx (server errors), timeouts, and connection errors. The providers array defines the failover order. When the primary provider fails, Prism automatically routes to the next provider in the list.

{
  "failover": {
    "enabled": true,
    "providers": ["openai", "anthropic", "gemini"],
    "failover_on": [429, 500, 502, 503, 504]
  }
}

Note

The providers array defines the failover order. Prism will attempt each provider in sequence until one succeeds.


Retries

Prism uses exponential backoff for retries. This means it waits progressively longer between each retry attempt. For example, 100ms, then 200ms, then 400ms. This gives struggling providers time to recover instead of flooding them with rapid retry requests.

SettingDescriptionDefault
max_retriesMaximum number of retry attempts3
initial_backoff_msInitial backoff duration in milliseconds100
max_backoff_msMaximum backoff duration in milliseconds10000
backoff_multiplierMultiplier for exponential backoff2
{
  "retries": {
    "max_retries": 3,
    "initial_backoff_ms": 100,
    "max_backoff_ms": 10000,
    "backoff_multiplier": 2
  }
}

Circuit breaking

Circuit breaking stops sending requests to a provider that is failing repeatedly. After a cooldown, Prism tests the provider with a few trial requests. If those succeed, normal routing resumes. This prevents a single failing provider from degrading your entire application.

The circuit breaker has three states:

StateBehavior
ClosedNormal operation, requests pass through
OpenRequests rejected immediately, no calls to provider
Half-OpenLimited requests allowed to test if provider recovered
{
  "circuit_breaker": {
    "enabled": true,
    "error_threshold_percent": 50,
    "min_requests": 10,
    "open_duration_seconds": 60,
    "half_open_max_requests": 3
  }
}

Tip

Circuit breaking works seamlessly with failover. When a circuit opens, Prism automatically routes to the next available provider.


Timeouts

Configure per-request and per-provider timeouts to prevent hanging requests.

{
  "timeouts": {
    "request_timeout_seconds": 30,
    "provider_timeout_seconds": 25
  }
}

Example: High-availability setup

This configuration combines weighted routing, failover, retries, and circuit breaking for a production setup:

{
  "name": "Production HA",
  "strategy": "weighted",
  "config": {
    "weights": {
      "openai": 60,
      "anthropic": 30,
      "gemini": 10
    },
    "failover": {
      "enabled": true,
      "providers": ["openai", "anthropic", "gemini"],
      "failover_on": [429, 500, 502, 503, 504]
    },
    "retries": {
      "max_retries": 3,
      "initial_backoff_ms": 100,
      "max_backoff_ms": 10000,
      "backoff_multiplier": 2
    },
    "circuit_breaker": {
      "enabled": true,
      "error_threshold_percent": 50,
      "min_requests": 10,
      "open_duration_seconds": 60,
      "half_open_max_requests": 3
    },
    "timeouts": {
      "request_timeout_seconds": 30,
      "provider_timeout_seconds": 25
    }
  }
}

Conditional routing

Route requests to specific providers based on request attributes. Rules are evaluated in priority order (lower number = higher priority). First match wins.

Supported fields: model, user, stream, provider, session_id, request_id, metadata.<key>

Supported operators: $eq, $ne, $in, $nin, $regex, $gt, $lt, $gte, $lte, $exists

routing:
  conditional_routes:
    - name: "enterprise-to-dedicated"
      priority: 10
      condition:
        field: "metadata.tier"
        op: "$eq"
        value: "enterprise"
      action:
        provider: "openai-dedicated"

    - name: "gpt-models-to-openai"
      priority: 50
      condition:
        field: "model"
        op: "$regex"
        value: "^gpt-"
      action:
        provider: "openai"

    - name: "streaming-to-groq"
      priority: 60
      condition:
        field: "stream"
        op: "$eq"
        value: true
      action:
        provider: "groq"

You can also combine conditions with $and, $or, and $not:

    - name: "premium-non-streaming"
      priority: 20
      condition:
        $and:
          - field: "metadata.tier"
            op: "$eq"
            value: "premium"
          - field: "stream"
            op: "$eq"
            value: false
      action:
        provider: "openai-premium"

Real-world patterns

Gradual provider migration

Migrate from one provider to another without a big-bang switch. Start with 10% traffic to the new provider and increase over time:

{
  "name": "Gradual migration to Anthropic",
  "strategy": "weighted",
  "config": {
    "weights": {
      "openai": 90,
      "anthropic": 10
    }
  }
}

Increase the Anthropic weight over days or weeks. If issues arise, dial it back immediately.

Cost optimization across tiers

Use conditional routing to direct different request types to the most cost-effective provider:

routing:
  conditional_routes:
    - name: "long-context-to-gemini"
      priority: 10
      condition:
        field: "model"
        op: "$in"
        value: ["gpt-4o", "claude-opus-4-6"]
      action:
        provider: "gemini"        # Lower cost for long-context tasks

    - name: "fast-tasks-to-groq"
      priority: 20
      condition:
        field: "metadata.task_type"
        op: "$eq"
        value: "classification"
      action:
        provider: "groq"          # High speed, low cost for simple tasks

Rate limit absorption

Spread load across providers so a single rate limit doesn’t block your application:

{
  "name": "Rate limit absorption",
  "strategy": "round-robin",
  "config": {
    "providers": ["openai", "anthropic", "gemini"],
    "failover": {
      "enabled": true,
      "providers": ["openai", "anthropic", "gemini"],
      "failover_on": [429, 500, 502, 503, 504]
    }
  }
}

When OpenAI rate-limits you, traffic automatically shifts to Anthropic and Gemini.


Model fallbacks

Configure per-model fallback chains for automatic failover when a specific model is unavailable:

routing:
  model_fallbacks:
    gpt-4o:
      - claude-sonnet-4-6
      - gemini-2.0-pro
    claude-sonnet-4-6:
      - gpt-4o
      - gemini-2.0-pro

When gpt-4o fails, Prism automatically tries claude-sonnet-4-6, then gemini-2.0-pro.


Complexity-based routing

Route requests to different models based on prompt complexity. Prism scores each request on 8 signals and maps it to a tier.

Scoring signals:

SignalDefault weightWhat it measures
token_count0.15Total input tokens
message_count0.10Number of messages in the conversation
system_prompt_length0.10Length of the system prompt
tool_count0.15Number of tools/functions provided
multimodal0.15Whether the request contains images or audio
keyword_heuristics0.15Presence of reasoning keywords (“analyze”, “step by step”, “compare”, etc.)
structured_output0.10Whether response_format is set
max_tokens0.10Requested output length

Each signal produces a 0-100 score. The weighted sum maps to a tier:

routing:
  complexity:
    enabled: true
    default_tier: "moderate"
    tiers:
      simple:
        max_score: 30
        model: "gpt-4o-mini"
        provider: "openai"
      moderate:
        max_score: 70
        model: "gpt-4o"
        provider: "openai"
      complex:
        max_score: 100
        model: "claude-sonnet-4-6"
        provider: "anthropic"

A simple classification request scores low and routes to gpt-4o-mini. A multi-tool reasoning task scores high and routes to claude-sonnet-4-6.

You can override the tier per request with the x-prism-complexity-override header. Pass the tier name (e.g., simple, moderate, complex - matching your configured tier names).


Provider lock (sticky routing)

Force a request to a specific provider, bypassing the routing strategy. Useful for stateful workflows where you need consistency across multiple calls.

Set it via the x-prism-provider-lock header or provider_lock in request metadata:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={"x-prism-provider-lock": "openai"},
)

Configure which providers can be locked to:

routing:
  provider_lock:
    enabled: true
    allowed_providers: ["openai", "anthropic"]
    deny_providers: ["groq"]  # never lock to Groq

If allowed_providers is empty, all providers are allowed (except those in deny_providers).


Adaptive strategy details

The adaptive strategy learns from real traffic and adjusts weights over time:

  1. Learning phase: For the first N requests (default: 100), uses round-robin to gather baseline latency and error data from all providers.
  2. Active phase: Computes per-provider weights every 30 seconds using latency (lower is better) and error rate (fewer errors is better).
  3. Weight smoothing: New weights are blended with old weights using a smoothing factor (default: 0.3) to prevent wild swings.
  4. Minimum weight: No provider drops below 5% weight, ensuring all providers stay in rotation.
routing:
  default_strategy: "adaptive"
  adaptive:
    enabled: true
    learning_requests: 100
    update_interval: 30s
    smoothing_factor: 0.3
    min_weight: 0.05
    signal_weights:
      latency: 0.5
      error_rate: 0.4
      # cost: 0.1 (parsed but not yet used in weight calculation)

Race (fastest response) details

The fastest strategy sends the same request to all eligible providers simultaneously and returns whichever responds first. The rest are cancelled.

routing:
  default_strategy: "fastest"
  fastest:
    max_concurrent: 3        # limit parallel calls
    cancel_delay: 50ms       # wait before cancelling losers
    excluded_providers:       # skip these in the race
      - "groq"

Warning

You are billed by every provider that receives the request, not just the winner. Use this for latency-critical requests where cost is secondary.


Next Steps

Was this page helpful?

Questions & Discussion