Self-hosted models

Connect Prism to locally-running models via Ollama, vLLM, LM Studio, and other OpenAI-compatible servers.

About

Prism can route requests to models running on your own hardware alongside cloud providers. Self-hosted models are configured as providers with a base_url pointing to your local inference server. All gateway features (routing, caching, failover, guardrails) work the same way.


Supported inference servers

Servertype valueNotes
OllamaollamaAuto-discovers models. No model list needed.
vLLMvllmOpenAI-compatible server for production inference
LM Studiolm_studioDesktop app with local server mode
Any OpenAI-compatible server(omit type)Set api_format: "openai" and base_url

Configuration

Ollama

providers:
  ollama:
    base_url: "http://localhost:11434"
    type: "ollama"
    # Models are auto-discovered from Ollama's /v1/models endpoint

Ollama auto-discovers all pulled models. After pulling a model (ollama pull llama3.1), it’s immediately available through Prism.

vLLM

providers:
  vllm:
    base_url: "http://gpu-server:8000"
    type: "vllm"
    api_format: "openai"
    models:
      - "meta-llama/Llama-3.1-70B-Instruct"

LM Studio

providers:
  lm-studio:
    base_url: "http://localhost:1234"
    type: "lm_studio"
    api_format: "openai"

Generic OpenAI-compatible server

Any server that implements the /v1/chat/completions endpoint:

providers:
  my-server:
    base_url: "http://inference.internal:8080"
    api_format: "openai"
    models:
      - "my-custom-model"

Hybrid routing

The main value of self-hosted models through Prism is hybrid routing: use cheap local models for simple requests and fall back to cloud providers for complex ones.

Cost-based routing

Route to the cheapest option first:

routing:
  default_strategy: "cost-optimized"

providers:
  ollama:
    base_url: "http://localhost:11434"
    type: "ollama"

  openai:
    api_key: "${OPENAI_API_KEY}"
    api_format: "openai"
    models: ["gpt-4o", "gpt-4o-mini"]

Failover from local to cloud

Use local models as the primary, with cloud as a backup:

routing:
  failover:
    enabled: true
    providers: ["ollama", "openai"]
    failover_on: [429, 500, 502, 503, 504]

providers:
  ollama:
    base_url: "http://localhost:11434"
    type: "ollama"

  openai:
    api_key: "${OPENAI_API_KEY}"
    api_format: "openai"
    models: ["gpt-4o"]

If Ollama is down or overloaded, requests automatically route to OpenAI.

Complexity-based routing

Route simple queries to a local model and complex queries to a cloud model:

routing:
  complexity:
    enabled: true
    tiers:
      simple:
        max_score: 30
        model: "llama3.1"
        provider: "ollama"
      complex:
        max_score: 100
        model: "gpt-4o"
        provider: "openai"

See Routing > Complexity-based routing for the full scoring system.


Using self-hosted models from code

Once configured, self-hosted models are used the same way as cloud models:

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="http://localhost:8080",  # your self-hosted Prism gateway
)

# Route to Ollama
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Hello"}],
)

# Or pin to a specific provider
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={"x-prism-provider-lock": "ollama"},
)

Limitations

  • Self-hosted models don’t support the Assistants API (threads are stored on OpenAI’s servers)
  • Embedding endpoints require the inference server to implement /v1/embeddings
  • Cost tracking uses configured pricing. Set custom pricing for self-hosted models in the provider config, or costs will show as $0.

Next Steps

Was this page helpful?

Questions & Discussion