Self-hosted models
Connect Prism to locally-running models via Ollama, vLLM, LM Studio, and other OpenAI-compatible servers.
About
Prism can route requests to models running on your own hardware alongside cloud providers. Self-hosted models are configured as providers with a base_url pointing to your local inference server. All gateway features (routing, caching, failover, guardrails) work the same way.
Supported inference servers
| Server | type value | Notes |
|---|---|---|
| Ollama | ollama | Auto-discovers models. No model list needed. |
| vLLM | vllm | OpenAI-compatible server for production inference |
| LM Studio | lm_studio | Desktop app with local server mode |
| Any OpenAI-compatible server | (omit type) | Set api_format: "openai" and base_url |
Configuration
Ollama
providers:
ollama:
base_url: "http://localhost:11434"
type: "ollama"
# Models are auto-discovered from Ollama's /v1/models endpoint
Ollama auto-discovers all pulled models. After pulling a model (ollama pull llama3.1), it’s immediately available through Prism.
vLLM
providers:
vllm:
base_url: "http://gpu-server:8000"
type: "vllm"
api_format: "openai"
models:
- "meta-llama/Llama-3.1-70B-Instruct"
LM Studio
providers:
lm-studio:
base_url: "http://localhost:1234"
type: "lm_studio"
api_format: "openai"
Generic OpenAI-compatible server
Any server that implements the /v1/chat/completions endpoint:
providers:
my-server:
base_url: "http://inference.internal:8080"
api_format: "openai"
models:
- "my-custom-model"
Hybrid routing
The main value of self-hosted models through Prism is hybrid routing: use cheap local models for simple requests and fall back to cloud providers for complex ones.
Cost-based routing
Route to the cheapest option first:
routing:
default_strategy: "cost-optimized"
providers:
ollama:
base_url: "http://localhost:11434"
type: "ollama"
openai:
api_key: "${OPENAI_API_KEY}"
api_format: "openai"
models: ["gpt-4o", "gpt-4o-mini"]
Failover from local to cloud
Use local models as the primary, with cloud as a backup:
routing:
failover:
enabled: true
providers: ["ollama", "openai"]
failover_on: [429, 500, 502, 503, 504]
providers:
ollama:
base_url: "http://localhost:11434"
type: "ollama"
openai:
api_key: "${OPENAI_API_KEY}"
api_format: "openai"
models: ["gpt-4o"]
If Ollama is down or overloaded, requests automatically route to OpenAI.
Complexity-based routing
Route simple queries to a local model and complex queries to a cloud model:
routing:
complexity:
enabled: true
tiers:
simple:
max_score: 30
model: "llama3.1"
provider: "ollama"
complex:
max_score: 100
model: "gpt-4o"
provider: "openai"
See Routing > Complexity-based routing for the full scoring system.
Using self-hosted models from code
Once configured, self-hosted models are used the same way as cloud models:
from prism import Prism
client = Prism(
api_key="sk-prism-your-key",
base_url="http://localhost:8080", # your self-hosted Prism gateway
)
# Route to Ollama
response = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "Hello"}],
)
# Or pin to a specific provider
response = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "Hello"}],
extra_headers={"x-prism-provider-lock": "ollama"},
)
Limitations
- Self-hosted models don’t support the Assistants API (threads are stored on OpenAI’s servers)
- Embedding endpoints require the inference server to implement
/v1/embeddings - Cost tracking uses configured pricing. Set custom pricing for self-hosted models in the provider config, or costs will show as $0.