Self-Hosted Model Integration in Agent Command Center
Route requests to Ollama, vLLM, LM Studio, or any OpenAI-compatible inference server alongside cloud providers. All gateway features apply to local models.
About
Agent Command Center can route requests to models running on your own hardware alongside cloud providers. Self-hosted models are configured as providers with a base_url pointing to your local inference server. All gateway features (routing, caching, failover, guardrails) work the same way.
Supported inference servers
| Server | type value | Notes |
|---|---|---|
| Ollama | ollama | Auto-discovers models. No model list needed. |
| vLLM | vllm | OpenAI-compatible server for production inference |
| LM Studio | lm_studio | Desktop app with local server mode |
| Any OpenAI-compatible server | (omit type) | Set api_format: "openai" and base_url |
Configuration
Ollama
providers:
ollama:
base_url: "http://localhost:11434"
type: "ollama"
# Models are auto-discovered from Ollama's /v1/models endpoint
Ollama auto-discovers all pulled models. After pulling a model (ollama pull llama3.1), it’s immediately available through Agent Command Center.
vLLM
providers:
vllm:
base_url: "http://gpu-server:8000"
type: "vllm"
api_format: "openai"
models:
- "meta-llama/Llama-3.1-70B-Instruct"
LM Studio
providers:
lm-studio:
base_url: "http://localhost:1234"
type: "lm_studio"
api_format: "openai"
Generic OpenAI-compatible server
Any server that implements the /v1/chat/completions endpoint:
providers:
my-server:
base_url: "http://inference.internal:8080"
api_format: "openai"
models:
- "my-custom-model"
Hybrid routing
The main value of self-hosted models through Agent Command Center is hybrid routing: use cheap local models for simple requests and fall back to cloud providers for complex ones.
Cost-based routing
Route to the cheapest option first:
routing:
default_strategy: "cost-optimized"
providers:
ollama:
base_url: "http://localhost:11434"
type: "ollama"
openai:
api_key: "${OPENAI_API_KEY}"
api_format: "openai"
models: ["gpt-4o", "gpt-4o-mini"]
Failover from local to cloud
Use local models as the primary, with cloud as a backup:
routing:
failover:
enabled: true
providers: ["ollama", "openai"]
failover_on: [429, 500, 502, 503, 504]
providers:
ollama:
base_url: "http://localhost:11434"
type: "ollama"
openai:
api_key: "${OPENAI_API_KEY}"
api_format: "openai"
models: ["gpt-4o"]
If Ollama is down or overloaded, requests automatically route to OpenAI.
Complexity-based routing
Route simple queries to a local model and complex queries to a cloud model:
routing:
complexity:
enabled: true
tiers:
simple:
max_score: 30
model: "llama3.1"
provider: "ollama"
complex:
max_score: 100
model: "gpt-4o"
provider: "openai"
See Routing > Complexity-based routing for the full scoring system.
Using self-hosted models from code
Once configured, self-hosted models are used the same way as cloud models:
from agentcc import AgentCC
client = AgentCC(
api_key="sk-agentcc-your-key",
base_url="http://localhost:8080", # your self-hosted Agent Command Center
)
# Route to Ollama
response = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "Hello"}],
)
# Or pin to a specific provider
response = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "Hello"}],
extra_headers={"x-agentcc-provider-lock": "ollama"},
)
Limitations
- Self-hosted models don’t support the Assistants API (threads are stored on OpenAI’s servers)
- Embedding endpoints require the inference server to implement
/v1/embeddings - Cost tracking uses configured pricing. Set custom pricing for self-hosted models in the provider config, or costs will show as $0.
Next Steps
Questions & Discussion