Self-Hosted Model Integration in Agent Command Center

Route requests to Ollama, vLLM, LM Studio, or any OpenAI-compatible inference server alongside cloud providers. All gateway features apply to local models.

About

Agent Command Center can route requests to models running on your own hardware alongside cloud providers. Self-hosted models are configured as providers with a base_url pointing to your local inference server. All gateway features (routing, caching, failover, guardrails) work the same way.

Supported inference servers

Server	`type` value	Notes
Ollama	`ollama`	Auto-discovers models. No model list needed.
vLLM	`vllm`	OpenAI-compatible server for production inference
LM Studio	`lm_studio`	Desktop app with local server mode
Any OpenAI-compatible server	(omit type)	Set `api_format: "openai"` and `base_url`

Configuration

Ollama

providers:
  ollama:
    base_url: "http://localhost:11434"
    type: "ollama"
    # Models are auto-discovered from Ollama's /v1/models endpoint

Ollama auto-discovers all pulled models. After pulling a model (ollama pull llama3.1), it’s immediately available through Agent Command Center.

vLLM

providers:
  vllm:
    base_url: "http://gpu-server:8000"
    type: "vllm"
    api_format: "openai"
    models:
      - "meta-llama/Llama-3.1-70B-Instruct"

LM Studio

providers:
  lm-studio:
    base_url: "http://localhost:1234"
    type: "lm_studio"
    api_format: "openai"

Generic OpenAI-compatible server

Any server that implements the /v1/chat/completions endpoint:

providers:
  my-server:
    base_url: "http://inference.internal:8080"
    api_format: "openai"
    models:
      - "my-custom-model"

Hybrid routing

The main value of self-hosted models through Agent Command Center is hybrid routing: use cheap local models for simple requests and fall back to cloud providers for complex ones.

Cost-based routing

Route to the cheapest option first:

routing:
  default_strategy: "cost-optimized"

providers:
  ollama:
    base_url: "http://localhost:11434"
    type: "ollama"

  openai:
    api_key: "${OPENAI_API_KEY}"
    api_format: "openai"
    models: ["gpt-4o", "gpt-4o-mini"]

Failover from local to cloud

Use local models as the primary, with cloud as a backup:

routing:
  failover:
    enabled: true
    providers: ["ollama", "openai"]
    failover_on: [429, 500, 502, 503, 504]

providers:
  ollama:
    base_url: "http://localhost:11434"
    type: "ollama"

  openai:
    api_key: "${OPENAI_API_KEY}"
    api_format: "openai"
    models: ["gpt-4o"]

If Ollama is down or overloaded, requests automatically route to OpenAI.

Complexity-based routing

Route simple queries to a local model and complex queries to a cloud model:

routing:
  complexity:
    enabled: true
    tiers:
      simple:
        max_score: 30
        model: "llama3.1"
        provider: "ollama"
      complex:
        max_score: 100
        model: "gpt-4o"
        provider: "openai"

See Routing > Complexity-based routing for the full scoring system.

Using self-hosted models from code

Once configured, self-hosted models are used the same way as cloud models:

from agentcc import AgentCC

client = AgentCC(
    api_key="sk-agentcc-your-key",
    base_url="http://localhost:8080",  # your self-hosted Agent Command Center
)

# Route to Ollama
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Hello"}],
)

# Or pin to a specific provider
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={"x-agentcc-provider-lock": "ollama"},
)

Limitations

Self-hosted models don’t support the Assistants API (threads are stored on OpenAI’s servers)
Embedding endpoints require the inference server to implement /v1/embeddings
Cost tracking uses configured pricing. Set custom pricing for self-hosted models in the provider config, or costs will show as $0.

Questions & Discussion

Self-Hosted Model Integration in Agent Command Center

About

Supported inference servers

Configuration

Ollama

vLLM

LM Studio

Generic OpenAI-compatible server

Hybrid routing

Cost-based routing

Failover from local to cloud

Complexity-based routing

Using self-hosted models from code

Limitations

Next Steps

Self-hosted deployment

Routing

Supported providers

Configuration