Self-Hosted Model Integration in Agent Command Center

Route requests to Ollama, vLLM, LM Studio, or any OpenAI-compatible inference server alongside cloud providers. All gateway features apply to local models.

About

Agent Command Center can route requests to models running on your own hardware alongside cloud providers. Self-hosted models are configured as providers with a base_url pointing to your local inference server. All gateway features (routing, caching, failover, guardrails) work the same way.


Supported inference servers

Servertype valueNotes
OllamaollamaAuto-discovers models. No model list needed.
vLLMvllmOpenAI-compatible server for production inference
LM Studiolm_studioDesktop app with local server mode
Any OpenAI-compatible server(omit type)Set api_format: "openai" and base_url

Configuration

Ollama

providers:
  ollama:
    base_url: "http://localhost:11434"
    type: "ollama"
    # Models are auto-discovered from Ollama's /v1/models endpoint

Ollama auto-discovers all pulled models. After pulling a model (ollama pull llama3.1), it’s immediately available through Agent Command Center.

vLLM

providers:
  vllm:
    base_url: "http://gpu-server:8000"
    type: "vllm"
    api_format: "openai"
    models:
      - "meta-llama/Llama-3.1-70B-Instruct"

LM Studio

providers:
  lm-studio:
    base_url: "http://localhost:1234"
    type: "lm_studio"
    api_format: "openai"

Generic OpenAI-compatible server

Any server that implements the /v1/chat/completions endpoint:

providers:
  my-server:
    base_url: "http://inference.internal:8080"
    api_format: "openai"
    models:
      - "my-custom-model"

Hybrid routing

The main value of self-hosted models through Agent Command Center is hybrid routing: use cheap local models for simple requests and fall back to cloud providers for complex ones.

Cost-based routing

Route to the cheapest option first:

routing:
  default_strategy: "cost-optimized"

providers:
  ollama:
    base_url: "http://localhost:11434"
    type: "ollama"

  openai:
    api_key: "${OPENAI_API_KEY}"
    api_format: "openai"
    models: ["gpt-4o", "gpt-4o-mini"]

Failover from local to cloud

Use local models as the primary, with cloud as a backup:

routing:
  failover:
    enabled: true
    providers: ["ollama", "openai"]
    failover_on: [429, 500, 502, 503, 504]

providers:
  ollama:
    base_url: "http://localhost:11434"
    type: "ollama"

  openai:
    api_key: "${OPENAI_API_KEY}"
    api_format: "openai"
    models: ["gpt-4o"]

If Ollama is down or overloaded, requests automatically route to OpenAI.

Complexity-based routing

Route simple queries to a local model and complex queries to a cloud model:

routing:
  complexity:
    enabled: true
    tiers:
      simple:
        max_score: 30
        model: "llama3.1"
        provider: "ollama"
      complex:
        max_score: 100
        model: "gpt-4o"
        provider: "openai"

See Routing > Complexity-based routing for the full scoring system.


Using self-hosted models from code

Once configured, self-hosted models are used the same way as cloud models:

from agentcc import AgentCC

client = AgentCC(
    api_key="sk-agentcc-your-key",
    base_url="http://localhost:8080",  # your self-hosted Agent Command Center
)

# Route to Ollama
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Hello"}],
)

# Or pin to a specific provider
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={"x-agentcc-provider-lock": "ollama"},
)

Limitations

  • Self-hosted models don’t support the Assistants API (threads are stored on OpenAI’s servers)
  • Embedding endpoints require the inference server to implement /v1/embeddings
  • Cost tracking uses configured pricing. Set custom pricing for self-hosted models in the provider config, or costs will show as $0.

Next Steps

Was this page helpful?

Questions & Discussion