LLM Response Caching in Agent Command Center

Cache LLM responses at the gateway with exact match and semantic caching. Cut provider costs and latency for repeated queries across all providers.

About

Agent Command Center caches LLM responses server-side. A cache hit returns an instant response without calling the provider. The X-AgentCC-Cache response header shows cache status (hit or miss), and X-AgentCC-Cost returns 0 on exact cache hits since no provider tokens were consumed.

No client-side cache logic needed. Caching works for all providers through the same configuration.

When to use

Repeated queries: FAQ bots, common customer questions, template-based prompts
Development and testing: Avoid burning API credits on the same test prompts
High-traffic endpoints: Reduce provider costs for popular queries

Exact match vs semantic cache

	Exact match	Semantic cache
How it matches	Identical request parameters (same messages, model, temperature)	Similar queries via vector embeddings
Example	Same prompt, character for character	”What’s the weather today?” matches “Tell me today’s weather”
Latency	Fastest - hash lookup	Slightly higher - embedding computation
Use case	Deterministic queries, templates	Paraphrased questions, conversational variations
Cost on hit	Zero (skips cost/credits plugins)	Cost plugins still run (embedding lookup has overhead)

Note

Streaming requests bypass cache entirely - both on read and write. Cache only applies to non-streaming completions.

Configuration

Setting	Description	Default
`enabled`	Enable or disable caching	`false`
`strategy`	`"exact"` or `"semantic"`	`"exact"`
`default_ttl`	Time-to-live for cached entries (e.g. `5m`, `1h`)	`5m`
`max_entries`	Maximum number of cached entries (LRU eviction)	`10000`

Go to Agent Command Center > Caching in the Future AGI dashboard to enable caching, choose a strategy, and set TTL.

from agentcc import AgentCC, GatewayConfig, CacheConfig

# Set cache config at the client level
client = AgentCC(
    api_key="sk-agentcc-your-key",
    base_url="https://gateway.futureagi.com",
    config=GatewayConfig(
        cache=CacheConfig(enabled=True, strategy="exact", ttl=300, namespace="prod"),
    ),
)

import { AgentCC } from '@futureagi/agentcc';

const client = new AgentCC({
    apiKey: 'sk-agentcc-your-key',
    baseUrl: 'https://gateway.futureagi.com',
    config: {
        cache: { enabled: true, strategy: 'exact', ttl: 300, namespace: 'prod' },
    },
});

Self-hosted config.yaml:

cache:
  enabled: true
  default_ttl: 5m
  max_entries: 10000

Cache namespaces

Partition cache into isolated buckets. Each namespace maintains its own entries, so entries from one environment don’t leak into another.

Use the x-agentcc-cache-namespace request header or set it in the SDK config:

# Per-request namespace
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={"x-agentcc-cache-namespace": "staging"},
)

Common namespace patterns:

Environment isolation: prod, staging, dev
Multi-tenant isolation: one namespace per customer
A/B testing: different namespaces per experiment variant

Per-request cache control

Override cache behavior on individual requests using headers:

Header	Value	Effect
`x-agentcc-cache-force-refresh`	`true`	Bypass cache, fetch fresh response, update cache
`Cache-Control`	`no-store`	Disable caching for this request entirely
`x-agentcc-cache-ttl`	seconds	Override TTL for this specific response
`x-agentcc-cache-namespace`	string	Route to a specific cache namespace

# Force a fresh response (bypass cache)
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is AI?"}],
    extra_headers={"x-agentcc-cache-force-refresh": "true"},
)

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-agentcc-your-key",
)

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is AI?"}],
    extra_headers={"x-agentcc-cache-force-refresh": "true"},
)

curl -X POST https://gateway.futureagi.com/v1/chat/completions \
  -H "Authorization: Bearer sk-agentcc-your-key" \
  -H "x-agentcc-cache-force-refresh: true" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "What is AI?"}]
  }'

Cache backends

Exact match backends:

Backend	Use case
In-memory (default)	Single-instance deployments, development
Redis	Multi-instance deployments, shared cache across replicas

Semantic cache backends (vector stores):

Backend	Notes
In-memory	Development and small-scale deployments
Qdrant	Production-grade self-hosted vector search
Pinecone	Managed vector database

Note

Backend configuration is set at the gateway level in config.yaml. If you’re using the cloud gateway at gateway.futureagi.com, the backend is managed for you.

Questions & Discussion

LLM Response Caching in Agent Command Center

About

When to use

Exact match vs semantic cache

Configuration

Cache namespaces

Per-request cache control

Cache backends

Next Steps

Routing

Cost tracking

Rate limiting

How it works