LLM Response Caching in Agent Command Center
Cache LLM responses at the gateway with exact match and semantic caching. Cut provider costs and latency for repeated queries across all providers.
About
Agent Command Center caches LLM responses server-side. A cache hit returns an instant response without calling the provider. The X-AgentCC-Cache response header shows cache status (hit or miss), and X-AgentCC-Cost returns 0 on exact cache hits since no provider tokens were consumed.
No client-side cache logic needed. Caching works for all providers through the same configuration.
When to use
- Repeated queries: FAQ bots, common customer questions, template-based prompts
- Development and testing: Avoid burning API credits on the same test prompts
- High-traffic endpoints: Reduce provider costs for popular queries
Exact match vs semantic cache
| Exact match | Semantic cache | |
|---|---|---|
| How it matches | Identical request parameters (same messages, model, temperature) | Similar queries via vector embeddings |
| Example | Same prompt, character for character | ”What’s the weather today?” matches “Tell me today’s weather” |
| Latency | Fastest - hash lookup | Slightly higher - embedding computation |
| Use case | Deterministic queries, templates | Paraphrased questions, conversational variations |
| Cost on hit | Zero (skips cost/credits plugins) | Cost plugins still run (embedding lookup has overhead) |
Note
Streaming requests bypass cache entirely - both on read and write. Cache only applies to non-streaming completions.
Configuration
| Setting | Description | Default |
|---|---|---|
enabled | Enable or disable caching | false |
strategy | "exact" or "semantic" | "exact" |
default_ttl | Time-to-live for cached entries (e.g. 5m, 1h) | 5m |
max_entries | Maximum number of cached entries (LRU eviction) | 10000 |
Go to Agent Command Center > Caching in the Future AGI dashboard to enable caching, choose a strategy, and set TTL.
from agentcc import AgentCC, GatewayConfig, CacheConfig
# Set cache config at the client level
client = AgentCC(
api_key="sk-agentcc-your-key",
base_url="https://gateway.futureagi.com",
config=GatewayConfig(
cache=CacheConfig(enabled=True, strategy="exact", ttl=300, namespace="prod"),
),
) import { AgentCC } from '@futureagi/agentcc';
const client = new AgentCC({
apiKey: 'sk-agentcc-your-key',
baseUrl: 'https://gateway.futureagi.com',
config: {
cache: { enabled: true, strategy: 'exact', ttl: 300, namespace: 'prod' },
},
}); Self-hosted config.yaml:
cache:
enabled: true
default_ttl: 5m
max_entries: 10000
Cache namespaces
Partition cache into isolated buckets. Each namespace maintains its own entries, so entries from one environment don’t leak into another.
Use the x-agentcc-cache-namespace request header or set it in the SDK config:
# Per-request namespace
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
extra_headers={"x-agentcc-cache-namespace": "staging"},
)
Common namespace patterns:
- Environment isolation:
prod,staging,dev - Multi-tenant isolation: one namespace per customer
- A/B testing: different namespaces per experiment variant
Per-request cache control
Override cache behavior on individual requests using headers:
| Header | Value | Effect |
|---|---|---|
x-agentcc-cache-force-refresh | true | Bypass cache, fetch fresh response, update cache |
Cache-Control | no-store | Disable caching for this request entirely |
x-agentcc-cache-ttl | seconds | Override TTL for this specific response |
x-agentcc-cache-namespace | string | Route to a specific cache namespace |
# Force a fresh response (bypass cache)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is AI?"}],
extra_headers={"x-agentcc-cache-force-refresh": "true"},
) from openai import OpenAI
client = OpenAI(
base_url="https://gateway.futureagi.com/v1",
api_key="sk-agentcc-your-key",
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is AI?"}],
extra_headers={"x-agentcc-cache-force-refresh": "true"},
) curl -X POST https://gateway.futureagi.com/v1/chat/completions \
-H "Authorization: Bearer sk-agentcc-your-key" \
-H "x-agentcc-cache-force-refresh: true" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "What is AI?"}]
}' Cache backends
Exact match backends:
| Backend | Use case |
|---|---|
| In-memory (default) | Single-instance deployments, development |
| Redis | Multi-instance deployments, shared cache across replicas |
Semantic cache backends (vector stores):
| Backend | Notes |
|---|---|
| In-memory | Development and small-scale deployments |
| Qdrant | Production-grade self-hosted vector search |
| Pinecone | Managed vector database |
Note
Backend configuration is set at the gateway level in config.yaml. If you’re using the cloud gateway at gateway.futureagi.com, the backend is managed for you.
Next Steps
Questions & Discussion