Caching
Reduce costs and latency with exact match and semantic caching at the gateway level.
About
Prism caches LLM responses server-side. A cache hit returns an instant response without calling the provider. The X-Prism-Cache response header shows cache status (hit or miss), and X-Prism-Cost returns 0 on exact cache hits since no provider tokens were consumed.
No client-side cache logic needed. Caching works for all providers through the same configuration.
When to use
- Repeated queries: FAQ bots, common customer questions, template-based prompts
- Development and testing: Avoid burning API credits on the same test prompts
- High-traffic endpoints: Reduce provider costs for popular queries
Exact match vs semantic cache
| Exact match | Semantic cache | |
|---|---|---|
| How it matches | Identical request parameters (same messages, model, temperature) | Similar queries via vector embeddings |
| Example | Same prompt, character for character | ”What’s the weather today?” matches “Tell me today’s weather” |
| Latency | Fastest - hash lookup | Slightly higher - embedding computation |
| Use case | Deterministic queries, templates | Paraphrased questions, conversational variations |
| Cost on hit | Zero (skips cost/credits plugins) | Cost plugins still run (embedding lookup has overhead) |
Note
Streaming requests bypass cache entirely - both on read and write. Cache only applies to non-streaming completions.
Configuration
| Setting | Description | Default |
|---|---|---|
enabled | Enable or disable caching | false |
strategy | "exact" or "semantic" | "exact" |
default_ttl | Time-to-live for cached entries (e.g. 5m, 1h) | 5m |
max_entries | Maximum number of cached entries (LRU eviction) | 10000 |
Go to Prism > Caching in the Future AGI dashboard to enable caching, choose a strategy, and set TTL.
from prism import Prism, GatewayConfig, CacheConfig
# Set cache config at the client level
client = Prism(
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com",
config=GatewayConfig(
cache=CacheConfig(enabled=True, strategy="exact", ttl=300, namespace="prod"),
),
) import { Prism } from '@futureagi/prism';
const client = new Prism({
apiKey: 'sk-prism-your-key',
baseUrl: 'https://gateway.futureagi.com',
config: {
cache: { enabled: true, strategy: 'exact', ttl: 300, namespace: 'prod' },
},
}); Self-hosted config.yaml:
cache:
enabled: true
default_ttl: 5m
max_entries: 10000
Cache namespaces
Partition cache into isolated buckets. Each namespace maintains its own entries, so entries from one environment don’t leak into another.
Use the x-prism-cache-namespace request header or set it in the SDK config:
# Per-request namespace
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
extra_headers={"x-prism-cache-namespace": "staging"},
)
Common namespace patterns:
- Environment isolation:
prod,staging,dev - Multi-tenant isolation: one namespace per customer
- A/B testing: different namespaces per experiment variant
Per-request cache control
Override cache behavior on individual requests using headers:
| Header | Value | Effect |
|---|---|---|
x-prism-cache-force-refresh | true | Bypass cache, fetch fresh response, update cache |
Cache-Control | no-store | Disable caching for this request entirely |
x-prism-cache-ttl | seconds | Override TTL for this specific response |
x-prism-cache-namespace | string | Route to a specific cache namespace |
# Force a fresh response (bypass cache)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is AI?"}],
extra_headers={"x-prism-cache-force-refresh": "true"},
) from openai import OpenAI
client = OpenAI(
base_url="https://gateway.futureagi.com/v1",
api_key="sk-prism-your-key",
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is AI?"}],
extra_headers={"x-prism-cache-force-refresh": "true"},
) curl -X POST https://gateway.futureagi.com/v1/chat/completions \
-H "Authorization: Bearer sk-prism-your-key" \
-H "x-prism-cache-force-refresh: true" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "What is AI?"}]
}' Cache backends
Exact match backends:
| Backend | Use case |
|---|---|
| In-memory (default) | Single-instance deployments, development |
| Redis | Multi-instance deployments, shared cache across replicas |
Semantic cache backends (vector stores):
| Backend | Notes |
|---|---|
| In-memory | Development and small-scale deployments |
| Qdrant | Production-grade self-hosted vector search |
| Pinecone | Managed vector database |
Note
Backend configuration is set at the gateway level in config.yaml. If you’re using the cloud gateway at gateway.futureagi.com, the backend is managed for you.