Caching
Reduce costs and latency with Prism's exact match and semantic caching.
What it is
Prism caches LLM responses at the gateway level. Entirely server-side, no client cache logic needed. A cache hit returns an instant response without calling the provider. The X-Prism-Cache header shows cache status (miss/hit/skip), and X-Prism-Cost returns 0 on cache hits, eliminating provider charges for repeated queries.
Use cases
- Repeated queries — FAQ bots, common customer questions, template-based prompts
- Development and testing — Avoid burning API credits on the same test prompts
- High-traffic endpoints — Reduce provider costs for popular queries
Cache modes
Prism supports two caching strategies:
Exact Match — Caches responses for identical request parameters. Fastest and most predictable. Use for deterministic queries where exact repetition is common.
Semantic Cache — Uses vector embeddings (numerical representations of text meaning generated by a language model) to match semantically equivalent queries, even when worded differently. For example, “What’s the weather like today?” and “Tell me today’s weather” would match the same cached response — even though the words differ. Catches paraphrased questions and variations that exact match would miss. Slightly higher latency than exact match due to the embedding computation.
Note
Streaming requests bypass cache entirely. Cache only applies to non-streaming completions.
Configuration
Configure caching with enabled flag, mode, TTL, and maximum entries:
| Setting | Description | Default |
|---|---|---|
| enabled | Enable or disable caching | false |
| mode | Cache mode: “exact” or “semantic" | "exact” |
| ttl_seconds | Time-to-live for cached entries | 3600 |
| max_entries | Maximum number of cached entries | 10000 |
{
"cache": {
"enabled": true,
"mode": "exact",
"ttl_seconds": 3600,
"max_entries": 10000
}
}
Cache namespaces
Partition cache into isolated buckets using namespaces. Each namespace maintains its own cache entries, preventing cross-contamination.
Use cases for namespaces:
- Environment isolation — Separate prod, staging, dev caches
- Multi-tenant isolation — Each tenant gets its own cache namespace
- A/B testing — Different cache namespaces for different experiment variants
Configuring cache via SDK
from prism import Prism, GatewayConfig, CacheConfig
client = Prism(
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com",
config=GatewayConfig(
cache=CacheConfig(enabled=True, mode="exact", ttl="5m", namespace="prod"),
),
) import { Prism } from '@futureagi/prism';
const client = new Prism({
apiKey: 'sk-prism-your-key',
baseUrl: 'https://gateway.futureagi.com',
config: { cache: { enabled: true, mode: 'exact', ttl: '5m', namespace: 'prod' } },
}); Per-request cache control
Override cache behavior on a per-request basis using headers:
| Header | Value | Effect |
|---|---|---|
| x-prism-cache-force-refresh | true | Bypass cache, fetch fresh response |
| Cache-Control | no-store | Disable caching for this request |
| x-prism-cache-ttl | seconds | Override TTL for this response |
cURL example:
curl -X POST https://gateway.futureagi.com/v1/chat/completions \
-H "Authorization: Bearer sk-prism-your-key" \
-H "x-prism-cache-force-refresh: true" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "What is AI?"}]
}'
Python SDK example:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is AI?"}],
extra_headers={"x-prism-cache-force-refresh": "true"}
)
TypeScript SDK example:
const response = await client.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: "What is AI?" }],
headers: { "x-prism-cache-force-refresh": "true" }
});