Caching

Reduce costs and latency with Prism's exact match and semantic caching.

What it is

Prism caches LLM responses at the gateway level. Entirely server-side, no client cache logic needed. A cache hit returns an instant response without calling the provider. The X-Prism-Cache header shows cache status (miss/hit/skip), and X-Prism-Cost returns 0 on cache hits, eliminating provider charges for repeated queries.


Use cases

  • Repeated queries — FAQ bots, common customer questions, template-based prompts
  • Development and testing — Avoid burning API credits on the same test prompts
  • High-traffic endpoints — Reduce provider costs for popular queries

Cache modes

Prism supports two caching strategies:

Exact Match — Caches responses for identical request parameters. Fastest and most predictable. Use for deterministic queries where exact repetition is common.

Semantic Cache — Uses vector embeddings (numerical representations of text meaning generated by a language model) to match semantically equivalent queries, even when worded differently. For example, “What’s the weather like today?” and “Tell me today’s weather” would match the same cached response — even though the words differ. Catches paraphrased questions and variations that exact match would miss. Slightly higher latency than exact match due to the embedding computation.

Note

Streaming requests bypass cache entirely. Cache only applies to non-streaming completions.


Configuration

Configure caching with enabled flag, mode, TTL, and maximum entries:

SettingDescriptionDefault
enabledEnable or disable cachingfalse
modeCache mode: “exact” or “semantic""exact”
ttl_secondsTime-to-live for cached entries3600
max_entriesMaximum number of cached entries10000
{
  "cache": {
    "enabled": true,
    "mode": "exact",
    "ttl_seconds": 3600,
    "max_entries": 10000
  }
}

Cache namespaces

Partition cache into isolated buckets using namespaces. Each namespace maintains its own cache entries, preventing cross-contamination.

Use cases for namespaces:

  • Environment isolation — Separate prod, staging, dev caches
  • Multi-tenant isolation — Each tenant gets its own cache namespace
  • A/B testing — Different cache namespaces for different experiment variants

Configuring cache via SDK

from prism import Prism, GatewayConfig, CacheConfig

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
    config=GatewayConfig(
        cache=CacheConfig(enabled=True, mode="exact", ttl="5m", namespace="prod"),
    ),
)
import { Prism } from '@futureagi/prism';

const client = new Prism({
  apiKey: 'sk-prism-your-key',
  baseUrl: 'https://gateway.futureagi.com',
  config: { cache: { enabled: true, mode: 'exact', ttl: '5m', namespace: 'prod' } },
});

Per-request cache control

Override cache behavior on a per-request basis using headers:

HeaderValueEffect
x-prism-cache-force-refreshtrueBypass cache, fetch fresh response
Cache-Controlno-storeDisable caching for this request
x-prism-cache-ttlsecondsOverride TTL for this response

cURL example:

curl -X POST https://gateway.futureagi.com/v1/chat/completions \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "x-prism-cache-force-refresh: true" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "What is AI?"}]
  }'

Python SDK example:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is AI?"}],
    extra_headers={"x-prism-cache-force-refresh": "true"}
)

TypeScript SDK example:

const response = await client.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: "What is AI?" }],
  headers: { "x-prism-cache-force-refresh": "true" }
});

What you can do next

Was this page helpful?

Questions & Discussion