Caching

Reduce costs and latency with exact match and semantic caching at the gateway level.

About

Prism caches LLM responses server-side. A cache hit returns an instant response without calling the provider. The X-Prism-Cache response header shows cache status (hit or miss), and X-Prism-Cost returns 0 on exact cache hits since no provider tokens were consumed.

No client-side cache logic needed. Caching works for all providers through the same configuration.


When to use

  • Repeated queries: FAQ bots, common customer questions, template-based prompts
  • Development and testing: Avoid burning API credits on the same test prompts
  • High-traffic endpoints: Reduce provider costs for popular queries

Exact match vs semantic cache

Exact matchSemantic cache
How it matchesIdentical request parameters (same messages, model, temperature)Similar queries via vector embeddings
ExampleSame prompt, character for character”What’s the weather today?” matches “Tell me today’s weather”
LatencyFastest - hash lookupSlightly higher - embedding computation
Use caseDeterministic queries, templatesParaphrased questions, conversational variations
Cost on hitZero (skips cost/credits plugins)Cost plugins still run (embedding lookup has overhead)

Note

Streaming requests bypass cache entirely - both on read and write. Cache only applies to non-streaming completions.


Configuration

SettingDescriptionDefault
enabledEnable or disable cachingfalse
strategy"exact" or "semantic""exact"
default_ttlTime-to-live for cached entries (e.g. 5m, 1h)5m
max_entriesMaximum number of cached entries (LRU eviction)10000

Go to Prism > Caching in the Future AGI dashboard to enable caching, choose a strategy, and set TTL.

from prism import Prism, GatewayConfig, CacheConfig

# Set cache config at the client level
client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
    config=GatewayConfig(
        cache=CacheConfig(enabled=True, strategy="exact", ttl=300, namespace="prod"),
    ),
)
import { Prism } from '@futureagi/prism';

const client = new Prism({
    apiKey: 'sk-prism-your-key',
    baseUrl: 'https://gateway.futureagi.com',
    config: {
        cache: { enabled: true, strategy: 'exact', ttl: 300, namespace: 'prod' },
    },
});

Self-hosted config.yaml:

cache:
  enabled: true
  default_ttl: 5m
  max_entries: 10000

Cache namespaces

Partition cache into isolated buckets. Each namespace maintains its own entries, so entries from one environment don’t leak into another.

Use the x-prism-cache-namespace request header or set it in the SDK config:

# Per-request namespace
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={"x-prism-cache-namespace": "staging"},
)

Common namespace patterns:

  • Environment isolation: prod, staging, dev
  • Multi-tenant isolation: one namespace per customer
  • A/B testing: different namespaces per experiment variant

Per-request cache control

Override cache behavior on individual requests using headers:

HeaderValueEffect
x-prism-cache-force-refreshtrueBypass cache, fetch fresh response, update cache
Cache-Controlno-storeDisable caching for this request entirely
x-prism-cache-ttlsecondsOverride TTL for this specific response
x-prism-cache-namespacestringRoute to a specific cache namespace
# Force a fresh response (bypass cache)
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is AI?"}],
    extra_headers={"x-prism-cache-force-refresh": "true"},
)
from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is AI?"}],
    extra_headers={"x-prism-cache-force-refresh": "true"},
)
curl -X POST https://gateway.futureagi.com/v1/chat/completions \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "x-prism-cache-force-refresh: true" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "What is AI?"}]
  }'

Cache backends

Exact match backends:

BackendUse case
In-memory (default)Single-instance deployments, development
RedisMulti-instance deployments, shared cache across replicas

Semantic cache backends (vector stores):

BackendNotes
In-memoryDevelopment and small-scale deployments
QdrantProduction-grade self-hosted vector search
PineconeManaged vector database

Note

Backend configuration is set at the gateway level in config.yaml. If you’re using the cloud gateway at gateway.futureagi.com, the backend is managed for you.


Next Steps

Was this page helpful?

Questions & Discussion