Cut LLM Costs 80% With Semantic Caching in Agent Command Center

Turn on exact and semantic response caching at the gateway so paraphrased duplicate prompts return cached answers instead of paying for the same call twice.

📝
TL;DR

Enable caching once in the Agent Command Center dashboard, switch the strategy to semantic, and your existing OpenAI SDK code starts returning cached answers for paraphrased prompts. The x-agentcc-cache: hit_semantic response header confirms it. You walk away with sub-100ms latency and near-zero cost on duplicate traffic, no application-code rewrites.

Open in ColabGitHub
TimeDifficultyPackage
10 minBeginneragentcc
Prerequisites

Install

Install the OpenAI and Agent Command Center SDKs and set your API key.

pip install openai agentcc
export AGENTCC_API_KEY="sk-agentcc-your-key"

Tutorial

Send a baseline request and note the cost

Point the OpenAI SDK at the gateway and send a request. The response headers tell you exactly what it cost and whether it came from cache.

import os
from openai import OpenAI

API_KEY = os.environ["AGENTCC_API_KEY"]

client = OpenAI(
    api_key=API_KEY,
    base_url="https://gateway.futureagi.com/v1",
)

r = client.chat.completions.with_raw_response.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is your return policy?"}],
)
print(f"cache:   {r.headers.get('x-agentcc-cache')}")
print(f"cost:    ${r.headers.get('x-agentcc-cost')}")
print(f"latency: {r.headers.get('x-agentcc-latency-ms')}ms")

x-agentcc-cache is empty or miss on a fresh call. The cost and latency are what you’d pay every time without caching.

Turn on exact caching in the dashboard

In the dashboard, go to Gateway → Providers → Cache and click Configure Cache. Toggle:

  • Enable Response Cache: on
  • Default TTL: 1h (or whatever fits your data freshness needs)
Agent Command Center Cache tab with Enabled: Yes, L1 Backend: memory, and Semantic Cache: Disabled. Exact-match caching is active

Save. Caching is now active for every request through the gateway. The Cache Configuration card shows Enabled: Yes with L1 Backend: memory (the L1 layer is always exact-match; memory just means it’s stored in-process. You can switch to Redis or disk for a multi-instance gateway). Semantic Cache: Disabled confirms only exact matches are served right now. No client change required. Run the same prompt twice:

prompt = [{"role": "user", "content": "What is your return policy?"}]

r1 = client.chat.completions.with_raw_response.create(model="gpt-4o-mini", messages=prompt)
print(f"call 1: {r1.headers.get('x-agentcc-cache')} | ${r1.headers.get('x-agentcc-cost')}")

r2 = client.chat.completions.with_raw_response.create(model="gpt-4o-mini", messages=prompt)
print(f"call 2: {r2.headers.get('x-agentcc-cache')} | ${r2.headers.get('x-agentcc-cost')}")

Call 1 is miss. Call 2 is hit_exact, instant, with $0 provider cost. Exact caching is fast and free, but only helps when prompts are byte-identical.

Tip

Use cache namespaces to isolate environments or experiments. Set x-agentcc-cache-namespace: staging on a request to keep its cache separate from production. Each namespace is independent. A prod hit won’t leak into staging.

Switch to semantic caching for paraphrased prompts

Real users don’t ask the same question the same way twice. Semantic caching matches prompts by meaning rather than exact text. It runs as an L2 fallback after the L1 exact-match check.

In the same Configure Cache dialog, enable:

  • L2 Semantic Cache: on
  • Threshold: 0.92 (similarity, 0 to 1, higher is stricter)

The same client code now matches paraphrases:

prompts = [
    "What is your return policy?",
    "Can I return a product I bought?",
    "How do refunds work at your store?",
    "Tell me about returning items.",
]
for p in prompts:
    r = client.chat.completions.with_raw_response.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": p}],
    )
    print(f"{(r.headers.get('x-agentcc-cache') or 'miss'):14} | ${r.headers.get('x-agentcc-cost')} | {p}")

The first prompt is miss; the rest are paraphrases above the 0.92 similarity threshold and come back as hit_semantic with near-zero cost.

Tip

Tune the threshold carefully. Too low (e.g., 0.7) and unrelated questions collide; too high (e.g., 0.99) and you only catch near-exact matches. Start at 0.92 and adjust based on your hit rate vs false-positive rate.

Measure the savings

Loop over a realistic mixed batch and tally cache hits, total cost, and latency.

import time
from collections import Counter

batch = [
    "What is your return policy?",
    "Can I return a product?",
    "How do I get a refund?",
    "What's the shipping cost?",
    "How long does shipping take?",
    "Do you ship internationally?",
] * 5  # 30 calls total

tally = Counter()
total_cost = 0.0
start = time.time()

for p in batch:
    r = client.chat.completions.with_raw_response.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": p}],
    )
    tally[r.headers.get("x-agentcc-cache") or "miss"] += 1
    total_cost += float(r.headers.get("x-agentcc-cost") or 0)

print(f"cache results: {tally}")
print(f"total cost:    ${total_cost:.5f}")
print(f"wall time:     {time.time() - start:.1f}s")

Expect ~80% hits after the first pass over each unique question (hit_exact for byte-identical, hit_semantic for paraphrases). Compare the total cost against the same batch with caching disabled. That’s your savings.

Bypass or invalidate the cache when you need fresh answers

When you change a system prompt or want a fresh response for a specific call, send x-agentcc-cache-force-refresh: true on that request. The gateway skips the cache read but still writes the new response back, so subsequent calls hit the refreshed entry.

r = client.chat.completions.with_raw_response.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is your return policy?"}],
    extra_headers={"x-agentcc-cache-force-refresh": "true"},
)
print(f"forced refresh: {r.headers.get('x-agentcc-cache')}")  # miss, then re-cached

For a global wipe after a prompt-template update, route your traffic to a fresh namespace by setting x-agentcc-cache-namespace: support-v2 instead of support. The old cache stays available to anything still pointed at support.

You enabled exact then semantic caching in the dashboard, watched paraphrased prompts return cached responses with x-agentcc-cache: hit_semantic, and measured the cost drop on a realistic batch, without changing application code beyond pointing at the gateway.

Explore further

Was this page helpful?

Questions & Discussion