A semantic cache stores past prompt-response pairs and serves new requests from the cache when the incoming request's embedding is sufficiently close to a stored one. The model call is skipped entirely. Done well, a semantic cache cuts AI cost and p95 latency by 40–70% on workloads where users ask variations of the same thing.
When it pays back.
Workloads with high query duplication: customer support FAQ, documentation Q&A, product search, internal knowledge assistants. If your top 100 queries account for over a third of traffic, semantic caching is almost always a net win.
Anatomy.
- Embed the incoming query. Same embedding model used for retrieval works.
- Lookup nearest neighbor in the cache's vector index, with a strict similarity threshold (typically 0.95+).
- Serve if hit, otherwise call the model and store the new (embedding, query, response) tuple.
- Invalidate on knowledge changes. Any cached response tied to documents that have been updated must be evicted.
Common gotchas.
Threshold too low → false positives (wrong cached answer served). Threshold too high → cache rarely hits. The eval harness measures both error rates, and the threshold is tuned against the workload. Cache invalidation when knowledge changes is the second-hardest part of semantic caching, after threshold tuning.
Frequently asked.
- What is a semantic cache?
- A semantic cache stores past prompt-response pairs and serves new requests from the cache when the new request's embedding is sufficiently close to a stored one. The model call is skipped. Done well, it cuts cost and p95 latency by 40–70% on workloads with high query duplication.
- Is semantic caching the same as prompt caching?
- No. Prompt caching (Anthropic, OpenAI) reuses the model's computed key-value state for an identical prompt prefix to skip work on the model side. Semantic caching skips the model call entirely by serving a stored response for an embedding-similar query. They're complementary — use both.
- When should I not use a semantic cache?
- Workloads where every query is genuinely unique (creative writing, ad-hoc analysis), workloads where the same query should produce different responses based on per-user context the embedding doesn't capture, or workloads where the knowledge base updates faster than cache invalidation can keep up.
- What threshold should I set?
- Start at cosine similarity 0.95–0.97 and measure both false-positive rate (wrong cached answer served) and hit rate against the eval harness. Lower the threshold if hit rate is too low for the budget target; raise it if you see hallucinated-cache responses in QA review.