Is semantic caching the same as prompt caching?

No. Prompt caching (Anthropic, OpenAI) reuses the model's computed key-value state for an identical prompt prefix to skip work on the model side. Semantic caching skips the model call entirely by serving a stored response for an embedding-similar query. They're complementary — use both.

When should I not use a semantic cache?

Workloads where every query is genuinely unique (creative writing, ad-hoc analysis), workloads where the same query should produce different responses based on per-user context the embedding doesn't capture, or workloads where the knowledge base updates faster than cache invalidation can keep up.

What threshold should I set?

Start at cosine similarity 0.95–0.97 and measure both false-positive rate (wrong cached answer served) and hit rate against the eval harness. Lower the threshold if hit rate is too low for the budget target; raise it if you see hallucinated-cache responses in QA review.

Semantic cache · Morvion Glossary

A semantic cache stores past prompt-response pairs and serves new requests from the cache when the incoming request's embedding is sufficiently close to a stored one. The model call is skipped entirely. Done well, a semantic cache cuts AI cost and p95 latency by 40–70% on workloads where users ask variations of the same thing.

When it pays back.

Workloads with high query duplication: customer support FAQ, documentation Q&A, product search, internal knowledge assistants. If your top 100 queries account for over a third of traffic, semantic caching is almost always a net win.

Anatomy.

Embed the incoming query. Same embedding model used for retrieval works.
Lookup nearest neighbor in the cache's vector index, with a strict similarity threshold (typically 0.95+).
Serve if hit, otherwise call the model and store the new (embedding, query, response) tuple.
Invalidate on knowledge changes. Any cached response tied to documents that have been updated must be evicted.

Common gotchas.

Threshold too low → false positives (wrong cached answer served). Threshold too high → cache rarely hits. The eval harness measures both error rates, and the threshold is tuned against the workload. Cache invalidation when knowledge changes is the second-hardest part of semantic caching, after threshold tuning.

Frequently asked.

What is a semantic cache?: A semantic cache stores past prompt-response pairs and serves new requests from the cache when the new request's embedding is sufficiently close to a stored one. The model call is skipped. Done well, it cuts cost and p95 latency by 40–70% on workloads with high query duplication.
Is semantic caching the same as prompt caching?: No. Prompt caching (Anthropic, OpenAI) reuses the model's computed key-value state for an identical prompt prefix to skip work on the model side. Semantic caching skips the model call entirely by serving a stored response for an embedding-similar query. They're complementary — use both.
When should I not use a semantic cache?: Workloads where every query is genuinely unique (creative writing, ad-hoc analysis), workloads where the same query should produce different responses based on per-user context the embedding doesn't capture, or workloads where the knowledge base updates faster than cache invalidation can keep up.
What threshold should I set?: Start at cosine similarity 0.95–0.97 and measure both false-positive rate (wrong cached answer served) and hit rate against the eval harness. Lower the threshold if hit rate is too low for the budget target; raise it if you see hallucinated-cache responses in QA review.

Semantic cache

When it pays back.

Anatomy.

Common gotchas.

Frequently asked.

Intelligent Systems & AI Infrastructure

Keep reading the glossary.

AI infrastructure

CRM intelligence

Immersive website

AI agent

Business intelligence dashboard

Client portal

Discovery sprint

Digital operating layer

Document intelligence

Eval-driven AI

Hospitality website

Marketplace platform

Multi-agent workflow

Real-time dashboard

Retrieval-augmented generation (RAG)

Prompt engineering

Vector database

AI observability

Embedding model

Fine-tuning

Vector search

Semantic search

Hallucination

Chain-of-thought

Function calling

Model distillation

Safety rails

Eval harness

Regression gate

Model Context Protocol (MCP)

Structured output

Agent tool use

Prompt injection

Agentic search

Observability traces

LLM guardrails

Agent handoff

Vector index

Token budget

Retrieval rerank

Embedding space

Context window

Faithfulness

Cross-encoder

Model router

AI cost control

Agent memory

Structured extraction

AI evaluation framework

Retrieval quality

AI guardrail policy

Eval fixture

Eval rubric

AI incident

Agent orchestration

Eval versioning

Model fallback

Fine-grained routing

AI policy version control