Deep Dive·7 min read

What Is Semantic Caching for LLMs? A Complete Guide

Learn how semantic caching works for LLM APIs, why it's different from traditional caching, and how it can eliminate 20-35% of your API costs instantly.


Traditional caching matches exact strings. "What's the capital of France?" and "What is the capital of France?" are treated as different requests, so you pay twice for the same answer.

Semantic caching solves this by comparing the meaning of requests, not the text. It uses embedding vectors and cosine similarity to detect when two requests are asking the same thing, even if the wording is different.

How Semantic Caching Works

1. When a request comes in, the prompt is converted into an embedding vector - a numerical representation of its meaning.

2. This vector is compared against all cached embeddings using cosine similarity.

3. If the similarity score exceeds a threshold (typically 0.95), the cached response is returned instantly.

4. If it's a miss, the request goes to the LLM, and the response is cached for future matches.

Semantic matching in action
Request 1: "What's the capital of France?"        → Miss → LLM call → cached
Request 2: "capital of france?"                    → Hit (0.97) → instant, $0
Request 3: "What is France's capital city?"        → Hit (0.96) → instant, $0
Request 4: "What's the capital of Germany?"        → Miss (0.72) → LLM call

3 out of 4 requests answered from cache = 75% cost reduction for this pattern

Semantic vs. Traditional Caching

The difference is significant:

FeatureTraditional CacheSemantic Cache
Match typeExact string matchMeaning-based similarity
Hit rate5-10% (typical)20-35% (typical)
"Hello" vs "Hi"MissHit (0.93 similarity)
Typo toleranceNoneFull
Latency on hit~1ms~2ms
StorageKey-valueVector + key-value

Tuning the Similarity Threshold

The threshold controls the trade-off between hit rate and precision:

0.99: Very strict - only near-identical prompts match. Low hit rate, zero risk of wrong answers.

0.95 (recommended): Good balance. Catches rephrased questions while avoiding false positives.

0.90: Aggressive - higher hit rate but may match semantically different requests. Good for FAQ-style apps.

0.85: Too loose for most use cases. High false positive risk.

Start with 0.95 and adjust based on your use case. Monitor cache hit quality in your analytics dashboard.

When Semantic Caching Works Best

Customer support bots: Users ask the same questions in different ways. Cache hit rates of 30-50% are common.

FAQ / knowledge base apps: Highly repetitive queries. 40-60% hit rates.

Internal tools: Same team asking similar analytical questions. 20-30% hit rates.

Creative / generative apps: Low repetition, so caching adds less value. 5-10% hit rates.

The key insight is that most production LLM apps have more repetition than developers expect. Even a 20% cache hit rate means 20% of your API costs eliminated with zero quality impact.

Implementing Semantic Caching

You can build semantic caching yourself with a vector database (Pinecone, Redis VSS, or pgvector) and an embedding model. But managing the infrastructure, TTL policies, cache invalidation, and monitoring adds significant complexity.

Promptly includes semantic caching out of the box. Enable it in settings, set your similarity threshold and TTL, and every request is automatically cached. Cache hits show up in your analytics with $0 cost.

Enable in one API call
PUT /api/optimization/settings
{
  "cache_enabled": true,
  "cache_similarity_threshold": 0.95,
  "cache_ttl_seconds": 3600
}

Ready to cut your LLM costs?

Promptly optimizes every API call automatically - smart routing, caching, prompt compression, and context pruning in one proxy.