How we cut LLM costs
by 80% with Semantic Cache
Your users ask the same question in a hundred different ways. Without semantic caching, every rephrasing is a paid LLM call. Here’s the fix — and why it takes 3 lines of code.
The problem nobody talks about
When you build an AI-powered product — a chatbot, a support agent, a Q&A interface — you quickly notice something expensive: your users ask the same things, differently.
Traditional key-value caching doesn’t help here. You can’t predict every phrasing variant. So you either cache nothing (expensive) or cache everything by exact key (useless for natural language).
At cachly.dev, we saw this pattern in every AI product we built or talked to customers about. So we built a fix.
How semantic caching actually works
The core idea is simple: instead of matching prompts by exact text, match them by meaning. We use vector embeddings and pgvector similarity search.
Here’s the flow:
- 1Embed the queryThe incoming prompt is converted to a dense vector using your embedding model (OpenAI, Cohere, Mistral — or your own).
- 2Similarity searchpgvector searches your cache index for vectors within cosine distance of your configured threshold (e.g. 0.92).
- 3Cache HIT or MISSIf a match is found with similarity ≥ threshold → return the cached answer instantly (< 1 ms). If not → call the LLM, cache the response for next time.
The numbers
We measured across production AI applications with 10,000–500,000 LLM calls per day. Typical semantic cache hit rates:
A support chatbot with 50,000 GPT-4o calls/day at $0.012/call spends ~$18,000/month on LLM costs. At a 75% hit rate with semantic caching: $4,500/month. That’s $13,500 back in your pocket — for a service that costs €79/month.
The most important setting: threshold
The similarity threshold controls how aggressively you cache. It’s the single knob that determines the accuracy/cost tradeoff.
0.97 – 0.990.90 – 0.960.80 – 0.89Our recommendation: start at 0.92 and monitor your hit rate in the dashboard. Enable Adaptive Threshold to let the system auto-tune based on user feedback signals.
The 3-line integration
Here’s the before/after. No architecture changes, no new infrastructure to manage.
const answer = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: userQuestion }],
});
// "What does cachly cost?" → $0.012
// "How much is cachly per month?" → $0.012 (same answer!)
// "cachly pricing?" → $0.012 (same answer again!)import { createClient } from "@cachly-dev/sdk";
const cache = createClient({
url: process.env.CACHLY_URL,
vectorUrl: process.env.CACHLY_VECTOR_URL, // ← the only new line
});
const { value, hit } = await cache.semantic!.getOrSet(
userQuestion,
() => openai.chat.completions.create({ model: "gpt-4o", messages: [...] }),
{ ttl: 86400, threshold: 0.92 }
);
// "What does cachly cost?" → $0.012 (cold — LLM called once)
// "How much is cachly per month?" → $0.000 ⚡ cache HIT
// "cachly pricing?" → $0.000 ⚡ cache HITThe SDK handles embedding, vector search, and fallback to your LLM automatically. You get back value (the answer) and hit (whether it was cached) — use hit for your analytics.
Under the hood: pgvector on Postgres
We chose pgvector on PostgreSQL rather than a dedicated vector database for a few reasons:
- ✓Transactional consistency: Cache writes and reads are ACID. No stale cache entries from failed writes.
- ✓No extra service: Postgres already runs in every production stack. pgvector is an extension, not a new system to operate.
- ✓HNSW index: Hierarchical Navigable Small World index gives approximate nearest-neighbor search in milliseconds, even at millions of entries.
- ✓Hybrid search: We combine cosine similarity with BM25 full-text search for better precision on technical queries.
The HNSW index is built per instance and rebuilds automatically as entries expire (TTL-based pruning runs every 6 hours). Each cache entry stores the original prompt, embedding vector, cached answer, namespace, and metadata.
What about image and multimodal queries?
Text is the easy case. Cachly also supports image embeddings (CLIP-style) and multimodal (text + image together). You pass an image as base64 or URL alongside the text, and the combined embedding is used for similarity search.
This enables semantic caching for vision-language models: two users who submit the same product photo with slightly different questions can still get a cache hit if the combined intent is close enough.
Try it free — no credit card
Free instance on German servers, GDPR-compliant, CACHLY_VECTOR_URL included. 14-day Dev trial when you sign up — 8× more memory, zero cost.