Semantic Cache··7 min read

How we cut LLM costs by 80% with Semantic Cache

Your users ask the same question in a hundred different ways. Without semantic caching, every rephrasing is a paid LLM call. Here’s the fix — and why it takes 3 lines of code.

The problem nobody talks about

When you build an AI-powered product — a chatbot, a support agent, a Q&A interface — you quickly notice something expensive: your users ask the same things, differently.

User A: “What does cachly cost?”
User B: “How much is cachly per month?”
User C: “cachly pricing?”
User D: “Is cachly expensive?”
4 LLM calls. Same answer. $0.048 wasted.

Traditional key-value caching doesn’t help here. You can’t predict every phrasing variant. So you either cache nothing (expensive) or cache everything by exact key (useless for natural language).

At cachly.dev, we saw this pattern in every AI product we built or talked to customers about. So we built a fix.

How semantic caching actually works

The core idea is simple: instead of matching prompts by exact text, match them by meaning. We use vector embeddings and pgvector similarity search.

Here’s the flow:

  1. 1
    Embed the query
    The incoming prompt is converted to a dense vector using your embedding model (OpenAI, Cohere, Mistral — or your own).
  2. 2
    Similarity search
    pgvector searches your cache index for vectors within cosine distance of your configured threshold (e.g. 0.92).
  3. 3
    Cache HIT or MISS
    If a match is found with similarity ≥ threshold → return the cached answer instantly (< 1 ms). If not → call the LLM, cache the response for next time.

The numbers

We measured across production AI applications with 10,000–500,000 LLM calls per day. Typical semantic cache hit rates:

70–90%
Customer support chatbots
High query repetition
65–85%
Documentation Q&A
Finite topic space
50–70%
Product recommendation
Category-level patterns
40–65%
General assistants
Broader topic range

A support chatbot with 50,000 GPT-4o calls/day at $0.012/call spends ~$18,000/month on LLM costs. At a 75% hit rate with semantic caching: $4,500/month. That’s $13,500 back in your pocket — for a service that costs €79/month.

The most important setting: threshold

The similarity threshold controls how aggressively you cache. It’s the single knob that determines the accuracy/cost tradeoff.

threshold behavior
0.97 – 0.99
Near-exact match
Only catches typos and very minor paraphrasing. Low risk, low savings (~10–20%).
0.90 – 0.96
Same intent
Catches different phrasings of the same question. Sweet spot for most apps (50–80% savings).
0.80 – 0.89
Related topic
Broader matching. Higher savings but risk of returning semantically adjacent (not identical) answers.

Our recommendation: start at 0.92 and monitor your hit rate in the dashboard. Enable Adaptive Threshold to let the system auto-tune based on user feedback signals.

The 3-line integration

Here’s the before/after. No architecture changes, no new infrastructure to manage.

❌ Before — every rephrasing = new LLM call
const answer = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: userQuestion }],
});
// "What does cachly cost?"        → $0.012
// "How much is cachly per month?" → $0.012 (same answer!)
// "cachly pricing?"               → $0.012 (same answer again!)
✅ After — pay once, serve forever
import { createClient } from "@cachly-dev/sdk";

const cache = createClient({
  url: process.env.CACHLY_URL,
  vectorUrl: process.env.CACHLY_VECTOR_URL,  // ← the only new line
});

const { value, hit } = await cache.semantic!.getOrSet(
  userQuestion,
  () => openai.chat.completions.create({ model: "gpt-4o", messages: [...] }),
  { ttl: 86400, threshold: 0.92 }
);
// "What does cachly cost?"        → $0.012 (cold — LLM called once)
// "How much is cachly per month?" →  $0.000 ⚡ cache HIT
// "cachly pricing?"               →  $0.000 ⚡ cache HIT

The SDK handles embedding, vector search, and fallback to your LLM automatically. You get back value (the answer) and hit (whether it was cached) — use hit for your analytics.

Under the hood: pgvector on Postgres

We chose pgvector on PostgreSQL rather than a dedicated vector database for a few reasons:

  • Transactional consistency: Cache writes and reads are ACID. No stale cache entries from failed writes.
  • No extra service: Postgres already runs in every production stack. pgvector is an extension, not a new system to operate.
  • HNSW index: Hierarchical Navigable Small World index gives approximate nearest-neighbor search in milliseconds, even at millions of entries.
  • Hybrid search: We combine cosine similarity with BM25 full-text search for better precision on technical queries.

The HNSW index is built per instance and rebuilds automatically as entries expire (TTL-based pruning runs every 6 hours). Each cache entry stores the original prompt, embedding vector, cached answer, namespace, and metadata.

What about image and multimodal queries?

Text is the easy case. Cachly also supports image embeddings (CLIP-style) and multimodal (text + image together). You pass an image as base64 or URL alongside the text, and the combined embedding is used for similarity search.

This enables semantic caching for vision-language models: two users who submit the same product photo with slightly different questions can still get a cache hit if the combined intent is close enough.

Try it free — no credit card

Free instance on German servers, GDPR-compliant, CACHLY_VECTOR_URL included. 14-day Dev trial when you sign up — 8× more memory, zero cost.