cachly is a persistent AI memory platform for developers. It gives AI coding assistants like Claude Code, Cursor, GitHub Copilot and Windsurf a brain that remembers every lesson, fix, and architecture decision — forever. It connects via the MCP (Model Context Protocol) standard and includes 126 MCP tools. Free tier available. Runs on German (EU) servers.

How does cachly work?

Run 'npx @cachly-dev/mcp-server@latest autopilot' once. The wizard auto-detects every AI editor you have installed (Claude Code, Cursor, Copilot, Windsurf, Cline, Zed) and writes the correct config for each. It then reads your entire git history with brain_from_git and loads years of team knowledge into your Brain before your first session. From that point, sessions start automatically, memory is shared across all your editors simultaneously, and a git post-commit hook teaches cachly from every commit.

Does cachly auto-detect my editors?

Yes. The cachly setup wizard automatically detects Claude Code, Cursor, GitHub Copilot, Windsurf, Cline, Zed, and Continue.dev — any editor that supports MCP. It writes the correct config file for each editor in one pass. You never manually edit JSON config files.

Is memory shared across all my AI editors?

Yes. cachly uses a single Brain that all your AI editors connect to simultaneously. A lesson remembered in Claude Code is instantly available in Cursor and GitHub Copilot. If your team uses different editors, all of you share the same persistent memory pool.

What is brain_from_git?

brain_from_git is a cachly tool that reads your entire git history before your first session and extracts lessons from every commit, PR, and revert. Your AI arrives knowing years of architectural decisions, bug fixes, and team conventions — without you writing a single line of documentation. Zero onboarding.

What is causal_trace?

causal_trace is a cachly tool that traces the history of any file or bug across your entire git history in seconds — replacing 30+ minutes of manual git blame. Describe a problem in plain English. It returns the root cause, the failure chain, and the exact fix that worked — with date, command, and file path.

What is brain_predict?

brain_predict is a cachly tool that scans your Brain for failure patterns before every deploy, migration, or dependency upgrade. It returns probability-weighted warnings based on your team's actual incident history — so you catch the next incident before it happens.

Does cachly work with Claude Code, Cursor, and GitHub Copilot?

Yes. cachly works with Claude Code, Cursor, GitHub Copilot, Windsurf, Cline, Zed, and Continue.dev — anywhere that supports MCP. Run 'npx @cachly-dev/mcp-server@latest autopilot' to configure all editors in one step. Memory is shared across all editors simultaneously.

Can cachly search memory across languages?

Yes. cachly uses semantic vector embeddings, not keyword search. A lesson stored in German appears when you search in English. A fix documented in Arabic matches a Japanese query about the same bug pattern. Supported languages include English, German, French, Spanish, Italian, Portuguese, Japanese, Chinese (Simplified and Traditional), Korean, Arabic, Hebrew, and more.

How is cachly different from mem0?

mem0 is a memory layer for Python LLM apps and chatbots — great for building AI products. cachly is built specifically for developer tooling: it connects to your AI editor via MCP, learns from your git history automatically, predicts failures before deploy, and gives your whole team shared memory. cachly runs on EU servers and is GDPR-native. For developers using Claude Code, Cursor, or Copilot, cachly is the right choice.

Is cachly GDPR compliant?

Yes. cachly runs exclusively on German servers (Hetzner). All data stays in the EU. No data is shared with third parties. cachly is fully GDPR compliant. An AVV (Auftragsverarbeitungsvertrag / Data Processing Agreement) is available for Business and Enterprise customers.

How we cut LLM costs by 80% with Semantic Cache

The problem nobody talks about

When you build an AI-powered product — a chatbot, a support agent, a Q&A interface — you quickly notice something expensive: your users ask the same things, differently.

User A: “What does cachly cost?”

User B: “How much is cachly per month?”

User C: “cachly pricing?”

User D: “Is cachly expensive?”

4 LLM calls. Same answer. $0.048 wasted.

Traditional key-value caching doesn’t help here. You can’t predict every phrasing variant. So you either cache nothing (expensive) or cache everything by exact key (useless for natural language).

At cachly.dev, we saw this pattern in every AI product we built or talked to customers about. So we built a fix.

How semantic caching actually works

The core idea is simple: instead of matching prompts by exact text, match them by meaning. We use vector embeddings and pgvector similarity search.

Here’s the flow:

1
Embed the query
The incoming prompt is converted to a dense vector using your embedding model (OpenAI, Cohere, Mistral — or your own).
2
Similarity search
pgvector searches your cache index for vectors within cosine distance of your configured threshold (e.g. 0.92).
3
Cache HIT or MISS
If a match is found with similarity ≥ threshold → return the cached answer instantly (< 1 ms). If not → call the LLM, cache the response for next time.

The numbers

We measured across production AI applications with 10,000–500,000 LLM calls per day. Typical semantic cache hit rates:

70–90%

Customer support chatbots

High query repetition

65–85%

Documentation Q&A

Finite topic space

50–70%

Product recommendation

Category-level patterns

40–65%

General assistants

Broader topic range

A support chatbot with 50,000 GPT-4o calls/day at $0.012/call spends ~$18,000/month on LLM costs. At a 75% hit rate with semantic caching: $4,500/month. That’s $13,500 back in your pocket — for a service that costs €79/month.

The most important setting: threshold

The similarity threshold controls how aggressively you cache. It’s the single knob that determines the accuracy/cost tradeoff.

threshold behavior

0.97 – 0.99

Near-exact match

Only catches typos and very minor paraphrasing. Low risk, low savings (~10–20%).

0.90 – 0.96

Same intent

Catches different phrasings of the same question. Sweet spot for most apps (50–80% savings).

0.80 – 0.89

The 3-line integration

Here’s the before/after. No architecture changes, no new infrastructure to manage.

❌ Before — every rephrasing = new LLM call

const answer = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: userQuestion }],
});
// "What does cachly cost?"        → $0.012
// "How much is cachly per month?" → $0.012 (same answer!)
// "cachly pricing?"               → $0.012 (same answer again!)

✅ After — pay once, serve forever

import { createClient } from "@cachly-dev/sdk";

const cache = createClient({
  url: process.env.CACHLY_URL,
  vectorUrl: process.env.CACHLY_VECTOR_URL,  // ← the only new line
});

const { value, hit } = await cache.semantic!.getOrSet(
  userQuestion,
  () => openai.chat.completions.create({ model: "gpt-4o", messages: [...] }),
  { ttl: 86400, threshold: 0.92 }
);
// "What does cachly cost?"        → $0.012 (cold — LLM called once)
// "How much is cachly per month?" →  $0.000 ⚡ cache HIT
// "cachly pricing?"               →  $0.000 ⚡ cache HIT

The SDK handles embedding, vector search, and fallback to your LLM automatically. You get back value (the answer) and hit (whether it was cached) — use hit for your analytics.

Under the hood: pgvector on Postgres

We chose pgvector on PostgreSQL rather than a dedicated vector database for a few reasons:

✓Transactional consistency: Cache writes and reads are ACID. No stale cache entries from failed writes.
✓No extra service: Postgres already runs in every production stack. pgvector is an extension, not a new system to operate.
✓HNSW index: Hierarchical Navigable Small World index gives approximate nearest-neighbor search in milliseconds, even at millions of entries.
✓Hybrid search: We combine cosine similarity with BM25 full-text search for better precision on technical queries.

The HNSW index is built per instance and rebuilds automatically as entries expire (TTL-based pruning runs every 6 hours). Each cache entry stores the original prompt, embedding vector, cached answer, namespace, and metadata.

What about image and multimodal queries?

Text is the easy case. Cachly also supports image embeddings (CLIP-style) and multimodal (text + image together). You pass an image as base64 or URL alongside the text, and the combined embedding is used for similarity search.

This enables semantic caching for vision-language models: two users who submit the same product photo with slightly different questions can still get a cache hit if the combined intent is close enough.