Infrastructure··5 min read

Free, Private Embeddings for Your AI Dev Brain — Powered by Ollama

We now run a local embedding model on our infrastructure. Zero API keys. No prompt data leaves Germany.

The problem we kept hearing

When we launched the Cachly AI Dev Brain — persistent cross-session memory for AI coding assistants — the most common question was:

"Do I need an OpenAI API key just for memory? I'm already using Claude/Copilot/Cursor…"

Short answer: you shouldn't have to. Memory is infrastructure. You don't pay Stripe every time you write to a database. So we fixed it.

What we built

We now run nomic-embed-text via Ollama directly on our cachly infrastructure (Hetzner, Germany). Every embedding operation for the AI Brain — index_project, smart_recall, learn_from_attempts, remember_context — goes through this local model.

✅ No OPENAI_API_KEY needed — just your CACHLY_JWT
✅ No prompts, code snippets, or filenames leave Germany
✅ 768-dimension embeddings, cosine similarity search via pgvector
✅ Sub-10ms embedding latency (model stays hot in memory)
✅ Works even in air-gapped or restricted corporate environments

nomic-embed-text: Why we chose it

ModelDimsSizeMTEB ScoreLicense
nomic-embed-text ★768274 MB62.4Apache 2.0
text-embedding-3-small1536API~62.0OpenAI ToS
text-embedding-ada-0021536API60.9OpenAI ToS
all-MiniLM-L6-v238491 MB56.3Apache 2.0

nomic-embed-text scores on par with OpenAI's best embedding models, is fully open-source, and fits under 300 MB — making it practical to run on existing server infrastructure without a GPU.

How it works under the hood

MCP Tool Call: smart_recall("fix deploy")
       ↓
cachly MCP Server (npm)
       ↓ HTTP
cachly API (Go) — POST /api/v1/semantic/search
       ↓
EmbedHandler.Embed("fix deploy")
       ↓ HTTP (internal Docker network)
Ollama: POST http://ollama:11434/api/embeddings
       ↓
nomic-embed-text → 768-dim float32 vector
       ↓
pgvector: SELECT ... ORDER BY embedding <=> $1 LIMIT 10
       ↓
Top-k lessons returned to your AI assistant

Everything runs on a single Hetzner CPX32 node in Germany. The Ollama container uses max 700 MB RAM and stays idle at ~62 MB between requests.

Bring your own model (optional)

If you'd rather use your own embedding provider, set these in your MCP config:

{
  "env": {
    "CACHLY_JWT": "your-jwt",
    "CACHLY_BRAIN_INSTANCE_ID": "your-instance-id",
    "CACHLY_EMBED_PROVIDER": "openai",
    "OPENAI_API_KEY": "sk-..."
  }
}

The Brain falls back automatically to your provider when CACHLY_EMBED_PROVIDER is set. Otherwise it uses our hosted nomic-embed-text instance.

Get started

npx @cachly-dev/mcp-server@latest autopilot

No OPENAI_API_KEY needed. Just your CACHLY_JWT.

cachly is a persistent AI Brain for developers — memory shared across Claude Code, Cursor, GitHub Copilot & Windsurf simultaneously. Auto-detects every editor. Bootstraps from your git history. 115 MCP tools. Free tier, EU servers, no credit card.

Your AI is forgetting everything right now.

Every session starts blank. Every bug re-discovered. Every deploy procedure re-explained. cachly fixes that in 30 seconds — your AI remembers every lesson, every fix, every teammate's hard-won knowledge. Forever.

🇪🇺 EU servers · GDPR-compliant🆓 Free tier — forever, no credit card⚡ 30-second setup via npx🔌 Claude Code · Cursor · Copilot · Windsurf