Self-Host a Semantic LLM Cache in 5 Minutes
Your data never leaves your server. One docker compose command. No Kubernetes, no cloud account, no vendor lock-in.
Why self-host?
Three types of teams ask us for self-hosting every week:
- Healthcare / finance — data cannot leave a specific VPC or country
- Air-gapped environments — no outbound internet from the inference cluster
- Cost control— running on existing hardware that's already paid for
Cachly ships as a standard Docker image. The managed cloud version and the self-hosted version run identical code — the only difference is who operates it.
What you need
- Any Linux server with 1 GB RAM (or more)
- Docker + Docker Compose
- An embedding API key (OpenAI, Cohere, or Mistral) — or use a local model
No Kubernetes. No Helm charts. No cert-manager. If you can run docker compose up, you can run Cachly.
Quick start
Create a compose.yml:
services:
cachly-api:
image: ghcr.io/cachly-dev/cachly-api:latest
environment:
SELF_HOSTED: "true"
DATABASE_URL: postgres://cachly:secret@postgres:5432/cachly
VALKEY_URL: redis://valkey:6379
EMBEDDING_PROVIDER: openai
EMBEDDING_API_KEY: sk-...
ENCRYPTION_KEY: ${CACHLY_ENCRYPTION_KEY}
ports:
- "3001:3001"
depends_on:
- postgres
- valkey
postgres:
image: pgvector/pgvector:pg16
environment:
POSTGRES_DB: cachly
POSTGRES_USER: cachly
POSTGRES_PASSWORD: secret
volumes:
- pg_data:/var/lib/postgresql/data
valkey:
image: valkey/valkey:8-alpine
volumes:
- valkey_data:/data
volumes:
pg_data:
valkey_data:Then:
export CACHLY_ENCRYPTION_KEY=$(openssl rand -hex 32)
docker compose up -d
# Create your first instance
curl -X POST http://localhost:3001/api/v1/instances \
-H "Content-Type: application/json" \
-d '{"name":"my-cache","tier":"free"}'That's it. Cachly is running locally. Port 3001 serves the REST API — point your SDK at it.
Connect your app
Use any Cachly SDK, just override the base URL:
# Python
from cachly import CachlyClient
client = CachlyClient(
base_url="http://localhost:3001",
api_key="your-self-hosted-key"
)
# TypeScript
import { CachlyClient } from "@cachly-dev/sdk";
const client = new CachlyClient({
baseUrl: "http://localhost:3001",
apiKey: "your-self-hosted-key",
});
# LangChain
from langchain_community.cache import CachlySemanticCache
import langchain
langchain.llm_cache = CachlySemanticCache(
vector_url="http://localhost:3001/api/v1/sem/YOUR_TOKEN",
threshold=0.92,
)What SELF_HOSTED=true disables
The SELF_HOSTED=true flag disables everything cloud-specific:
| Feature | Managed | Self-Hosted |
|---|---|---|
| Stripe billing | ✅ | ❌ disabled |
| Kubernetes provisioning | ✅ | ❌ disabled |
| Multi-tenant isolation | ✅ | Single-tenant |
| Semantic cache (pgvector) | ✅ | ✅ |
| Exact cache (Valkey) | ✅ | ✅ |
| AI Brain (MCP tools) | ✅ | ✅ |
| REST API | ✅ | ✅ |
| All 15+ SDKs | ✅ | ✅ |
| Data sovereignty | EU/US/APAC | Your server |
Air-gapped setup (no internet)
If your inference server has no outbound internet, use a local embedding model instead of OpenAI:
# In compose.yml, add Ollama as the embedding provider:
cachly-api:
environment:
EMBEDDING_PROVIDER: ollama
EMBEDDING_BASE_URL: http://ollama:11434
EMBEDDING_MODEL: nomic-embed-text
ollama:
image: ollama/ollama
volumes:
- ollama_data:/root/.ollamanomic-embed-text produces 768-dim embeddings. Pull it once with docker exec ollama ollama pull nomic-embed-textand you're offline-ready.
Production checklist
- Set
ENCRYPTION_KEYto a random 32-byte hex string (never commit it) - Mount a persistent volume for Postgres — the pgvector index lives there
- Add nginx or Caddy in front for TLS termination
- Enable Valkey persistence (
appendonly yes) for the L1 exact cache - Set
CORS_ORIGINSto your frontend domain
What's next
Once you have the cache running locally, check out the MCP server setup to give your AI coding assistant persistent memory backed by your self-hosted instance. All 30 Brain tools work identically against a local Cachly instance.
Need help with the air-gapped setup or a custom SLA? [email protected]
cachly is a managed AI Brain for developers — persistent memory, team knowledge sharing, and semantic cache for Claude Code, Cursor, GitHub Copilot & Windsurf. One MCP server. 51 tools. Free tier, EU servers, no credit card.
Your AI is forgetting everything right now.
Every session starts blank. Every bug re-discovered. Every deploy procedure re-explained. cachly fixes that in 30 seconds — your AI remembers every lesson, every fix, every teammate's hard-won knowledge. Forever.