Semantic Caching

cachly's semantic cache layer uses pgvector embeddings to return cached responses for semantically similar queries – even when the exact wording differs.

How It Works

  1. 1

    Embed

    Your query is embedded into a high-dimensional vector using the configured model.

  2. 2

    Search

    pgvector performs an approximate nearest-neighbor search across cached embeddings.

  3. 3

    Match

    If similarity exceeds your threshold (default 0.92), the cached response is returned in <1 ms.

  4. 4

    Miss

    On a miss, the request passes through to your origin. The response is cached for next time.

API Example

POST /api/v1/cache/semantic
Authorization: Bearer <API_KEY>
Content-Type: application/json

{
  "query": "What is the capital of France?",
  "threshold": 0.92,
  "ttl": 3600
}

A semantically equivalent query like "France's capital city?" will hit the cache.

🎨 Multimodal Support

Vector Caching v2 extends semantic caching to images, audio, and mixed media. Each modality gets its own embedding model and vector space.

📝

Text

Natural language queries and completions

🖼️

Images

CLIP-based embeddings for visual similarity

🎵

Audio

Whisper-based transcription + embedding

POST /api/v1/cache/semantic
Content-Type: application/json

{
  "modality": "image",
  "data": "<base64-encoded-image>",
  "threshold": 0.88,
  "ttl": 7200
}

SDK Snippets

Python

from cachly import Cachly

client = Cachly(api_key="ck_...")

# Semantic cache lookup
hit = client.semantic.get("What is the capital of France?")
if hit:
    print("Cache hit:", hit.value)
else:
    # Compute and store
    answer = call_llm("What is the capital of France?")
    client.semantic.set("What is the capital of France?", answer, ttl=3600)

TypeScript

import { Cachly } from "@cachly/sdk";

const client = new Cachly({ apiKey: "ck_..." });

const hit = await client.semantic.get("What is the capital of France?");
if (hit) {
  console.log("Cache hit:", hit.value);
}

Configuration

ParameterDefaultDescription
threshold0.92Cosine similarity threshold for a cache hit
ttl3600Time-to-live in seconds
modalitytextEmbedding modality: text, image, audio
modelautoEmbedding model override (per-modality defaults)