Semantic Caching

cachly's semantic cache layer uses vector similarity search to return cached responses for semantically similar queries – even when the exact wording differs.

How It Works

  1. 1

    Embed

    Your query is embedded into a high-dimensional vector using the configured model.

  2. 2

    Search

    A similarity search finds the most relevant cached embeddings using approximate nearest-neighbor lookup.

  3. 3

    Match

    If similarity exceeds your threshold (default 0.92), the cached response is returned in <1 ms.

  4. 4

    Miss

    On a miss, the request passes through to your origin. The response is cached for next time.

API Example

POST /api/v1/cache/semantic
Authorization: Bearer <API_KEY>
Content-Type: application/json

{
  "query": "What is the capital of France?",
  "threshold": 0.92,
  "ttl": 3600
}

A semantically equivalent query like "France's capital city?" will hit the cache.

Multimodal Support

Vector Caching v2 extends semantic caching to images, audio, and mixed media. Each modality gets its own embedding model and vector space.

Text

Natural language queries and completions

Images

CLIP-based embeddings for visual similarity

Audio

Whisper-based transcription + embedding

POST /api/v1/cache/semantic
Content-Type: application/json

{
  "modality": "image",
  "data": "<base64-encoded-image>",
  "threshold": 0.88,
  "ttl": 7200
}

SDK Snippets

Python

from cachly import Cachly

client = Cachly(api_key="ck_...")

# Semantic cache lookup
hit = client.semantic.get("What is the capital of France?")
if hit:
    print("Cache hit:", hit.value)
else:
    # Compute and store
    answer = call_llm("What is the capital of France?")
    client.semantic.set("What is the capital of France?", answer, ttl=3600)

TypeScript

import { Cachly } from "@cachly/sdk";

const client = new Cachly({ apiKey: "ck_..." });

const hit = await client.semantic.get("What is the capital of France?");
if (hit) {
  console.log("Cache hit:", hit.value);
}

Configuration

ParameterDefaultDescription
threshold0.92Cosine similarity threshold for a cache hit
ttl3600Time-to-live in seconds
modalitytextEmbedding modality: text, image, audio
modelautoEmbedding model override (per-modality defaults)