Recall Benchmark — quality, measured

cachly's claim is structural: a memory that ranks by lesson quality (outcome, confidence, proven-ness, human review) surfaces the lesson that actually helps before a text-similar failed attempt. A flat-file memory — an LLM reading its own /memories directory — has none of those signals.

This benchmark proves the lift on fixed, labeled corpora, is fully reproducible from the open-source MCP package, and is defended by a CI regression gate on every commit.

Methodology

Three rankers are compared head-to-head over the same candidate set, so the comparison isolates ranking quality, not retrieval. Gold answers are the lessons that actually solved a problem; text-similar failed attempts act as distractors. Metrics are standard IR: Precision@1, Precision@3, Recall@3, MRR, nDCG@5.

RankerWhat it models
flatfileNaive term-overlap, no IDF, no quality signal — models an LLM reading its own /memories directory
baselineRaw BM25+ keyword ranking — a solid classical retrieval baseline
cachlyBM25+ over a wider top-25 candidate pool, then quality-aware rerank (outcome, confidence, proven-ness, human review)

Two corpora: a home fixture corpus (17 lessons · 13 queries) and a third-party-labeled external corpus — both fixed and versioned in the repo, runs are fully in-memory and deterministic.

Results

Measured 2026-06-07 · reproducible via npm run bench

Home fixture corpus (17 lessons · 13 queries)

Metricflatfilebaseline (BM25+)cachlyvs flatfile
Precision@176.9%69.2%92.3%+20.0%
Recall@3100.0%96.2%100.0%+0.0%
MRR87.2%84.6%96.2%+10.3%
nDCG@589.9%88.8%97.6%+8.7%

vs. raw BM25 baseline: +33.3% Precision@1 · +13.6% MRR · +9.9% nDCG@5

External labeled corpus

MetriccachlyCI gate floor
Precision@180.4%78.0%
Recall@398.2%96.0%
MRR89.0%87.0%
nDCG@591.8%90.0%

Where the lift comes from

Three mechanisms, each load-bearing — remove any one and the numbers drop:

  1. 1

    Document-side cross-lingual expansion is disabled

    Indexing a document with all ~28 multilingual synonyms of every common word (error → エラー, 错误, خطأ, …) made BM25 count them as exact matches and inflated any document containing a common word ~28×, burying topic-specific lessons. Queries still expand — a Japanese query still retrieves English lessons.

  2. 2

    Wider candidate pool (top-25)

    The reranker can only rescue a relevant lesson it can see; BM25 vocabulary mismatch sometimes ranks the right lesson 11–25. A 25-deep pool lets quality pull it back into the top 3–5.

  3. 3

    Score compression before quality

    A distractor whose what_failed contains the exact query phrase can score 5–6× higher than the correct success lesson in raw BM25 — too much for a ±40% quality multiplier to overcome. Compressing with score^0.3 shrinks the ratio to ~1.5×, then compressed × (0.4 + 0.6 × qualityBoost) lets a proven success win.

Token cost — the other half

Recall quality is one half of the value; token cost is the other. On the same corpus we count the context input tokens a targeted recall sends per call versus the "paste everything" approach (a flat-file / CLAUDE.md dump re-sent into the prompt each call). Counted with gpt-tokenizer (cl100k_base).

ApproachContext tokens / call
Paste everything (flat-file / dump)939
cachly targeted recall (top-3)~57
Context-token reduction~94%

The reduction grows with the size of your knowledge base — re-sending everything scales linearly, targeted recall does not. This is retrieval efficiency only: an honest input-token number, not a blanket "X% lower bill". Output tokens, multi-turn dynamics, and semantic-cache hit rate are out of scope here.

Reproduce it yourself

The harness, both corpora, and the gate ship inside the open-source MCP package (@cachly-dev/mcp-server, src/bench/) — no network, no real Redis, deterministic:

npm run bench            # head-to-head on the built-in fixture corpus
npm run bench:external   # third-party-labeled external corpus
npm run bench:gate       # CI gate: home + external, fails on regression
npm run bench:cost       # context-token cost vs paste-everything

CI regression gate:every commit runs both corpora and asserts cachly's metrics stay at or above committed floors. Floors move only deliberately, with a bench run in the PR — a ranking change can never silently degrade recall.

Questions about the methodology? Read how cachly's memory works or open an issue on the repo.