Recall Benchmark — quality, measured

cachly's claim is structural: a memory that ranks by lesson quality (outcome, confidence, proven-ness, human review) surfaces the lesson that actually helps before a text-similar failed attempt. A flat-file memory — an LLM reading its own /memories directory — has none of those signals.

This benchmark proves the lift on fixed, labeled corpora, is fully reproducible from the open-source MCP package, and is defended by a CI regression gate on every commit.

Methodology

Three rankers are compared head-to-head over the same candidate set, so the comparison isolates ranking quality, not retrieval. Gold answers are the lessons that actually solved a problem; text-similar failed attempts act as distractors. Metrics are standard IR: Precision@1, Precision@3, Recall@3, MRR, nDCG@5.

Ranker	What it models
flatfile	Naive term-overlap, no IDF, no quality signal — models an LLM reading its own /memories directory
baseline	Raw BM25+ keyword ranking — a solid classical retrieval baseline
cachly	BM25+ over a wider top-25 candidate pool, then quality-aware rerank (outcome, confidence, proven-ness, human review)

Two corpora: a home fixture corpus (17 lessons · 13 queries) and a third-party-labeled external corpus — both fixed and versioned in the repo, runs are fully in-memory and deterministic.

Results

Measured 2026-06-07 · reproducible via npm run bench

Home fixture corpus (17 lessons · 13 queries)

Metric	flatfile	baseline (BM25+)	cachly	vs flatfile
Precision@1	76.9%	69.2%	92.3%	+20.0%
Recall@3	100.0%	96.2%	100.0%	+0.0%
MRR	87.2%	84.6%	96.2%	+10.3%
nDCG@5	89.9%	88.8%	97.6%	+8.7%

vs. raw BM25 baseline: +33.3% Precision@1 · +13.6% MRR · +9.9% nDCG@5

External labeled corpus

Metric	cachly	CI gate floor
Precision@1	80.4%	78.0%
Recall@3	98.2%	96.0%
MRR	89.0%	87.0%
nDCG@5	91.8%	90.0%

Where the lift comes from

Three mechanisms, each load-bearing — remove any one and the numbers drop:

1
Document-side cross-lingual expansion is disabled
Indexing a document with all ~28 multilingual synonyms of every common word (error → エラー, 错误, خطأ, …) made BM25 count them as exact matches and inflated any document containing a common word ~28×, burying topic-specific lessons. Queries still expand — a Japanese query still retrieves English lessons.
2
Wider candidate pool (top-25)
The reranker can only rescue a relevant lesson it can see; BM25 vocabulary mismatch sometimes ranks the right lesson 11–25. A 25-deep pool lets quality pull it back into the top 3–5.
3
Score compression before quality
A distractor whose what_failed contains the exact query phrase can score 5–6× higher than the correct success lesson in raw BM25 — too much for a ±40% quality multiplier to overcome. Compressing with score^0.3 shrinks the ratio to ~1.5×, then compressed × (0.4 + 0.6 × qualityBoost) lets a proven success win.

Token cost — the other half

Recall quality is one half of the value; token cost is the other. On the same corpus we count the context input tokens a targeted recall sends per call versus the "paste everything" approach (a flat-file / CLAUDE.md dump re-sent into the prompt each call). Counted with gpt-tokenizer (cl100k_base).

Approach	Context tokens / call
Paste everything (flat-file / dump)	939
cachly targeted recall (top-3)	~57
Context-token reduction	~94%

The reduction grows with the size of your knowledge base — re-sending everything scales linearly, targeted recall does not. This is retrieval efficiency only: an honest input-token number, not a blanket "X% lower bill". Output tokens, multi-turn dynamics, and semantic-cache hit rate are out of scope here.

Reproduce it yourself

The harness, both corpora, and the gate ship inside the open-source MCP package (@cachly-dev/mcp-server, src/bench/) — no network, no real Redis, deterministic:

npm run bench            # head-to-head on the built-in fixture corpus
npm run bench:external   # third-party-labeled external corpus
npm run bench:gate       # CI gate: home + external, fails on regression
npm run bench:cost       # context-token cost vs paste-everything

CI regression gate:every commit runs both corpora and asserts cachly's metrics stay at or above committed floors. Floors move only deliberately, with a bench run in the PR — a ranking change can never silently degrade recall.

Questions about the methodology? Read how cachly's memory works or open an issue on the repo.