Recall Benchmark — quality, measured
cachly's claim is structural: a memory that ranks by lesson quality (outcome, confidence, proven-ness, human review) surfaces the lesson that actually helps before a text-similar failed attempt. A flat-file memory — an LLM reading its own /memories directory — has none of those signals.
This benchmark proves the lift on fixed, labeled corpora, is fully reproducible from the open-source MCP package, and is defended by a CI regression gate on every commit.
Methodology
Three rankers are compared head-to-head over the same candidate set, so the comparison isolates ranking quality, not retrieval. Gold answers are the lessons that actually solved a problem; text-similar failed attempts act as distractors. Metrics are standard IR: Precision@1, Precision@3, Recall@3, MRR, nDCG@5.
| Ranker | What it models |
|---|---|
| flatfile | Naive term-overlap, no IDF, no quality signal — models an LLM reading its own /memories directory |
| baseline | Raw BM25+ keyword ranking — a solid classical retrieval baseline |
| cachly | BM25+ over a wider top-25 candidate pool, then quality-aware rerank (outcome, confidence, proven-ness, human review) |
Two corpora: a home fixture corpus (17 lessons · 13 queries) and a third-party-labeled external corpus — both fixed and versioned in the repo, runs are fully in-memory and deterministic.
Results
Measured 2026-06-07 · reproducible via npm run bench
Home fixture corpus (17 lessons · 13 queries)
| Metric | flatfile | baseline (BM25+) | cachly | vs flatfile |
|---|---|---|---|---|
| Precision@1 | 76.9% | 69.2% | 92.3% | +20.0% |
| Recall@3 | 100.0% | 96.2% | 100.0% | +0.0% |
| MRR | 87.2% | 84.6% | 96.2% | +10.3% |
| nDCG@5 | 89.9% | 88.8% | 97.6% | +8.7% |
vs. raw BM25 baseline: +33.3% Precision@1 · +13.6% MRR · +9.9% nDCG@5
External labeled corpus
| Metric | cachly | CI gate floor |
|---|---|---|
| Precision@1 | 80.4% | 78.0% |
| Recall@3 | 98.2% | 96.0% |
| MRR | 89.0% | 87.0% |
| nDCG@5 | 91.8% | 90.0% |
Where the lift comes from
Three mechanisms, each load-bearing — remove any one and the numbers drop:
- 1
Document-side cross-lingual expansion is disabled
Indexing a document with all ~28 multilingual synonyms of every common word (error → エラー, 错误, خطأ, …) made BM25 count them as exact matches and inflated any document containing a common word ~28×, burying topic-specific lessons. Queries still expand — a Japanese query still retrieves English lessons.
- 2
Wider candidate pool (top-25)
The reranker can only rescue a relevant lesson it can see; BM25 vocabulary mismatch sometimes ranks the right lesson 11–25. A 25-deep pool lets quality pull it back into the top 3–5.
- 3
Score compression before quality
A distractor whose what_failed contains the exact query phrase can score 5–6× higher than the correct success lesson in raw BM25 — too much for a ±40% quality multiplier to overcome. Compressing with score^0.3 shrinks the ratio to ~1.5×, then compressed × (0.4 + 0.6 × qualityBoost) lets a proven success win.
Token cost — the other half
Recall quality is one half of the value; token cost is the other. On the same corpus we count the context input tokens a targeted recall sends per call versus the "paste everything" approach (a flat-file / CLAUDE.md dump re-sent into the prompt each call). Counted with gpt-tokenizer (cl100k_base).
| Approach | Context tokens / call |
|---|---|
| Paste everything (flat-file / dump) | 939 |
| cachly targeted recall (top-3) | ~57 |
| Context-token reduction | ~94% |
The reduction grows with the size of your knowledge base — re-sending everything scales linearly, targeted recall does not. This is retrieval efficiency only: an honest input-token number, not a blanket "X% lower bill". Output tokens, multi-turn dynamics, and semantic-cache hit rate are out of scope here.
Reproduce it yourself
The harness, both corpora, and the gate ship inside the open-source MCP package (@cachly-dev/mcp-server, src/bench/) — no network, no real Redis, deterministic:
npm run bench # head-to-head on the built-in fixture corpus npm run bench:external # third-party-labeled external corpus npm run bench:gate # CI gate: home + external, fails on regression npm run bench:cost # context-token cost vs paste-everything
CI regression gate:every commit runs both corpora and asserts cachly's metrics stay at or above committed floors. Floors move only deliberately, with a bench run in the PR — a ranking change can never silently degrade recall.
Questions about the methodology? Read how cachly's memory works or open an issue on the repo.