Hybrid Retrieval for Persistent Developer Memory

How combining HyDE, session-aware embeddings, and multi-category extraction beats single-technique baselines

Published

February 21, 2026

Results at a glance

382 domain questions · 6 knowledge categories · two metrics: Token F1 (deterministic lexical overlap) and LLM Judge (Gemini 2.5 Flash, YES/NO semantic correctness × 100). Ceiling = full-context upper bound with no retrieval step.

61.7
Token F1
50.8
LLM Judge
91%
of F1 ceiling
363%
of Judge ceiling
8.4%
Not-found rate

The problem: retrieval is harder than it looks

Developers accumulate knowledge in structured notes — decisions, solutions, patterns, bugs, insights, procedures. Retrieving the right entry when asked a natural-language question is a deceptively hard retrieval problem:

  • Vocabulary gap. A question like “how do we handle async?” shares few tokens with the stored answer “We decided to use tokio for all async operations in CLI tools.”
  • Exact-term failure. Dense embeddings are good at semantic similarity but silently miss queries that hinge on a specific function name, version number, or error code.
  • Category pollution. A single conversation contains decisions, workarounds, and code patterns. Routing everything to one bucket causes retrieval to compete across unrelated topics.

This report shows how three techniques address these failure modes — and documents the measured contribution of each.


Benchmark setup

Dataset 386 QA pairs hand-authored across 6 categories
Retriever engram — embeds knowledge entries offline, retrieves at query time
Answering model Claude claude-haiku-4-5 (same for all versions)
Metrics Token F1 (precision/recall over tokens) · LLM Judge (0 or 1 per question × 100)
Ceiling Full knowledge base injected as context — upper bound for a perfect retriever

Why two metrics? Token F1 is deterministic and rewards exact overlap — good for factual recall but blind to paraphrases. The LLM judge catches semantically correct answers that use different wording. Together they bracket retrieval quality from two sides.

Why the ceiling matters. The ceiling is not achievable in practice (it requires unlimited context), but it tells us how much room retrieval leaves on the table.


Landscape comparison

The persistent memory space has several active systems. Benchmarks differ across projects, so direct score comparison is not always possible; the table below covers architecture, retrieval design, and the best publicly reported numbers for each system.

System Storage model Retrieval technique Knowledge structure Best public result
engram (v19) Flat markdown files per category HyDE → cosine; BM25 exact-term fallback (fires only on 0 cosine results) 6 typed categories (decisions, solutions, patterns, bugs, insights, procedures) ✅ Token F1 61.7 · Judge 50.8% — this report
Mem0 Three-tier: vector store + entity graph + key-value Multi-factor ranking across all three tiers; personalized reranking Flat fact extraction, entity-linked in graph ✅ LoCoMo: 66.9% LLM-judge (+26% vs OpenAI memory)
MemGPT / Letta Two-tier: in-context working memory + out-of-context external storage Self-editing via tool calls; model explicitly loads/evicts memory pages Conversation segments + summarised hierarchical context ✅ LoCoMo: 74.0% (Letta Filesystem)
Zep Temporal knowledge graph (Graphiti engine): episode → semantic → community subgraphs Bi-directional graph traversal + vector search over subgraphs Entity/relationship extraction with temporal edges ✅ DMR: 94.8% · LongMemEval: +18.5% accuracy
Naive RAG Vector store (flat) Dense cosine only No structure varies

Key architectural differences:

  • Category-aware extraction — engram routes each insight to a typed bucket at write time. At retrieval time, the search space is smaller and topic-coherent; searching across an unstructured flat store is a known source of retrieval noise. Mem0, Letta, and Zep all mix structured and unstructured retrieval but do not apply upfront category routing.
  • HyDE — most deployed systems embed the raw query. Generating a plausible hypothetical answer first and embedding that bridges the vocabulary gap between short questions and long stored answers (Gao et al., 2023). Mem0 and Zep embed raw queries against their graph/vector indices.
  • BM25 last-resort — dense retrieval silently drops exact-term queries (function names, version strings, error codes). engram fires a keyword-match fallback only when cosine returns zero results, recovering these cases without adding noise to queries where semantic search already found results.
  • Graph-based systems (Mem0, Zep) trade write-time overhead (entity extraction, graph construction) for richer relationship queries. This suits long-horizon personal-assistant workloads; engram targets developer knowledge bases where write speed and human-editable plaintext matter more.
Note

Reproducibility. The 382-question evaluation dataset and all result files are available in eval/. Other systems can run the same questions against their retrieval stack and report Token F1 and LLM Judge for a direct comparison.


How each technique contributes

F1 and judge scores across all versions. Dashed lines = ceiling upper bounds.

Technique 1 — HyDE: bridging the vocabulary gap (+14.9 F1)

v2 → v3 is the largest single jump in the chart.

Dense embedding models embed short questions and long knowledge entries in the same vector space — but they land far apart because they look nothing alike at the token level. Hypothetical Document Embedding (HyDE) sidesteps this by generating a plausible hypothetical answer before searching:

Question: "what did we decide about async runtimes?"

Hypothetical answer (generated, not stored):
  "We decided to use tokio as the async runtime for CLI tools
   that require concurrent LLM calls, because it provides
   a mature ecosystem and integrates well with reqwest."

embed(hypothetical) → cosine search → top-k real entries → answer

The hypothetical answer lives in the same region of embedding space as real answers. The vocabulary gap disappears because we are now searching answer space → answer space instead of question space → answer space.

Impact: +14.9 F1 (v2=37.6 → v3=52.5). The gain is largest for insights and decisions — categories where phrasing between question and answer diverges the most.

Technique 2 — Multi-category extraction: eliminating category pollution

The same commit that introduced HyDE also changed how knowledge is extracted. Previously each conversation was routed to a single category bucket. Now the extraction LLM assigns each insight to whichever of the six categories it belongs to — decisions, solutions, patterns, bugs, insights, procedures — simultaneously.

Why it matters for retrieval: with per-conversation buckets, a conversation about a bug fix and a design decision would compete in the same pool. With category-level routing, the retriever can search the correct pool first, reducing irrelevant candidates.

Technique 3 — BM25 hybrid with RRF: rescuing exact-term queries (+3.3 F1)

Dense retrieval is good at semantic similarity. It struggles when the query is exact: version numbers (v0.3.1), error codes (ECONNREFUSED), function names (parse_session_block). BM25 is purely lexical — it excels at exactly the queries dense search misses.

Reciprocal Rank Fusion merges both ranked lists without requiring score calibration:

\[\text{score}(d) = \frac{1}{60 + r_{\text{dense}}(d)} + \frac{1}{60 + r_{\text{BM25}}(d)}\]

The constant 60 dampens rank sensitivity, making the fusion robust to small positional differences at the top. Standard BM25 parameters: \(k_1 = 1.5\), \(b = 0.75\).

F1 change at each version. Most gains come from v2→v3 (HyDE) and v5→v6 (BM25 hybrid).

v5 → v6 also bundled two dataset fixes that are worth separating from the retrieval improvement:

  • Procedures QA regenerated. The original 51 questions referenced procedures by ordinal position (“the first procedure”) — retrieval is a lottery when questions carry no distinctive terms. Regenerated with named references, making the dataset tractable.
  • ADD artifact fixed. The extraction LLM occasionally prefixed entries with resolver keywords (ADD, NOOP), polluting knowledge files. Root cause fixed; the prefix no longer appears.

Category-level analysis

Token F1 by category for four reference points: baseline, v3-full (HyDE only), v6b (full stack), ceiling.

Key observations:

  • Procedures gained the most (+28 F1, baseline → v6b). Almost entirely from the QA dataset fix — once questions were answerable, retrieval performed well.
  • Decisions and solutions are closest to ceiling. Well-structured factual entries are largely solved at the current corpus size.
  • Bugs has high judge (45%) despite moderate F1 (52%) — the retriever finds the right session but the exact wording of the answer differs (e.g. model says “HTTP 404 error” vs gold “404”). Token F1 penalises this; the judge accepts it.
  • Insights is the hardest category: both F1 (51%) and judge (48%) are moderate. Conceptual/interpretive entries use diffuse vocabulary that embeds poorly relative to specific factual queries.

Not-found rate: three fixes that mattered

Fraction of questions where the retriever returned no answer across versions.

Three distinct improvements drove the not-found rate from 31% → 8.4%:

  1. v1→v6b (HyDE + BM25 hybrid + threshold tuning): 31.2% → 13.2%. BM25 recovers exact-term queries (function names, error codes, version strings) that dense embeddings miss.

  2. v6b→v11 (cosine threshold fix + session-aware chunking + inbox drain): 13.2% → 9.1%.

    • Cosine threshold bug: smart_search was calling hybrid_search() (RRF scores, max ~0.033) but filtering with threshold=0.15 (calibrated for cosine 0–1 scale) → semantic search silently returned 0 results on every query. Fixed by calling store.search() (raw cosine) directly.
    • Session chunking: character-based chunking bundled 3–5 sessions per embedding (session_id=None). Fixed to parse session blocks individually, each with its own embedding.
    • Inbox drain: 132 sessions were stuck in inbox.md (never promoted to searchable category files). Running engram drain promoted them.
  3. v11→v19 (improved HyDE + keyword lexical fallback): 9.1% → 8.4%. Marginal but stable improvement. The remaining 32 not-found questions (8.4%) appear to be genuine knowledge gaps — questions about events never recorded in the knowledge base.


Full results table

Version Token F1 LLM Judge % F1 ceil % J ceil Not-found N
Baseline 34.7 9.6 51% 69% 52
v1 38.9 8.8 57% 63% 385
v2 38.8 9.4 57% 67% 385
v3 53.7 12.2 79% 87% 90
v3-full 56.6 15.3 83% 109% 385
v4 57.2 14.5 84% 104% 18.4% 385
v5 56.6 13.0 83% 93% 18.2% 385
v6 59.9 16.1 88% 115% 13.2% 386
v6b 60.5 16.3 89% 116% 13.2% 386
v11 62.7 18.1 92% 129% 9.1% 386
v19 61.7 50.8 91% 362% 8.4% 382
Ceiling 67.9 14.0 385

Judge calibration: fixing a silent measurement bug

LLM-as-judge score jumped from ~18% to 50.8% between v11 and v19 — without any change to the retrieval system. The cause was a broken judge prompt.

The original prompt asked Gemini to reply “YES or NO only” but embedded 6 rules of explanation first. Gemini would reason through the rules, then conclude — but start the response with “After considering…” rather than the verdict. Since the parser checked response.startswith("YES"), most correct answers were scored as NO.

# Old prompt (broken): Gemini would respond "After considering the rules: YES"
JUDGE_PROMPT = """Is this answer correct or semantically equivalent...
Rules:
- YES if prediction is semantically equivalent...
- YES if prediction uses synonyms...
Reply YES or NO only."""

# Fixed prompt: forces verdict first
JUDGE_PROMPT = """...
Start your response with YES or NO (nothing before it).
YES = semantically equivalent, paraphrase, synonym, or gold is contained in prediction.
NO = contradicts, or prediction is vague while gold is specific (e.g. gold=404, pred="an error").

YES or NO:"""

What this means for historical scores: All judge scores before v19 are underestimates. The true semantic accuracy at v6b was likely ~40–45%, not 16.3%. Token F1 was never affected (deterministic). The ranking order of versions is unchanged — the relative improvements are real, just the absolute judge numbers were deflated.


Remaining gaps and next steps

Gap between v19 and ceiling per category. Negative = still below ceiling.

Bugs and patterns have the largest gaps (~11 pts each). Both categories have high surface-form variation — the semantic content is often correct but exact token overlap is low.


Future directions

Near-term: closing the bugs/patterns gap

Cross-encoder re-ranking. After retrieving top-k candidates with the current HyDE + BM25 hybrid pipeline, a lightweight cross-encoder re-scores each candidate with full pairwise attention (query + entry seen together). Unlike bi-encoder cosine similarity, cross-encoders can catch semantic relevance that surface form misses — directly addressing the bugs/patterns gap.

Query expansion. Generate 2–3 query variants from the original question, retrieve independently, merge with RRF. Helps short ambiguous bug queries where a single embedding lands in the wrong region of the vector space.

Longer-term: recursive retrieval

Current engram retrieval is one-shot: embed query → fetch top-k → answer. MIT’s Recursive Language Model (RLM) work (2024) demonstrates a fundamentally different approach where the model navigates the knowledge base iteratively:

1. Fetch category index (what does engram know about X?)
2. Model identifies which specific entries to fetch next
3. Fetch only those entries
4. Answer with minimal, relevant context

RLM(GPT-4o-mini) outperforms plain GPT-4o by 114% on long-document QA and maintains a 49% advantage at 263k-token documents — using a cheaper model. The gains come entirely from task decomposition and targeted reading, not from a larger context window.

Engram’s MCP layer already exposes the primitives for this: index returns a compact map (~100 tokens), recall(session_ids=[...]) fetches specific blocks. A recursive retrieval mode would wire these together into a multi-step answering loop rather than a single injection pass — likely the highest-leverage architectural improvement available.

Technique Addresses Effort Expected gain
Cross-encoder re-ranking bugs, patterns gap Medium +5–8 F1
Query expansion not-found rate Low +2–4 F1
Recursive retrieval (RLM-style) all categories at scale High unknown, high ceiling

Metric notes

Why judge can exceed ceiling. The ceiling injects the entire knowledge base as context — thousands of tokens of mixed content. The model has to locate and extract the relevant fragment from noise. RAG retrieves a small, focused set; the model sees less noise and gives cleaner answers that the judge scores higher. This is a known property of RAG vs long-context injection and is not a measurement artefact.

Judge variance. Binary scoring (0/1 per question × 100) means each question contributes 100/N ≈ 0.26 pts. A swing of ±10 questions changes the score by ±2.6 pts. Two identical runs can differ by 2–3 pts. The CI regression gate is set at 2.5 pts to account for this; v6 and v6b were run twice to confirm reproducibility.

Token F1 limitations. F1 rewards token overlap but penalises paraphrases. A semantically correct answer with different wording scores lower than a partially wrong answer that reproduces the reference verbatim. This is why insights shows F1=50.3 but judge=25.8 — the answers are semantically right but worded differently from the reference.