Back to Article
Experiments
Download Source

Experiments

We evaluate the Memory Palace system across three dimensions: (1) retrieval accuracy compared to standard RAG systems, (2) hallucination prevention effectiveness, and (3) context efficiency at scale.

Datasets

BEIR Benchmark

For zero-shot retrieval evaluation, we use the BEIR benchmark [7], which includes:

  • Natural Questions: Google search queries with Wikipedia answers
  • HotpotQA: Multi-hop reasoning questions requiring evidence from multiple documents
  • MS MARCO: Real Bing search queries with human-annotated passages
  • PubMed (TREC-COVID): Biomedical literature retrieval with COVID-19 research queries

Dataset Statistics:

Dataset Queries Corpus Size Task Type
Natural Questions 3,452 2.68M QA
HotpotQA 7,405 5.23M Multi-hop
MS MARCO 6,980 8.84M Passage Ranking
TREC-COVID (PubMed) 50 171K Bio-Medical

The PubMed/TREC-COVID dataset provides a challenging large-scale evaluation with scientific terminology and domain-specific retrieval requirements.

RAGBench

For retrieval evaluation, we use RAGBench [2], a comprehensive benchmark with 100,000 examples across five industry domains. RAGBench provides the TRACe evaluation framework measuring:

  • Utilization: How much of the retrieved context is used
  • Relevance: Whether retrieved documents match the query
  • Adherence: Whether the response stays faithful to context
  • Completeness: Whether all relevant information is included

Custom System Design Corpus

We constructed a domain-specific corpus of 93 memories covering system design concepts:

Memory Palace Corpus Distribution
Domain Memories Avg SMASHIN Score Percentage
Fundamentals 8 9.2 8.6%
Scalability 10 8.7 10.8%
Data Storage 8 10.1 8.6%
Distributed Systems 12 9.5 12.9%
Patterns 6 8.3 6.5%
Reliability 13 9.8 14.0%
Cloud 19 8.9 20.4%
Security 17 9.1 18.3%

Baselines

We compare against the following state-of-the-art retrieval systems:

Dense Retrieval Systems

  1. Flat RAG: Standard vector similarity search over embedded chunks using cosine similarity.

  2. Contriever [4]: Self-supervised dense retriever trained on unlabeled data with contrastive learning.

  3. ColBERT [5]: Late interaction model computing fine-grained relevance between query and document tokens.

Hierarchical and Graph-Based RAG

  1. RAPTOR [6]: Hierarchical RAG using recursive abstractive processing for tree-organized retrieval.

  2. GraphRAG [1]: Knowledge graph-augmented retrieval for multi-hop reasoning.

  3. HyDE [3]: Hypothetical document embeddings for improved query-document matching.

Evaluation Metrics

Retrieval Metrics

  • Recall@k: Proportion of queries where the correct memory appears in top-k results
  • MRR: Mean Reciprocal Rank of the first correct result
  • Context Size: Total characters loaded into LLM context
  • Retrieval Latency: Time from query to memory retrieval

Hallucination Metrics

  • Faithfulness: Proportion of responses grounded in retrieved context
  • Token Verification Rate: Success rate of verification token checks
  • False Positive Rate: Rate of rejecting valid, grounded responses

Experimental Setup

Hardware

All experiments were conducted on:

  • Apple M2 Max with 32GB RAM (local inference)
  • Ollama with ministral-3:8b and nomic-embed-text
  • Google Gemini API (gemini-pro) for cloud comparison

Retrieval Experiment Protocol

For each query:

  1. Extract keywords and compute query embedding
  2. Retrieve top-k candidates using each method
  3. Generate response using retrieved context
  4. Verify response contains expected information and verification token
  5. Measure latency and context size

Scaling Protocol

We evaluate context efficiency across corpus sizes:

  1. Initialize memory corpus at sizes: 10, 50, 100, 200, 500, 1000 memories
  2. Execute 100 random queries per corpus size
  3. Measure context bytes loaded per query
  4. Compare flat retrieval vs hierarchical 2-hop retrieval

Red Queen Pre-Learning Protocol

We evaluate the impact of adversarial pre-learning rounds on retrieval performance:

  1. Initialize 100 memories with varying SMASHIN scores (0, 6, 12)
  2. Run 0, 3, or 5 Red Queen adversarial rounds before learning
  3. Simulate 30 days of retrieval with spaced intervals
  4. Measure: total retrievals needed, final retention, RQ boosts applied

Each Red Queen round tests all memories against a harder threshold (base probability 0.5 vs 0.7 for normal retrieval), boosting weak memories that fail the adversarial test.

[1]
Edge, D. et al. 2024. From local to global: A graph RAG approach to query-focused summarization. arXiv preprint arXiv:2404.16130. (2024).
[2]
Friel, R. et al. 2024. RAGBench: Explainable benchmark for retrieval-augmented generation systems. arXiv preprint arXiv:2407.11005. (2024).
[3]
Gao, L. et al. 2022. Precise zero-shot dense retrieval without relevance labels. arXiv preprint arXiv:2212.10496. (2022).
[4]
Izacard, G. et al. 2022. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research. (2022).
[5]
Khattab, O. and Zaharia, M. 2020. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval (2020), 39–48.
[6]
Sarthi, P. et al. 2024. RAPTOR: Recursive abstractive processing for tree-organized retrieval. arXiv preprint arXiv:2401.18059. (2024).
[7]
Thakur, N. et al. 2021. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663. (2021).