experiments

Experiments

We evaluate the Memory Palace system across three dimensions: (1) retrieval accuracy compared to standard RAG systems, (2) hallucination prevention effectiveness, and (3) context efficiency at scale.

Datasets

BEIR Benchmark

For zero-shot retrieval evaluation, we use the BEIR benchmark [7], which includes:

Natural Questions: Google search queries with Wikipedia answers
HotpotQA: Multi-hop reasoning questions requiring evidence from multiple documents
MS MARCO: Real Bing search queries with human-annotated passages
PubMed (TREC-COVID): Biomedical literature retrieval with COVID-19 research queries

Dataset Statistics:

Dataset	Queries	Corpus Size	Task Type
Natural Questions	3,452	2.68M	QA
HotpotQA	7,405	5.23M	Multi-hop
MS MARCO	6,980	8.84M	Passage Ranking
TREC-COVID (PubMed)	50	171K	Bio-Medical

The PubMed/TREC-COVID dataset provides a challenging large-scale evaluation with scientific terminology and domain-specific retrieval requirements.

RAGBench

For retrieval evaluation, we use RAGBench [2], a comprehensive benchmark with 100,000 examples across five industry domains. RAGBench provides the TRACe evaluation framework measuring:

Utilization: How much of the retrieved context is used
Relevance: Whether retrieved documents match the query
Adherence: Whether the response stays faithful to context
Completeness: Whether all relevant information is included

Custom System Design Corpus

We constructed a domain-specific corpus of 93 memories covering system design concepts:

Memory Palace Corpus Distribution

Domain	Memories	Avg SMASHIN Score	Percentage
Fundamentals	8	9.2	8.6%
Scalability	10	8.7	10.8%
Data Storage	8	10.1	8.6%
Distributed Systems	12	9.5	12.9%
Patterns	6	8.3	6.5%
Reliability	13	9.8	14.0%
Cloud	19	8.9	20.4%
Security	17	9.1	18.3%

Baselines

We compare against the following state-of-the-art retrieval systems:

Dense Retrieval Systems

Flat RAG: Standard vector similarity search over embedded chunks using cosine similarity.
Contriever [4]: Self-supervised dense retriever trained on unlabeled data with contrastive learning.
ColBERT [5]: Late interaction model computing fine-grained relevance between query and document tokens.

Hierarchical and Graph-Based RAG

RAPTOR [6]: Hierarchical RAG using recursive abstractive processing for tree-organized retrieval.
GraphRAG [1]: Knowledge graph-augmented retrieval for multi-hop reasoning.
HyDE [3]: Hypothetical document embeddings for improved query-document matching.

Evaluation Metrics

Retrieval Metrics

Recall@k: Proportion of queries where the correct memory appears in top-k results
MRR: Mean Reciprocal Rank of the first correct result
Context Size: Total characters loaded into LLM context
Retrieval Latency: Time from query to memory retrieval

Hallucination Metrics

Faithfulness: Proportion of responses grounded in retrieved context
Token Verification Rate: Success rate of verification token checks
False Positive Rate: Rate of rejecting valid, grounded responses

Experimental Setup

Hardware

All experiments were conducted on:

Apple M2 Max with 32GB RAM (local inference)
Ollama with ministral-3:8b and nomic-embed-text
Google Gemini API (gemini-pro) for cloud comparison

Retrieval Experiment Protocol

For each query:

Extract keywords and compute query embedding
Retrieve top-k candidates using each method
Generate response using retrieved context
Verify response contains expected information and verification token
Measure latency and context size

Scaling Protocol

We evaluate context efficiency across corpus sizes:

Initialize memory corpus at sizes: 10, 50, 100, 200, 500, 1000 memories
Execute 100 random queries per corpus size
Measure context bytes loaded per query
Compare flat retrieval vs hierarchical 2-hop retrieval

Red Queen Pre-Learning Protocol

We evaluate the impact of adversarial pre-learning rounds on retrieval performance:

Initialize 100 memories with varying SMASHIN scores (0, 6, 12)
Run 0, 3, or 5 Red Queen adversarial rounds before learning
Simulate 30 days of retrieval with spaced intervals
Measure: total retrievals needed, final retention, RQ boosts applied

Each Red Queen round tests all memories against a harder threshold (base probability 0.5 vs 0.7 for normal retrieval), boosting weak memories that fail the adversarial test.

[1]

Edge, D. et al. 2024. From local to global: A graph RAG approach to query-focused summarization. arXiv preprint arXiv:2404.16130. (2024).

[2]

Friel, R. et al. 2024. RAGBench: Explainable benchmark for retrieval-augmented generation systems. arXiv preprint arXiv:2407.11005. (2024).

[3]

Gao, L. et al. 2022. Precise zero-shot dense retrieval without relevance labels. arXiv preprint arXiv:2212.10496. (2022).

[4]

Izacard, G. et al. 2022. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research. (2022).

[5]

Khattab, O. and Zaharia, M. 2020. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval (2020), 39–48.

[6]

Sarthi, P. et al. 2024. RAPTOR: Recursive abstractive processing for tree-organized retrieval. arXiv preprint arXiv:2401.18059. (2024).

[7]

Thakur, N. et al. 2021. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663. (2021).