Experiments
We evaluate the Memory Palace system across three dimensions: (1) retrieval accuracy compared to standard RAG systems, (2) hallucination prevention effectiveness, and (3) context efficiency at scale.
Datasets
BEIR Benchmark
For zero-shot retrieval evaluation, we use the BEIR benchmark [7], which includes:
- Natural Questions: Google search queries with Wikipedia answers
- HotpotQA: Multi-hop reasoning questions requiring evidence from multiple documents
- MS MARCO: Real Bing search queries with human-annotated passages
- PubMed (TREC-COVID): Biomedical literature retrieval with COVID-19 research queries
Dataset Statistics:
| Dataset | Queries | Corpus Size | Task Type |
|---|---|---|---|
| Natural Questions | 3,452 | 2.68M | QA |
| HotpotQA | 7,405 | 5.23M | Multi-hop |
| MS MARCO | 6,980 | 8.84M | Passage Ranking |
| TREC-COVID (PubMed) | 50 | 171K | Bio-Medical |
The PubMed/TREC-COVID dataset provides a challenging large-scale evaluation with scientific terminology and domain-specific retrieval requirements.
RAGBench
For retrieval evaluation, we use RAGBench [2], a comprehensive benchmark with 100,000 examples across five industry domains. RAGBench provides the TRACe evaluation framework measuring:
- Utilization: How much of the retrieved context is used
- Relevance: Whether retrieved documents match the query
- Adherence: Whether the response stays faithful to context
- Completeness: Whether all relevant information is included
Custom System Design Corpus
We constructed a domain-specific corpus of 93 memories covering system design concepts:
| Domain | Memories | Avg SMASHIN Score | Percentage |
|---|---|---|---|
| Fundamentals | 8 | 9.2 | 8.6% |
| Scalability | 10 | 8.7 | 10.8% |
| Data Storage | 8 | 10.1 | 8.6% |
| Distributed Systems | 12 | 9.5 | 12.9% |
| Patterns | 6 | 8.3 | 6.5% |
| Reliability | 13 | 9.8 | 14.0% |
| Cloud | 19 | 8.9 | 20.4% |
| Security | 17 | 9.1 | 18.3% |
Baselines
We compare against the following state-of-the-art retrieval systems:
Dense Retrieval Systems
Hierarchical and Graph-Based RAG
Evaluation Metrics
Retrieval Metrics
- Recall@k: Proportion of queries where the correct memory appears in top-k results
- MRR: Mean Reciprocal Rank of the first correct result
- Context Size: Total characters loaded into LLM context
- Retrieval Latency: Time from query to memory retrieval
Hallucination Metrics
- Faithfulness: Proportion of responses grounded in retrieved context
- Token Verification Rate: Success rate of verification token checks
- False Positive Rate: Rate of rejecting valid, grounded responses
Experimental Setup
Hardware
All experiments were conducted on:
- Apple M2 Max with 32GB RAM (local inference)
- Ollama with ministral-3:8b and nomic-embed-text
- Google Gemini API (gemini-pro) for cloud comparison
Retrieval Experiment Protocol
For each query:
- Extract keywords and compute query embedding
- Retrieve top-k candidates using each method
- Generate response using retrieved context
- Verify response contains expected information and verification token
- Measure latency and context size
Scaling Protocol
We evaluate context efficiency across corpus sizes:
- Initialize memory corpus at sizes: 10, 50, 100, 200, 500, 1000 memories
- Execute 100 random queries per corpus size
- Measure context bytes loaded per query
- Compare flat retrieval vs hierarchical 2-hop retrieval
Red Queen Pre-Learning Protocol
We evaluate the impact of adversarial pre-learning rounds on retrieval performance:
- Initialize 100 memories with varying SMASHIN scores (0, 6, 12)
- Run 0, 3, or 5 Red Queen adversarial rounds before learning
- Simulate 30 days of retrieval with spaced intervals
- Measure: total retrievals needed, final retention, RQ boosts applied
Each Red Queen round tests all memories against a harder threshold (base probability 0.5 vs 0.7 for normal retrieval), boosting weak memories that fail the adversarial test.