ADR 09: In-Memory TF-IDF Vector Store & Indexer

Context

Standard agent search relies on exact keyword matching (via grep or glob), which struggles with synonyms or conceptual queries. We need a semantic lookup feature. However, introducing an external vector database (like Chromadb or Pinecone) adds complex dependency requirements, installation overhead, and slows down local execution.

Decision

We decided to implement a pure TypeScript in-memory vector store based on TF-IDF (Term Frequency-Inverse Document Frequency) calculations.

Implementation Details

WorkspaceIndexer:
- Recursively reads files matching extensions .ts, .tsx, .js, .json, .md.
- Chunks files into small text blocks.
- Cleans tokens (lowercase, alphanumeric filtering, basic stop-word removal).
VectorStore:
- Computes TF-IDF matrices on indexing.
- Computes cosine similarity of the query vector against all document chunk vectors.
- Returns top-K results sorted by score, filtered by a minimum score threshold.

Consequences

Pros:
- Zero external dependencies. Starts instantly and consumes minimal memory.
- Very fast searches for typical workspace directory sizes (under 10,000 chunks).
- Easy to unit test and mock.
Cons:
- Not suitable for huge codebases (e.g. monorepos with hundreds of thousands of files), where term frequency matrix allocations can exceed Node.js heap limits.
- TF-IDF uses lexical overlap rather than dense embeddings (neural semantic search), so conceptual matching is limited to shared vocabularies.

Context​

Decision​

Implementation Details​

Consequences​

Context

Decision

Implementation Details

Consequences