Eval Harness

ak-coder ships with an LLM-as-judge evaluation suite in packages/evals/. It tests agent behavior end-to-end against a real LLM (Ollama by default).

Quick start

# Requires Ollama running with a judge model
bun run packages/evals/run.ts

See Running Evals for filters, CI usage, and troubleshooting.

What evals test

19 built-in eval cases covering:

Area	Cases
File tools	`read_file`, `write_file`, `str_replace`, `patch_file`
Shell	`bash` (echo, read-only gate)
Search	`glob`, `grep_search`, `semantic_search`
Agent	`delegate_task`, plan mode, skills (load + create/invoke)
Session	Multi-turn context, compaction retention
Network	`web_fetch` real URL
Snapshots	Golden file-state comparisons

The skills evals include a multi-step case where the agent creates a SKILL.md, reloads, and invokes it via the same Apply Skill message the REPL uses.

Criterion types

Static (check.*) — deterministic assertions: did the tool get called? does the file contain X? did the response match a regex?

Judge (judge(...)) — LLM-graded: a local Ollama model evaluates the agent's response against a natural-language criterion.

See Writing Evals for the full API — check.toolCalled, check.fileContains, check.skillInvoked, custom run() flows, and snapshot tests.

Tool reference — what each built-in tool does and its annotations
Confirmation policy — how plan mode and write gates affect eval behavior

Quick start​

What evals test​

Criterion types​

Related docs​

Quick start

What evals test

Criterion types

Related docs