Skip to main content

Eval Harness

ak-coder ships with an LLM-as-judge evaluation suite in packages/evals/. It tests agent behavior end-to-end against a real LLM (Ollama by default).

Quick start

# Requires Ollama running with a judge model
bun run packages/evals/run.ts

See Running Evals for filters, CI usage, and troubleshooting.

What evals test

19 built-in eval cases covering:

AreaCases
File toolsread_file, write_file, str_replace, patch_file
Shellbash (echo, read-only gate)
Searchglob, grep_search, semantic_search
Agentdelegate_task, plan mode, skills (load + create/invoke)
SessionMulti-turn context, compaction retention
Networkweb_fetch real URL
SnapshotsGolden file-state comparisons

The skills evals include a multi-step case where the agent creates a SKILL.md, reloads, and invokes it via the same Apply Skill message the REPL uses.

Criterion types

Static (check.*) — deterministic assertions: did the tool get called? does the file contain X? did the response match a regex?

Judge (judge(...)) — LLM-graded: a local Ollama model evaluates the agent's response against a natural-language criterion.

See Writing Evals for the full API — check.toolCalled, check.fileContains, check.skillInvoked, custom run() flows, and snapshot tests.