Skip to main content

Writing Evals

Create packages/evals/evals/<feature>.eval.ts:

import { evalCase, check, judge } from '../src';

evalCase('feature: what the agent should do', {
prompts: ['Ask the agent to do something specific'],
setup: (env) => {
env.files({ '/ws/src/app.ts': 'export const x = 1;' });
env.confirmAll(); // auto-approve writes/commands
},
criteria: [
check.toolCalled('read_file'), // was the tool used?
check.fileContains('/ws/out.ts', 'foo'), // is text in the file?
check.responseContains('success'), // is text in the response?
judge('Response confirms the task completed successfully'),
],
});

run.ts auto-discovers all *.eval.ts files — no registration needed.

Available checks

CheckDescription
check.toolCalled(name)Tool was invoked at least once
check.toolCalledWith(name, args)Tool was invoked with specific args
check.fileContains(path, substring)File exists and contains text
check.fileModified(path)File was written during the run
check.responseContains(substring)Final response contains text
check.responseMatches(regex)Final response matches pattern
check.skillInvoked(name)User message contains Apply Skill "<name>"
check.golden(name, opts)File state matches a saved snapshot

Multi-turn evals

Pass multiple prompts to simulate a conversation:

evalCase('session: context retained', {
prompts: [
'My favorite color is blue.',
'What is my favorite color?',
],
criteria: [
check.responseContains('blue'),
],
});

Custom run flows

For multi-step scenarios (create a file, reload skills, invoke), use run instead of prompts:

evalCase('skills: create SKILL.md, reload, and invoke', {
setup: (env) => env.confirmAll(),
run: async ({ prompt, invokeSkill, agent }) => {
await prompt('Use write_file to create .ak-coder/skills/howdy/SKILL.md with …');
await agent.reloadSkills();
await invokeSkill('howdy-skill', 'say hello in one word');
},
criteria: [
check.toolCalled('write_file'),
check.fileContains('/ws/.ak-coder/skills/howdy/SKILL.md', 'howdy-skill'),
check.skillInvoked('howdy-skill'),
judge('Final response starts with or prominently contains "HOWDY".'),
],
});

invokeSkill(name, args) sends the same Apply Skill "<name>"… message the REPL uses for /skills:<name>.

Golden snapshots

Capture the expected filesystem state after a run:

evalCase('golden: write then read', {
setup: (env) => {
env.confirmAll();
env.files({ '/ws/input.txt': 'hello' });
},
prompts: ['Copy /ws/input.txt to /ws/output.txt'],
criteria: [
check.golden('copy_input_to_output', { checkToolCalls: false, checkFiles: true }),
],
});

On first run the snapshot is created. Subsequent runs compare against it. Regenerate with --update-goldens.