Writing Evals

Create packages/evals/evals/<feature>.eval.ts:

import { evalCase, check, judge } from '../src';

evalCase('feature: what the agent should do', {
  prompts: ['Ask the agent to do something specific'],
  setup: (env) => {
    env.files({ '/ws/src/app.ts': 'export const x = 1;' });
    env.confirmAll();   // auto-approve writes/commands
  },
  criteria: [
    check.toolCalled('read_file'),           // was the tool used?
    check.fileContains('/ws/out.ts', 'foo'), // is text in the file?
    check.responseContains('success'),       // is text in the response?
    judge('Response confirms the task completed successfully'),
  ],
});

run.ts auto-discovers all *.eval.ts files — no registration needed.

Available checks

Check	Description
`check.toolCalled(name)`	Tool was invoked at least once
`check.toolCalledWith(name, args)`	Tool was invoked with specific args
`check.fileContains(path, substring)`	File exists and contains text
`check.fileModified(path)`	File was written during the run
`check.responseContains(substring)`	Final response contains text
`check.responseMatches(regex)`	Final response matches pattern
`check.skillInvoked(name)`	User message contains `Apply Skill "<name>"`
`check.golden(name, opts)`	File state matches a saved snapshot

Multi-turn evals

Pass multiple prompts to simulate a conversation:

evalCase('session: context retained', {
  prompts: [
    'My favorite color is blue.',
    'What is my favorite color?',
  ],
  criteria: [
    check.responseContains('blue'),
  ],
});

Custom run flows

For multi-step scenarios (create a file, reload skills, invoke), use run instead of prompts:

evalCase('skills: create SKILL.md, reload, and invoke', {
  setup: (env) => env.confirmAll(),
  run: async ({ prompt, invokeSkill, agent }) => {
    await prompt('Use write_file to create .ak-coder/skills/howdy/SKILL.md with …');
    await agent.reloadSkills();
    await invokeSkill('howdy-skill', 'say hello in one word');
  },
  criteria: [
    check.toolCalled('write_file'),
    check.fileContains('/ws/.ak-coder/skills/howdy/SKILL.md', 'howdy-skill'),
    check.skillInvoked('howdy-skill'),
    judge('Final response starts with or prominently contains "HOWDY".'),
  ],
});

invokeSkill(name, args) sends the same Apply Skill "<name>"… message the REPL uses for /skills:<name>.

Golden snapshots

Capture the expected filesystem state after a run:

evalCase('golden: write then read', {
  setup: (env) => {
    env.confirmAll();
    env.files({ '/ws/input.txt': 'hello' });
  },
  prompts: ['Copy /ws/input.txt to /ws/output.txt'],
  criteria: [
    check.golden('copy_input_to_output', { checkToolCalls: false, checkFiles: true }),
  ],
});

On first run the snapshot is created. Subsequent runs compare against it. Regenerate with --update-goldens.

Available checks​

Multi-turn evals​

Custom run flows​

Golden snapshots​

Available checks

Multi-turn evals

Custom run flows

Golden snapshots