v2.0.5 - SDK for Agent Self-Evaluation

EmbedEval

Binary evals. Trace-centric. Error-analysis-first. The evaluation CLI that actually works.

npm install -g embedeval

๐ŸŽฌ See EmbedEval in Action

Watch how to go from zero to production-ready evaluation in under 5 minutes

๐Ÿ“น EmbedEval v3.0 - Complete workflow demonstration

Built on Hamel Husain's Principles

After studying how the best AI teams evaluate their systems, we rebuilt EmbedEval from scratch

โœ“

Binary Only

Pass or fail. No 1-5 scales, no debating whether it's a 3 or 4. Clear, fast decisions.

๐Ÿ”

Error Analysis First

60-80% of your time should be manually looking at traces, not building automated evals.

๐Ÿ“Š

Trace-Centric

Complete session records with query, response, context, and metadata. Everything you need to understand failures.

โšก

Cheap Evals First

Start with assertions and regex. Use LLM-as-judge sparingly for complex subjective criteria.

๐Ÿ‘ค

Benevolent Dictator

One domain expert owns quality decisions. No committees, no voting, no disagreements on scoring.

๐Ÿ“

JSONL Storage

Simple, grep-friendly, version-control friendly. No databases, no queues, no infrastructure.

4 Commands to Start

From zero to error analysis in under 5 minutes

1

Setup (30 seconds)

Verify environment and scaffold your first project

embedeval doctor # Check environment
embedeval init my-project --yes # Scaffold project
cd my-project
2

Collect Traces

Import your LLM interaction logs or use sample data

embedeval collect ./production-logs.jsonl --output traces.jsonl
3

Annotate Manually

Spend 30 minutes reviewing 50-100 traces. This is where the magic happens.

embedeval annotate traces.jsonl --user "pm@company.com"

# Interactive mode:
# p = pass, f = fail, c = category, n = notes, s = save
3

Build Taxonomy

Automatically categorize failures to understand what breaks

embedeval taxonomy build --annotations annotations.jsonl

โœ“ Taxonomy built!
Hallucination: 12 traces (36%)
Incomplete: 8 traces (24%)
Wrong Format: 5 traces (15%)

Complete Workflow

From raw logs to evaluation report

terminal
$ embedeval collect ./logs.jsonl --limit 100
โœ… Collected 100 traces โ†’ traces.jsonl
 
$ embedeval annotate traces.jsonl --user "pm@company.com"
Trace 1/100 | Not annotated
Query: What's your refund policy?
Response: We offer full refunds within 30 days...
 
Commands: [p]ass [f]ail [c]ategory [n]otes [j]next [s]ave
 
Annotating... Progress: 73/100
 
$ embedeval taxonomy show
Pass Rate: 73.0% (73 passed, 27 failed)
 
Top Failure Categories:
1. Hallucination (44.4%) - LLM invented information
2. Incomplete (29.6%) - Missing required info
3. Wrong Format (18.5%) - Format doesn't match spec
 
$ embedeval eval add
? Eval name: no_hallucination
? Type: llm-judge
? Model: gemini-2.5-flash
? Binary output: yes
โœ“ Eval added to evals.yaml

๐Ÿ” Multi-Provider Authentication

Securely manage and verify API keys for all supported LLM providers.

terminal
$ embedeval auth login openai
๐Ÿ” Authenticating with OpenAI...
API Key: sk-...here
โœ… Successfully authenticated with OpenAI
 
$ embedeval auth status
Google Gemini โœ“ Authenticated (env var)
OpenAI โœ“ Authenticated (stored)
OpenRouter โ—‹ Not configured
Anthropic โ—‹ Not configured
Ollama (Local) โœ“ Authenticated (env var)
 
$ embedeval auth check
Google Gemini โœ“ Healthy (289ms) ยท 46 models
OpenAI โœ“ Healthy (312ms) ยท 12 models
OpenRouter โ—‹ Not configured
Anthropic โ—‹ Not configured
Ollama (Local) โœ“ Healthy (14ms) ยท 1 models
โœ… All 3 configured provider(s) healthy

Features

  • ๐Ÿ”‘ Secure storage (system keychain or encrypted file)
  • ๐Ÿงช Health check verifies only if key is valid (not network or quota)
  • ๐Ÿงฉ Works with all providers: Gemini, OpenAI, OpenRouter, Anthropic, Ollama
  • โšก CI/CD ready: --json output and exit codes
  • ๐Ÿ”„ Auto token refresh OAuth tokens automatically refreshed before expiration

๐Ÿš€ Easy Setup Commands

Get started in seconds with built-in setup and diagnostic tools.

embedeval doctor
$ embedeval doctor
๐Ÿ”ง EmbedEval Doctor
Checking your environment...
 
โœ“ Node.js Version Node.js v20.11.0
โœ“ Installation embedeval v2.0.5
โœ“ LLM Providers 4 provider(s) configured
โœ“ Sample Data Sample traces available
โœ“ Operating System darwin arm64
 
โœ… All checks passed! Ready to go.
embedeval init
$ embedeval init my-project
๐Ÿš€ Creating embedeval project: my-project
 
โœ“ Created directory structure
โœ“ Copied sample traces
โœ“ Created rag.eval template
โœ“ Created .env file
โœ“ Created README.md
 
โœ… Project created successfully!
 
Next: cd my-project && embedeval doctor

New Setup Features

  • ๐Ÿ”ง Doctor command - Diagnose environment issues before starting
  • ๐Ÿ—๏ธ Init command - Scaffold new projects with templates in seconds
  • ๐Ÿ“‹ 6 built-in templates - rag, chatbot, code-assistant, docs, agent, minimal
  • โš™๏ธ Auto .env creation - Interactive API key setup

Everything You Need

No complexity. No infrastructure. Just evaluation that works.

๐Ÿ”

Interactive Annotation

Terminal UI for fast binary annotation. 50 traces in 30 minutes.

๐Ÿ“Š

Failure Taxonomy

Auto-categorize failures. Understand what breaks and why.

โšก

Cheap + Expensive Evals

Assertions, regex, code checks, and LLM-as-judge. Priority system runs cheap ones first.

๐ŸŽฒ

Synthetic Data Generation

Generate test data using dimensions. Test edge cases systematically.

๐Ÿ““

Jupyter Export

Export to notebooks for deep statistical analysis and visualization.

๐Ÿ“„

HTML Reports

Generate shareable reports with metrics, charts, and trace browsers.

๐ŸŽจ DSL-Generated Annotation UI

Define evals in natural language, auto-generate annotation interfaces

Define evals in DSL

# support.eval
name: Customer Support Evals
domain: support

# Cheap evals (run first)
must "Has Content": response length > 50
must "Addresses Query": must contain query keyword
should "Cites Sources": cites sources

# Expensive evals (LLM judge)
[expensive] must "Helpful": is helpful
[expensive] must "No Hallucination": no hallucination
$ embedeval dsl ui support.eval
๐Ÿ“ Natural language patterns
โŒจ๏ธ Keyboard shortcuts (P/F/J/K)
๐Ÿ’พ Auto-save to localStorage
๐Ÿ“ค Export to JSONL

Deploy Anywhere

One-click deployment to Vercel

Deploy to Vercel

Host your evaluation dashboard with automatic deployments from GitHub. Share results with your team via URL.

Deploy to Vercel

CI/CD Integration

Automate evaluation in your pipeline

GitHub Actions

Run evals on every PR. Block merges on quality regression.

uses: embedeval/action@v2

GitLab CI

Integrate with GitLab pipelines. Generate reports as artifacts.

image: embedeval/cli:latest

Local Testing

Run locally with Docker. Same results, zero setup.

docker run embedeval/cli

SDK for Real-Time Self-Evaluation

New in v2.0.5: Agents can self-evaluate responses in real-time using the SDK.

Programmatic Evaluation

import { evaluate, preflight, getConfidence } from 'embedeval/sdk'; // Quick preflight check (fast) const check = await preflight(response, query); if (!check.passed) { console.log('Issues:', check.failedChecks); } // Confidence scoring with action routing const conf = await getConfidence(response, { query }); // conf.action: 'send' | 'revise' | 'escalate' | 'clarify' // Full evaluation const result = await evaluate(response, { query, evals: ['coherent', 'helpful', 'no-hallucination'] }); console.log(`Pass rate: ${result.passRate}%`);

Multi-Provider LLM Support

Use any LLM provider as a judge: Gemini, OpenAI, OpenRouter, Ollama, or your own local models.

Provider Configuration

# Gemini (recommended - fast, cheap) export GEMINI_API_KEY="your-api-key" # OpenAI export OPENAI_API_KEY="your-api-key" # OpenRouter (access to many models) export OPENAI_API_KEY="your-openrouter-key" export OPENAI_BASE_URL="https://openrouter.ai/api/v1" # Ollama (local, private) export OPENAI_BASE_URL="http://localhost:11434/v1" # Check providers: embedeval providers list # Benchmark: embedeval providers benchmark

Self-assessment: embedeval assess run for price, speed, quality metrics.

MCP Server for AI Agents

{ "mcpServers": { "embedeval": { "command": "npx", "args": ["embedeval", "mcp-server"], "env": { "GEMINI_API_KEY": "your-api-key" } } } }

๐ŸŽฏ Skills Index

Navigate EmbedEval capabilities as actionable skills. Tell agents: "Use the [skill] skill"

๐Ÿ”ง Setup Skills

  • embedeval doctor - Environment check
  • embedeval init - Project scaffolding
  • embedeval auth login - Configure providers

๐Ÿ“Š Core Workflow

  • embedeval collect - Import traces
  • embedeval annotate - Binary annotation
  • embedeval taxonomy build - Failure categories

๐ŸŽจ DSL Templates

  • dsl init rag - RAG evaluation
  • dsl init chatbot - Support bots
  • dsl init agent - Autonomous agents

โšก Automation

  • embedeval dsl run - Run evaluations
  • embedeval watch - Real-time eval
  • embedeval eval add - Add evaluators

๐Ÿ“š Full Documentation: AGENTS.md โ€ข Skills Index โ€ข Skill Manifest (JSON)