EmbedEval v2 - Binary Evals, Trace-Centric, Error-Analysis-First

Built on Hamel Husain's Principles

After studying how the best AI teams evaluate their systems, we rebuilt EmbedEval from scratch

✓

Binary Only

Pass or fail. No 1-5 scales, no debating whether it's a 3 or 4. Clear, fast decisions.

🔍

Error Analysis First

60-80% of your time should be manually looking at traces, not building automated evals.

📊

Trace-Centric

Complete session records with query, response, context, and metadata. Everything you need to understand failures.

⚡

Cheap Evals First

Start with assertions and regex. Use LLM-as-judge sparingly for complex subjective criteria.

👤

Benevolent Dictator

One domain expert owns quality decisions. No committees, no voting, no disagreements on scoring.

📁

JSONL Storage

Simple, grep-friendly, version-control friendly. No databases, no queues, no infrastructure.

4 Commands to Start

From zero to error analysis in under 5 minutes

Setup (30 seconds)

Verify environment and scaffold your first project

                            embedeval doctor                    # Check environment

                            embedeval init my-project --yes   # Scaffold project

                            cd my-project

Collect Traces

Import your LLM interaction logs or use sample data

embedeval collect ./production-logs.jsonl --output traces.jsonl

Annotate Manually

Spend 30 minutes reviewing 50-100 traces. This is where the magic happens.

                            embedeval annotate traces.jsonl --user "pm@company.com"

                            # Interactive mode:

                            # p = pass, f = fail, c = category, n = notes, s = save

Build Taxonomy

Automatically categorize failures to understand what breaks

                            embedeval taxonomy build --annotations annotations.jsonl

                            ✓ Taxonomy built!

                            Hallucination: 12 traces (36%)

                            Incomplete: 8 traces (24%)

                            Wrong Format: 5 traces (15%)

Complete Workflow

From raw logs to evaluation report

terminal

$ embedeval collect ./logs.jsonl --limit 100

✅ Collected 100 traces → traces.jsonl

$ embedeval annotate traces.jsonl --user "pm@company.com"

Trace 1/100 | Not annotated

Query: What's your refund policy?

Response: We offer full refunds within 30 days...

Commands: [p]ass [f]ail [c]ategory [n]otes [j]next [s]ave

Annotating... Progress: 73/100

$ embedeval taxonomy show

Pass Rate: 73.0% (73 passed, 27 failed)

Top Failure Categories:

1. Hallucination (44.4%) - LLM invented information

2. Incomplete (29.6%) - Missing required info

3. Wrong Format (18.5%) - Format doesn't match spec

$ embedeval eval add

? Eval name: no_hallucination

? Type: llm-judge

? Model: gemini-2.5-flash

? Binary output: yes

✓ Eval added to evals.yaml

🔐 Multi-Provider Authentication

Securely manage and verify API keys for all supported LLM providers.

terminal

$ embedeval auth login openai

🔐 Authenticating with OpenAI...

API Key: sk-...here

✅ Successfully authenticated with OpenAI

$ embedeval auth status

Google Gemini ✓ Authenticated (env var)

OpenAI ✓ Authenticated (stored)

OpenRouter ○ Not configured

Anthropic ○ Not configured

Ollama (Local) ✓ Authenticated (env var)

$ embedeval auth check

Google Gemini ✓ Healthy (289ms) · 46 models

OpenAI ✓ Healthy (312ms) · 12 models

OpenRouter ○ Not configured

Anthropic ○ Not configured

Ollama (Local) ✓ Healthy (14ms) · 1 models

✅ All 3 configured provider(s) healthy

Features

🔑 Secure storage (system keychain or encrypted file)
🧪 Health check verifies only if key is valid (not network or quota)
🧩 Works with all providers: Gemini, OpenAI, OpenRouter, Anthropic, Ollama
⚡ CI/CD ready: --json output and exit codes
🔄 Auto token refresh OAuth tokens automatically refreshed before expiration

🚀 Easy Setup Commands

Get started in seconds with built-in setup and diagnostic tools.

embedeval doctor

$ embedeval doctor

🔧 EmbedEval Doctor

Checking your environment...

✓ Node.js Version Node.js v20.11.0

✓ Installation embedeval v2.0.5

✓ LLM Providers 4 provider(s) configured

✓ Sample Data Sample traces available

✓ Operating System darwin arm64

✅ All checks passed! Ready to go.

embedeval init

$ embedeval init my-project

🚀 Creating embedeval project: my-project

✓ Created directory structure

✓ Copied sample traces

✓ Created rag.eval template

✓ Created .env file

✓ Created README.md

✅ Project created successfully!

Next: cd my-project && embedeval doctor

New Setup Features

🔧 Doctor command - Diagnose environment issues before starting
🏗️ Init command - Scaffold new projects with templates in seconds
📋 6 built-in templates - rag, chatbot, code-assistant, docs, agent, minimal
⚙️ Auto .env creation - Interactive API key setup

Everything You Need

No complexity. No infrastructure. Just evaluation that works.

🔍

Interactive Annotation

Terminal UI for fast binary annotation. 50 traces in 30 minutes.

📊

Failure Taxonomy

Auto-categorize failures. Understand what breaks and why.

⚡

Cheap + Expensive Evals

Assertions, regex, code checks, and LLM-as-judge. Priority system runs cheap ones first.

🎲

Synthetic Data Generation

Generate test data using dimensions. Test edge cases systematically.

📓

Jupyter Export

Export to notebooks for deep statistical analysis and visualization.

📄

HTML Reports

Generate shareable reports with metrics, charts, and trace browsers.

🎨 DSL-Generated Annotation UI

Define evals in natural language, auto-generate annotation interfaces

Define evals in DSL

# support.eval
name: Customer Support Evals
domain: support

# Cheap evals (run first)
must "Has Content": response length > 50
must "Addresses Query": must contain query keyword
should "Cites Sources": cites sources

# Expensive evals (LLM judge)
[expensive] must "Helpful": is helpful
[expensive] must "No Hallucination": no hallucination

$ embedeval dsl ui support.eval

🔗 Try Live Demo

📝 Natural language patterns

⌨️ Keyboard shortcuts (P/F/J/K)

💾 Auto-save to localStorage

📤 Export to JSONL

CI/CD Integration

Automate evaluation in your pipeline

GitHub Actions

Run evals on every PR. Block merges on quality regression.

uses: embedeval/action@v2

GitLab CI

Integrate with GitLab pipelines. Generate reports as artifacts.

image: embedeval/cli:latest

Local Testing

Run locally with Docker. Same results, zero setup.

docker run embedeval/cli

SDK for Real-Time Self-Evaluation

New in v2.0.5: Agents can self-evaluate responses in real-time using the SDK.

Programmatic Evaluation

import { evaluate, preflight, getConfidence } from 'embedeval/sdk';

// Quick preflight check (fast)
const check = await preflight(response, query);
if (!check.passed) {
  console.log('Issues:', check.failedChecks);
}

// Confidence scoring with action routing
const conf = await getConfidence(response, { query });
// conf.action: 'send' | 'revise' | 'escalate' | 'clarify'

// Full evaluation
const result = await evaluate(response, {
  query,
  evals: ['coherent', 'helpful', 'no-hallucination']
});
console.log(`Pass rate: ${result.passRate}%`);

Multi-Provider LLM Support

Use any LLM provider as a judge: Gemini, OpenAI, OpenRouter, Ollama, or your own local models.

Provider Configuration

# Gemini (recommended - fast, cheap)
export GEMINI_API_KEY="your-api-key"

# OpenAI
export OPENAI_API_KEY="your-api-key"

# OpenRouter (access to many models)
export OPENAI_API_KEY="your-openrouter-key"
export OPENAI_BASE_URL="https://openrouter.ai/api/v1"

# Ollama (local, private)
export OPENAI_BASE_URL="http://localhost:11434/v1"

# Check providers: embedeval providers list
# Benchmark: embedeval providers benchmark

Self-assessment: embedeval assess run for price, speed, quality metrics.

MCP Server for AI Agents

{
  "mcpServers": {
    "embedeval": {
      "command": "npx",
      "args": ["embedeval", "mcp-server"],
      "env": {
        "GEMINI_API_KEY": "your-api-key"
      }
    }
  }
}

🎯 Skills Index

Navigate EmbedEval capabilities as actionable skills. Tell agents: "Use the [skill] skill"

🔧 Setup Skills

embedeval doctor - Environment check
embedeval init - Project scaffolding
embedeval auth login - Configure providers

📊 Core Workflow

embedeval collect - Import traces
embedeval annotate - Binary annotation
embedeval taxonomy build - Failure categories

🎨 DSL Templates

dsl init rag - RAG evaluation
dsl init chatbot - Support bots
dsl init agent - Autonomous agents

⚡ Automation

embedeval dsl run - Run evaluations
embedeval watch - Real-time eval
embedeval eval add - Add evaluators

📚 Full Documentation: AGENTS.md • Skills Index • Skill Manifest (JSON)

EmbedEval

🎬 See EmbedEval in Action