Binary evals. Trace-centric. Error-analysis-first. The evaluation CLI that actually works.
npm install -g embedeval
Watch how to go from zero to production-ready evaluation in under 5 minutes
๐น EmbedEval v3.0 - Complete workflow demonstration
After studying how the best AI teams evaluate their systems, we rebuilt EmbedEval from scratch
Pass or fail. No 1-5 scales, no debating whether it's a 3 or 4. Clear, fast decisions.
60-80% of your time should be manually looking at traces, not building automated evals.
Complete session records with query, response, context, and metadata. Everything you need to understand failures.
Start with assertions and regex. Use LLM-as-judge sparingly for complex subjective criteria.
One domain expert owns quality decisions. No committees, no voting, no disagreements on scoring.
Simple, grep-friendly, version-control friendly. No databases, no queues, no infrastructure.
From zero to error analysis in under 5 minutes
Verify environment and scaffold your first project
Import your LLM interaction logs or use sample data
Spend 30 minutes reviewing 50-100 traces. This is where the magic happens.
Automatically categorize failures to understand what breaks
From raw logs to evaluation report
Securely manage and verify API keys for all supported LLM providers.
--json output and exit codesGet started in seconds with built-in setup and diagnostic tools.
No complexity. No infrastructure. Just evaluation that works.
Terminal UI for fast binary annotation. 50 traces in 30 minutes.
Auto-categorize failures. Understand what breaks and why.
Assertions, regex, code checks, and LLM-as-judge. Priority system runs cheap ones first.
Generate test data using dimensions. Test edge cases systematically.
Export to notebooks for deep statistical analysis and visualization.
Generate shareable reports with metrics, charts, and trace browsers.
Define evals in natural language, auto-generate annotation interfaces
# support.eval
name: Customer Support Evals
domain: support
# Cheap evals (run first)
must "Has Content": response length > 50
must "Addresses Query": must contain query keyword
should "Cites Sources": cites sources
# Expensive evals (LLM judge)
[expensive] must "Helpful": is helpful
[expensive] must "No Hallucination": no hallucination
$ embedeval dsl ui support.eval
One-click deployment to Vercel
Host your evaluation dashboard with automatic deployments from GitHub. Share results with your team via URL.
Deploy to VercelAutomate evaluation in your pipeline
Run evals on every PR. Block merges on quality regression.
Integrate with GitLab pipelines. Generate reports as artifacts.
Run locally with Docker. Same results, zero setup.
New in v2.0.5: Agents can self-evaluate responses in real-time using the SDK.
Use any LLM provider as a judge: Gemini, OpenAI, OpenRouter, Ollama, or your own local models.
Self-assessment: embedeval assess run for price, speed, quality metrics.
Navigate EmbedEval capabilities as actionable skills. Tell agents: "Use the [skill] skill"
embedeval doctor - Environment checkembedeval init - Project scaffoldingembedeval auth login - Configure providersembedeval collect - Import tracesembedeval annotate - Binary annotationembedeval taxonomy build - Failure categoriesdsl init rag - RAG evaluationdsl init chatbot - Support botsdsl init agent - Autonomous agentsembedeval dsl run - Run evaluationsembedeval watch - Real-time evalembedeval eval add - Add evaluators๐ Full Documentation: AGENTS.md โข Skills Index โข Skill Manifest (JSON)