EvalOps Workbench

A local-first evaluation harness for prompts, tools, and agents with regression tracking and experiment history.

Connecting

Prototype

Developer Tool

LLM

A local-first evaluation harness for prompts, tools, and agents with regression tracking and experiment history.

Problem. LLM teams lack a lightweight way to compare prompt and tool changes before shipping. Why now. Evaluation is moving from optional best practice to baseline engineering hygiene.

View prototype run See capabilities GitHub

Last pass rate

Candidate, latest run

Pass rate · 7d

Rolling mean

Regressions · 30d

Distinct cases caught

Eval runs

Recorded runs

System status

unknown

Mode

live

Tier A · live workload

Last eval run

never

Last deploy

never

Schema

public contract

Built for

Agent builders, prompt engineers, applied AI teams

Stack

Python

Argparse CLI

DuckDB

OpenTelemetry

What ships now

The harness capabilities, live in this repo.

Load datasets from JSON or CSV
Run prompt or agent variants
Score outputs with rubric functions
Compare runs and export regressions
Inspect runs case-by-case and persist shareable reports
Gate releases on regression thresholds