/

EvalOps Workbench

Workspace

OverviewDashboardPrototypeReal eval loopTelemetry/api/statsCapabilitiesMVP scopeRoadmapStage + plan

Account

Settings

Prototype

Developer Tool · LLM

Built by eleventh.Explore the fleet · 6 systems

EvalOps Workbench

A local-first evaluation harness for prompts, tools, and agents with regression tracking and experiment history.

Open telemetry
Connecting
Prototype
Developer Tool
LLM

A local-first evaluation harness for prompts, tools, and agents with regression tracking and experiment history.

Problem. LLM teams lack a lightweight way to compare prompt and tool changes before shipping. Why now. Evaluation is moving from optional best practice to baseline engineering hygiene.

View prototype runSee capabilitiesGitHub

Last pass rate

Candidate, latest run

Pass rate · 7d

Rolling mean

Regressions · 30d

Distinct cases caught

Eval runs

Recorded runs

System status
unknown

Mode

live

Tier A · live workload

Last eval run

never

never

Last deploy

never

never

Schema

v1

public contract

Built for
Agent builders, prompt engineers, applied AI teams

Stack

Python
Argparse CLI
DuckDB
OpenTelemetry
What ships now
The harness capabilities, live in this repo.
  • Load datasets from JSON or CSV
  • Run prompt or agent variants
  • Score outputs with rubric functions
  • Compare runs and export regressions
  • Inspect runs case-by-case and persist shareable reports
  • Gate releases on regression thresholds