/

EvalOps Workbench

Workspace

OverviewDashboardPrototypeReal eval loopTelemetry/api/statsCapabilitiesMVP scopeRoadmapStage + plan

Account

Settings

Prototype

Developer Tool · LLM

Built by eleventh.Explore the fleet · 6 systems
prototype

Prototype Run

A concrete baseline vs candidate evaluation story using the bundled support QA dataset.

Open examples
Connecting
Prototype
Gate pass
Support QA dataset

This repo now demonstrates a real eval loop, not just a product thesis.

The example path starts with a weak baseline variant, upgrades to a better candidate, then proves the change with a score delta, a pass-rate improvement, and case-level traceability.

Release gate verdict

Average score

1.000

+0.667

Pass rate

4 / 4

+4 cases

Regressions

0

clean

Gate verdict

pass

ship

Case-by-case deltas
Each case records exactly what the candidate restored or clarified.

refund_policy

billing
+0.333

Baseline

0.667

Candidate

1.000

Recovered detail

complete

Candidate adds the support escalation detail that makes the answer operational.

baseline missed: support@acme.test

sla_enterprise

support
+0.667

Baseline

0.333

Candidate

1.000

Recovered detail

complete

Baseline is vague. Candidate restores the exact SLA and dedicated escalation channel.

baseline missed: four-hour
baseline missed: Slack

seat_upgrade

self-serve
+0.667

Baseline

0.333

Candidate

1.000

Recovered detail

complete

Candidate turns a directional answer into a workflow answer with timing and billing specifics.

baseline missed: prorated
baseline missed: billing settings

data_residency

security
+1.000

Baseline

0.000

Candidate

1.000

Recovered detail

complete

Candidate is compliant because it states the current limit, not just the desired future state.

baseline missed: roadmap
baseline missed: not generally available
CLI flow
The bundled dataset is enough to run the full loop locally.
bash
uv run evalops-workbench run --dataset examples/support_qa.json --variant prompt_v1 --format json
bash
uv run evalops-workbench run --dataset examples/support_qa.json --variant prompt_v2 --format json
bash
uv run evalops-workbench compare --base run_001 --candidate run_002 --format markdown --report reports/comparison.md
bash
uv run evalops-workbench gate --base run_001 --candidate run_002 --max-regressions 0 --max-score-drop 0 --max-pass-rate-drop 0
bash
uv run evalops-workbench show --run run_002 --limit 4
Saved report
`compare --report` and `gate --report` can emit a shareable artifact for review threads or CI logs.
md
# EvalOps Gate PASS

- Max regressions: `0`
- Max score drop: `0.000`
- Max pass-rate drop: `0.000`

## Comparison

- Base run: `prompt_v1`
- Candidate run: `prompt_v2`
- Average score delta: `+0.667`
- Pass-rate delta: `+1.000`
- Regressions: `0`
- Improvements: `4`