A concrete baseline vs candidate evaluation story using the bundled support QA dataset.
The example path starts with a weak baseline variant, upgrades to a better candidate, then proves the change with a score delta, a pass-rate improvement, and case-level traceability.
Average score
1.000
+0.667
Pass rate
4 / 4
+4 cases
Regressions
0
clean
Gate verdict
pass
ship
refund_policy
Baseline
0.667
Candidate
1.000
Recovered detail
complete
Candidate adds the support escalation detail that makes the answer operational.
sla_enterprise
Baseline
0.333
Candidate
1.000
Recovered detail
complete
Baseline is vague. Candidate restores the exact SLA and dedicated escalation channel.
seat_upgrade
Baseline
0.333
Candidate
1.000
Recovered detail
complete
Candidate turns a directional answer into a workflow answer with timing and billing specifics.
data_residency
Baseline
0.000
Candidate
1.000
Recovered detail
complete
Candidate is compliant because it states the current limit, not just the desired future state.
uv run evalops-workbench run --dataset examples/support_qa.json --variant prompt_v1 --format jsonuv run evalops-workbench run --dataset examples/support_qa.json --variant prompt_v2 --format jsonuv run evalops-workbench compare --base run_001 --candidate run_002 --format markdown --report reports/comparison.mduv run evalops-workbench gate --base run_001 --candidate run_002 --max-regressions 0 --max-score-drop 0 --max-pass-rate-drop 0uv run evalops-workbench show --run run_002 --limit 4# EvalOps Gate PASS
- Max regressions: `0`
- Max score drop: `0.000`
- Max pass-rate drop: `0.000`
## Comparison
- Base run: `prompt_v1`
- Candidate run: `prompt_v2`
- Average score delta: `+0.667`
- Pass-rate delta: `+1.000`
- Regressions: `0`
- Improvements: `4`