Prototype Run

A concrete baseline vs candidate evaluation story using the bundled support QA dataset.

Connecting

Prototype

Gate pass

Support QA dataset

This repo now demonstrates a real eval loop, not just a product thesis.

The example path starts with a weak baseline variant, upgrades to a better candidate, then proves the change with a score delta, a pass-rate improvement, and case-level traceability.

Release gate verdict

Average score

1.000

+0.667

Pass rate

4 / 4

+4 cases

Regressions

clean

Gate verdict

pass

ship

Case-by-case deltas

Each case records exactly what the candidate restored or clarified.

refund_policy

billing

+0.333

Baseline

0.667

Candidate

1.000

Recovered detail

complete

Candidate adds the support escalation detail that makes the answer operational.

baseline missed: support@acme.test

sla_enterprise

support

+0.667

Baseline

0.333

Candidate

1.000

Recovered detail

complete

Baseline is vague. Candidate restores the exact SLA and dedicated escalation channel.

baseline missed: four-hour

baseline missed: Slack

seat_upgrade

self-serve

+0.667

Baseline

0.333

Candidate

1.000

Recovered detail

complete

Candidate turns a directional answer into a workflow answer with timing and billing specifics.

baseline missed: prorated

baseline missed: billing settings

data_residency

security

+1.000

Baseline

0.000

Candidate

1.000

Recovered detail

complete

Candidate is compliant because it states the current limit, not just the desired future state.

baseline missed: roadmap

baseline missed: not generally available

CLI flow

The bundled dataset is enough to run the full loop locally.

bash

uv run evalops-workbench run --dataset examples/support_qa.json --variant prompt_v1 --format json

bash

uv run evalops-workbench run --dataset examples/support_qa.json --variant prompt_v2 --format json

bash

uv run evalops-workbench compare --base run_001 --candidate run_002 --format markdown --report reports/comparison.md

bash

uv run evalops-workbench gate --base run_001 --candidate run_002 --max-regressions 0 --max-score-drop 0 --max-pass-rate-drop 0

bash

uv run evalops-workbench show --run run_002 --limit 4

Saved report

`compare --report` and `gate --report` can emit a shareable artifact for review threads or CI logs.

# EvalOps Gate PASS

- Max regressions: `0`
- Max score drop: `0.000`
- Max pass-rate drop: `0.000`

## Comparison

- Base run: `prompt_v1`
- Candidate run: `prompt_v2`
- Average score delta: `+0.667`
- Pass-rate delta: `+1.000`
- Regressions: `0`
- Improvements: `4`

# EvalOps Gate PASS - Max regressions: `0` - Max score drop: `0.000` - Max pass-rate drop: `0.000` ## Comparison - Base run: `prompt_v1` - Candidate run: `prompt_v2` - Average score delta: `+0.667` - Pass-rate delta: `+1.000` - Regressions: `0` - Improvements: `4`