A local-first evaluation harness for prompts, tools, and agents with regression tracking and experiment history.
Problem. LLM teams lack a lightweight way to compare prompt and tool changes before shipping. Why now. Evaluation is moving from optional best practice to baseline engineering hygiene.
Last pass rate
Candidate, latest run
Pass rate · 7d
Rolling mean
Regressions · 30d
Distinct cases caught
Eval runs
Recorded runs
Mode
live
Tier A · live workload
Last eval run
never
never
Last deploy
never
never
Schema
v1
public contract
Stack