Most teams evolve prompts by feel. Change something, eyeball the output, ship it if nobody screams. This is alchemy, not engineering. When your system is non-deterministic, a single successful run proves nothing. I built an eval harness that runs prompt versions head-to-head — 10 runs each, scored on behavior, measured on cost and latency. No vibes. No guessing. Just data that tells you exactly what improved, what regressed, and what it costs.

A scientist's workstation with two glowing experiment chambers side by side, each containing a different luminous prompt version being tested

The Vibes Problem

You changed the system prompt. The agent seems smarter now. It answered that one tricky question correctly. Ship it?

No. You ran it once. LLMs are non-deterministic. That “improvement” might be a coin flip you happened to win. Tomorrow it fails. Next week it regresses on a behavior you never thought to check. And you’ll never know, because you tested like someone tasting soup — one spoonful, gut feel, call it done.

Single-run validation is the unit test equivalent of console.log("works"). It tells you nothing about reliability.

The problem gets worse at scale. An agent with three critical behaviors means three dimensions of quality to track across every prompt change. Multiply by the variance of non-deterministic outputs, and you’re navigating a fog bank with no instruments.

The Eval Stack

Here’s what I actually built. Hand-rolled. No framework. No SaaS eval platform. Just a harness that runs from my terminal with bun eval.

The architecture:

Mocked externals — Every MCP server and external tool the agent calls gets a mock. The evals run fast and cheap, isolated from real infrastructure.
Versioned prompts — Each prompt version is a named artifact. v0.1, v0.2, whatever. The harness runs both against the same eval suite.
N-run execution — 10 runs per prompt per eval. Non-deterministic systems need statistical sampling. One run is an anecdote. Ten runs is a distribution.
Behavioral scoring — Each eval defines specific criteria. Did the agent find the candidate? Did it return the correct name? Did it paginate the chart? Binary pass/fail per criterion, aggregated across runs.
Cost tracking — Total tokens consumed, agent steps performed, wall clock duration. Every run logged, every metric compared.

The output is a side-by-side comparison: Prompt A vs. Prompt B across every eval, every metric, every run.

What the Data Actually Showed

I was evolving a prompt across two versions to improve three specific behaviors: thorough research, correct chart pagination, and graceful handling of typos in names.

Here’s what the A/B comparison revealed:

Name correction (typo handling):

v0.1: 0/10 passed. Failed every single time.
v0.2: 10/10 passed. Perfect.
100% improvement.

Chart pagination:

v0.1: Passed consistently.
v0.2: Passed consistently.
Zero change. No improvement, no regression.

Retry on empty results:

v0.1: 2/10 passed. Failed 80% of the time.
v0.2: 10/10 passed. Perfect.
80% improvement.

The Trade-Off Table

Infographic: A/B eval results showing correctness up, tokens up, latency up — with the thesis that trade-offs should be visible not accidental

Average score jumped from 0.4 to 1.0. The new prompt is categorically more correct. But it’s not free.

v0.2 consumes an average of 680 more tokens per run. It takes 2.8 seconds longer wall clock time. The agent uses more search calls, more tool invocations, more steps.

The new prompt works more deeply. It researches more thoroughly, retries when it hits dead ends, and corrects mistakes on the way out. That costs compute. That costs time.

And that’s exactly the kind of trade-off you should be making intentionally.

Without the eval harness, you’d see “it works better now” and ship it. You’d never quantify the cost delta. You’d never know you traded 2.8 seconds of latency for 60 percentage points of correctness. You’d be making the trade-off accidentally — which means you’re not really making it at all.

Why 10 Runs

Classic software tests are deterministic. Run once, pass or fail, done. AI agents don’t work that way. The same prompt, same input, same temperature can produce different tool call sequences, different reasoning paths, different answers.

One run tells you what can happen. Ten runs tell you what does happen.

Ten is my baseline for standard behaviors. For critical paths — anything touching money, user data, or irreversible actions — go higher. Twenty, fifty, whatever it takes to trust the distribution.

The eval harness runs all ten in sequence, aggregates the scores, and shows you the pass rate. Not “did it work?” but “how often does it work?” That’s the question that matters for production systems.

Build the Harness, Not the Habit

The alternative to an eval stack is the prompt tweak treadmill. You change the prompt, run it a few times, convince yourself it’s better, push it to production, get surprised by failures, tweak again. This cycle never ends because you never had a baseline to begin with.

An eval harness converts prompt evolution from an art into a science. You state your hypothesis (v0.2 will handle typos better), you run the experiment (10 runs against the typo eval), and you read the results (100% pass rate, +680 tokens, +2.8s).

The results might confirm your hypothesis. They might reject it. They might reveal a regression you didn’t anticipate. All of those outcomes are valuable — but only if you measure.