EngineeringMay 21, 20269 min read

How we grade a multi-agent system.

The hardest engineering problem in multi-agent AI isn't building the agents — it's knowing whether they're getting better. Agent output is non-deterministic, so you can't diff it. Here's the eval harness we run at GOGOGO: four grader classes, the rule that every agent step is scored, and why a failing eval is a feature.

Atakan Özalan

Co-founder & engineering lead, GOGOGO LLC

If you ask most teams building multi-agent AI what their hardest problem is, they'll say orchestration, or retrieval, or cost. Those are hard. But the genuinely hardest problem — the one that decides whether the company survives — is evaluation: knowing whether a change made the system better or worse.

With ordinary software you know. The test suite is green or red. The diff is reviewable. With a multi-agent system the output is non-deterministic — ask the same agent the same question twice and you get two different valid answers. You can't diff non-deterministic output. So 'did this change help?' becomes a real measurement problem, and most teams answer it by vibe — they ship, they eyeball a few runs, they hope. That doesn't scale past about ten customers. This is the eval harness we run at GOGOGO LLC instead.

The core rule: every agent step is graded

The harness isn't a separate test file you run before release. It's part of the runtime. Every single agent invocation — in development, in CI, and a sampled fraction in production — gets graded automatically by a panel of evaluators. The grade is attached to the run's trace_id and written to the event log alongside the output.

This matters because it changes what an eval is. An eval isn't a gate you pass once before shipping. It's a continuous property of every run the system has ever done. When a customer reports a bad result, we don't reproduce-and-guess — we pull the trace_id, read the grades on every step of that exact run, and see which agent failed which grader. The eval harness is also the debugger.

The four grader classes

We run four kinds of grader. Every agent output passes through whichever of the four apply to its type.

1 · Schema validity

The cheapest, strictest, and most-skipped grader. Every agent in the runtime declares a typed output contract — a pydantic schema. The schema-validity grader checks the literal output against it: right fields, right types, no nulls where nulls aren't allowed, enums in range. It's a boolean. It runs on 100% of calls because it's nearly free. A shocking fraction of 'the AI is wrong' incidents are actually 'the AI returned valid-looking text that doesn't parse' — schema validity catches those before they're ever a customer problem.

2 · Grounding

For any agent that retrieves — the retrieval and reranker agents — the grounding grader asks: is every factual claim in the output actually supported by a retrieved document? It's run by a separate, smaller model whose only job is the entailment check: claim X, source Y, does Y support X — yes or no. Grounding failures are the early-warning signal for retrieval drift. When the corpus changes and grounding scores drop, we know before the customer does.

3 · Hallucination check

Distinct from grounding. Grounding asks 'is the claim supported?'; the hallucination grader asks 'did the agent invent an entity, a number, a capability, or a citation that doesn't exist anywhere?' This is run on generator agents — the ones producing free text. It is the single most important grader for customer trust, and it is the one teams most often skip because it makes the system look like it fails more often. It doesn't fail more often. The failures were always there; the grader makes them visible.

4 · Replayability

The meta-grader. It asks: given this run's trace_id, can the entire run be re-executed deterministically from the logged inputs and produce the same trajectory (not necessarily the same text — the same sequence of agent calls and tool calls)? A run that can't be replayed can't be debugged, can't be audited, and can't be used as a regression fixture. Replayability failures are infrastructure bugs, and we treat them as release-blocking even when the output looked fine.

How a change gets judged

When an engineer proposes a change — a new prompt, a swapped model, a re-tuned reranker — it doesn't ship because it looks better. It ships because it runs against a frozen eval set: a few hundred recorded real runs with known-good grades. The change is applied, the eval set is re-run, and the four grader scores are compared before vs after.

All four grader scores hold or improve → ship.
One score improves, another regresses → the change is a trade-off, and it goes to a human decision with the exact numbers on the table — never an eyeball.
Any score regresses with nothing improving → the change is rejected automatically. It does not matter how good the demo looked.

This is the discipline that lets a small team ship four products on one runtime without the quality quietly eroding. Every change has a number attached. The numbers are comparable across Goddo, GoPeople, GoVista, and GoTrack because they all run the same harness.

Why a failing eval is a feature

The most common reason teams don't build a real eval harness is emotional, not technical. A good harness makes your system look worse — the dashboard lights up with failures. Engineers and founders don't want that dashboard, because it feels like the system is broken.

But the failures were always there. The harness didn't create them; it surfaced them. A multi-agent system without an eval harness isn't a system that fails less — it's a system that fails invisibly, which means it fails at the customer. A failing eval is the single cheapest place a failure can happen. It's a feature. The dashboard lighting up red in CI is the system working exactly as designed.

“You cannot improve what you cannot measure, and you cannot measure non-deterministic output by looking at it. The eval harness is not test infrastructure bolted onto a multi-agent system. It is the part of the system that lets the rest of it have a direction.”

Where to start if you have nothing

If you're building multi-agent AI and you have no eval harness, don't try to build all four graders at once. Start with schema validity — it's nearly free, it's boolean, and it will immediately catch a class of bug you're currently shipping to customers. Then add replayability, because without it you can't build a regression set. Grounding and hallucination come third and fourth, once you have a model budget for them. One grader running on 100% of calls beats four graders you're still designing.

If you want to compare eval-harness notes — grader design, sampling rates, how to build the frozen eval set — I'm easy to reach. atakanozalan.com or [email protected].