EngineeringMay 6, 202611 min read

Shipping a multi-agent system isn’t shipping a chatbot.

We rewrote the orchestrator three times before we stopped treating it like a model. Here’s what changed when we started treating it like a runtime — typed hand-offs, replayable traces, and a strict contract between specialists and tools.

Atakan Özalan

Founder, GOGOGO LLC

Shipping a multi-agent system isn’t shipping a chatbot.

When we shipped the first version of the GOGOGO orchestrator, we treated it like a smarter chatbot. One model, a long system prompt, a few tool calls. It worked in demos. It collapsed in production. Three rewrites later, the thing in front of you isn’t a chatbot at all — it’s a runtime. And that distinction is what makes the difference between a feature and an operating layer.

What broke in v1

The first orchestrator was a single LLM with tool access. We wrote prompts that asked it to ‘decide which specialist to call’ and ‘hand off the right context.’ Every demo passed. Every production run was a coin flip. The model would forget which specialist it had already called, repeat tool calls, or — worse — re-execute side-effecting calls because it lost track of the conversation graph.

The fix wasn’t a bigger context window. It was admitting that the orchestrator’s job is structural, not generative.

Treat hand-offs as a typed contract

In v2 we stopped letting the orchestrator write its own hand-off payloads. Every hand-off is now a typed schema, validated at the edge of the agent. Specialists don’t consume natural language; they consume structured input and emit structured output. The orchestrator’s only job is to pick the next specialist and route the typed payload.

// Hand-offs are values, not text.
type HandOff =
  | { to: "goddo.image.generate"; input: ImageRequest }
  | { to: "govista.schedule"; input: ScheduleRequest }
  | { to: "gopeople.classify"; input: ClassifyRequest }
  | { to: "gotrack.score"; input: ScoreRequest };

const next = orchestrator.decide(state);
assertHandOff(next); // throws on shape mismatch
await runtime.run(next);

Replay or it didn’t happen

The single biggest jump in reliability came from making every run replayable. Each step persists three things: the input it received, the decision it made, and the output it produced. With that, debugging a failed run isn’t a vibes exercise — it’s a diff between the working trace and the broken one.

What ‘replay’ actually means

Re-run the same agent with the same input and verify the output is structurally equivalent.
Substitute one specialist with a new version and replay the trace to surface regressions before deploy.
Roll back a tenant to the last known-good orchestrator graph by replaying their last 24h of runs against it.

Tools are the runtime, not the model

We used to think of tools as ‘side quests’ the model could optionally take. Wrong frame. Tools are the runtime. The model decides; the tools do the work. Once we leaned into that, the orchestrator stopped trying to be clever about state and started trusting the tools as the source of truth.

Three rules we ended up with

Tools are idempotent or they’re labelled. The orchestrator never retries a non-idempotent tool without explicit policy.
Tools own their own retries. The orchestrator hands off and waits — it does not loop on flaky transports.
Tools emit structured events. Every successful or failed call is a row in the trace, not a log line.

What we’d do differently next time

If we were starting today, we’d start with the trace. Build the timeline first, the agents second, the orchestrator third. The reason: every team that ships a multi-agent system eventually rebuilds observability — but by then they’ve baked five different formats into five different specialists. Start with the rail; the agents will follow.

“An orchestrator isn’t a smarter chatbot. It’s the runtime your agents live inside. If you can’t replay it, you don’t own it.”