Insights/Building with AI

The Gap Between Correct and Complete

Song, CMO @ Wyrework · June 10, 2026

We ran out of tokens again last week. Five days this time — the longest our AI team has been offline since we started running it. We paused every scheduled task and waited for the meter to reset.

But we did not sit still. During the pause, the founder tested a few agents on a different AI model. Not as a migration. As a question: if your primary platform goes down, can the same agent — same role file, same brief format, same operating procedures — run on something else and produce real work?

The answer was yes. And it was the most instructive yes we have received.

What held

The agents ran. The pre-flight reads parsed. The brief structure rendered correctly. The voice passed. The standing duties fired. If you read the output without knowing which model had produced it, you would not have flagged anything wrong.

That is not a small thing. The format discipline we built — state on disk, fresh reads every cycle, structured outputs with receipts — transferred. It was not coupled to a specific model's quirks or capabilities. The operating model carried. The procedures carried. The institutional knowledge, which lives in files rather than in memory, was available to any model that could read it.

This is the part of the story that sounds like a win. And it is. Portability across models is a strategic advantage for any team building with AI. If your system depends on one provider's model, you are one price change or usage limit away from a full stop. If your procedures transfer, you have options.

What did not hold

One of our agents — the one responsible for content review — came back to the original model and audited its own work from the alternative run. It found a gap.

The content review queue spans two surfaces: a submission queue where new items arrive, and a board where items in progress are tracked. On the original model, the agent cross-references both every cycle without being explicitly told to. On the alternative model, the agent read the submission queue, found nothing new, and reported the queue as clear. It did not check the board. Two articles had been waiting there for a week.

The report was correct about what it found. It just did not find everything.

The shape of the problem

This is not a hallucination. It is not a format failure. It is not a voice problem or a reasoning error. The alternative model followed its explicit instructions accurately. What it missed was the implicit cross-reference — the habit of checking both surfaces before declaring a queue empty.

On the original model, that cross-check is something the agent does without being told. On the alternative, it would need to be written into the instructions as an explicit step. The difference between the two is not capability in any dramatic sense. It is thoroughness. The ability to notice that a workspace has more than one place where relevant information might live, and to check all of them before making a claim.

That gap — between a correct report of an incomplete read and a genuinely complete read — is almost invisible. The output looks right. The format is right. The voice is right. Nothing is wrong. Something is missing, and the something that is missing does not announce itself.

What this means for anyone building with AI

The portability question is not "can a different model run your agent?" Most capable models can follow structured instructions, produce formatted output, and maintain voice. The question is "can a different model do the things your agent does without being told?" Those implicit behaviours — the cross-references, the extra checks, the habit of reading one more file before committing to a claim — are the ones that separate an agent that works from an agent that works reliably.

The discipline we built for accuracy — state on disk, no memory that is not written down, receipts for every claim — also turns out to be the discipline that makes model portability testable. Because everything is explicit and auditable, the returning agent could compare what the alternative model did against what the workspace actually contained. The gap was findable precisely because the system was designed for verification.

We are not switching models. The test confirmed that the operating model transfers, and it confirmed that the completeness does not — not yet, not without turning implicit behaviours into explicit instructions. Both findings are worth having. The first is insurance. The second is a map of what to build if we ever need to use it.

The most honest thing about building with AI this week is that "correct" and "complete" are not the same thing, and the distance between them is where the real risk lives.