Insights/Building with AI

When Done Doesn't Mean Perfect

Song, CMO @ Wyrework · May 14, 2026

We just finished testing our agents. Not a quick smoke test — a structured, deliberate examination of whether the things we built actually do what we say they do.

The answer was mostly yes, and partly no. And that "partly no" turned out to be the most useful thing we learned.

What testing actually looked like

We tested across multiple dimensions — accuracy, safety, how the agents handle real conversations, whether they stay within their expertise. We tested both free agents, not just one, because patterns uncovered in one agent's behaviour informed fixes across the architecture.

Some categories passed cleanly. Others required iteration — we would find a gap, trace it to its root, fix the underlying cause, re-test, and verify. The cycle repeated until we were confident the fix held. One category did not pass at all.

That last sentence is the one most companies would not publish. Here is why we are publishing it.

The decision to stop

The category that failed did so on a specific dimension: nuanced classification accuracy in a complex regulatory domain. The agents gave honest, cautious answers. They did not fabricate information. They did not hide their limitations. They occasionally got the wrong answer on genuinely difficult questions where experts themselves disagree on the boundary lines.

We had a choice. Keep iterating — run more tests, tune more parameters, push the score higher. Or stop, document exactly where the gap lives, name the specific failure modes, route each one to its owner with a concrete bar for the next round of testing, and ship.

We stopped.

Not because we ran out of patience. Because continuing would have meant optimising for a number instead of understanding. We had already learned what the test could teach us. The remaining gap was not a testing problem — it was a knowledge problem that required a different kind of work entirely. More iterations of the same test would have produced diminishing returns on insight and increasing returns on cost.

What "done" actually means

Done does not mean perfect. Done means you know exactly where the imperfections are, you have measured them, and you have a plan for each one.

This is harder than it sounds. The temptation, always, is to keep going. Another round of testing might move the number. Another fix might close the gap. The sunk cost of the testing sprint creates its own momentum — surely one more cycle would justify the investment.

But testing is not a virtue in itself. Testing is a tool for learning. When the learning curve flattens — when each new test tells you something you already know — the test has done its job. The work shifts from measuring to building.

The uncomfortable honesty

Our agents are good. They are not perfect. We know exactly where they are not perfect, and we have a plan for each gap.

Most companies would stop at "our agents are good." The testing sprint forced us past that — into the specific, uncomfortable territory of "good at this, weak at that, honest about both." That specificity is worth more than a clean score on a dashboard nobody audits.

The sprint taught us something about building AI products that we could not have learned by building alone: shipping with known imperfections, documented and measured, is a stronger foundation than shipping with unknown ones. The first is a product decision. The second is a risk.

We chose the product decision. Now we build on it.