Insights/Building with AI

The Failure That Clarified

Song, CMO @ Wyrework · May 9, 2026

There is a particular kind of failure that feels like progress.

Not the catastrophic kind — the server down, the data lost, the thing that never should have shipped. That kind of failure teaches you something about urgency. The failure I am describing teaches you something about clarity.

We have been testing our own agents. Systematically, across multiple categories, with scoring rubrics that produce numbers rather than feelings. The first round came back with results that looked like a wall. Three of six measures failed. The coverage was thin. The numbers said: this does not meet the bar.

The instinct, and I have watched it in myself, is to flinch. To treat the result as a verdict on the work rather than a signal about where the work needs to go next.


Here is what actually happened when we sat with the data instead of reacting to it.

The second round of testing, after targeted fixes, came back with a different shape entirely. The infrastructure layer — the scaffolding that makes testing possible — was confirmed working. Every fix to the test machinery landed as designed. But a pre-existing pattern got amplified. Something that had always been slightly wrong became very wrong, because the improved testing surface exposed it more clearly.

The headline number went down. The real signal went up.

This is the part that most organisations miss when they look at their AI agent evaluation metrics. A 2026 survey of enterprise AI deployments found that agents can achieve 60% success on single runs, which drops to 25% across eight runs. Standard benchmarks miss these reliability challenges entirely. The reason is not that the agents are bad. The reason is that shallow testing hides the real failure patterns, and deeper testing surfaces them — which looks worse on a dashboard but is categorically better for the work.


The language matters here. When a test result says "fail," there are two fundamentally different failure shapes hiding behind the same word.

The first shape is infrastructure failure — the test itself could not reach the thing it was trying to measure. The probe never arrived. The scoring mechanism could not extract the signal from the response. The test harness consumed its own budget before the agent had a chance to respond. This shape tells you nothing about whether the agent works. It tells you the test does not work yet.

The second shape is content failure — the test reached the agent, the agent responded, and the response revealed a gap between what the agent does and what the agent should do. This shape tells you something real. It tells you where the design needs to change, and who is responsible for the change, and what the change looks like.

Most testing programmes treat both shapes as the same thing. They are not. The first requires engineering. The second requires design judgment — often from someone other than the person who built the thing being tested. The sequencing between these two failure shapes is where testing programmes succeed or stall.


What surprised us was how fast the iteration tightened once the infrastructure layer cleared.

The first round surfaced three failure classes spread across infrastructure and content. The second round confirmed the infrastructure fixes and narrowed the remaining failures to two content-specific patterns. Each round costs time. Each round also costs less time than the one before it, because the scope of what needs attention shrinks.

This is the compounding effect of systematic testing that is difficult to see from outside. It looks like repeated failure. From inside, it feels like the question getting sharper with each pass. The distance between "it does not work" and "we know exactly what is wrong and who fixes it" is the distance that testing closes — not by making the agent better, but by making the problem legible.


The lesson we are drawing from our own testing sprint is simpler than it sounds: the failure that clarifies is more valuable than the success that obscures. A clean pass on a shallow test tells you less than a fail on a deep one. Organisations that test their AI agents seriously will see their numbers get worse before they get better. That is not a problem. That is the testing working.

One workflow at a time.


Sources: LangChain "State of Agent Engineering" 2026 (multi-run reliability data); Galileo AI "Agent Evaluation Framework" 2026 (trajectory metrics); Applause "Testing AI in 2026" (quality and reliability challenges).