The readiness mirage: why test coverage does not predict AI readiness

Ask any engineering leader which of their repositories will be the first to benefit from AI coding agents, and you will get a confident answer in about three seconds. The answer will usually be their newest, most-tested, most-documented service. It will be wrong.

We have now measured the correlation directly across enough portfolios to be unambiguous. The usual proxies for readiness — test coverage percentage, documentation completeness, commit recency, the subjective “how clean is it” vote from senior engineers — do not predict which repositories absorb AI agents well. Something else does. Most organizations do not track it, and when they hear what it is, they usually understand why clean-looking services were failing and their legacy corners were quietly outperforming.

We are going to call this gap between intuition and reality the readiness mirage.

The proxies and why they fail

Let us take the four usual proxies in order.

Test coverage. High coverage does not predict that an agent’s output is correct. It predicts that a particular kind of regression will be caught. The agent can happily produce a diff that preserves behavior the tests exercise and breaks behavior the tests do not. Worse, high-coverage codebases are often heavily mocked, which means the tests pass against synthetic interfaces while the real contract between services is implicit, oral, and entirely opaque to the model.

Documentation. Most internal documentation was written for humans who already knew the system. It optimizes for brevity, assumes shared context, and explains what at the expense of why. A model reading documentation like this has no way to distinguish a casual aside from a hard invariant. Organizations with excellent wikis sometimes perform worse than organizations with none, because the model treats the wiki as gospel and cannot see where it has drifted from the code.

Recency. A service that has been touched by forty engineers in eighteen months has more churn in its naming conventions, more half-migrated patterns, and more unstated constraints (“do not touch that function, it’s being rewritten in Q3”) than a service that has been stable for two years. The agent has no way to know which of the patterns it sees is the current one.

Senior engineer vote. This one is the worst predictor, and the most expensive politically. Senior engineers rate the readiness of repositories they are responsible for. This correlates with how much they trust the repository, which correlates with how much of it they can hold in their head, which is exactly the cognitive offloading that the agent is supposed to do for them. It is a vote about their own comfort, not about the system’s legibility to an external reasoner.

What actually predicts readiness

The predictor that comes out on top in every engagement we have run is spec density: the ratio of machine-readable intent to machine-readable implementation in a given repository.

Spec density is not test coverage. It is not documentation. It is the amount of content in the repository that explains, in a format the model can parse, what the code is supposed to do, what invariants it holds, what it cannot do, and what its callers depend on. In practice this is a mix of:

Type annotations, preferably strict ones, with nominal types rather than structural aliases.
Docstrings on public APIs that state behavior in terms of inputs, outputs, and postconditions — not prose.
ADRs or design docs kept in-repo, in Markdown, next to the code they govern.
CODEOWNERS files that are actually accurate.
Boundary files (OpenAPI specs, Protobuf files, GraphQL schemas) that are the source of truth rather than generated artifacts.

Repositories with high spec density absorb agents well. Repositories with low spec density do not, regardless of how much test coverage, documentation, or engineering love they have received.

Why clean-looking services fail

The reason clean-looking services often fail the real readiness test is that “clean” in the human sense — small files, few abstractions, pretty formatting — frequently correlates with low spec density. The beauty of a well-factored service is often that all its meaning lives in the structure, which is legible to the humans who built it and opaque to an external reasoner. Meanwhile, the legacy corner of the codebase that everyone flinches at often has ten years of defensive type signatures, long docstrings written by someone who was worried about getting paged at 3am, and boundary files that are the literal source of truth because they pre-date the ORM.

The legacy corner absorbs agents. The pristine service does not. This counterintuition is roughly half the reason AI rollouts stall in organizations that bought the tools and carefully rolled them out to their “best” teams first.

What to do about it

There are two responses. One is to measure spec density directly. We do this as part of the AI Readiness Assessment, scored per-repo with a weighted rubric, and it produces a ranked list of which parts of your codebase will actually uplift when agents land on them and which will burn spend. The ranking typically does not match the intuitive one. It is worth having on paper before decisions about rollout sequence are made.

The second is to change the review process so that new code has a reason to be spec-dense. This is a bigger change — it reaches into how tickets are written, how specs are reviewed, and what engineers are rewarded for — and it is the bulk of the Agentic Workflow Design engagement. But the short version is: if engineers know the model will be asked to reason over the code they are about to write, they write it differently. Specifically, they write it more like the legacy corner. That turns out to be a good thing.

The intuition was always going to be wrong. The mirage is structural. The real signal is there, but it is not where the eye goes first.