The situation
The organization was a Series C vertical SaaS business with roughly 120 engineers split across six product pods and a small platform group. They had adopted AI coding tools early: Copilot rolled out across the org, Cursor seats for the engineering leadership, a Claude-based internal agent for long-running tasks. Budget was approved, seats were paid for, pilots had been running for nine months.
The delivery curve had not moved. Leading indicators (PRs opened, PRs merged) had ticked up slightly. Trailing indicators (features shipped to customers, cycle time from ticket open to production) had not. The board was asking. The VPE was running out of time to tell a story about why.
We were brought in to run a short diagnostic followed by a workflow and topology engagement.
What we found
The diagnostic surfaced three things, in this order.
A silent suppressor in their context pipeline. Their monorepo tooling was stripping cross-package type information from the context available to the agent at inference time. This had been a deliberate performance optimization done eighteen months earlier, before the org was running agents at all. The tooling team was unaware the optimization was now a constraint. The fix was a single configuration change; the uplift on agent output the following week was immediate and visible in review queues.
Low spec density in the repositories leadership believed were most ready. The repos at the top of the “we’ll roll out agents here first” list were the newest, cleanest services. They also had the lowest spec density in the portfolio — small files, clean factoring, implicit contracts. Meanwhile, a legacy billing service that nobody wanted to touch scored the highest on our rubric because it was defensively type-annotated and had an OpenAPI spec that was the actual source of truth. We reordered the rollout sequence.
A pending reorg that was splitting the delivery loop. The leadership team had already drafted a new org structure that separated “spec engineering” (a new central function) from “review engineering” (embedded in pods). We pulled that draft, re-ran it against the seam failure pattern, and redesigned the structure so spec and review stayed together within each pod, with a small platform function owning the shared infrastructure instead.
What we built
Over the engagement we delivered, with the client’s engineering leadership:
- A repo readiness scorecard covering all 140 of their services and libraries, with a ranked rollout sequence.
- A spec template and spec review process for each delivery pod, with worked examples against their own codebase.
- An agent PR review protocol tuned for the failure modes we instrumented on their stack, with reviewer training.
- A 90-day reorg plan implementing the loop-based topology, including role changes, capacity model, and talking points for affected engineers.
- A cost-per-PR instrumentation dashboard covering human hours, token spend, and CI cost.
The outcome
In the quarter after the engagement closed:
- Meaningful throughput — merged work that shipped to customers, not raw PR count — increased by approximately 50% measured against the same quarter the previous year.
- Cycle time from ticket open to production dropped from a median of nine days to a median of five.
- Headcount was unchanged across the engineering org during the measurement period.
- Voluntary attrition on the pods that adopted the loop-based topology first ran below the org average by a measurable margin, reversing the trend that had prompted our engagement.
The numbers are aggregated and anonymized at the client’s request. More detail, including the instrumentation methodology and the spec-density rubric, is available under NDA on the diagnostic call.
Why this is worth reading
We are publishing this because the pattern is not specific to this client. Every mid-size engineering organization we have worked with has had a version of each of the three findings — a suppressor in the context pipeline, a readiness mirage on the rollout sequence, and a pending reorg that was going to split the loop. The combination is what caps the delivery curve. Fixing them in sequence is what unsticks it.
If any of this sounds familiar, the AI Readiness Assessment is the shortest path to a ranked list of what is actually capping your throughput, written in a form your board will read.