Just Build It in AI

It is the most confident sentence in software right now. A feature comes up, and before anyone has written down what it actually does, the estimate is already in the room: quick and easy, just build it in AI. The code really is quick now. Almost nothing else about the claim holds, because the fast part was never the expensive part, and "just build it in AI" quietly optimizes the ten percent of the job that was already cheap.

If you only have a minute, here's what you need to know.

Agents made writing code nearly free. That did not make delivering software faster, because the expensive work was never the typing. It moved to two places a demo never shows: deciding precisely what to build, and proving what came back is correct.
"Quick and easy" is true for a throwaway script. It is a pricing error for production software in front of a client, where the same sentence sets a timeline and a budget around the cheap ten percent and ignores the costly rest.
The invisible work is specific, not vague: writing a testable spec, forcing the silent product decisions, bounding the agent, and a separate reviewer proving the result. None of it shows up when you watch the agent type.
The most dangerous answer is "automate the checking too." You cannot trust the thing that wrote the code to vouch for it. Every mature agentic system in production converged on the same rule: the agent proposes, a gate and a human dispose.
The shortcut that looks cheapest is the one that ships the incident. The worst bugs in agent-built software arrive inside the changes someone decided were too small to check.

The sentence that sounds like savings

You have heard the sentence, probably this week. A client wants a new feature, and before anyone has written down what the feature actually does, the verdict is already in: "That's quick and easy, just build it in AI." It gets said in planning meetings, in estimates, in the Slack thread where the timeline gets set.

Here is the uncomfortable part. On the narrow point, the sentence is right. The code really is quick now. An agent will produce a working first pass of most features in an afternoon, and it will look convincing. If the job were writing the code, the job would in fact be nearly done.

The job was never writing the code.

"Just build it in AI" optimizes the part of software delivery that was already cheap, and prices the engagement as if that part were the whole thing. Code generation collapsed to near zero. Delivery did not move with it, because the cost lived somewhere else the entire time. It lived in saying exactly what you wanted, and in proving you got it. Agents made the typing cheap. They cannot do either of those two things for you. So the bill didn't shrink. It moved.

Where the work actually went

Watch someone drive an agent and you see one thing: typing, fast, lots of it. That visible activity is the cheap part. Here is the work sitting on either side of it, none of which a client sees on the invoice.

Saying what you want, precisely enough to check. Not a prompt. A spec: the goal, what is in scope, what is explicitly out, and acceptance criteria you could actually test. The bar is a question I now ask of every line: could two people read this and build different things? "Handle errors gracefully" fails that test. "A duplicate submission returns a 409 with a duplicate-claim error" passes it. Writing the second kind of line is slow, it is the highest-leverage hour anyone spends on the feature, and it is completely invisible.

Answering the decisions nobody wrote down. Fail open or fail closed? What does a blocked user actually see? Every real product decision left unstated does not disappear when you hand the work to an agent. The agent makes the call for you, instantly, with no judgment about your business, and you find out what it picked when something breaks in front of a customer.

Drawing the box the agent works inside. Which files it may touch, which existing pattern it must reuse instead of inventing a second one, what it is allowed to do without asking. And approving its plan before it writes a line. Correcting a plan costs a sentence. Correcting a finished build costs a redo.

Proving it, with someone who didn't build it. Build, tests, lint, coverage as hard gates. A separate review of the change against the spec. A human sign-off from someone other than the author, and a security pass on anything touching auth, identity, or customer data.

That list is the feature. The agent's afternoon of typing sits in the middle of it, and it is the only piece "just build it in AI" can see.

The trap: "so automate the checking too"

The sharp reader has an answer ready. Fine, the spec and the tests and the review are work, so have the AI do those too. Let it write its own spec, generate its own tests, review its own pull request. Now it really is quick and easy.

This is the most expensive wrong turn in the whole conversation, and the reason is not a limitation of today's models. It is structural.

You cannot trust the thing that wrote the code to certify that the code is right. A self-review is the same actor grading its own exam. It approves instantly, nothing ever bounces back, and you have built checking theater that catches nothing. The fix is not a better prompt. It is separation: the reviewer is a fresh agent that did not write the change, and the human who signs off is never the change's author.

This is not a preference. It is the rule every serious agentic system in production landed on independently. A leading vendor's coding agent can only push to branches it creates and cannot approve or merge its own work. Self-healing pipeline agents stop at a proposed fix rather than applying it. Infrastructure agents that detect drift propose a remediation and wait for a human to apply it. Three different companies, three different problems, one rule: the agent proposes, a gate and a human dispose. It is the only rule that holds when the agent is wrong, and a probabilistic system is sometimes wrong.

There is also a Stop condition worth stealing. An agent's claim that it finished is an opinion. The build passing is a fact. Mature setups refuse to let an agent call its work done while a test is red, because "done" should never be the agent's judgment to make alone.

The numbers say the same thing from the outside. The bottleneck didn't vanish when writing got cheap. It relocated to review, and most teams are still managing as if writing were the constraint.

Experienced developers using AI tools took 19% longer to finish real tasks, while believing the tools had made them about 20% faster.

— METR, Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, July 2025

That gap between feeling fast and being fast is the entire illusion in one statistic. "Just build it in AI" runs on the feeling.

Same tool, very different results

There is a second thing the sentence gets wrong, and it is the one that makes "quick and easy" sound reasonable to anyone who has only seen it work once. It assumes the tool produces the same result regardless of who is holding it.

It does not, and the gap is enormous.

Same tool, very different results: the deliberate loop of decide, delegate, and discern versus the shortcut of type a request, glance, and ship

The person who knows what they are doing runs every change through a short, deliberate loop. Decide and write down what you want, clearly enough to check. Delegate it to the agent inside bounds you set, from a plan you approved. Then prove it with checks and a second pair of eyes before anyone trusts it. The person who doesn't know what they're doing types a request, reads the output, decides it looks right, and ships it.

Same model. Same afternoon. One produces a feature that survives contact with production. The other produces a convincing draft with no idea whether it's correct, and a codebase that gets quietly worse with every "quick" change. GitClear analyzed 211 million changed lines and found the pattern that follows volume without discipline: copy-pasted code rising and reuse falling, with 2024 the first year cloned lines outnumbered the lines developers actually refactored.

Duplicated code blocks rose 8x in 2024, while code that was refactored fell from 25% of changes to under 10%.

— GitClear, AI Copilot Code Quality, 2025

Cheap code with no craft is not progress. It is churn that looks like progress until you try to maintain it.

The craft is not in the keystrokes. The agent has those covered. The craft is in the bounds and the checking, and that is exactly the part that takes a person who has done it before.

The shortcut that ships the incident

Here is what it actually costs when "quick and easy" wins the argument with a client in the room.

The sponsor hears "quick and easy." A timeline and a fixed price get set around that sentence. The team, now committed to a date that assumed the cheap version of the work, starts looking for time to give back. The first thing to go is always the same: the loop gets skipped on the changes that feel too small to bother checking.

That instinct is exactly backwards. The worst bugs in agent-built software ship inside the changes someone decided were too small to check. Small and risky is the cheap-to-type, expensive-to-get-wrong case, and it is precisely where unchecked work slips through.

Imagine a regional insurer. A one-line change to a claims-intake rule, the kind nobody felt was worth a spec or a second reviewer, goes straight to merge because it was "obviously fine." It was not fine. It changed which submissions get flagged, and now claims are being routed wrong at scale, against customer data, in a regulated business. What started as a saved afternoon is now an audit finding and a remediation project.

The "quick and easy" estimate did not just miss the work. It created the incident, by pricing out the one step that would have caught it.

8x more code shipped per engineer, with over 80% of merged code authored by the model.

— Anthropic, When AI builds itself, June 2026

When that much of the code is machine-written, the human job is no longer writing it. It is checking it. Skip the checking to hit a "quick and easy" date and you are not saving the expensive part. You are removing the only part that was protecting the client.

What to do this week

Stop pricing by code volume. If your estimate gets cheaper because "the AI writes it," you are estimating the wrong half of the job. Price the spec work and the review work, because that is where the time and the risk live now.

Make the spec pass the two-people test. Before any agent touches a feature, write down what it does in lines that two people could not interpret differently. If a line is a wish, not a check, it is not ready to build. This single habit removes more downstream rework than any model upgrade.

Separate the author from the approver, always. The agent or person who wrote a change is never its only sign-off. A fresh reviewer, human or a separate agent, checks it against the spec. On anything touching auth, identity, or customer data, add a named security sign-off. No exceptions for "small."

Install the checking before you cut the old process. The DORA research is blunt about teams that loosened their delivery discipline as AI adoption rose: stability dropped. The old rituals were also a safety net. Stand up the automated gates and the separate review first, then retire whatever they replace. Never the other way around.

A change is done when it has been checked, not when the agent stopped typing. Price the engagement around that, or don't bid it. "Just build it in AI" is not a plan. It is a sentence that sounds like savings right up until the part nobody budgeted for arrives.

Matthew Kruczek is Managing Director at EY, leading Microsoft domain initiatives within Digital Engineering. Connect with Matthew on LinkedIn to discuss building software with AI agents that holds up in production.

References

METR, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity." July 2025. metr.org
GitClear, "AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones." gitclear.com
DORA / Google Cloud, "2024 Accelerate State of DevOps Report." dora.dev
Anthropic, "When AI builds itself: Our progress toward recursive self-improvement and its implications." June 2026. anthropic.com

Just Build It in AI

The sentence that sounds like savings

Where the work actually went

The trap: "so automate the checking too"

Same tool, very different results

The shortcut that ships the incident

What to do this week

References

Continue Reading

You're Not Writing Prompts Anymore. You're Writing Loops.

Your Agent Isn't Learning. It's Taking Notes.

Why Your Agent Forgets the Rules: The Neurosymbolic Case for the Harness