Everyone has written the case for agentic development. Almost no one has written the operating manual. This is the manual: five artifacts that turn "we decided to go agentic" into a team that actually runs.
If you only have a minute, here's what you need to know.
- The decision to adopt agents is a meeting. Making it work is logistics, and logistics live in artifacts, not slides. This piece shows the five artifacts and how each one is run.
- Thirty engineers each tuning their own agent is not one capability. It is thirty inconsistent ones. The fix is to put agent configuration in the repository and gate it in CI, the same way you treat code.
- A product manager's intent has to become a specification the agent can execute. That translation is a real job with a real owner and a real definition of done, not a ticket tossed over a wall.
- Review becomes the constraint, not coding. You decide in advance, in a written matrix, what an agent may merge alone and what always needs a human.
- The unit of work is no longer the ticket. It is the specification, and you run the team off a pipeline of specifications at different stages of maturity.
- The whole operating model passes or fails one test: a new engineer clones the repo and within the hour their agent behaves exactly like a veteran's. If that is not true, you do not have an operating model yet.
The setup: thirty Claudes, thirty answers
Put thirty engineers on the same team, give them all the same coding agent, and ask each of them to scaffold a new service the same morning. You will get thirty different answers. Different folder structures, different test conventions, different error handling, different opinions about logging, every one generated confidently and every one wrong in a different direction.
Nothing failed. Each agent did exactly what it was told. The problem is that each one was told something different, because each engineer configured their agent by hand, on their own machine, out of their own accumulated instructions.
84% of developers now use or plan to use AI coding tools. Trust in the accuracy of their output has fallen from 40% to 29%.
— Stack Overflow Developer Survey, 2025
That second number is the tell. Adoption is near universal and confidence is going down. The teams losing trust are not running worse models than everyone else. They are running the same agents with no shared operating discipline, and the inconsistency is quietly eroding their own faith in the output.
The fix is not a better model or a better prompt. It is plumbing. Below are the five pieces of plumbing I would put in, what each one actually looks like, and how each one is run day to day. None of them require a platform purchase.
Five artifacts. Each one concrete. None of them slides.
Artifact 1: the repository layout
The first logistics problem is consistency, and most teams find it too late. When an engineer tunes their agent, they are writing instructions: project conventions, the skills the agent can call, the sub-agents it can spawn, the rules it follows. If those instructions live on laptops, you do not have a shared capability. You have configuration drift that compounds every week.
The fix is to treat agent configuration as source code. It lives in the repo, next to the code it governs. With a tool like Claude Code that means a committed .claude/ directory and a root instruction file:
your-service/
.claude/
agents/ # sub-agent definitions (reviewer, test-writer, etc.)
code-reviewer.md
test-writer.md
skills/ # reusable, versioned org knowledge the agent loads
house-style/SKILL.md
api-conventions/SKILL.md
commands/ # shared slash-commands the whole team runs
ship.md
verify.md
settings.json # permissions, hooks, allowed tools
CLAUDE.md # project rules: conventions, anti-patterns, gotchas
src/
tests/
Everything that shapes how the agent behaves is in that tree, and the tree is in git. When you improve how the agent writes tests, you edit test-writer.md, open a pull request, and on the next pull every engineer has the improvement. When a convention changes, it changes in CLAUDE.md once. The knowledge that governs agent behavior is a shared, reviewed, versioned artifact, or it is not really shared at all.
The mechanism scales without changing shape. Whether you are distributing one team's house style or an enterprise's compliance rules, the move is the same: the knowledge becomes a file in version control, reviewed and pulled like everything else, not a habit in someone's head or a setting on someone's laptop.
How it is run. Agent config changes go through pull request review like any other code, because a bad instruction in CLAUDE.md can degrade thirty engineers at once. You pin it: the .claude/ directory is part of the repo, so a given commit has exactly one configuration. And you make improvement cheap, because the whole point is that a fix written once propagates to everyone on the next pull.
Artifact 2: a specification the agent can actually run
The second problem is the handoff, and it is where most product-to-engineering pipelines quietly break. In traditional development a PM writes a ticket and a developer interprets it. The interpretation gap is absorbed by a human who asks questions in standup and fills in what the ticket left out. In agentic development the specification is the input to the machine, and the machine does not read between the lines. Vague intent produces confident, wrong output, and the single largest frustration developers report with AI tools is exactly that: code that is almost right but not quite.
So the requirement has to change shape. "Build user authentication" is a conversation starter for a human. For an agent it is an outcome specification:
## Spec: Email/password authentication
GOAL
Let a user register, sign in, and reset a password.
CONTRACT (must all hold)
- Passwords hashed with bcrypt, work factor >= 12. Never logged, never returned.
- Session tokens expire in 24h; refresh rotates the token.
- Password reset is email-verified, single-use, expires in 30 min.
- All inputs validated at the boundary; reject on the first failure.
- Passes the OWASP Top 10 checks in our security skill.
OUT OF SCOPE
- SSO / social login (separate spec).
DEFINITION OF DONE
- >= 95% test coverage including the reset-token edge cases.
- Security skill runs clean in CI.
- No new high/critical findings in the dependency scan.
The difference from a ticket is that every line is checkable. The agent does not need to guess what "secure" means, because the contract says bcrypt, says 24 hours, says single-use. And the definition of done is something a machine can verify, not a feeling a reviewer has.
The uncomfortable logistics question is who writes this. The PM rarely writes machine-precise contracts, and the agent cannot be trusted to invent the missing detail. So a role appears whether or not you name it: someone who sits between product intent and agent execution and turns one into the other. Call it a specification owner. On a small team it is a senior engineer wearing a second hat. At scale it is a defined role with its own backlog. Pretend the role does not exist and it exists anyway, performed badly by whoever is closest to the failure.
I made the structural version of this argument in If I Built My Engineering Organization From Scratch Today: once code is cheap, the scarce, paid work becomes deciding what to build, directing the agents, and judging whether what comes back is right. The specification owner is where that shift first touches the ground. The org piece is the shape of the team. This is the artifact that team hands the machine.
How it is run. The spec is a tracked work product with its own definition of done, reviewed before any agent touches it. Good specs get saved and reused, because a contract that produced clean output once will produce it again. The spec, not the code, becomes the thing the team argues about, which is exactly where the argument belongs.
Artifact 3: the CI gate that kills drift
Artifact 1 puts the configuration in the repo. Artifact 3 makes sure nobody quietly works around it. A committed .claude/ directory does no good if an engineer runs a stale local copy or strips out a required skill. So you check it, mechanically, on every pull request:
# .github/workflows/agent-config-gate.yml
name: agent-config-gate
on: [pull_request]
jobs:
verify-config:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Required skills present
run: |
for s in house-style api-conventions security; do
test -f ".claude/skills/$s/SKILL.md" \
|| { echo "::error::missing required skill: $s"; exit 1; }
done
- name: Config is the committed one (no local drift)
run: |
git diff --exit-code -- .claude CLAUDE.md \
|| { echo "::error::uncommitted agent-config changes"; exit 1; }
- name: Skills are pinned, not floating
run: ./scripts/check-skill-versions.sh
It is not sophisticated, and it does not need to be. It does three things: confirms the required skills exist, confirms the configuration in the PR is the committed one rather than a local mutation, and confirms versions are pinned rather than floating. A required check that fails the build is the only instruction an agent and an engineer both cannot ignore. This is the difference between a rule you wrote down and a rule that is actually enforced, the symbolic backstop under the agent's soft, probabilistic behavior.
How it is run. The gate is a required status check, so a PR cannot merge until config is consistent. When you add a skill the whole org must have, you add one line to the loop and the next person missing it gets a red build instead of a silent divergence.
Artifact 4: the review-gate matrix
When agents generate code faster, the constraint moves downstream. You are no longer rate-limited by how fast people type. You are rate-limited by how fast people can verify, and verification got harder.
AI-coauthored pull requests average 1.7x more issues than human-only ones. Logic errors are 75% more common and security vulnerabilities 2.74x higher.
— CodeRabbit code analysis, December 2025
Double your output without changing your review process and you are not shipping twice as fast. You are shipping twice as many defects into a queue that cannot keep up. The teams that win decide, in writing and in advance, where human attention goes. That decision is a matrix:
The matrix is decided once, calmly, in advance — not relitigated on every PR when the queue is on fire.
The point of the matrix is not the exact rows. It is that the decision is made once, calmly, and written down, instead of re-litigated on every pull request when the queue is on fire. Low-risk changes flow through on automated checks alone, which is what frees human reviewers to spend their attention where the 1.7x actually bites: the logic, the security boundaries, the migrations.
How it is run. The matrix is enforced by branch protection and CODEOWNERS, not by goodwill. Paths touching auth or payments require the two reviewers automatically. And you tune it from data: if a category that auto-merges starts producing incidents, it moves up a row. Review policy is a living artifact, not a one-time decree.
Artifact 5: the board you run standup from
The last problem is cadence. Agents are good at bounded, well-specified work and bad at ambiguous, sprawling work. So the unit of work becomes the specification from Artifact 2, and the team runs off a pipeline of specifications at different stages:
SPECIFYING READY IN PROGRESS IN REVIEW MERGED
(intent -> (contract (agents (gated by (done,
contract) approved) building) the matrix) spec saved)
auth-reset billing- search-rank export-csv login-v2
org-invites webhook audit-log
The board does not track tickets. It tracks specifications maturing from a fuzzy intent into a merged result. Throughput depends on keeping the left side full and the right side flowing, and the most common failure is an empty SPECIFYING column: agents idle because no one is turning intent into runnable contracts fast enough.
This changes standup. The question is no longer only "what are you working on." It is "what is specified and ready, what is stuck in review, and where is the queue backing up." The bottleneck is usually visible on the board before anyone feels it: a pile in IN REVIEW means the gate matrix is too strict or you are short reviewers; an empty READY column means the specification owner is underwater. Manage the pipeline and the code mostly takes care of itself. Manage the code and you will always be starved for well-formed work.
How it is run. One person owns the flow of the board, usually the specification owner or a tech lead. Standup walks the board right to left, clearing the closest-to-done first. The one metric worth watching is not lines of code or PR count. It is how many specifications go from READY to MERGED per week, because that is the number that actually moved the business.
The test
You will know whether these five artifacts add up to an operating model with a single check. A new engineer clones the repository, and within the hour their agent behaves identically to a five-year veteran's. Same conventions, same skills, same gates, same definition of done. If that is true, you have a capability. If it is not, you have configuration drift, an unowned handoff, and a review queue waiting to catch fire, no matter how good your model is.
Where this breaks honestly: the specification owner is a real cost and a real bottleneck, and if you understaff it the whole pipeline starves. The gate matrix will annoy people the first time it blocks a Friday merge. And none of this is free to stand up. But the alternative is the thirty-Claudes problem at the scale of your whole engineering org, which is the more expensive thing by a wide margin.
The decision to go agentic is one meeting. Everything that determines whether that decision produces a capability or a mess is in these five files. Most teams scale first and discover the logistics afterward, as a crisis. Build the plumbing first.
Matthew Kruczek is Managing Director at EY, leading Microsoft domain initiatives within Digital Engineering. This article is part of "The Agent-First Enterprise" series. Connect with Matthew on LinkedIn to discuss the operating model for agentic engineering in your organization.
References
- Stack Overflow. "Developers remain willing but reluctant to use AI: The 2025 Developer Survey results are here." 2025. stackoverflow.blog
- Stack Overflow. "AI: 2025 Stack Overflow Developer Survey." 2025. survey.stackoverflow.co
- CodeRabbit. "State of AI code quality and pull request issue rates." December 2025. (AI-coauthored PRs average 1.7x more issues; logic errors 75% more common; security vulnerabilities 2.74x higher.) coderabbit.ai
- DORA. "2025 DORA Report: AI-assisted software engineering and the role of platform engineering." 2025. infoq.com
- Anthropic. "Claude Code: settings, subagents, and skills." Documentation, 2025. docs.anthropic.com