Harness Engineering Playbook: Match AI Methodology to Project Rigor

If you only have a minute, here's what you need to know.

Every developer using AI coding agents has a harness, whether they know it or not. Your CLAUDE.md, your settings, your skills, your hooks, your MCP connections: that collection of configuration IS your agentic harness. Most developers have no idea they're building one, and the result is inconsistent, unstructured, and unrepeatable.
Specification-Driven Development (SDD) has become the dominant methodology for building software with AI agents. But the SDD ecosystem now has 15+ competing frameworks, from lightweight "just use Plan Mode" to full constitutional governance with traceability matrices.
These frameworks span three rigor levels: L1 Spec-First (80% of projects), L2 Spec-Anchored (regulated industries), and L3 Spec-as-Source (aspirational). Higher rigor is not better. It is more expensive. Applying L2 ceremony to a solo greenfield app wastes cycles. Applying L1 flexibility to a regulated 50-person build fails audits.
I built an open-source Claude Code plugin called mck-scaffold that makes the methodology decision deterministic. Eight questions, a routing table, and you get a fully scaffolded development harness matched to your project context.
The plugin scaffolds your complete harness in one pass: CLAUDE.md with methodology rules, AGENTS.md for cross-tool compatibility, settings.json with permissions, decision documentation, and auto-installed upstream frameworks. It also includes a research-refresh pipeline that keeps the ecosystem catalog current because static scaffolds rot.
The latest version adds neuro-symbolic enforcement: the LLM handles the interview (neural), a deterministic Python installer handles file operations (symbolic), and six real Claude Code hooks enforce methodology rules at the tool-use level. "Write tests first" in a rules file is asking nicely. A PreToolUse hook that blocks edits without a sister test is making the wrong thing impossible.

The harness you didn't know you were building

Every developer using Claude Code, Copilot, or Cursor has built an agentic development harness. Most just don't realize it.

Your CLAUDE.md file. Your .claude/settings.json. The skills you've installed. The hooks you've configured. The MCP servers you've connected. The rules in your head about when to use Plan Mode, when to delegate to subagents, whether to write specs first or just start coding. That collection of configuration and methodology IS your agentic harness, and it determines whether your AI coding agent writes production-quality software or generates plausible-looking code that falls apart under real constraints.

I wrote previously about how the harness is the multiplier, where the same prompt produces wildly different results depending on the harness wrapping the agent. And about how simpler harnesses outperform complex ones when the layers are well-chosen.

This article is about the decision that comes before both of those: how do you choose what goes into the harness in the first place?

A 5-person team has 5 divergent harness configurations. A 50-person team has 50. Nobody is managing this.

One developer has a meticulously tuned CLAUDE.md with coding standards and security guardrails. The developer next to them has a bare default. A third copied a configuration from a blog post three months ago. Your newest hire has nothing at all.

The methodology landscape has an answer to this problem. It just has too many answers.

The SDD explosion

Specification-Driven Development has become the dominant paradigm for building software with AI coding agents. The core idea is straightforward: write a specification first, then let the agent implement against it. Plan before you code. This is not controversial. What IS controversial is how much specification you need.

I maintain a research catalog tracking 15+ SDD frameworks, and the range is staggering.

On one end, you have Lean: just use Plan Mode, write enough of a plan to orient the agent, and start building. Minimal overhead. Good for experienced developers who can course-correct in real time. No framework to install.

On the other end, you have MUSUBI: a full constitutional governance system with EARS-format requirements, C4 architecture diagrams, Architecture Decision Records, and a traceability matrix that maps every requirement through design to test. Nine governance articles. Explicit change-control addenda for every modification.

Between those extremes sit Superpowers (198K+ GitHub stars, hard-gated TDD, mandatory brainstorming before implementation), Spec-Kit (103K+ stars, constitution-based five-command pipeline), and OpenSpec (brownfield-specialized with delta-spec workflows).

Each of these is correct for somebody. The problem is that most developers either pick the most popular one regardless of fit, or they skip methodology entirely and let the AI agent freestyle.

A 2026 research paper by Piskala proposed a taxonomy that organized this chaos into something navigable.

Three-column diagram showing SDD rigor levels: L1 Spec-First for 80% of projects with Superpowers and Spec-Kit frameworks, L2 Spec-Anchored for regulated industries with MUSUBI and OpenSpec, and L3 Spec-as-Source as aspirational with Tessl and CSDD

L1: Spec-First. The spec lives in the feature branch. Once you merge, the spec becomes historical. Maximum flexibility, minimum ceremony. This is where roughly 80% of projects should be. Frameworks: Superpowers, Spec-Kit, Lean.

L2: Spec-Anchored. The spec lives through the entire software lifecycle. Every change requires a formal addendum. Full traceability from requirements through design to tests. This is for regulated industries, audit-sensitive systems, and teams above ten people where "we all just know what we're building" stops working. Frameworks: MUSUBI, OpenSpec, with CSDD as a security overlay.

L3: Spec-as-Source. Code is generated directly from the specification. The spec IS the source code. This remains aspirational in 2026. Tessl raised $125 million for it and is still in closed beta. The academic evidence from CSDD (one study, 73% improvement) needs considerably more validation before enterprise adoption.

62.6% of AI coding bugs are in tool invocation and command execution, not in the model's reasoning. The harness, not the model, is where quality is won or lost.

Source: arXiv 2603.20847

The critical insight: rigor is a cost, not a virtue

Here is where most teams get the methodology decision wrong. Higher rigor is not better. It is more expensive. And the cost is not abstract. It is concrete overhead that either pays for itself or drags your team down.

Spec-Kit adds roughly 10x overhead on your first development cycle before the team internalizes the workflow. MUSUBI's nine governance articles create a traceability paper trail that is worthless if nobody audits it. Superpowers' hard-gated TDD workflow has received feedback about being "stripped to 30%" when the gates are too aggressive for the project's complexity.

Applying L2 ceremony to a solo developer's greenfield app is like requiring three sign-offs to push a README change. The process consumes more time than the code it's protecting.

Applying L1 flexibility to a 50-person team building a compliance-sensitive system is how you get audit failures, inconsistent quality, and agents making decisions that should have gone through formal review.

The research maps this to four decision contexts (Wasowski 2026):

Context A: Solo or small team, greenfield. Route to L1.
Context B: 10+ team, regulated, compliance requirements. Route to L2 with optional CSDD security overlay.
Context C: Brownfield or legacy codebase. Route to L2-OpenSpec with Shotgun CLI for codebase analysis.
Context D: Security-critical (fintech, healthcare, embedded). Route to L2 + CSDD + L3 monitoring document.

Match the rigor to the project. Not the other way around.

What I built

This mismatch problem is what led me to build mck-scaffold, an open-source Claude Code plugin that routes you to the right development methodology and scaffolds your entire harness in one pass.

The plugin runs an 8-question interview. Not a quiz. A structured decision tree that captures the dimensions that actually determine methodology fit:

What is your audit and compliance posture?
Is this greenfield or brownfield?
How large is the team?
Is TDD discipline mandatory?
What is the tech stack?
Do you have an MCP gateway available?
What is your multi-agent appetite?
(Security-critical only) Do you need L3 monitoring?

Questions get pruned based on earlier answers. Say "brownfield" and the plugin skips team size and TDD questions because the routing is already determined: you need OpenSpec with Shotgun CLI for codebase analysis. Say "no compliance requirements" and a solo team, and you land on L1-lean or L1-superpowers depending on TDD preference.

The routing is deterministic. Same answers always resolve to the same methodology preset. This is a lookup table, not a judgment call. If the table has no match, the plugin aborts rather than guessing. I specifically did not want an AI making methodology decisions based on vibes.

What the scaffolded harness includes

Each methodology preset composes a complete development harness through template layering: shared base, then methodology fragment, then optional overlays, then stack-specific adapters.

Three-panel flow diagram showing the 8-question interview on the left routing through a deterministic lookup to one of seven methodology presets in the center, which scaffolds the complete harness output on the right including CLAUDE.md, AGENTS.md, settings.json, decision documentation, and auto-installed framework

CLAUDE.md with methodology-specific rules. An L1-superpowers preset injects hard-gated TDD rules and mandatory brainstorming. An L2-musubi preset injects EARS requirement formatting, C4 architecture expectations, and traceability mandates. An L1-lean preset adds a Plan Mode reminder and nothing else.

AGENTS.md for cross-tool compatibility. Copilot, Cursor, Gemini CLI, and OpenCode all recognize AGENTS.md. The scaffold writes both CLAUDE.md (authoritative for Claude Code) and AGENTS.md (pointing back to CLAUDE.md as source of truth) so the methodology travels with the repository regardless of which agent your collaborators use.

settings.json with appropriate permissions. Different methodologies need different tool access. An L2 preset with CSDD security overlay needs stricter permissions than an L1-lean setup.

Decision documentation. Every scaffold writes docs/decisions/INTERVIEW.md (the exact answers given) and docs/decisions/METHODOLOGY.md (what was chosen, why, what the alternatives were, and honest caveats about the chosen framework). Six months from now, when someone asks "why are we using MUSUBI?", the answer is in the repo.

Auto-installed upstream framework. The scaffold runs the actual install commands for the chosen methodology: npx musubi-sdd@latest init, or wires up Superpowers' skill bundle, or runs Spec-Kit's specify init. Platform-aware, version-current, with Windows-specific environment variable workarounds where needed.

One detail I am particularly deliberate about: every preset surfaces honest caveats. Superpowers received feedback about being "stripped to 30%" when its TDD gates are too aggressive. Spec-Kit adds roughly 10x overhead on your first development cycle. MUSUBI has 48 GitHub stars, which means a very small community if you need help. These facts appear in the scaffolded METHODOLOGY.md. No framework is sold as perfect because none of them are.

The neuro-symbolic layer most harnesses are missing

The first version of mck-scaffold had a fundamental weakness: every rule it scaffolded was Level 1 enforcement. Markdown instructions in CLAUDE.md that the LLM tries to follow and silently ignores when it doesn't.

"Write tests before implementation" in a rules file is asking nicely. A PreToolUse hook that blocks Edit on source files unless a sister test exists is making the wrong thing impossible. That distinction matters.

Neural creativity + symbolic enforcement. The LLM proposes; code disposes.

So I rebuilt the enforcement model around a neuro-symbolic architecture. The concept is straightforward: let the LLM handle what it's good at (conversational interviews, contextual judgment) and let deterministic code handle what it's good at (validation, file operations, gate enforcement).

The scaffold now operates on a five-level enforcement scale:

Level 1: Prompt rules in markdown. The LLM tries to follow them. Silently fails when it doesn't.
Level 2: Structured output schemas. Catches shape errors, not content errors.
Level 3: Tool-forced workflows. Must call X before Y. Forces ordering.
Level 4: External validators. Hooks, post-edit checks, CI gates. Catches substance after the fact.
Level 5: Formal verification. Strict types, property tests, contracts. Provable correctness.

The previous version operated entirely at Level 1. The current version ships Level 2-4 enforcement out of the box.

Concretely, this means six real Claude Code hooks that fire on every tool use:

spec_gate (L1-spec-kit): Blocks edits on gated paths unless an approved spec covers the target file.
spec_shape (L1-spec-kit): Validates spec files against a JSON Schema after they're written.
task_proof (L1-spec-kit): Runs the spec's registered tests after an edit to a covered file.
hardgate_check (L1-superpowers): Refuses edits unless an approved design marker exists.
tdd_check (L1-superpowers): Refuses new source files without a sister test file.
secret_gate (L2-csdd-security): Blocks edits introducing credential patterns (12 high-precision regex patterns for AWS keys, GitHub PATs, Stripe secrets, JWTs, and more).

Each hook respects a three-position strictness dial: warn (log to stderr, allow), block (reject the tool use), or strict (reject and audit). L1 methodologies default to warn. L2 defaults to block. The installer writes the profile, and every hook reads it at runtime.

The install process itself is now neuro-symbolic. The LLM handles the 8-question interview (it's good at conversational context), then writes a JSON decision document validated against a schema. A deterministic Python installer reads that document and handles all file operations: copying hooks, merging settings, writing CLAUDE.md, seeding the audit log. The LLM never touches settings.json directly. If the installer fails, the LLM does not retry. It reports the error and stops.

Rules.md is "asking nicely." A PreToolUse hook that blocks uncovered edits is "making the wrong thing impossible."

The research problem I didn't expect

The SDD ecosystem moves fast. Frameworks change their CLI commands, rename features, deprecate flags. A scaffold that was correct three months ago might generate broken install commands today.

The first time I dogfooded mck-scaffold on a real project, it immediately caught drift in Spec-Kit: the --ai flag had been deprecated, slash commands were renamed from dot notation to hyphen notation, and core commands had moved from CLI to skills. A Windows-specific encoding bug would have crashed the install. If I had scaffolded from a static template, every install command would have failed silently.

So I built a research-refresh pipeline into the plugin. It launches four parallel research agents across different signal sources: GitHub for framework releases and star trajectory, Reddit for practitioner sentiment and pain points, YouTube for tutorial tactics, and engineering blogs for architectural deep dives. Each agent re-surveys its source and diffs findings against the current catalog. Changes require section-by-section confirmation before updating.

The research findings live in the repository. A refresh is effectively a contribution to the plugin. Everyone benefits on their next pull.

A scaffold is only as good as its last update. Static templates rot. Built-in ecosystem monitoring doesn't.

This is iterative, not finished

I want to be direct about where mck-scaffold stands today: it is experimental. The routing table covers the major decision contexts, but the SDD ecosystem is moving fast enough that any static claim of completeness would be dishonest.

That said, this is exactly how agentic tooling should evolve. In a world where AI coding agents ship features in hours instead of weeks, everything is iterative. The plugin that routes you to a methodology today will route you to a better methodology next month, because the research-refresh pipeline catches what changed and the scaffold adapts. The neuro-symbolic hooks that enforce six rules today will enforce twelve tomorrow, because the enforcement model is composable.

The alternative, waiting until a tool is "done" before releasing it, does not work in agentic development. The ecosystem will have moved by the time you ship. Release early, enforce what you can, iterate in public. That is the only honest path when the ground beneath you is shifting at this pace.

What to do this week

Audit your team's harness consistency. Ask every developer using AI coding agents one question: "Show me your CLAUDE.md." If the answers range from "my what?" to a 500-line custom configuration, you have a harness gap. The methodology and rules your agents follow should be a team decision, not individual preference.

Classify your projects by rigor level. Not every project needs the same harness. A quick internal tool is L1. A patient-facing healthcare system is L2 at minimum. Map your portfolio to the taxonomy and stop applying one-size-fits-all ceremony to projects that don't need it, or one-size-fits-all flexibility to projects that need governance.

Try the scaffold. If you use Claude Code, install mck-scaffold and run the interview on your next project. Even if you don't use the output directly, the eight questions force the architectural conversations that most teams skip: What is our compliance posture? Is TDD mandatory here? How much ceremony does this project actually justify?

Treat your harness as infrastructure. Your CLAUDE.md, your methodology choice, your skill and hook configuration: these are infrastructure decisions, not personal preferences. Version them. Review them. Keep them consistent across your team. The organizations that treat harness configuration as a first-class architectural concern will compound their advantages the same way teams with disciplined dependency management outperformed those without.

The teams that will pull ahead are not the ones with the best models. They are the ones who figured out that the harness surrounding the model determines whether it writes production-quality software or generates plausible-looking code that falls apart under real constraints. Your harness is your multiplier. Build it deliberately, or accept that you are leaving that multiplier to chance.

Matthew Kruczek

Managing Director at EY

Matthew leads EY's Microsoft domain within Digital Engineering, overseeing enterprise-scale AI and cloud-native software initiatives. A member of Microsoft's Inner Circle and Pluralsight author with 18 courses reaching 17M+ learners. Connect on LinkedIn to discuss agentic development harness architecture for your organization.

References

Piskala, "Three-Level SDD Rigor Taxonomy," arXiv:2602.00180, 2026
Wasowski, "Four Decision Contexts for SDD Selection," Medium, 2026
arXiv 2603.20847, "AI Coding Bug Distribution in Tool Invocation," 2026
mck-scaffold, github.com/MCKRUZ/mck-scaffold

Scaffolding the Agentic Harness: 15 Methodologies, Three Rigor Levels, and the Plugin That Routes Between Them