Same Prompt, Different Results. Your Agent Harness Is the Multiplier.

If you only have a minute, here's what you need to know.

Most engineers run AI coding agents with default settings. They're leaving 60-70% of the tool's capability on the table.
An agent harness, the rules, hooks, skills, memory systems, and plugins wrapping your AI agent, determines output quality more than the model itself. Same model, same prompt, dramatically different results.
My Claude Code harness uses 16 lifecycle hooks, 28 reusable skills, 5 plugins, 8 MCP server endpoints (5 aggregated through a single gateway), and a cross-project memory graph. Every layer is designed around one principle: optimize what enters the context window.
Token comparisons show a 3-4x efficiency gain on research-heavy tasks and a 2x reduction in correction cycles on implementation tasks, compared to default configuration.
The context window is the bottleneck for all agentic coding. You don't need my exact setup. But if you're running an AI coding agent without governance rules, memory persistence, and context optimization, you're paying for a sports car and driving it in first gear.

I wrote two weeks ago that agent harnesses need fewer layers, not more. That argument still stands. But "fewer layers" doesn't mean "no layers." It means the right layers.

Today I want to show you what that looks like in practice. Not a framework diagram. Not a theoretical architecture. My actual Claude Code configuration, the one I use every day to ship production code, monitor my social media, and manage a portfolio of 30+ projects.

This isn't a tutorial. It's an argument: your agent harness is the single highest-leverage investment you can make in AI-augmented engineering, and most teams haven't even started.

Why the context window is everything

Before I walk through my setup, I need to make the case for why all of this matters. The answer is three words: the context window.

Every AI coding agent operates within a fixed context window. That's the token budget holding system instructions, your conversation history, tool outputs, and the model's own reasoning. When that window fills up, the agent compacts: it summarizes and discards older context to make room. Every compaction loses conversational state. Every lost state means re-explanation, re-exploration, and re-orientation.

The default Claude Code experience treats the context window as a dumb pipe. Raw build logs? Dump them in. Five hundred lines of git log? Dump them in. The model's own correction attempts after generating code that doesn't match your standards? All of it goes in. Most sessions, 60-75% of the context window is consumed by information the model doesn't need in raw form.

This is the fundamental problem a harness solves. Every layer of a production harness exists to optimize what enters the context window. Front-load the right information, sandbox the noise, and eliminate the correction cycles that waste tokens on work the model should have gotten right the first time.

The model is powerful. The context window is finite. The harness bridges that gap.

What "default" actually looks like

When you install Claude Code and start a conversation, here's what happens:

The model receives your prompt. It has access to built-in tools: file read, file write, bash execution, web search. It knows nothing about your project conventions, your coding style, your team's architectural decisions, or what you worked on yesterday. Every session starts from zero.

This is like hiring a senior engineer and giving them no onboarding, no documentation, no code review standards, and no access to your team's Slack history. They're talented. They'll produce something. But they'll spend half their time asking questions you've already answered and writing code that doesn't match your patterns.

That's the default experience. And it's what 90% of Claude Code users are running right now.

Here's where their context window goes in a typical 2-hour session:

Table showing default context window usage breakdown: exploration 25%, raw tool output 20%, correction cycles 15%, state re-explanation 10%, actual productive work 30%

Less than half the context window is doing real work. The rest is overhead. That's the cost of having no harness.

The seven layers of a production harness

My harness has seven layers. Each one solves a specific problem that default configuration doesn't address, and each one is fundamentally about putting the right tokens into the context window while keeping the wrong ones out.

Same Prompt

claude-code

$ claude

# Session starting...

---

> Refactor auth middleware

> to use Result pattern

> and add FluentValidation

---

# Harness loaded:

~/.claude/

rules/ (7 files)

skills/ (28 skills)

hooks/ (16 hooks)

CLAUDE.md

---

# Nexus synced. Memory restored.

# Context-mode active.

Ready. _

❯ ❯ ❯

7 Layers

Rules 7 files

Standards enforced from first keystroke

Hooks 16

Autonomous state management

Skills 28

Pre-encoded domain expertise

Plugins 5

Context-mode sandboxes tool output

MCP / Bifrost 8 endpoints

Focused results, not raw HTML

Memory Nexus + CMEM

Ends cold-start sessions

Observability Langfuse

Makes context waste visible

Productive work 45% → 78%

Compactions 4-6 → 1-2

Correction cycles 3-5 → 0-1

Layer 1: Governance rules (7 enforcement files)

The ~/.claude/rules/ directory contains seven markdown files that load into every conversation as system instructions. They encode:

Coding style. Immutability patterns for TypeScript and C#. Spread operators, never mutate. Records and with expressions. Init-only properties. This isn't a preference; it's a requirement enforced before the first line of code is written.
Security checklist. No hardcoded secrets. FluentValidation on all DTOs. EF Core only for SQL (auto-parameterized). JWT validation with ClockSkew=Zero. CSRF protection. Rate limiting on public endpoints. This runs as a pre-commit mental model, not an afterthought.
Agent orchestration. When to auto-spawn specialized subagents. Security-sensitive code triggers a security-reviewer. Build failures trigger a build-error-resolver. These aren't optional. They're wired into the rules.
Testing standards. 80% coverage minimum. TDD when requested. xUnit with AAA pattern. WebApplicationFactory<Program> for integration tests.
Performance guardrails. Context window management, research time limits, build troubleshooting protocols.
Cross-project memory. When to query Nexus before asking me to re-explain something. When to check for existing decisions instead of making new ones.
Common patterns. Privacy tags, skeleton project discovery, reusable architectural templates.

The cost: approximately 3,000 tokens of system context. The payoff: the model writes code that matches my standards on the first attempt. No correction cycles. No "actually, we use FluentValidation, not DataAnnotations." No "please don't use console.log."

Layer 2: Lifecycle hooks (16 hooks across 6 phases)

This is where a harness becomes autonomous. Claude Code supports hooks at six lifecycle phases: SessionStart, PreToolUse, PostToolUse, PreCompact, PostCompact, and Stop. I use all six.

SessionStart (3 hooks):

Memory persistence restores my previous session's context: what I was working on, what's next, what decisions were made.
Nexus syncs at session start. Nexus is a tool I built, a local-first knowledge graph that spans all my projects (more on this in Layer 6). It loads architectural decisions and dependency maps from all 30+ projects so the model doesn't start cold.
Nexus session initialization sets up cross-project tracking for the current session.

PreToolUse (3 hooks):

A skill switchboard routes file edits to the appropriate skill based on file type and project context.
A strategic compact advisor monitors context usage and suggests compaction before I hit limits, preventing emergency compactions that lose more state.
A push safety gate lists commits before any git push, requiring review.

PostToolUse (3 hooks):

Auto-formatting runs after every edit, eliminating style inconsistency.
Observation logging records file paths and patterns to session memory.
Nexus records architectural patterns and decisions for cross-project learning.

PreCompact & PostCompact (2 hooks):

Before compaction: update MEMORY.md, save reusable patterns to lesson storage.
After compaction: re-orient from MEMORY.md and task list so the conversation continues without interruption.

Stop (5 hooks):

Clean MCP subprocess termination.
Autonomy throttle (tracks how often Claude stops to check in; if it stops 3+ times in 5 minutes, it pauses and asks what I need instead of continuing on autopilot).
Session end persistence.
Nexus post-session analysis.
Telemetry export for cost and token tracking.

The critical insight: these hooks run without my intervention. I don't think about memory persistence or auto-formatting or session telemetry. The harness handles it. Every cognitive cycle I'm not spending on housekeeping is a cycle I'm spending on the actual problem.

Layer 3: Skills (28 domain expertise modules)

This is the layer nobody talks about, and it might be the most underrated.

Claude Code skills are reusable prompt templates, slash commands that load pre-encoded domain expertise into the conversation on demand. Instead of writing a 500-token ad-hoc prompt explaining what you want, you invoke a skill that loads optimized, battle-tested instructions.

I maintain 28 skills across two tiers: 18 global skills available in every project, and 10 project-specific skills tailored to individual workflows.

Engineering skills:

/tdd-workflow enforces test-driven development: scaffold interfaces, write tests first, implement to pass, verify coverage. No ambiguity about the TDD process.
/security-review triggers a comprehensive security audit for auth, payment, or identity code. Covers OWASP Top 10, secrets management, and attack surface analysis.
/shannon is an autonomous AI pentester. White-box security assessments with real exploit execution. Not theoretical. It finds actual vulnerabilities.
/llm-cost-optimizer audits model selections across call sites and recommends cheaper alternatives by complexity tier. Pays for itself immediately.
/code-review, /build-fix, /refactor-clean are post-implementation quality gates that catch issues before they compound.

Design and visualization skills:

/design covers full brand identity, design tokens, UI styling, and logo generation with 55+ styles.
/slides creates strategic HTML presentations with Chart.js, design tokens, and contextual slide strategies.
/visual-explainer generates self-contained HTML pages that visually explain systems, code changes, and data.
/dashboard-creator builds KPI metric cards, charts, and data visualizations as standalone HTML.

Planning and analysis skills:

/deep-plan creates sectionized implementation plans with multi-LLM review and stakeholder interviews.
/skeptic is one I built myself (open source on GitHub). It runs critical analysis that pokes holes in plans, surfaces hidden complexity, and challenges assumptions before I commit to an approach. This one exists because AI agents have an agreeableness problem. You describe an idea and the model tells you it's brilliant, then helps you build it. Three days later you realize the approach had an obvious flaw that a skeptical colleague would have caught in five minutes. /skeptic is that colleague. It forces the model to argue against the plan before I greenlight it, and it's saved me from more bad architectural decisions than any other single tool in this harness.
/functional-design handles end-to-end UI/UX creation from functional spec to working code.

The context window angle on skills is easy to miss. Without a skill, I type out detailed instructions every time, and the model still pulls its punches or misses nuance. That means follow-up corrections. With a skill, I type three words and the model loads 2,000 tokens of pre-optimized instructions. Comprehensive and correct from the first pass. Zero follow-up cycles. Skills convert ad-hoc prompting (variable quality, frequent corrections) into encoded expertise (consistent quality, zero corrections).

Layer 4: Plugins (5 active)

Claude Code's plugin ecosystem is young, but the right plugins dramatically extend capability:

context-mode is the biggest context window optimizer in my stack. Instead of dumping raw command output (build logs, git history, file contents) directly into the context window, context-mode sandboxes execution and indexes the output in a local SQLite FTS5 database. Only summaries enter my conversation. A 500-line git log that would normally consume ~8,000 tokens? Indexed and searchable for a few hundred. A build failure with 200 lines of stack trace? Summarized to the 3 relevant error lines. Over a session, this compounds fast.
deep-plan provides sectionized implementation planning with multi-LLM review. Turns vague feature requests into structured, reviewable plans before a line of code is written.
deep-implement runs TDD-oriented implementation from deep-plan sections. Writes tests first, implements to pass, then reviews.
deep-project decomposes high-level project requirements into scoped planning units.
code-simplifier handles post-implementation cleanup. Reviews changed code for reuse opportunities and unnecessary complexity.

Layer 5: MCP server aggregation (Bifrost)

MCP (Model Context Protocol) servers give Claude Code access to external tools, from APIs to documentation indexes to video analytics, through a standardized interface. The problem: each MCP server exposes its own set of tool definitions, and every tool definition consumes context window tokens. Run five MCP servers with 3-5 tools each, and you're spending thousands of tokens just on tool schemas before you've asked a single question.

Bifrost solves this by acting as an HTTP gateway that aggregates multiple MCP servers behind a single endpoint. It runs on a dedicated machine in my network and proxies requests to backend servers:

context7 provides library documentation and code examples on demand. When Claude needs to reference a framework's API, it queries context7 instead of burning tokens on web searches or hallucinating method signatures. This is the one I lean on hardest. Accurate docs in the context window, not guesswork.
sequential-thinking enables structured multi-step reasoning for complex architectural decisions. Forces the model to decompose a problem before committing to an approach, rather than jumping straight to code.
github provides direct GitHub API access for PR management, issue tracking, and repository operations without leaving the conversation.
firecrawl handles web scraping and content extraction. When I need to pull in a competitor's documentation, an API reference, or a technical article, firecrawl fetches and cleans it without me alt-tabbing to a browser.
youtube provides video analytics and transcript extraction for competitive research and content analysis.

Two additional MCP servers run standalone: Nexus (cross-project memory graph) and CMEM (session memory). These stay direct because they're lightweight, local-only operations that don't benefit from gateway aggregation.

That's 8 MCP server endpoints total: 5 through Bifrost, 2 standalone, plus Bifrost itself. From the model's perspective, it sees 3 MCP connections. From my perspective, it's 3 processes to manage instead of 8.

The token savings come from two places. First, Bifrost consolidates all five servers behind one HTTP endpoint with one schema negotiation instead of five, reducing tool definition overhead in the context window. Second, and more importantly, these MCP servers return focused results instead of raw output. A context7 documentation query returns the exact function signature, parameters, and a usage example in 200-400 tokens. The alternative? A web search that dumps 3,000-6,000 tokens of HTML, ads, and irrelevant content into the context window. Firecrawl returns clean markdown instead of raw DOM. These MCP servers return what the model needs, not everything the source contains.

The bigger point: MCP servers are how you give an agent real-world reach without bloating its context. The harness controls what enters the context window, and Bifrost is the gatekeeper.

Layer 6: Cross-project memory (Nexus + CMEM)

This is the layer most people don't realize they're missing.

Nexus is a tool I built: a local-first, encrypted knowledge graph that spans all my projects. When I make an architectural decision in Project A, Nexus records it. When I'm working on Project B and face a similar decision, Nexus surfaces the prior art. It tracks:

Architectural decisions and their rationale
Code patterns and where they're used
Cross-project dependencies and potential conflicts
Infrastructure notes (SSH configs, deployment targets, service accounts)

CMEM (session memory) provides semantic search across past conversations. When I worked on a similar problem three weeks ago, CMEM surfaces the relevant context without me re-explaining it.

Without cross-project memory, every session starts cold. The model reads files it read yesterday, asks questions I've answered before, and explores architecture it's already mapped. With Nexus and CMEM, the model loads synced context at session start and picks up where we left off. The first task of every session starts productive instead of exploratory.

Layer 7: Observability (Langfuse + status line)

You can't optimize what you don't measure. My harness exports every session to Langfuse for cost and token tracking. A custom status line shows project name, model, context usage percentage, and remaining capacity in real time.

This might seem like a nice-to-have. It's not. When you can see that a research task consumed 45% of your context window on raw command output, you know exactly where to optimize. When you can compare token-per-task costs across sessions, you can make data-driven decisions about which plugins and rules are pulling their weight.

The status line is also a context window guardian. Seeing "Context: 72% | ~56K remaining" in real time changes how you work. You don't issue a massive git log --all when you can see you're at 70% capacity. You reach for context-mode's sandboxed execution instead. Observability turns unconscious context waste into conscious context management.

The token math

Here's what this looks like in practice. These are representative estimates from my Langfuse telemetry across typical coding sessions. Individual sessions vary, but the pattern is consistent.

$Table showing token investment versus savings for each harness layer: rules invest 3K tokens saving 15-20K, hooks invest 2K saving 25-35K, skills invest 2K saving 10-15K per invocation, context-mode invests 500 saving 30-50K, MCP servers invest 1.5K saving 15-25K, memory invests 2K saving 20-30K, observability invests 500 saving 5-10K$

The combined effect: my sessions run at roughly 70-80% productive work ratio compared to the ~45% I see in default configurations. Compactions drop from 4-6 per session to 1-2. The model gets things right on the first attempt instead of the third. And I spend my time thinking about the actual problem instead of re-explaining my project.

Same prompt, different worlds

Let me make this concrete. Here's a real prompt I might give Claude Code:

"Refactor the authentication middleware to use the Result pattern and add FluentValidation"

Without a harness (default Claude Code):

Claude asks clarifying questions (500 tokens): "What framework? What's the project structure? What's the Result pattern you're using?"
Exploration phase (20,000 tokens): Reads 10-15 files to understand the codebase, architecture, existing patterns
First implementation attempt (1,500 tokens): Writes code with try/catch and DataAnnotations. Reasonable, but wrong for this codebase.
User correction (800 tokens): "We use FluentValidation, not DataAnnotations. And we use Result<T> with error discriminated unions."
Second attempt (1,500 tokens): Closer, but uses mutable patterns and console.log for debugging
User correction (600 tokens): "Immutable only. No console.log. Use structured logging."
Third attempt (1,500 tokens): Finally correct
No security review triggered
No tests written

Total: ~26,000 tokens. Three iterations. No tests. No security review.

With my harness:

Rules pre-loaded: Claude already knows FluentValidation, Result<T>, immutability requirements, structured logging, security standards
Nexus syncs (2,000 tokens): Project architecture, existing middleware patterns, and dependency graph loaded
First implementation (1,500 tokens): Correct patterns from the start. Immutable, FluentValidation, Result<T>, structured logging.
Security reviewer auto-spawns (rules detect auth code): Reviews for JWT validation, CSRF, token storage
Auto-format runs: Code style consistent without manual review
Build output sandboxed by context-mode: a few hundred tokens instead of thousands

Total: ~10,000 tokens. One iteration. Security reviewed. Standards enforced.

That's not a 10% improvement. That's a fundamentally different relationship with your context window.

Context window optimization: the throughline

If you've been counting, every layer of this harness optimizes the same scarce resource. That's not a coincidence. It's the design principle.

Rules front-load 3,000 tokens to prevent thousands more in correction cycles. Hooks automate state management that would otherwise consume context with manual "remember what we were doing" prompts. Skills convert variable-quality ad-hoc instructions into consistent, pre-optimized prompts. context-mode keeps raw tool output out of the window entirely. Bifrost routes queries to MCP servers that return focused results instead of raw HTML. Memory eliminates the cold-start exploration that consumes 10-20% of every default session. Observability makes all of this measurable so you know what's working.

The context window is the fundamental constraint of agentic coding. Every token spent on overhead is a token not spent on reasoning about your actual problem. Every unnecessary compaction is lost state and degraded continuity. Every correction cycle is the model doing the same work twice.

A harness doesn't make the model faster or smarter. It makes the context window deeper. And depth, the ability to sustain complex, multi-step reasoning without losing state, is what separates a coding assistant from a coding agent.

This is the new skill gap

Here's why this matters beyond my personal setup.

In every previous technology wave, the differentiator was knowledge of the technology itself. Know Java better than the next person, ship faster. Know Kubernetes better, deploy more reliably.

With AI coding agents, the technology is the same for everyone. We all have access to the same Claude, the same GPT, the same Gemini. The model isn't the differentiator.

The harness is.

The engineer who spends a week configuring governance rules, building skills for their recurring workflows, and optimizing context consumption will outperform the engineer with default settings for the next two years. Every session. Every task. Compounding daily.

The gap compounds in both speed and quality. The harnessed engineer gets security reviews triggered automatically. They get consistent code patterns without thinking about it. Their sessions don't break every 30 minutes from compaction. Their model has access to documentation through MCP tools instead of hallucinating API signatures from training data.

This is also why I keep arguing that the agent harness is the architecture. It's not infrastructure you bolt on after the fact. It's the primary lever for engineering productivity in an AI-native workflow.

What to do this week

You don't need 16 hooks, 28 skills, and 5 plugins on day one. Start with the highest-leverage layers, the ones that save the most context for the least effort:

Start with rules. Create ~/.claude/rules/coding-style.md and encode your team's three most-violated coding standards. This single file will eliminate the most common correction cycles immediately.

Add a CLAUDE.md to your project. Document your stack, your patterns, and your conventions. This is the project onboarding document you wish every new hire got on day one, except now your AI agent reads it every session. It's the highest-leverage single file in your entire repository.

Install context-mode. If you use Claude Code for any research or debugging, this plugin will be the single biggest context window optimizer. Raw command output flooding your context window is the #1 source of unnecessary compactions.

Build your first skill. Take the prompt you type most often, the one with specific formatting requirements, voice guidelines, or domain conventions, and turn it into a skill. It takes 30 minutes. It saves correction cycles forever.

Set up memory persistence. Even a simple PreCompact hook that prompts the model to update a MEMORY.md file will dramatically improve cross-session continuity. You shouldn't have to re-explain your project every conversation.

Measure everything. You can't optimize a context window you don't observe. Start tracking token consumption per task. The patterns will be obvious, and they'll tell you exactly which layer to build next.

The model is the engine. The context window is the fuel tank. The harness determines how far you go on every drop.

References

Kruczek, M. "Agent Harnesses Don't Need More Layers. They Need Fewer." matthewkruczek.ai, March 17, 2026. matthewkruczek.ai
Anthropic. "Claude Code: Hooks." Anthropic Documentation, 2026. docs.anthropic.com
Anthropic. "Claude Code: CLAUDE.md." Anthropic Documentation, 2026. docs.anthropic.com
mksglu. "Context-Mode Plugin for Claude Code." GitHub, 2026. github.com
Kruczek, M. "Progressive Disclosure for MCP Servers." matthewkruczek.ai, 2026. matthewkruczek.ai
Kruczek, M. "Context Engineering for Enterprise AI." matthewkruczek.ai, 2026. matthewkruczek.ai
Anthropic. "Claude Code: Model Context Protocol." Anthropic Documentation, 2026. docs.anthropic.com
Bifrost. "MCP Gateway for Claude Code." GitHub, 2026. github.com
Anthropic. "Claude Code: Skills." Anthropic Documentation, 2026. docs.anthropic.com

This article is part of "The Agent-First Enterprise" series exploring how organizations can transform their operations around AI agent capabilities. Connect with me on LinkedIn or Substack to discuss agent harness architecture and context optimization for your engineering organization.

Matthew Kruczek

Managing Director at EY

Matthew leads EY's Microsoft domain within Digital Engineering, overseeing enterprise-scale AI and cloud-native software initiatives. A member of Microsoft's Inner Circle and Pluralsight author with 18 courses reaching 17M+ learners.