Agentic AI

Same Prompt, Different Results. Your Agent Harness Is the Multiplier.

March 28, 2026 11 min read
Split-screen comparison showing a default Claude Code terminal versus a fully harnessed configuration with seven optimization layers

If you only have a minute, here's what you need to know.

I wrote two weeks ago that agent harnesses need fewer layers, not more. That argument still stands. But "fewer layers" doesn't mean "no layers." It means the right layers.

Today I want to show you what that looks like in practice. Not a framework diagram. Not a theoretical architecture. My actual Claude Code configuration, the one I use every day to ship production code, monitor my social media, and manage a portfolio of 30+ projects.

This isn't a tutorial. It's an argument: your agent harness is the single highest-leverage investment you can make in AI-augmented engineering, and most teams haven't even started.

Why the context window is everything

Before I walk through my setup, I need to make the case for why all of this matters. The answer is three words: the context window.

Every AI coding agent operates within a fixed context window. That's the token budget holding system instructions, your conversation history, tool outputs, and the model's own reasoning. When that window fills up, the agent compacts: it summarizes and discards older context to make room. Every compaction loses conversational state. Every lost state means re-explanation, re-exploration, and re-orientation.

The default Claude Code experience treats the context window as a dumb pipe. Raw build logs? Dump them in. Five hundred lines of git log? Dump them in. The model's own correction attempts after generating code that doesn't match your standards? All of it goes in. Most sessions, 60-75% of the context window is consumed by information the model doesn't need in raw form.

This is the fundamental problem a harness solves. Every layer of a production harness exists to optimize what enters the context window. Front-load the right information, sandbox the noise, and eliminate the correction cycles that waste tokens on work the model should have gotten right the first time.

The model is powerful. The context window is finite. The harness bridges that gap.

What "default" actually looks like

When you install Claude Code and start a conversation, here's what happens:

The model receives your prompt. It has access to built-in tools: file read, file write, bash execution, web search. It knows nothing about your project conventions, your coding style, your team's architectural decisions, or what you worked on yesterday. Every session starts from zero.

This is like hiring a senior engineer and giving them no onboarding, no documentation, no code review standards, and no access to your team's Slack history. They're talented. They'll produce something. But they'll spend half their time asking questions you've already answered and writing code that doesn't match your patterns.

That's the default experience. And it's what 90% of Claude Code users are running right now.

Here's where their context window goes in a typical 2-hour session:

Table showing default context window usage breakdown: exploration 25%, raw tool output 20%, correction cycles 15%, state re-explanation 10%, actual productive work 30%

Less than half the context window is doing real work. The rest is overhead. That's the cost of having no harness.

The seven layers of a production harness

My harness has seven layers. Each one solves a specific problem that default configuration doesn't address, and each one is fundamentally about putting the right tokens into the context window while keeping the wrong ones out.

Same Prompt
claude-code
$ claude
# Session starting...
---
> Refactor auth middleware
> to use Result pattern
> and add FluentValidation
---
# Harness loaded:
~/.claude/
  rules/ (7 files)
  skills/ (28 skills)
  hooks/ (16 hooks)
  CLAUDE.md
---
# Nexus synced. Memory restored.
# Context-mode active.
Ready. _
7 Layers
1
Rules 7 files
Standards enforced from first keystroke
2
Hooks 16
Autonomous state management
3
Skills 28
Pre-encoded domain expertise
4
Plugins 5
Context-mode sandboxes tool output
5
MCP / Bifrost 8 endpoints
Focused results, not raw HTML
6
Memory Nexus + CMEM
Ends cold-start sessions
7
Observability Langfuse
Makes context waste visible
Productive work 45% 78%
Compactions 4-6 1-2
Correction cycles 3-5 0-1

Layer 1: Governance rules (7 enforcement files)

The ~/.claude/rules/ directory contains seven markdown files that load into every conversation as system instructions. They encode:

The cost: approximately 3,000 tokens of system context. The payoff: the model writes code that matches my standards on the first attempt. No correction cycles. No "actually, we use FluentValidation, not DataAnnotations." No "please don't use console.log."

Layer 2: Lifecycle hooks (16 hooks across 6 phases)

This is where a harness becomes autonomous. Claude Code supports hooks at six lifecycle phases: SessionStart, PreToolUse, PostToolUse, PreCompact, PostCompact, and Stop. I use all six.

SessionStart (3 hooks):

PreToolUse (3 hooks):

PostToolUse (3 hooks):

PreCompact & PostCompact (2 hooks):

Stop (5 hooks):

The critical insight: these hooks run without my intervention. I don't think about memory persistence or auto-formatting or session telemetry. The harness handles it. Every cognitive cycle I'm not spending on housekeeping is a cycle I'm spending on the actual problem.

Layer 3: Skills (28 domain expertise modules)

This is the layer nobody talks about, and it might be the most underrated.

Claude Code skills are reusable prompt templates, slash commands that load pre-encoded domain expertise into the conversation on demand. Instead of writing a 500-token ad-hoc prompt explaining what you want, you invoke a skill that loads optimized, battle-tested instructions.

I maintain 28 skills across two tiers: 18 global skills available in every project, and 10 project-specific skills tailored to individual workflows.

Engineering skills:

Design and visualization skills:

Planning and analysis skills:

The context window angle on skills is easy to miss. Without a skill, I type out detailed instructions every time, and the model still pulls its punches or misses nuance. That means follow-up corrections. With a skill, I type three words and the model loads 2,000 tokens of pre-optimized instructions. Comprehensive and correct from the first pass. Zero follow-up cycles. Skills convert ad-hoc prompting (variable quality, frequent corrections) into encoded expertise (consistent quality, zero corrections).

Layer 4: Plugins (5 active)

Claude Code's plugin ecosystem is young, but the right plugins dramatically extend capability:

Layer 5: MCP server aggregation (Bifrost)

MCP (Model Context Protocol) servers give Claude Code access to external tools, from APIs to documentation indexes to video analytics, through a standardized interface. The problem: each MCP server exposes its own set of tool definitions, and every tool definition consumes context window tokens. Run five MCP servers with 3-5 tools each, and you're spending thousands of tokens just on tool schemas before you've asked a single question.

Bifrost solves this by acting as an HTTP gateway that aggregates multiple MCP servers behind a single endpoint. It runs on a dedicated machine in my network and proxies requests to backend servers:

Two additional MCP servers run standalone: Nexus (cross-project memory graph) and CMEM (session memory). These stay direct because they're lightweight, local-only operations that don't benefit from gateway aggregation.

That's 8 MCP server endpoints total: 5 through Bifrost, 2 standalone, plus Bifrost itself. From the model's perspective, it sees 3 MCP connections. From my perspective, it's 3 processes to manage instead of 8.

The token savings come from two places. First, Bifrost consolidates all five servers behind one HTTP endpoint with one schema negotiation instead of five, reducing tool definition overhead in the context window. Second, and more importantly, these MCP servers return focused results instead of raw output. A context7 documentation query returns the exact function signature, parameters, and a usage example in 200-400 tokens. The alternative? A web search that dumps 3,000-6,000 tokens of HTML, ads, and irrelevant content into the context window. Firecrawl returns clean markdown instead of raw DOM. These MCP servers return what the model needs, not everything the source contains.

The bigger point: MCP servers are how you give an agent real-world reach without bloating its context. The harness controls what enters the context window, and Bifrost is the gatekeeper.

Layer 6: Cross-project memory (Nexus + CMEM)

This is the layer most people don't realize they're missing.

Nexus is a tool I built: a local-first, encrypted knowledge graph that spans all my projects. When I make an architectural decision in Project A, Nexus records it. When I'm working on Project B and face a similar decision, Nexus surfaces the prior art. It tracks:

CMEM (session memory) provides semantic search across past conversations. When I worked on a similar problem three weeks ago, CMEM surfaces the relevant context without me re-explaining it.

Without cross-project memory, every session starts cold. The model reads files it read yesterday, asks questions I've answered before, and explores architecture it's already mapped. With Nexus and CMEM, the model loads synced context at session start and picks up where we left off. The first task of every session starts productive instead of exploratory.

Layer 7: Observability (Langfuse + status line)

You can't optimize what you don't measure. My harness exports every session to Langfuse for cost and token tracking. A custom status line shows project name, model, context usage percentage, and remaining capacity in real time.

This might seem like a nice-to-have. It's not. When you can see that a research task consumed 45% of your context window on raw command output, you know exactly where to optimize. When you can compare token-per-task costs across sessions, you can make data-driven decisions about which plugins and rules are pulling their weight.

The status line is also a context window guardian. Seeing "Context: 72% | ~56K remaining" in real time changes how you work. You don't issue a massive git log --all when you can see you're at 70% capacity. You reach for context-mode's sandboxed execution instead. Observability turns unconscious context waste into conscious context management.

The token math

Here's what this looks like in practice. These are representative estimates from my Langfuse telemetry across typical coding sessions. Individual sessions vary, but the pattern is consistent.

Table showing token investment versus savings for each harness layer: rules invest 3K tokens saving 15-20K, hooks invest 2K saving 25-35K, skills invest 2K saving 10-15K per invocation, context-mode invests 500 saving 30-50K, MCP servers invest 1.5K saving 15-25K, memory invests 2K saving 20-30K, observability invests 500 saving 5-10K

The combined effect: my sessions run at roughly 70-80% productive work ratio compared to the ~45% I see in default configurations. Compactions drop from 4-6 per session to 1-2. The model gets things right on the first attempt instead of the third. And I spend my time thinking about the actual problem instead of re-explaining my project.

Same prompt, different worlds

Let me make this concrete. Here's a real prompt I might give Claude Code:

"Refactor the authentication middleware to use the Result pattern and add FluentValidation"

Without a harness (default Claude Code):

  1. Claude asks clarifying questions (500 tokens): "What framework? What's the project structure? What's the Result pattern you're using?"
  2. Exploration phase (20,000 tokens): Reads 10-15 files to understand the codebase, architecture, existing patterns
  3. First implementation attempt (1,500 tokens): Writes code with try/catch and DataAnnotations. Reasonable, but wrong for this codebase.
  4. User correction (800 tokens): "We use FluentValidation, not DataAnnotations. And we use Result<T> with error discriminated unions."
  5. Second attempt (1,500 tokens): Closer, but uses mutable patterns and console.log for debugging
  6. User correction (600 tokens): "Immutable only. No console.log. Use structured logging."
  7. Third attempt (1,500 tokens): Finally correct
  8. No security review triggered
  9. No tests written

Total: ~26,000 tokens. Three iterations. No tests. No security review.

With my harness:

  1. Rules pre-loaded: Claude already knows FluentValidation, Result<T>, immutability requirements, structured logging, security standards
  2. Nexus syncs (2,000 tokens): Project architecture, existing middleware patterns, and dependency graph loaded
  3. First implementation (1,500 tokens): Correct patterns from the start. Immutable, FluentValidation, Result<T>, structured logging.
  4. Security reviewer auto-spawns (rules detect auth code): Reviews for JWT validation, CSRF, token storage
  5. Auto-format runs: Code style consistent without manual review
  6. Build output sandboxed by context-mode: a few hundred tokens instead of thousands

Total: ~10,000 tokens. One iteration. Security reviewed. Standards enforced.

That's not a 10% improvement. That's a fundamentally different relationship with your context window.

Context window optimization: the throughline

If you've been counting, every layer of this harness optimizes the same scarce resource. That's not a coincidence. It's the design principle.

Rules front-load 3,000 tokens to prevent thousands more in correction cycles. Hooks automate state management that would otherwise consume context with manual "remember what we were doing" prompts. Skills convert variable-quality ad-hoc instructions into consistent, pre-optimized prompts. context-mode keeps raw tool output out of the window entirely. Bifrost routes queries to MCP servers that return focused results instead of raw HTML. Memory eliminates the cold-start exploration that consumes 10-20% of every default session. Observability makes all of this measurable so you know what's working.

The context window is the fundamental constraint of agentic coding. Every token spent on overhead is a token not spent on reasoning about your actual problem. Every unnecessary compaction is lost state and degraded continuity. Every correction cycle is the model doing the same work twice.

A harness doesn't make the model faster or smarter. It makes the context window deeper. And depth, the ability to sustain complex, multi-step reasoning without losing state, is what separates a coding assistant from a coding agent.

This is the new skill gap

Here's why this matters beyond my personal setup.

In every previous technology wave, the differentiator was knowledge of the technology itself. Know Java better than the next person, ship faster. Know Kubernetes better, deploy more reliably.

With AI coding agents, the technology is the same for everyone. We all have access to the same Claude, the same GPT, the same Gemini. The model isn't the differentiator.

The harness is.

The engineer who spends a week configuring governance rules, building skills for their recurring workflows, and optimizing context consumption will outperform the engineer with default settings for the next two years. Every session. Every task. Compounding daily.

The gap compounds in both speed and quality. The harnessed engineer gets security reviews triggered automatically. They get consistent code patterns without thinking about it. Their sessions don't break every 30 minutes from compaction. Their model has access to documentation through MCP tools instead of hallucinating API signatures from training data.

This is also why I keep arguing that the agent harness is the architecture. It's not infrastructure you bolt on after the fact. It's the primary lever for engineering productivity in an AI-native workflow.

What to do this week

You don't need 16 hooks, 28 skills, and 5 plugins on day one. Start with the highest-leverage layers, the ones that save the most context for the least effort:

Start with rules. Create ~/.claude/rules/coding-style.md and encode your team's three most-violated coding standards. This single file will eliminate the most common correction cycles immediately.

Add a CLAUDE.md to your project. Document your stack, your patterns, and your conventions. This is the project onboarding document you wish every new hire got on day one, except now your AI agent reads it every session. It's the highest-leverage single file in your entire repository.

Install context-mode. If you use Claude Code for any research or debugging, this plugin will be the single biggest context window optimizer. Raw command output flooding your context window is the #1 source of unnecessary compactions.

Build your first skill. Take the prompt you type most often, the one with specific formatting requirements, voice guidelines, or domain conventions, and turn it into a skill. It takes 30 minutes. It saves correction cycles forever.

Set up memory persistence. Even a simple PreCompact hook that prompts the model to update a MEMORY.md file will dramatically improve cross-session continuity. You shouldn't have to re-explain your project every conversation.

Measure everything. You can't optimize a context window you don't observe. Start tracking token consumption per task. The patterns will be obvious, and they'll tell you exactly which layer to build next.

The model is the engine. The context window is the fuel tank. The harness determines how far you go on every drop.

References

  1. Kruczek, M. "Agent Harnesses Don't Need More Layers. They Need Fewer." matthewkruczek.ai, March 17, 2026. matthewkruczek.ai
  2. Anthropic. "Claude Code: Hooks." Anthropic Documentation, 2026. docs.anthropic.com
  3. Anthropic. "Claude Code: CLAUDE.md." Anthropic Documentation, 2026. docs.anthropic.com
  4. mksglu. "Context-Mode Plugin for Claude Code." GitHub, 2026. github.com
  5. Kruczek, M. "Progressive Disclosure for MCP Servers." matthewkruczek.ai, 2026. matthewkruczek.ai
  6. Kruczek, M. "Context Engineering for Enterprise AI." matthewkruczek.ai, 2026. matthewkruczek.ai
  7. Anthropic. "Claude Code: Model Context Protocol." Anthropic Documentation, 2026. docs.anthropic.com
  8. Bifrost. "MCP Gateway for Claude Code." GitHub, 2026. github.com
  9. Anthropic. "Claude Code: Skills." Anthropic Documentation, 2026. docs.anthropic.com

This article is part of "The Agent-First Enterprise" series exploring how organizations can transform their operations around AI agent capabilities. Connect with me on LinkedIn or Substack to discuss agent harness architecture and context optimization for your engineering organization.

Matthew Kruczek

Matthew Kruczek

Managing Director at EY

Matthew leads EY's Microsoft domain within Digital Engineering, overseeing enterprise-scale AI and cloud-native software initiatives. A member of Microsoft's Inner Circle and Pluralsight author with 18 courses reaching 17M+ learners.

Share this article:

Continue Reading