Harness Engineering — Building Agents Around LLMs

A large language model, on its own, is a stateless function: text in, text out. It can't open a file, run a test, remember yesterday, or take a second attempt after seeing an error. Everything that makes a model feel like an agent — reading your repo, editing files, running commands, recovering from mistakes — lives in the software wrapped around the model. That software is the harness, and building it well is its own discipline: harness engineering. As frontier models converge in raw capability, the harness is increasingly where coding tools win or lose. This piece is the companion to how AI coding agents work — that one explains the agent; this one is about engineering the scaffolding that runs it.

⚡ Quick Takeaways

The model is the engine; the harness is the car. The model reasons; the harness gives it senses (context), hands (tools), a loop, memory, guardrails, and a way to check its work.
Context assembly is the highest-leverage job — deciding what goes into a finite context window determines output quality more than a marginally bigger model. "Context engineering" has overtaken prompt engineering.
The agent loop is the heart — build prompt → call model → run its tool calls → feed results back → repeat until done or out of budget.
Tools are an API you design — clear schemas, good descriptions, right granularity, and error messages the model can recover from.
Verification closes the loop — running tests/linters/builds and feeding failures back is what turns a one-shot generator into an agent that iterates to correct.
Harnesses are tuned empirically — you need evals, trajectory analysis, and step/cost budgets, because small harness changes swing real-world success rates a lot.
Capability ≈ model × harness. A great harness on a smaller model often beats a great model on a poor harness.

tldr

The harness is all the engineering around the model: the agent loop, context assembly, tool definitions, memory/compaction, permissions and sandboxing, and environment-grounded verification. The model supplies reasoning; the harness supplies everything that makes that reasoning actionable and reliable. Because models are commoditizing, harness quality — especially how well it assembles context and verifies work — is now the main differentiator between AI coding tools.

What Is a Harness?

It helps to draw a hard line between the two parts of any AI coding tool. The model is a frontier LLM (Claude, GPT, Gemini): it takes a blob of text and produces a blob of text, with no memory and no ability to act. The harness is the program around it that makes it useful — it assembles the input, interprets the output, executes any actions the model requested, and loops. Claude Code, Cursor, Devin, OpenHands, and Aider are all harnesses over the same handful of models; their differences are almost entirely harness differences.

The model does	The harness does
Reasoning & language	Assembles the context that's reasoned over
Decides which tool to call	Defines the tools and actually executes them
Produces text/diffs	Applies edits, runs commands, parses output
Is stateless	Holds memory, history, and the loop
Has no notion of "safe"	Enforces permissions, sandboxing, guardrails

The practical consequence: most of the engineering effort — and most of the product's quality — is in the harness, not the model you call.

The Agent Loop

At its core, a harness is a loop. The model is called, and if it asks to use a tool, the harness runs that tool, appends the result to the conversation, and calls the model again — repeating until the model produces a final answer or the harness hits a budget. This single loop is what converts a one-shot text predictor into something that explores, acts, and iterates.

the agent loop — the heart of a harness

state = [system_prompt, tools, user_goal]
steps = 0
while not done and steps < budget:
    reply = model(state)              # the ONLY call to the LLM
    if reply.tool_calls:
        for call in reply.tool_calls:
            result = execute(call)     # harness acts: read file, run cmd, grep…
            state.append(result)      # observation goes back into context
    else:
        done = True                  # model gave a final answer
    state = compact(state)           # keep it within the context window
    steps += 1

Notice how much is the harness's responsibility: choosing the budget, executing tools safely, deciding what counts as "done," and compacting state. The model only ever sees state and emits the next step — the loop, and everything in it, is engineering.

Context Assembly: the Highest-Leverage Job

The model can only reason about what's in its context window, which is finite. The harness's single most important job is deciding what goes in that window on each turn — the system prompt, tool definitions, the right files pulled from the repo, prior tool results, and conversation history. Get the right code in front of the model and a mediocre model shines; get it wrong and the best model hallucinates. This is why the field's center of gravity moved from prompt engineering to context engineering.

budgeting a finite context window

context window (say 200K tokens)
├─ system prompt + tool defs ......  ~5K    fixed overhead
├─ retrieved files / RAG ..........  ~40K   ◀ the high-leverage slice
├─ conversation + tool results ....  grows every turn
└─ headroom reserved for output ...  ~8K
        when it fills → summarize / drop / compact older turns

Prompt engineering	Context engineering
Wording the instruction well	Choosing what information the model sees
One prompt, mostly static	Dynamic per turn — retrieval, pruning, ranking
"How do I ask?"	"What does it need in front of it to answer?"

Concretely, the harness does retrieval (find the relevant files/symbols, by index or by letting the model grep on demand), ranking (put the most relevant first), and pruning (drop or summarize what no longer matters). When an agent "ignores" an obvious file, it's almost always a context-assembly bug, not a reasoning failure.

Tool Design

Tools are the agent's hands, and the harness defines them — their names, schemas, descriptions, and behavior (see tool use, function calling & MCP). Good tool design is genuine API design, with a twist: the consumer is a model, so the description is part of the interface — it's how the model decides when and how to call the tool.

a tool definition the model can use well

{
  "name": "run_tests",
  "description": "Run the project's test suite and return pass/fail"
                 " plus failing-test output. Use after editing code to"
                 " verify changes.",
  "input_schema": {
    "type": "object",
    "properties": { "path": {"type": "string", "description": "dir to test"} }
  }
}

Principles that make tools work: right granularity (a few composable tools beat dozens of hyper-specific ones), clear descriptions (the model only knows what you tell it), safe defaults (a destructive tool should require explicit confirmation), and — crucially — recoverable errors: a tool that fails should return a message the model can read and act on ("file not found: did you mean X?"), not an opaque stack trace. The error text is part of the loop's feedback.

Memory and State

The model is stateless, so the harness owns memory. There are two tiers. Short-term memory is the conversation/context window itself — the running transcript of goals, actions, and observations. Long-term memory lives outside the window: a scratchpad or notes file, a project-instructions file the harness injects each session, or an external store the agent can query. The hard problem is the short-term tier filling up: a long task accumulates tool results until it threatens the window. The harness handles this with compaction — summarizing or dropping older turns while preserving the goal and key facts — so the agent can keep working on tasks longer than any single context window.

Guardrails, Permissions, and Safety

The model has no concept of "dangerous." The harness is where safety is enforced, because the harness is what actually executes actions. This layer includes permission prompts before risky operations, allowlists/denylists for commands, sandboxing (running tool calls in a constrained environment so an agent can't touch what it shouldn't), and hooks that intercept actions to apply policy. A well-engineered harness fails safe: when an action is irreversible or outward-facing, it pauses for confirmation rather than trusting the model's judgment. This is also where you stop prompt-injection from a malicious file turning into a destructive command.

key point

Safety can't live in the model — the model only emits a request to act. The harness decides whether to honor it. That's why permissions, sandboxing, and confirmation gates are harness features, and why "the model wouldn't do that" is never a security argument.

Verification and Self-Correction

The feature that most separates a real agent from a fancy autocomplete is verification: the harness runs the code, the tests, the linter, or the build, and feeds the result back into the loop. This grounds the agent in the environment rather than its own (possibly wrong) belief that the code is correct. A failing test becomes an observation the model reasons about and fixes — closing a loop the model cannot close alone.

verification turns one-shot output into iteration

model edits code
      │
      ▼
harness runs tests  ──▶ fail?  ──yes──▶ feed error back ──▶ model fixes ──┐
      │                                                                    │
      └────────────────────────◀───────────────────────────────────── loop ┘
                              no
      ▼
   done — verified by the environment, not by the model's say-so

The quality of this loop depends on harness choices: what to run, how to surface failures concisely (a 5,000-line test log must be trimmed to the part that matters), and when to stop. A harness with good verification can let a weaker model grind to a correct answer; without it, even a strong model confidently ships broken code.

Error Handling and Recovery

Agents fail in ways one-shot calls don't, and the harness must anticipate them: a tool throws, the model emits malformed JSON for a tool call, a command hangs, or the agent gets stuck repeating the same failing action. Robust harnesses add retries (with the error fed back so the model can adjust), timeouts on tool execution, loop detection (notice the same action repeating and break out), and hard step/cost budgets so a confused agent can't burn unlimited tokens. Designing these recovery paths is a large part of why production harnesses are hard.

Evaluation and Iteration

You cannot improve a harness by intuition alone, because small changes — a tweaked tool description, a different compaction strategy, reordering context — swing real-world success rates surprisingly far. Serious harness work is empirical: run an eval harness against benchmark tasks (e.g. SWE-bench-style "fix this real bug" suites), measure success rate, latency, and cost, and inspect trajectories (the full sequence of steps) to see where runs go wrong. Then change one thing and re-measure. The harness is a system you tune against data, not a prompt you write once.

Sub-Agents and Orchestration

As tasks grow, a single agent's context gets cluttered. A common harness pattern is sub-agents: the main agent delegates a scoped job (search the codebase, review a diff) to a fresh agent with its own clean context, which returns just a conclusion. This keeps the main context focused, enables parallel work, and bounds how much any one context window must hold. Orchestration — deciding how to split work, run agents in parallel, and combine results — is an increasingly important harness capability for large tasks.

Why the Harness Matters as Much as the Model

The thesis of harness engineering: effective capability ≈ model × harness. The same model, dropped into a better harness, completes more tasks correctly — because the harness controls what the model sees (context), what it can do (tools), whether it catches its own mistakes (verification), and whether it stays safe and on-budget (guardrails). As frontier models converge, these factors, not raw model IQ, increasingly decide which coding tool actually finishes the job. That's why "context assembly beats a bigger model" is the recurring lesson, and why harness engineering has become a discipline in its own right.

takeaway

An LLM is a brilliant but blind, amnesiac, hands-tied reasoner. The harness gives it eyes (context), hands (tools), memory, reflexes (verification and recovery), and a conscience (guardrails) — and wires them into a loop. Build that scaffolding well and an ordinary model becomes a capable agent; build it poorly and the best model in the world flails. Increasingly, the harness is the product.

🎯 interview hot-takes

What is a "harness"? The software around an LLM that makes it an agent: the loop, context assembly, tool execution, memory, guardrails, and verification. Claude Code and Cursor are harnesses over the same models.
Why does context engineering beat prompt engineering? The window is finite; getting the right code in front of the model determines output quality more than wording or a marginally bigger model.
What turns one-shot generation into an agent? The loop plus environment-grounded verification — running tests/builds and feeding failures back so the model iterates to a correct, checked answer.
Where does safety live? In the harness, not the model — the model only requests actions; permissions, sandboxing, and confirmation gates decide whether to execute them.
Model vs harness — which matters more? Both; capability ≈ model × harness. As models converge, harness quality is the main differentiator.