Designing an Enduring Agent Harness
Your agent is not the model. Your agent is everything around it.
The model is a brain in a jar. The harness is the body, the senses, the memory, the kill switch. Get the harness wrong and it doesn't matter how smart the brain is. You'll watch a frontier model hallucinate through a task, declare victory, and leave you holding a bag of confident nonsense.
Get the harness right and a cheap model beats an expensive one. Every single time.
The harness is the product
Nobody wants to hear this. Most of the reliability problems people blame on "AI" are harness problems.
The model didn't fail. You failed to give it the right context. The right tools. The right constraints. The right way to check its own work.
Anthropic put it plainly: the harness is not support infrastructure. It's the primary reliability layer.
Six things the harness owns or you don't have an agent. You have a chatbot with a credit card.
- Context. What the model sees. When it sees it.
- Tools. What the model can do. How it asks.
- State. What persists between turns, sessions, crashes.
- Loop control. When it acts. When it stops.
- Permissions. What runs free. What needs approval.
- Verification. Proof the work happened.
One loop
Every serious builder landed in the same place. Anthropic. Cursor. Manus. LangChain. The SWE-agent team.
One model. One prompt. One set of tools. One verification step. Repeat.
Claude Code ships more code than most engineering teams. People assume it's some elaborate multi-agent swarm. It's a single master loop with planning, tools, compaction, and bounded delegation. That's it.
The instinct to add agents is strong. Resist it.
Every new agent is a new failure mode. A new context window. A new source of drift. Anthropic's numbers: agents burn 4x more tokens than chat. Multi-agent burns 15x. That cost buys you complexity, not capability.
Add a second agent when you can prove it pays for itself. With data. Not vibes.
Context rots
Models get dumber as conversations get longer.
Not because the model degrades. Because the context does. Instructions get buried. Tool results pile up. The model can't find what you told it 3,000 tokens ago.
Stanford called it "Lost in the Middle." Information buried in the center of long context gets ignored. In some cases, performance drops below a model with no context at all.
More context can make your agent worse than no context. Sit with that for a second.
The fix is the same everywhere you look:
Small static prompts. Give the model what this step needs. Nothing more.
Pull, don't push. Cursor's dynamic context discovery cut agent tokens by 46.9%. Let the agent find what it needs instead of front-loading everything.
Compact hard. LangChain dumps results over 20,000 tokens to the filesystem. Truncates old tool arguments at 85% of the window. Summarization is the last resort. Not the first.
Cache everything. Manus treats KV-cache hit rate as a core production metric. Cached Sonnet input costs 10x less than uncached. Their whole architecture is built around making sure tokens hit the cache.
Edges, not middle. Put critical instructions at the beginning or end of context. Not buried on page four.
Write it down
Agents that rely on conversational memory fail. Not might. Will.
The model doesn't remember what it promised. It rereads the transcript and reconstructs an approximation. As that transcript grows, the approximation gets worse.
Durable artifacts fix this. Files. Databases. Structured state outside the prompt.
Anthropic's long-running harness creates a progress file and an initial commit before the first line of code. Manus keeps a todo.md and recites the current objective near the end of every context window. LangChain uses a virtual filesystem as a first-class context surface.
The pattern: anything that matters gets externalized to a place the harness controls. Not in the model's head. In a file with a known location and a known format.
If your agent can't resume from a cold start by reading its own artifacts, it's not production-ready.
Tools are prompts
Tool quality matters as much as prompt quality. Maybe more.
The model reads your tool definitions. Names, descriptions, parameter schemas. It makes decisions based on what it reads. Bad descriptions produce bad tool calls. Without exception.
Anthropic improved tool descriptions and cut task completion time by 40%. Same model. Same prompt. Just better tool docs.
SWE-agent proved that purpose-built tools beat raw shell access. Cursor found a 30% performance drop when they removed reasoning traces for GPT-5-Codex. The harness stopped matching the model's priors and everything fell apart.
Four rules:
One action, one tool. If the model needs three API calls to do one thing, that's one tool. Not three.
Return context, not IDs. If a tool returns a user ID, return the name too. Give the model what it needs for the next decision.
Name things the way the training data names things. The model has seen millions of tool definitions. Match those priors.
Lock the tool set. Manus learned this one the hard way. Changing tools mid-loop destabilizes everything.
Trust nothing
The model will tell you it succeeded. With absolute confidence. And it will be wrong.
Language models produce the most likely next token. If the most likely token after a task description is "Done!" then that's what you get. Regardless of whether the task is done.
LangChain went from 52.8 to 66.5 on Terminal Bench with the same model. No model upgrade. They just stopped letting it self-report success.
Verification goes in the loop. Not after. Not in a separate pipeline.
Did the file get created? Check. Did the test pass? Run it. Did the API return what the model claims? Compare receipts.
If your agent can say "task complete" without the harness confirming it, that's a bug.
Earn your agents
The industry has a multi-agent problem. Not because multi-agent doesn't work. Because people reach for it before they've earned it.
You don't need an orchestrator, a researcher, a writer, and a reviewer. You need one agent with good tools and a verification step. That covers 90% of everything.
When multi-agent is the right call:
- True parallelism. Two independent tasks that can run simultaneously.
- Context isolation. A subtask that would pollute the primary window.
- Security boundaries. Different permissions for different work.
When it's the wrong call:
- The architecture diagram looks cooler.
- It feels more "agentic."
- Someone on Twitter said swarms are the future.
One primary agent owns the user relationship. Subagents are workers. They take a bounded task, return a condensed artifact, and disappear. Anthropic caps subagent summaries at 1,000 to 2,000 tokens. The primary agent doesn't need a life story.
Permissions are architecture
The model will rm -rf / if you let it. It doesn't have judgment. It has token probabilities.
Anthropic's sandboxing on Claude Code cut permission prompts by 84% and blocked prompt-injection attacks from reaching user secrets. More autonomy. More safety. Same system. That's not a tradeoff. That's harness design.
Read and write are not the same operation. Treat them differently. High-risk actions get explicit approval from the host runtime, not from the model deciding it's fine.
The scoreboard
Numbers only. No theory.
| Harness change | Result |
|---|---|
| Lazy-load context (Cursor) | 46.9% fewer tokens |
| Better tool descriptions (Anthropic) | 40% faster completion |
| Verification in the loop (LangChain) | 52.8 → 66.5, same model |
| Sandboxed execution (Anthropic) | 84% fewer permission prompts |
| Explicit reasoning tools (Anthropic) | pass@1: 0.370 → 0.570 |
| On-demand code loading (Anthropic) | 150k tokens → 2k |
| Custom agent interface (SWE-agent) | 12.5% pass@1 on SWE-bench |
Every line is a harness change. Not a model change. The model stayed the same.
Build accordingly
Most teams are comparing models. Tweaking prompts. Adding agents. Debating frameworks.
Better tools. Smaller context. Durable state. Verification loops.
The harness is the product. The model is a component