30-second primer: an LLM (Claude / GPT / ...) only predicts "what text comes next."
To make it really read files, run commands, or browse the web, you wrap it with
tools + a loop: it asks to call a tool → the framework runs it → result is fed back → it continues.
That wrapper is an "agent." This page compares five of them. New here? Start with "Concepts" below.
1. 基本概念1. Concepts
1. LLM 本质是"下一个 token 预测器"1. LLM is a next-token predictor
你给它一段文本, 它给你接下来最可能出现的一段文本。没了。
Feed it text, it returns the most-likely next chunk of text. That's all.
An LLM only does one thing: look at the tokens so far, predict the next token;
then append that prediction and predict the next one — step by step, a whole reply appears.
First, text is split into tokens (≈ subwords):
"The cat sat on the"
──tokenize──▶ ["The", " cat", " sat", " on", " the"]
Then one step at a time:
step 1 sees: ["The", " cat", " sat", " on", " the"] → predicts " mat"
step 2 sees: ["The", " cat", " sat", " on", " the", " mat"] → predicts "."
step 3 sees: [..., "."] → predicts <end> (stop)
Each step makes the input one token longer. The total length of "history + new tokens" is
capped by the context window — overflow either gets compacted (§5.2) or truncated.
它不会执行代码、不会读文件、不会上网。这些全是 agent 在外面套的一层。
It can't execute code, read files, or access the web. All of that is what the agent wraps around it.
2. Function calling: 让 LLM "指点"框架去干活2. Function calling: let the LLM direct the framework
Tell the LLM in the system prompt "you can call read_file(path)". A user asks "read /tmp/foo.py" — the LLM won't invent file contents, it returns structured JSON:
Key insight: LLMs have great brains but fake hands. The agent framework installs real hands — and makes sure it doesn't burn the kitchen down. The five differ in how they install the hands and how they prevent fires.
Installing the hands is step one; measuring whether the hands actually work is step two — ClawBench is designed exactly for this: a live benchmark that tests any harness on cookie popups, dynamic JS, multi-step interactions, and real everyday online tasks.
术语表 — 忘记某个词时展开查 Glossary — expand when you forget a word
The exact limit depends on model and runtime config; for harness design, the key issue is compaction, truncation, and reloading persistent memory near the limit
API
How programs call programs; here: LLM REST APIs
Send HTTP, get JSON back
Streaming
Return tokens as they are generated
Lower latency, pipeline next step earlier
Function calling / Tool use
LLM returns structured "please call this tool" JSON
Prerequisite for an agent to "do things"
Prompt cache
Server-side cache of long system prompt
Up to 10× cheaper, lower latency
Sandbox
Confined process env (FS/network limited)
Keeps agent from wrecking your machine
Provider
LLM vendor (Anthropic / OpenAI / Google)
Pick one, or write adapters
Turn
One user message + full agent response cycle
"One turn" = the main loop runs a full pass
ReAct
Reasoning + Acting loop: think → act → think
All five are ReAct variants
MCP
Model Context Protocol for external tools
Lets agents plug in any 3rd-party tool
CLAUDE.md / AGENTS.md
Root-level project convention file
Read at startup; a "README for bots"
Plan-and-execute
Ask the model to plan first, then execute step by step
opencode's plan mode, claw-code's EnterPlanMode
Reflection
Agent self-reviews after acting; retries on error
§7 Takeaway · common auxiliary loop
Tools
Typed functions the agent can call (read file, run bash, browse…)
§2 Tools vs Skills table
Skills
Markdown files (SKILL.md) teaching when/how to use tools
codex's default mode stacks these to block 99% of accidents
Vercel AI SDK
Provider-agnostic TypeScript SDK from Vercel that abstracts streaming / tool-calling / reasoning across vendors
opencode's provider layer uses it; adding a new vendor is a config one-liner
Self-evolving
Agent that writes new SKILL.md files, updates prompts, or augments memory at runtime — so the next run starts from a higher baseline; a step beyond reflection, where what was learned persists
The Skills layer is the persistence entry point · hermes's skill_manage tool · a natural next step after §7 Reflection
2. 五层概念地图: Prompt → Harness2. The five-layer stack: prompt to harness
In 2023 the craft was prompt engineering. In 2024 it moved to context engineering — retrieval, memory, compaction. In 2025 the frontier climbed two more layers: Skills and Harnesses. The five projects on this page are all different takes on harness engineering.
One-line summary: Prompt Engineering is wording; Context Engineering is what fits in the window; Tools are what the agent can do; Skills are when and how to do it; Harness Engineering is the exoskeleton — without it, the LLM brain has nowhere to attach its hands.
Claude Code is best understood as an agentic harness, not "Claude plus a few shell commands." Its value is the outer state machine: how user input, project context, tool schemas, permission policy, hooks, tool results, and compaction summaries are organized into a durable turn loop. The official docs describe the loop as gather context → take action → verify results; this post expands it into implementation-level state transitions.
Claude Code 状态
它在做什么
为什么重要
Context Assembly
读 system prompt、CLAUDE.md / skills / conversation history / tool schemas, 组装本轮请求。
决定模型"看见什么"; 这比单句 prompt 更接近真实能力上限。
Model Step
流式调用模型, 输出自然语言或结构化 tool_use。
模型不直接执行动作, 只声明"我想调用什么工具"。
PreToolUse
工具执行前先跑 hook, 可以改参、拒绝、要求确认、推迟、补充上下文。
这是 Claude Code 的治理入口: 用户能写程序影响 agent, 但强制权限规则仍会评估。
Load system prompt, CLAUDE.md / skills / conversation history / tool schemas, then assemble the request.
Determines what the model can see; this matters more than any single prompt.
Model Step
Stream the model; receive either natural language or structured tool_use.
The model does not act directly; it declares which tool it wants.
PreToolUse
Run hooks before execution; rewrite input, deny, ask, defer, or add context.
This is the governance surface: users can program the agent, while enforced permission rules still apply.
Permission
Allow / ask / deny based on tool type, path, command risk, and user policy.
Separates "the model wants" from "the system permits."
Execute + Observe
The harness runs shell / file / MCP tools and appends tool_result back into history.
This is where action happens; the model learns the result by observation.
Loop / Terminate
If tool calls remain, go back to the model; if none remain, end the turn.
This is why a coding agent can fix bugs over multiple steps.
Compaction
Summarize old history when context is too long, preserving important state.
Long tasks can continue instead of losing the session.
最关键的 transition: assistant_message has ToolUse → 进工具管线; no ToolUse → 进入 stop/结束检查; hook 或 permission denied → 生成 error tool_result 让模型读到; context too long → compact 后继续。这四个分支就是 Claude Code 状态机的骨架。
The key transitions: assistant_message has ToolUse → enter the tool pipeline; no ToolUse → enter stop/finalization checks; hook or permission denied → append an error tool_result for the model to read; context too long → compact then continue. Those four branches are the backbone of the Claude Code state machine.
Tools 和 Skills 的分工是 Anthropic 2025 年在 Agent Skills 博客 + anthropics/skills 仓库里推的核心抽象。openclaw 把它复述为一句话: "Tools are what the agent calls; Skills teach the agent when and how."
The Tools / Skills split is the core abstraction Anthropic pushed in 2025 (see their Agent Skills blog and the anthropics/skills repo). openclaw restates it as: "Tools are what the agent calls; Skills teach the agent when and how."
Having all five layers in place only means the system is theoretically capable; whether it actually works requires real-task success rates. That's exactly what ClawBench measures: live web tasks that grade each layer end-to-end, not offline DOM snapshots you can game.
github.com/anthropics/skills IS the reference repository for this standard — Anthropic's official collection of Skill examples and the origin of the SKILL.md format. Each skill is a folder containing one required file SKILL.md: YAML frontmatter (name + description) followed by a Markdown body. Claude auto-mounts the skill when the description matches the task, reads the body, and follows it.
# 最小模板(来自 anthropics/skills/template/SKILL.md):
---
name: my-skill-name
description: A clear description of what this skill does and when to use it
---
# My Skill Name
[Add your instructions here that Claude will follow when this skill is active]
## Examples · Guidelines · Reference files · etc.
# Minimum template (from anthropics/skills/template/SKILL.md):
---
name: my-skill-name
description: A clear description of what this skill does and when to use it
---
# My Skill Name
[Add your instructions here that Claude will follow when this skill is active]
## Examples · Guidelines · Reference files · etc.
Real example: anthropics/skills/skills/pdf/SKILL.md has a very precise description — "use this skill whenever the user wants to read PDFs / merge / split / rotate / watermark / OCR" — Claude auto-invokes on those keywords. The body contains Python snippets, CLI guidance, links to REFERENCE.md, etc. The repo ships 17 official skills today (algorithmic-art, pdf, docx, pptx, xlsx, mcp-builder, skill-creator, webapp-testing, brand-guidelines, …), covering creative / office / development / enterprise categories.
Tools vs Skills 对照表Tools vs Skills side-by-side
维度
Tools
Skills
是什么
带类型签名的函数
带 YAML frontmatter 的 Markdown 文件夹
谁执行
harness 执行 (调真实 API / shell / FS)
LLM 自己读完照做 (instructions + 参考资料)
回答的问题
"agent 能调什么?"
"什么时候 / 怎么调?"
进入上下文
schema 列在 tools[] 里
description 常驻, 正文按需挂载
跨 harness 复用
每家 harness 都要自己实现
同一 SKILL.md 任何支持的 agent 都能装
例子
bash、read、write、browser、MCP 工具
pdf、mcp-builder、frontend-design
Dimension
Tools
Skills
What
Typed function with a signature
Folder of Markdown with YAML frontmatter
Executor
The harness runs it (hits real APIs / shell / FS)
The LLM reads it and follows (instructions + refs)
How the five relate: Claude Code (≈ claw-code) ships a first-class Skill tool that mounts SKILL.md; openclaw devotes major docs space to Skills (53 community-published); hermes-agent provides skill_view / skills_list / skill_manage tools that load SKILL.md per the agentskills.io spec; opencode's Markdown-frontmatter agents sit close to this idea; codex has no first-class Skill concept — it uses AGENTS.md like CLAUDE.md for per-project instructions.
Flue Framework recasts the "five-layer stack" of this page into a four-layer model: Model · Harness · Sandbox · Filesystem. Its slogan "Not another SDK" is a stance — it doesn't add another chat abstraction; it offers a programmable TypeScript control plane at the harness layer. It earns a place in this comparison because it independently confirms §2's claim: the harness is the real engineering axis.
Flue 的四层 ↔ 本页五层Flue's four layers ↔ the five-layer stack
同一段 agent 代码跑五个部署形态。Node.js · Cloudflare Workers · GitHub Actions · GitLab CI · HTTP 服务 — 同一个 harness 实现可以是常驻 server, 也可以是单次 CLI run, 也可以是 CI 任务里的一段。这把 §6 takeaway #4 "agent as a service"再推一步: service vs CLI 不是架构选型, 是部署目标的旋钮。
Secrets never enter the LLM context. Flue keeps tokens like GITHUB_TOKEN at the harness boundary and only injects them into the child-process env at the moment of shell exec — the agent never sees them, and the sandbox only touches them for one syscall. A natural complement to the §3 "PreToolUse / Permission" governance line: governance gates what to do; secret isolation gates what can be read.
One agent codebase, five deployment shapes. Node.js · Cloudflare Workers · GitHub Actions · GitLab CI · HTTP server — the same harness can run as a long-lived service, a one-shot CLI, or a CI step. This pushes §8 takeaway #4 "agent as a service" one step further: service vs CLI is not an architecture choice, it's a deployment knob.
在本页 5+1 张地图里坐标Where Flue sits on the 5+1 map
维度
Flue
最像谁
不一样在哪
语言
TypeScript
opencode (TS+Go) · openclaw (TS)
纯 TS, 不需要 Go runtime
形态
SDK / library, 用户用 TS 写 agent 入口
opencode 的 core library
更彻底——没有自带 TUI, 部署形态完全交给用户
Sandbox
三档可插
hermes (Modal) · openclaw (Docker)
把"挑哪个 sandbox"做成配置而不是源码 fork
定位
"自主代理可编程控制面"
偏 opencode 的服务化思路
更强调"全栈自控": agent 逻辑 + harness + sandbox 都在你这边
Dimension
Flue
Closest sibling
How it differs
Language
TypeScript
opencode (TS+Go) · openclaw (TS)
Pure TS — no Go runtime needed
Shape
SDK / library; users write the agent entry in TS
opencode's core library
More radical — no bundled TUI, deployment shape is fully user-decided
Sandbox
Three pluggable backends
hermes (Modal) · openclaw (Docker)
Backend choice is a config knob, not a source fork
Stance
"Programmable control plane for autonomous agents"
opencode's service-shaped approach
Pushes harder on full-stack ownership: agent logic + harness + sandbox all yours
One-line placement: if §2 names harness engineering as the layer of 2025, Flue is the most literal attempt to ship that layer as a TS package — it doesn't treat "agent" as a product, but as your TS code on top of a standard harness runtime.
4. 六个流程图4. Six diagrams
下面六张图讲的是这些 harness 怎么运作;想量化它们到底 work 得多好, 用我们的 ClawBench 在真实网页任务上跑一跑就知道。The six diagrams below show how these harnesses work; to quantify how well they actually work, run them against our ClawBench on live web tasks.
ReAct 循环 + 共享迭代预算 + 子代理委派。ReAct loop with shared iteration budget and sub-agent delegation.
flowchart TD
U([User message]):::io
A[Apply prompt cache + memory · every 10 turns]:::ctx
M{{Adapter.stream · Anthropic · Bedrock · Gemini}}:::model
P[Parse tool_calls · preserve reasoning_content]:::model
R[ToolRegistry.dispatch · 47 built-in tools]:::tool
S{delegate_task?}:::decision
SA[[Spawn sub-agent · shared IterationBudget]]:::sub
RES[Append tool results]:::tool
C[ContextCompressor · if near context limit]:::ctx
B{budget > 0?}:::decision
Y([Return final message]):::io
U --> A --> M --> P --> R --> S
S -- yes --> SA --> RES
S -- no --> RES
RES --> C --> B
B -- yes --> A
B -- no --> Y
class U step1
class A step2
class M step3
class P step4
class R step5
class SA step6
class RES step7
class C step8
class B step9
click U call jumpTo("hermes", 1)
click A call jumpTo("hermes", 2)
click M call jumpTo("hermes", 3)
click P call jumpTo("hermes", 4)
click R call jumpTo("hermes", 5)
click SA call jumpTo("hermes", 6)
click RES call jumpTo("hermes", 7)
click C call jumpTo("hermes", 8)
click B call jumpTo("hermes", 9)
classDef io fill:#233042,stroke:#7aa2f7,color:#e6e8ef;
classDef model fill:#2b1f3a,stroke:#bb9af7,color:#e6e8ef;
classDef tool fill:#1f3a2b,stroke:#9ece6a,color:#e6e8ef;
classDef sub fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef;
classDef ctx fill:#3a1f2b,stroke:#f7768e,color:#e6e8ef;
classDef decision fill:#2d2d3a,stroke:#8a93a6,color:#e6e8ef;
Hooks before permissions — a PreToolUse hook can deny, ask, defer, or rewrite a call before the policy engine; enforced deny/ask rules remain the safety boundary.
No in-loop sub-agents — task registry is for async background only; multi-agent coord pushed outside.
Auto-compaction with provenance — summaries logged as SessionCompaction events + health probe.
结构化 output[] 流 + 原生推理项 + 沙箱 bash。Structured output[] stream with first-class reasoning items and sandboxed bash.
flowchart TD
U([User message]):::io
K[_build_api_kwargs · instructions · tools · reasoning.effort]:::ctx
ST{{responses.stream · with reasoning.encrypted_content}}:::model
FB[[Fallback — responses.create stream · synthesize from deltas]]:::model
N[_normalize_codex_response · parse output array]:::model
RS[codex_reasoning_items · dedup by ID across turns]:::ctx
PP[PermissionPolicy · ReadOnly · WorkspaceWrite · DangerFull]:::gate
SB[Exec in sandbox · seatbelt · landlock]:::tool
AP[Append tool result]:::tool
CK{incomplete or commentary}:::decision
Y([Return message]):::io
U --> K --> ST
ST -- transport err --> FB --> N
ST --> N --> RS
RS --> CK
CK -- function_call --> PP --> SB --> AP --> K
CK -- commentary --> K
CK -- completed --> Y
class U step1
class K step2
class ST step3
class FB step4
class N step5
class RS step6
class PP step7
class SB step8
class AP step9
class CK step10
click U call jumpTo("codex", 1)
click K call jumpTo("codex", 2)
click ST call jumpTo("codex", 3)
click FB call jumpTo("codex", 4)
click N call jumpTo("codex", 5)
click RS call jumpTo("codex", 6)
click PP call jumpTo("codex", 7)
click SB call jumpTo("codex", 8)
click AP call jumpTo("codex", 9)
click CK call jumpTo("codex", 10)
classDef io fill:#233042,stroke:#7aa2f7,color:#e6e8ef;
classDef model fill:#2b1f3a,stroke:#bb9af7,color:#e6e8ef;
classDef tool fill:#1f3a2b,stroke:#9ece6a,color:#e6e8ef;
classDef gate fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef;
classDef ctx fill:#3a1f2b,stroke:#f7768e,color:#e6e8ef;
classDef decision fill:#2d2d3a,stroke:#8a93a6,color:#e6e8ef;
HTTP as the boundary — TUI / web / IDE all speak to /session/*; OpenAPI 3.1 spec at /doc (server.ts) and mDNS broadcast (server/mdns.ts) let any client discover and generate an SDK.
Build vs plan modes — plan defaults edits/bash to ask, same loop two personas.
Provider-agnostic — Vercel AI SDK delegates streaming / tool / reasoning to each adapter.
First-class LSP + MCP — code intelligence and external tools sit beside native ones.
Multi-channel routing — IM traffic from Discord / Slack / Telegram / WhatsApp / iMessage / Signal / Matrix / Teams / Google Chat / Zalo all feeds one Gateway (long-lived daemon); CLI / iOS / IDE act as additional entry points, all landing in shared sessions.
Four default tools — read / write / edit / bash are the only ones the model can call directly; find/grep/ls exist as files but aren't mounted, so the model uses native shell via bash.
Two-key steering — while the agent is running tools, Enter interrupts the remaining tools and lands a new message in reasoning; Alt+Enter queues a follow-up until the current run ends.
"What we didn't build" — sub-agents, plan mode, MCP, permission popups, todos, background bash are all deliberately pushed into extensions / packages; build one or install one.
Session as a tree — /tree jumps back to any old message and forks from there; every branch lives in one file; /share uploads to a GitHub gist and returns a shareable URL.
SDK embedding — the same AgentSession runs in four modes: TUI / print(JSON) / RPC / SDK; openclaw uses the SDK to embed pi as its runner.
packages/coding-agent/src/core/agent-session.ts — AgentSession (3099 lines)packages/coding-agent/src/core/tools/{read,write,edit,bash}.ts — 4 default toolspackages/coding-agent/src/core/extensions/ — TS extension runtimepackages/coding-agent/src/core/compaction/ — replaceable compactionpackages/coding-agent/src/modes/{interactive,print-mode.ts,rpc} — 4 run modespackages/coding-agent/src/core/sdk.ts — embed API (used by openclaw)
5. 一眼看懂5. At a glance
下面表里的术语若有陌生, 后面"深度拆解"会讲透——先扫一眼整体。Unfamiliar terms below will be explained in the deep dives — just skim for now.
Why you should care: it shows how to support multiple LLM providers in one loop, and how to delegate to sub-agents without losing control — the first two engineering problems you'll hit when building your own agent.
Adapter pattern (agent/anthropic_adapter.py, gemini_native_adapter.py, ...) — abstract "stream + receive tool calls" interface; each provider implements it. The main loop never knows who it's talking to.
Shared IterationBudget (run_agent.py:730, parent default 90, child default 50) — the main agent decrements once per API call; sub-agents share the same counter AND each is individually capped at 50. Without this, "a dumb agent spawns 10 sub-agents, each spawns 10 more" explodes exponentially.
delegate_task sub-agents (run_agent.py:8076 + tools/delegate_tool.py:13) — shipped as a tool. Calling it spins up an isolated-context child AIAgent with a restricted toolset. The parent only sees the summary; the 50 child tool-calls don't pollute parent context.
Great for RL training: drop into Modal cloud sandbox, run 100 rollouts in parallel, each with its own FS but a shared reward function. MCP server (mcp_serve.py) exposes internal conversations outward, letting Claude Code / Cursor consume hermes as a tool. ClawBench is a natural RL evaluation target for this setup — its per-evidence scores plug straight in as reward signal.
Complex — single file is 12,000 + lines. Multi-provider means you can't 1:1 map each vendor's latest features (e.g., Claude's extended thinking has no Gemini equivalent).
Why you should care: shows how to make an agent safe enough for production — every dangerous action can be intercepted, rewritten, or audited by scripts. Essential for anyone shipping an agent.
注: claw-code 是 Claude Code 的 Rust 开源复刻。Anthropic 官方文档把 Claude Code 定位为围绕 Claude 的 agentic harness(见 How Claude Code Works)。官方可确认的是 agentic loop、工具、权限、hooks、CLAUDE.md / memory、context compaction 这些机制; 本节的具体源码行、100k 压缩阈值、12 阶段 Bootstrap、health probe 是 claw-code 的实现选择, 不等于官方 Claude Code 内部实现。
Note: claw-code is an open-source Rust reimplementation of Claude Code. Anthropic's docs describe Claude Code as an agentic harness around Claude (see How Claude Code Works). The official surface confirms the agentic loop, tools, permissions, hooks, CLAUDE.md / memory, and context compaction; the source lines, 100k compaction threshold, 12-phase bootstrap, and health probe in this section are claw-code implementation choices, not official Claude Code internals.
目的Purpose
做一个可审计、可干预的 Claude Code 开源实现。每一次工具调用都可以被脚本拦截、改参、查权限、记日志、事后清理。
An auditable, interceptable Claude Code reimplementation. Every tool call can be intercepted, rewritten, permission-checked, logged, or cleaned up afterward.
Pre/Post hooks — claw-code's hooks.rs:23HookEvent enum implements three: PreToolUse, PostToolUse, PostToolUseFailure. Anthropic's official Claude Code docs list more lifecycle hooks, including SessionStart, UserPromptSubmit, PreCompact, Stop, ConfigChange, and others. Pre-hooks can allow / ask / deny / defer / modify input / add context; post-hooks handle cleanup / notify / log.
PermissionPolicy (permissions.rs:175authorize_with_context) — post-hook authorization with static rules + interactive prompter; bash commands also run through bash_validation.rs for syntax + danger checks.
Strict ordering (conversation.rs:414): Pre-hook → Permission → Execute → Post-hook. Note: hook runs before permission — a hook can request allow / ask / deny / defer or rewrite a dangerous command before permission evaluates the final call.
Auto-compact + health probe — threshold is DEFAULT_AUTO_COMPACTION_INPUT_TOKENS_THRESHOLD = 100_000 (conversation.rs:18); compaction runs via compact.rs:96 compact_session; right after, a health probe fires (conversation.rs:297 run_session_health_probe) that probes the session to confirm it still works. The action is logged as an AutoCompactionEvent for audit.
Policy/runtime decoupling — how to run bash is runtime's job; whether to allow it is policy's. One JSON config turns the same loop into "fully autonomous" or "ask for every step."
Hook-before-permission = more expressive — traditional permission is allow/deny. Hook is a programmable middle layer. You can do "in prod, rewrite a destructive command into a dry-run or safer target before permission sees it."
Compaction has provenance — not a black-box history wipe; summary + metadata preserved, issues traceable.
Same runtime, different personas: developers get wide permissions with lint hooks; prod gets strict permissions with forced dry-run hooks; teaching mode asks on every bash. Rust impl means fast startup, low memory, embeddable in other programs.
Main loop has no sub-agents (task registry is async background only). Multi-agent coordination is pushed outside the runtime — a deliberate philosophy: "keep agent context focused on work, not meetings."
为什么你该看懂这家: Responses API 是未来几年其他服务商大概率会跟进的方向。提前看懂 = 别家跟进时你能立刻上手。
Why you should care: Responses API is likely the direction other vendors will follow over the next few years. Learn it now, be ready when others catch up.
目的Purpose
展示 OpenAI 把 agent 能力直接内置到 API 会是什么样——不是让客户端组装工具调用, 而是 API 直接返回"我在想什么 / 要调什么工具 / 要说什么"的结构化流。
What it looks like when OpenAI bakes agent capability into the API itself — not client-side tool-call assembly, but the API streaming structured items: "what I'm thinking / which tool to call / what to say."
Responses API, not Chat Completions (run_agent.py:5183responses.stream) — Chat Completions returns a content string + optional tool_calls array (client assembles). Responses API returns an output[] array of typed items: {type: "message"} / {type: "function_call"} / {type: "function_call_output"} / {type: "reasoning"} — clients route by type.
Encrypted reasoning across turns (run_agent.py:7266 dedup logic) — request with include: ["reasoning.encrypted_content"]; Codex returns encrypted reasoning blobs. Those blobs can be fed back as part of the next turn's input — the model "remembers" how it was thinking, multi-turn reasoning stays coherent.
3-step fallback (run_agent.py:5168 → :5297): responses.stream() → retry → responses.create(stream=True) synthesized from deltas. Even if streaming drops, the turn isn't lost.
Codex has hooks too — config.toml accepts [hooks.pre_tool_use] / [hooks.post_tool_use] scripts for pre/post-tool interception (marked stable in April 2026's codex_hooks release).
OS-level sandbox — macOS seatbelt (sandbox-exec) with .sb configs limits FS / network; Linux uses landlock + bubblewrap + seccomp via the codex-linux-sandbox helper. The documented default is read-only (codex --sandbox read-only); the full four modes are read-only / workspace-write / danger-full-access / external-sandbox. Not a container — but the default blocks 99% of accidents.
为什么 workWhy it works
Typed output——不用正则从 content 里抠 tool_use JSON。
推理保留——长任务不失忆, 省 token 一致性好。
OS 沙箱——比 docker 轻 100×, 比纯权限硬一个量级。
Typed output — no more regex-extracting tool_use JSON from content.
Reasoning preserved — long tasks don't go amnesic; saves tokens, stays consistent.
OS sandbox — 100× lighter than docker, an order of magnitude stronger than pure permissions.
First-party optimization — Responses API treats agents as first-class citizens. The client only handles fallback and dedup; the server owns inference, caching, streaming. The whole pipeline is much cleaner than "Chat Completions + hand-rolled agent loop." For OpenAI-committed teams, codex is the highest-ceiling option.
代价Cost
锁定 OpenAI——Responses API 目前只有 OpenAI。推理加密——你拿不到纯文本推理内容, 只能原样传回。
Locked to OpenAI — Responses API is OpenAI-only today. Reasoning is encrypted — you can't inspect it, only pass it back.
3.4 opencode
github.com/sst/opencode
为什么你该看懂这家: 如果你要做 IDE 插件、团队共享 agent、或多端同步, 这是蓝图。
Why you should care: if you want to build an IDE plugin, a team-shared agent, or multi-client sync — this is the blueprint.
One engineering problem: agent logic should not be bound to a TUI. TUI today, VS Code plugin tomorrow, iPhone app the day after — write the agent once.
隐藏系统 agent——compaction(对话过长自动摘要)、summary(生成摘要)、title(自动命名 session)。用户看不到, server 后台在跑。
Client-server split — Server is TypeScript on Bun, holds all sessions + agent loop. Client is Go TUI, but the protocol is open HTTP: POST /session/:id/message, GET /global/event (SSE), POST /session/:id/permissions/:id. OpenAPI 3.1 spec at /doc auto-generates any-language SDKs; mDNS broadcast on startup lets mobile apps discover.
Build vs Plan modes — same tool surface, different permission maps: build (the default) allows edit/write/bash; plan restricts write-class tools (edit / write / patch / bash) to ask or deny (resolved from agent.ts defaults merged with user config), while read-only tools flow through. Two personas, one loop.
Per-tool permissions — each tool independently set to allow | ask | deny, with wildcard support (mymcp_* whitelists a whole MCP bundle). Permission requests flow over HTTP back to the client UI.
Hidden system agents — compaction, summary, title are all marked hidden: true and run server-side on schedule. Users never see them.
Stable protocol → flourishing frontends — HTTP + OpenAPI enables community neovim plugins, mobile apps, web UIs.
Mode switching is free — plan mode is just different permission config, not a separate agent impl.
Vercel AI SDK underneath — swap provider with one config line.
为什么好Why it's good
对团队协作友好——server 跑在共享机器上, 多人接客户端连进来看同一 session。对 IDE 集成友好——任何 IDE 插件都能对接, 不用各自重造 agent。
Team-friendly — run server on a shared machine, multiple clients connect to the same session. IDE-friendly — any IDE plugin can wire up, no need to reinvent the agent.
Why you should care: shows how one agent can handle IM messages, CLI commands, iOS pushes, and IDE sessions in parallel — essential reading if you want an "all-in-one" personal copilot.
A local-first multi-channel agent. The docs position it as "one long-lived Gateway, many channels, one agent" — not another chat box but a control plane on your device. 10 + IM channels (Discord / Slack / Telegram / WhatsApp / iMessage / Signal / Matrix / Teams / Zalo and more) plus CLI / iOS / IDE all feed the same Gateway and share sessions.
Docs terminology: Tools vs Skills.Tools are the typed functions the agent can call (bash / read / write / browser / canvas / ~19 core); Skills are Markdown docs (SKILL.md) injected into the system prompt, teaching when and how to use them. This split is called out in the official docs as the core abstraction.
Gateway + Embedded Runner split (pi-embedded-runner/run.ts) — Gateway is a local WebSocket orchestrator owning channels / cron / auth / sessions; Embedded Runner is the portable agent core (runs in CLI, browser, remote SSH). They meet via session keys.
ToolPolicy pipeline (tool-policy-pipeline.ts) — tools filter by sandbox mode × channel: main session is permissive; sandbox lane only exposes exec · read · write · edit · sessions_*; different messageProvider values (e.g. voice, node) get their own allow/deny mappings. All config-driven, no hardcoding.
Dockerised browser sandbox (sandbox/browser.ts + Dockerfile.sandbox-browser) — every session spins its own container with Chromium + xvfb + noVNC + CDP. Automation, plus you can open port 6080 and literally watch the agent click.
Async compaction (compact.ts) — when context approaches the limit, an async compaction task is queued; the current turn finishes first, then history is summarised. New turns are blocked during compaction so state stays consistent.
ACP bridge (src/acp/session.ts) — openclaw acp exposes the Agent Client Protocol over stdio. Zed, Cursor and other IDEs drive openclaw as a backend without needing native plugins.
Ideal for "personal copilot / on-call bot": the agent watches Slack during a meeting, answers iMessage on the commute, resumes via CLI at home — all the same session. Add ACP and the IDE joins in too. The other four ask you to context-switch tools yourself. The browser sandbox doubles as a ClawBench runner, so you can use openclaw for both web-agent dev and evaluation in one place.
Higher ops cost: Docker + WebSocket + multi-channel webhooks must all be up. run.ts is 2100 + lines of dense logic. Not a plug-and-play mini-tool.
3.6 pi
github.com/badlogic/pi-mono · packages/coding-agentpi.dev
为什么你该看懂这家: 当前面五家比的是"我加了多少特性", pi 反过来比"我能砍掉多少特性还活得下去"。openclaw 的 embedded runner 就是基于 pi 的 SDK——这是 pi 在生产里最好的存在证明。如果你想把 agent 做成一个能放进自己 app 里的库, 而不是一个吞掉用户工作流的 CLI, 这就是范例。
Why you should care: while the other five compete on "how many features I add," pi competes on "how many features I can strip out and still survive." openclaw's embedded runner is built on pi's SDK — the best existence proof in production. If you want an agent shaped like a library you embed in your app, not a CLI that eats your workflow, this is the template.
Author Mario Zechner (badlogicgames; pi.dev donated by exe.dev) calls pi "a minimal terminal coding harness." The model gets four atomic tools (read / write / edit / bash); everything else is grown by users via TypeScript Extensions / Skills / Prompt Templates / Themes, which can be shipped as npm or git packages. The pi.dev homepage literally has a "What we didn't build" section: no MCP, no sub-agents, no plan mode, no permission popups, no built-in to-dos, no background bash — each entry tells you the recommended workaround instead.
核心机制Key mechanisms
4 工具默认(packages/coding-agent/src/core/tools/: read.ts · write.ts · edit.ts · bash.ts)——pi README 第一句:"By default, pi gives the model four tools." 也有 find / grep / ls 文件但默认未挂载, 让模型用 bash 走原生工具链。这跟 claw-code 的 40 + 工具是另一极。
Four-tool default (packages/coding-agent/src/core/tools/: read.ts · write.ts · edit.ts · bash.ts) — the pi README opens with "By default, pi gives the model four tools." find / grep / ls exist as files but aren't mounted by default — the model reaches for native shell via bash. Polar opposite of claw-code's 40 + tools.
Four run modes (src/modes/: interactive/ · print-mode.ts · rpc/) — the same AgentSession (3099 lines, src/core/agent-session.ts) runs as: interactive TUI (default), pi -p "query" for scripts (or --mode json for an event stream), JSON-RPC over stdin/stdout (for non-Node integrators), or embedded via the SDK. openclaw takes the SDK path (see §3.5).
Session as a Git-like tree (src/core/session-manager.ts + compaction/) — sessions persist as trees; /tree jumps to any old message, forks a new branch from there, all branches live in the same file; /share uploads to a GitHub gist and returns a shareable URL. Same primitive as opencode's parentID, but promoted to a first-class UX feature.
Two-key steering vs follow-up (shown on pi.dev) — while the agent is running tools, Enter sends a steering message: the current tool finishes, remaining tools are interrupted, and the new message lands in the model's next reasoning step. Alt+Enter sends a follow-up: queued, applied only after the agent finishes the current run. Two-key formalisation of "I can't wait, let me cut in."
Extensions = TypeScript modules (src/core/extensions/) — register new tools, slash commands, keybindings, TUI overlays. Features don't live in core, they live in extensions: want sub-agents? Write an extension that spawns another pi instance. Want plan mode? Flip edit/bash to ask in an extension. Want MCP? Write an extension that bridges MCP calls into bash.
Skills + Prompt Templates + AGENTS.md/SYSTEM.md (src/core/skills.ts · prompt-templates.ts · system-prompt.ts) — Skills load on demand per the SKILL.md spec without busting the prompt cache (progressive disclosure); Prompt Templates are Markdown, expanded via /name; AGENTS.md is loaded at startup from ~/.pi/agent/, parent directories, and cwd — pi's CLAUDE.md equivalent.
Compaction is replaceable (src/core/compaction/) — default behaviour on threshold is summary-rewrite, but extensions can swap in topic-grouping, code-aware compaction, or a different summarisation model. Where claw-code makes compaction a runtime first-class concept (with health probe), pi exposes it as a hook for you to wire.
15 + providers, subscription or API key (src/core/auth-storage.ts · model-registry.ts) — Anthropic Claude Pro/Max, OpenAI ChatGPT Plus/Pro (Codex), GitHub Copilot, Gemini CLI all flow through OAuth subscription. Fourteen API-key providers listed (Anthropic / OpenAI / Azure / DeepSeek / Bedrock / Mistral / Groq / Cerebras / Cloudflare / xAI / OpenRouter / Vercel AI Gateway etc.). /model or Ctrl+L switches mid-session, Ctrl+P cycles your favourites.
为什么 workWhy it works
Token 效率压榨到极致——4 工具 + 极简 system prompt 意味着每 turn 的 prompt 前缀小, prompt cache 命中率高, 上下文窗口更经得住消耗。pi 主页声称"very token efficient due to its minimal system prompt"。
"Ask pi to build it" 闭环——pi 鼓励你让 pi 自己写一个 extension, /reload 立刻生效。这把"自定义"做成 agent 的自指能力, 不是开发流程外面的事。
Token efficiency squeezed — four tools + a minimal system prompt means small prompt prefixes per turn, higher prompt-cache hit rate, more context budget for actual work. The pi homepage claims it is "very token efficient due to its minimal system prompt."
Small core surface = small bug surface — all variability lives in extensions; core doesn't need to change for new use cases. The opposite bet of claw-code's "every guardrail in the runtime."
"Ask pi to build it" closes the loop — pi encourages you to ask pi itself to write an extension, then /reload makes it live. Customisation is a self-referential capability of the agent, not something outside the dev flow.
For teams who want to embed an agent into their own product, pi is essentially the only option — clean SDK, stable protocol (RPC mode is documented), no imposed UX concepts. openclaw embedding pi as a runtime in its Gateway, while only owning channel routing, is the best advertisement for pi's philosophy. One more thing worth copying: the author publishes his own pi-mono work sessions to Hugging Face via pi-share-hf, donating real OSS workflow data to the RL / agent-training community.
"Deliberately not built" comes with a tax — you grow it yourself. Teams that need permission popups, sub-agents, plan mode, or MCP in production must first write a stack of extensions; reaching for claw-code or opencode is the cheaper path. The two-key steering protocol is elegant but has a learning curve — everyone on a team has to know the Enter vs Alt+Enter distinction or the wrong key will break a long task.
自己造 agent 时可以直接借鉴的设计。顺带一提:把它们造出来后, 用 ClawBench 在真实网页任务上打个分, 就知道到底哪几条 idea 真的 work。
Design patterns you can lift directly for your own agent. And once you've built it, run it against ClawBench on live web tasks to see which of these ideas actually pay off in practice.
No matter how deeply agents nest, total tool calls can't explode. DIY: add a single shared counter decremented by every call — far more robust than "each agent gets its own limit."
Traditional permission is binary allow/deny. Hooks are a programmable middle layer. DIY: expose a "user-injectable function" at every critical decision point, not hard-coded rules.
In multi-turn tasks, the prior turn's reasoning should carry to the next — don't re-think from scratch. DIY: turn on reasoning persistence if the model supports it; if not, inject "last turn's conclusion" via the system prompt manually.
4. 把 agent 做成服务 — opencode4. Agent as a service — opencode
Prompt cache breaks when the prefix changes. Treat dynamic content (memory, hook output) as "only effective for this API call" — don't pollute the history. DIY: history stores only user messages + tool results; all agent-internal metadata lives elsewhere.
6. 预算耗尽时留一次 "grace call" — codex6. Give the model one "grace call" on budget exhaustion — codex
When the tool budget is exhausted, don't hard-error. The codex adapter in hermes grants one more model call (run_agent.py:916 _budget_grace_call) so the model can exit gracefully: summarise what got done, what's left, save partial results. DIY: reserve a single graceful-exit slot in your budget watchdog — the UX upgrade is immediate.
7. Session 分叉作为一等公民 — opencode (源码级)7. Session forking as a first-class primitive — opencode (code-level)
opencode's session.sql.ts uses a parentID field to track session lineage, letting you fork a parallel session from any message. Note: public docs only describe the share feature; forking is present in source but not yet promoted to official docs — this one is from the code. Most agent frameworks treat "undo/retry" as destruction; opencode treats it as a tree. DIY: add a parent_id to persisted messages and you've unlocked "try both approaches at once."
一句话画像One-line mental model
hermes-agent
"ReAct + 预算 + 子代理, 一进程多 provider。""ReAct + budgets + sub-agents, one process, many providers."
The other four optimize for running an agent well; hermes-agent is the only one that simultaneously optimizes for training one. As evaluation, RL, offline analysis, and cross-model comparison become the new battleground, "agent-as-training-target" is the axis that matters — and hermes is architected for it from the process model up to the data flow.
Modal cloud sandbox, one VM per rollout (environments/hermes_swe_env/hermes_swe_env.py:62) — RL demands hundreds of parallel rollouts, each with isolated FS state that survives until the reward function scores it. Local Docker can't hit that scale; Modal's serverless VMs as "one-shot containers" is what makes hermes feasible as an RL target.
Shared IterationBudget + per-child cap (run_agent.py:730, parent 90, child 50) — exploratory training fears fork-bombs. hermes decrements one shared integer from parent through every descendant; any chain that overflows is cut immediately. Without this, one RL epoch can nuke your cloud bill.
Trajectory compression that preserves head + tail (trajectory_compressor.py:86) — on context overflow, hermes summarises only the middle, keeping the opening (system / human / first tool feedback) and the final four turns. Training signal lives at head ("what to do") and tail ("what happened") — few harnesses treat compaction as a training-data concern; most just truncate.
Deterministic cache IDs (run_agent.py:4209, SHA256(fn:args:index)) — every rollout shares the same prompt prefix instead of random UUIDs, so same-batch rollouts hit the prompt cache. At scale, the savings are not small.
Multi-provider adapters in one process (agent/anthropic_adapter.py / bedrock / gemini_native / auxiliary_client) — same rollout can swap between Claude / Gemini / OpenAI with a config change. Cross-model eval, distillation, ablation all become one-line diffs.
MCP server mode as a signal tap (mcp_serve.py:431) — hermes can expose its internal conversations outward via MCP to Claude Code / Cursor and act as a training-signal collector. An agent's output becomes the next agent's input — research-grade bootstrapping.
Error classifier drives recovery (agent/error_classifier.py:24, FailoverReason enum) — auth / rate-limit / context-overflow each take distinct recovery paths; a single bad rollout can't sink a training run.
Why does this beat "pretty" for elegance? Training is the harshest load an agent harness can face: parallel, idempotent, cost-bounded, failure-recoverable, data-traceable. A harness that withstands all five is, by definition, a harness that can also run in production — but not the reverse. Hermes bakes "trainable" into every layer; that system-wide coherence is what real elegance looks like.
荣誉提名 (各自最优雅的一处)Honorable mentions (each has one truly elegant choice)
claw-code's hooks before permission — one tiny ordering choice unlocks an entire programmable policy surface. Binary "allow/deny" → programmable "intercept then ask" is an order-of-magnitude expressivity jump.
codex's typed output[] protocol — the first API to distinguish "the model is talking / calling a tool / thinking" at the protocol level, instead of making clients regex their way through content.
opencode's client-server split — agent-as-HTTP-service, inheriting decades of Unix-pipe and REST wisdom; frontends cost nothing marginally.
openclaw's Gateway + Channel abstraction — collapses "where did this message come from" into one layer; the agent doesn't distinguish Slack from CLI. The most unified worldview among the five.
This is my taste, not the only right answer. Production audit → claw-code. OpenAI stack → codex. Multi-client work → opencode. Personal multi-channel copilot → openclaw. And if you want to actually stress-test any of the five on real web tasks, that's what ClawBench is for. I vote hermes because "being trainable" is the axis that, long-term, will pull the whole agent ecosystem into a new paradigm — if you aren't training now, you probably will be next year.
If this post or ClawBench is useful to you, please cite. Click the button for one-click BibTeX copy.
ClawBench
@article{zhang2026clawbench,
title={ClawBench: Can AI Agents Complete Everyday Online Tasks?},
author={Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen},
year={2026},
eprint={2604.08523},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.08523},
}
This post
@misc{zhang2026harnessblog,
author = {Yuxuan Zhang},
title = {Agent Harness Engineering: A Source-Level Comparison of Coding Agents},
year = {2026},
url = {https://reacher-z.github.io/blog/harness/}
}
# Agent Harness Engineering: A Source-Level Comparison of Coding Agents
Author: Yuxuan Zhang (2026)
URL: https://reacher-z.github.io/blog/harness/
## Scope
A source-level comparison of how five open-source coding agents actually run:
1. hermes-agent (Python, multi-provider) — ReAct loop with shared IterationBudget (default 90, child 50); delegate_task sub-agents share the parent's budget; Modal cloud VM per RL rollout; trajectory compression preserves head + last 4 turns; deterministic cache IDs; 47 built-in tools.
2. claw-code (Rust reimplementation of Claude Code) — strict order PreToolUse hook > Permission > Execute > PostToolUse; hooks fire BEFORE permission so users can allow / ask / deny / defer / rewrite a call; DEFAULT_AUTO_COMPACTION_INPUT_TOKENS_THRESHOLD = 100_000; run_session_health_probe at conversation.rs:297.
3. codex (OpenAI Responses API) — typed output[] array with message / function_call / function_call_output / reasoning items; encrypted reasoning carried across turns via include=reasoning.encrypted_content; streaming fallback cascade (stream -> retry -> create(stream=True) -> synthesize); seatbelt (macOS) / landlock + bubblewrap + seccomp (Linux) OS sandbox; default mode read-only.
4. opencode (TS server on Bun + Go TUI) — client-server split over HTTP; OpenAPI 3.1 at /doc; mDNS broadcast; build vs plan agents with different permission maps; hidden system agents (compaction / summary / title); LSP as first-class tool; per-tool allow | ask | deny with wildcards.
5. openclaw (TypeScript, multi-channel) — local Gateway daemon routes 10+ IM channels (Slack / Discord / iMessage / Telegram / WhatsApp / Signal / Matrix / Teams / Google Chat / Zalo) plus CLI / iOS / IDE (ACP) into one session; Docker browser sandbox with Chromium + xvfb + noVNC + CDP; ~19 core Tools + ~53 Skills; Tools vs Skills split = "Tools are what the agent calls; Skills teach when and how".
## Conceptual framework (§2 in the post)
Five-layer stack, bottom to top:
- Prompt Engineering — how to phrase input (system prompt, few-shot, CoT)
- Context Engineering — what fits in the window (retrieval, memory, compaction, prompt cache)
- Tools — typed functions the agent can call
- Skills — Markdown (SKILL.md) teaching when/how to use tools — https://github.com/anthropics/skills is the reference repo
- Harness Engineering — loop + sandbox + budget + hook + session + channel (the five agents above are each a harness)
Anthropic officially calls Claude Code an "agentic harness" — which validates this layering.
## Seven design patterns worth adopting (from §7)
1. Shared budget to prevent runaway — hermes's IterationBudget
2. Hook-before-permission for ultimate expressiveness — claw-code
3. Reasoning persisted across turns — codex
4. Agent as a service (client-server split) — opencode
5. Ephemeral injection to preserve prompt cache — hermes
6. Grace call on budget exhaustion — codex-adapter pattern
7. Session forking as first-class primitive — opencode
## hermes-agent and training (§8)
hermes-agent. Rationale: it is the only harness in the five that is architected for training (RL rollouts in parallel, bounded exploration via shared budget, compaction that preserves training signal, cross-provider adapters for ablations, deterministic cache IDs, MCP as signal collector, semantic error classifier for recovery).
## Related benchmark
ClawBench — live browser-task benchmark that grades whether a harness actually works on real everyday online tasks (cookie popups, dynamic JS, multi-step interactions, traceable per-evidence scoring). arXiv:2604.08523, https://claw-bench.com/
## Where to go next
Open https://reacher-z.github.io/blog/harness/ for the full interactive article with five Mermaid flowcharts you can step through node-by-node, tied to exact file:line citations.