Coding Agent 架构对比 Coding Agents — Architecture Comparison

5 家开源 coding agent (hermes-agent · claw-code · codex · opencode · openclaw) 到底怎么跑, 为什么这么设计。从零讲起, 带交互动画, 每一步对得上真实源码行。 How five open-source coding agents (hermes-agent · claw-code · codex · opencode · openclaw) actually run, and why. Zero-to-deep, with interactive animations tied to real source lines.

一键复制 Quick cite → 跳到引用区→ Jump to cite section

30 秒看懂 agent: 大语言模型 (LLM, 像 Claude / GPT) 本身只做一件事——猜下一个字该是什么。要让它真正去读文件、执行命令、上网搜, 得给它工具, 外面再套一层循环: 它说要调工具 → 框架去执行 → 结果塞回对话 → 它接着想。这一整套就叫 agent。本页拆解了 5 种 agent 的设计差异。第一次看, 先从下面的"基本概念"读起。

30-second primer: an LLM (Claude / GPT / ...) only predicts "what text comes next." To make it really read files, run commands, or browse the web, you wrap it with tools + a loop: it asks to call a tool → the framework runs it → result is fed back → it continues. That wrapper is an "agent." This page compares five of them. New here? Start with "Concepts" below.

1. 基本概念1. Concepts

1. LLM 本质是"下一个 token 预测器"1. LLM is a next-token predictor

你给它一段文本, 它给你接下来最可能出现的一段文本。没了。

Feed it text, it returns the most-likely next chunk of text. That's all.

LLM 只做一件事: 看见前面的 token, 预测下一个 token;
然后把预测的那个加到末尾, 再预测下一个——一步步生成出整段话。

先把文字切成 token(≈ 子词):

   "The cat sat on the"
   ──tokenize──▶ ["The", " cat", " sat", " on", " the"]

然后一步步预测: 

   步 1  看到: ["The", " cat", " sat", " on", " the"]   → 预测 " mat"
   步 2  看到: ["The", " cat", " sat", " on", " the", " mat"] → 预测 "."
   步 3  看到: [..., "."]                                 → 预测 <end> (结束)

每预测一次, 输入就长一点。"历史 + 新词"的总长度就是 context window 的上限所在——
超过了, 要么压缩 (§5.2), 要么丢弃尾部。

An LLM only does one thing: look at the tokens so far, predict the next token;
then append that prediction and predict the next one — step by step, a whole reply appears.

First, text is split into tokens (≈ subwords):

   "The cat sat on the"
   ──tokenize──▶ ["The", " cat", " sat", " on", " the"]

Then one step at a time:

   step 1  sees: ["The", " cat", " sat", " on", " the"]   → predicts " mat"
   step 2  sees: ["The", " cat", " sat", " on", " the", " mat"] → predicts "."
   step 3  sees: [..., "."]                                → predicts <end> (stop)

Each step makes the input one token longer. The total length of "history + new tokens" is
capped by the context window — overflow either gets compacted (§5.2) or truncated.

它不会执行代码、不会读文件、不会上网。这些全是 agent 在外面套的一层。

It can't execute code, read files, or access the web. All of that is what the agent wraps around it.

2. Function calling: 让 LLM "指点"框架去干活2. Function calling: let the LLM direct the framework

给 LLM 一段 system prompt 告诉它"你可以调 read_file(path)"。用户问"看看 /tmp/foo.py"——LLM 不会猜文件内容, 它会返回结构化 JSON:

Tell the LLM in the system prompt "you can call read_file(path)". A user asks "read /tmp/foo.py" — the LLM won't invent file contents, it returns structured JSON:

{
  "stop_reason": "tool_use",
  "content": [
    { "type": "text", "text": "Let me read it." },
    { "type": "tool_use",
      "id": "toolu_01abc",
      "name": "read_file",
      "input": { "path": "/tmp/foo.py" } }
  ]
}

真正去读文件的是外面的 agent 框架——拿到这个 JSON, 调 open(), 把结果塞回对话:

The agent framework does the actual reading — it takes the JSON, calls open(), and feeds the result back:

{
  "role": "tool",
  "tool_use_id": "toolu_01abc",
  "content": "import os\nprint(os.getcwd())\n"
}

这一来一回就是 agent 的一次工具调用。

That round-trip is one tool call inside an agent turn.

3. Agent = LLM + 工具 + 循环3. Agent = LLM + tools + loop

Agent Loop (一次 turn)

Agent loop (one turn)

用户消息
把全部历史给 LLM
LLM 输出文本或工具调用
工具调用 → 框架执行 → 拿到结果
结果塞回历史
回到 2, 直到 LLM 不再要调用

User message
Send full history to the LLM
LLM outputs text or a tool_use
Tool_use → framework runs it → result
Result appended to history
Go to 2 until no more tool_use

就这 6 步。五个 agent 都遵循这个骨架, 差异在每一步做多少事、加了多少保护、怎么扩展。

Six steps. All five agents follow this skeleton — they differ in how much each step does, what safety layers are added, and how it extends.

4. 为什么 agent 能做到 LLM 做不到的事4. Why agents can do what LLMs can't

单次推理无法感知真实世界——"2025 GDP 多少?" LLM 靠记忆, 过期就错。
加工具就能实时访问——web_search → 新鲜结果 → LLM 基于结果答。
加循环就能多步规划——修 bug:read_file → grep → read_file → write_file → pytest, 一次 turn 跑 5–20 次工具调用。

Single inference can't perceive the real world — "2025 GDP?" — relies on memory, becomes stale.
Add tools → real-time access — web_search → fresh results → LLM answers based on them.
Add a loop → multi-step planning — bug fix: read_file → grep → read_file → write_file → pytest, 5–20 tool calls per turn.

核心洞察: LLM 的"脑"很强但"手"是假的。Agent 框架负责给它装上真手, 并保证它不把厨房烧了。五家 agent 的差异, 本质是"装手的方式"和"保证不烧厨房的方式"不同。

把手装好只是第一步;"手装得好不好"要靠真实任务测——ClawBench 就是为这个场景设计的, 专测这五家(及任何 harness)在真实网页任务上能不能完成 cookie 弹窗、动态 JS、多步交互等操作。

Key insight: LLMs have great brains but fake hands. The agent framework installs real hands — and makes sure it doesn't burn the kitchen down. The five differ in how they install the hands and how they prevent fires.

Installing the hands is step one; measuring whether the hands actually work is step two — ClawBench is designed exactly for this: a live benchmark that tests any harness on cookie popups, dynamic JS, multi-step interactions, and real everyday online tasks.

术语表 — 忘记某个词时展开查 Glossary — expand when you forget a word

术语表Glossary

词	一句话解释	在本文里怎么用
LLM	大语言模型, 如 Claude、GPT-4	五个 agent 都是套在 LLM 外面的框架
Token	LLM 看到的最小单位, ≈ 一个子词	`"hello world"` ≈ 2 tokens
Context window	LLM 单次能看到的 token 总量上限	具体上限取决于模型和运行配置; 对 harness 来说关键是快满时如何压缩、裁剪和重载持久记忆
API	程序调程序的接口, 本文指 LLM 服务商 REST API	发 HTTP, 模型返回 JSON
Streaming	边生成边返回 (打字机效果)	减少等待, 能提前开始下一步
Function calling / Tool use	LLM 输出"我要调这个函数"的结构化 JSON	是 agent 能"动手"的前提
Prompt cache	服务端缓存长 system prompt	省钱 (最多 10x) 省延迟
Sandbox	把进程关进小盒子, 限制它能访问什么	防 agent 把电脑搞坏
Provider	LLM 服务商 (Anthropic / OpenAI / Google)	绑一家或做 adapter 兼容多家
Turn	用户说一次话 + agent 干完活返回 = 一个 turn	"一次 turn" = 主循环跑一整圈
ReAct	Reasoning + Acting 循环: 想一下 → 做一下	五个 agent 都是 ReAct 变体
MCP	Model Context Protocol, 外部工具协议	让 agent 接入任意第三方工具
CLAUDE.md / AGENTS.md	项目根目录的约定配置文件	启动时读, 相当于"给 bot 的 README"
Plan-and-execute	先让模型出计划, 再一步步执行的编排模式	opencode 的 plan 模式、claw-code 的 EnterPlanMode
Reflection	agent 完成动作后再自我检查一轮, 发现错误就重试	§7 Takeaway · Reflection 是很多 harness 的辅助轮回
Tools	agent 可以调用的带类型函数 (读文件、执行 bash、浏览网页…)	§2 Tools vs Skills 对照表
Skills	教 agent "什么时候、怎么用工具" 的 Markdown 文件 (`SKILL.md`)	§2 · anthropics/skills
Subagent	父 agent 派出的子 agent, context 隔离, 只回传总结	hermes `delegate_task`、opencode @general/@explore
Orchestration	决定"谁来做、按什么顺序、出错怎么接"——harness 的外层编排	§2 五层地图的最顶层
Hook	用户配置的脚本, 在 agent 生命周期关键时刻 (工具前/后) 自动跑, 能拦截 / 改写 / 否决 / 记日志	claw-code 的 PreToolUse / PostToolUse
Adapter	通用 agent loop 和具体 provider API 之间的翻译层; 换 adapter 就换 provider, loop 一行不用动	hermes 的 anthropic_adapter / gemini_native
Compaction	对话历史超过 context 上限时, 自动摘要旧 turn、保留头尾的行为	claw-code 的 auto-compact · hermes 的 trajectory 压缩
Rollout	agent 一次从头跑到尾的完整 turn 序列; RL 训练里通常并行跑几百个	§8 hermes 的 Modal 云沙箱每 rollout 一个 VM
SSE	Server-Sent Events, 单向 HTTP 流式推送协议	opencode 用它把 agent 事件从 server 流回 TUI
ACP	Agent Client Protocol, IDE 与 agent 之间的 stdio 协议 (Zed / Cursor 推动)	openclaw 的 acp 桥、opencode 的 acp 支持
LSP	Language Server Protocol, IDE 与语言服务器 (跳转定义、查引用、诊断…) 之间的协议	opencode 把 LSP 做成一等工具
CDP	Chrome DevTools Protocol, 程序化控制 Chromium 的协议 (无头浏览器自动化的基础)	openclaw 浏览器沙箱用 CDP 给 agent 下操作指令
noVNC	浏览器里的 VNC 客户端, 允许你通过 HTTP 端口远程看到沙箱里的图形桌面	openclaw 6080 端口可"看 agent 在浏览器里点什么"
RL (强化学习)	让 agent 反复试错、按"奖励函数"给出的分数学习的训练范式; 通常需要并行跑几百个 rollout	§8 讨论 hermes 与训练的契合——它被设计成 RL-friendly harness
Modal	serverless 云 VM 服务商, 按秒计费、秒级启停; agent 可以把每个 rollout 扔进一个独立 VM	hermes 的 RL 沙箱就是基于 Modal
AWS Bedrock	AWS 托管的 LLM API 网关, 里面可以调 Claude、Llama、Mistral 等多家模型	hermes 的 bedrock_adapter 就是对接它
OpenRouter	第三方 LLM 路由服务, 一个 API key 调所有主流模型, 自动做限流 / 回退	hermes 支持它作为 provider 之一
OS 沙箱 (seatbelt / landlock / bubblewrap / seccomp)	操作系统层的进程隔离原语: seatbelt (macOS `sandbox-exec`) · landlock (Linux 内核自愿放弃能力) · bubblewrap (用户态容器) · seccomp (系统调用白名单)	codex 默认模式就用这一套挡 99% 误操作
Vercel AI SDK	Vercel 出的 provider-agnostic TypeScript SDK, 抽象了 streaming / tool-calling / reasoning 的跨家差异	opencode 的 provider 层直接用它, 新加一家只改一行配置
Self-evolving(自进化)	agent 在运行中自己写新 `SKILL.md`、改 prompt 或更新 memory, 下一次起跑点比上一次更高; 比 reflection 更进一步——学到的东西能持久化, 不只是当轮改错	Skills 层就是自进化的产物入口 · hermes 的 `skill_manage` 工具 · §7 Reflection 之后的下一层 takeaway

Term	One-liner	Usage in this doc
LLM	Large Language Model (Claude, GPT-4, etc.)	All five agents wrap an LLM
Token	Smallest unit the LLM sees, ≈ a subword	`"hello world"` ≈ 2 tokens
Context window	Max tokens the LLM can see at once	The exact limit depends on model and runtime config; for harness design, the key issue is compaction, truncation, and reloading persistent memory near the limit
API	How programs call programs; here: LLM REST APIs	Send HTTP, get JSON back
Streaming	Return tokens as they are generated	Lower latency, pipeline next step earlier
Function calling / Tool use	LLM returns structured "please call this tool" JSON	Prerequisite for an agent to "do things"
Prompt cache	Server-side cache of long system prompt	Up to 10× cheaper, lower latency
Sandbox	Confined process env (FS/network limited)	Keeps agent from wrecking your machine
Provider	LLM vendor (Anthropic / OpenAI / Google)	Pick one, or write adapters
Turn	One user message + full agent response cycle	"One turn" = the main loop runs a full pass
ReAct	Reasoning + Acting loop: think → act → think	All five are ReAct variants
MCP	Model Context Protocol for external tools	Lets agents plug in any 3rd-party tool
CLAUDE.md / AGENTS.md	Root-level project convention file	Read at startup; a "README for bots"
Plan-and-execute	Ask the model to plan first, then execute step by step	opencode's plan mode, claw-code's EnterPlanMode
Reflection	Agent self-reviews after acting; retries on error	§7 Takeaway · common auxiliary loop
Tools	Typed functions the agent can call (read file, run bash, browse…)	§2 Tools vs Skills table
Skills	Markdown files (`SKILL.md`) teaching when/how to use tools	§2 · anthropics/skills
Subagent	Child agent spawned by a parent; isolated context; returns summary only	hermes `delegate_task`, opencode @general/@explore
Orchestration	"Who does what, in what order, with what fallback" — the harness's outer layer	Top row of §2's five-layer map
Hook	User-configured script run at lifecycle moments (before / after a tool call); can intercept, modify, veto, or log	claw-code's PreToolUse / PostToolUse
Adapter	Translation layer between a generic agent loop and a specific provider's API; swap adapter → swap provider, loop unchanged	hermes's anthropic_adapter / gemini_native
Compaction	Auto-summarise old turns when history exceeds the context window, preserving head and tail	claw-code's auto-compact · hermes's trajectory compression
Rollout	One full start-to-end turn sequence of an agent; in RL you run hundreds in parallel	§8 hermes's Modal cloud VM per rollout
SSE	Server-Sent Events, a one-way HTTP streaming protocol	opencode pushes agent events from server to TUI over SSE
ACP	Agent Client Protocol; a stdio protocol between an IDE and an agent (pushed by Zed / Cursor)	openclaw's acp bridge, opencode's acp support
LSP	Language Server Protocol; the standard protocol between an IDE and a language server (goto-definition, find-references, diagnostics, ...)	opencode ships LSP as a first-class tool
CDP	Chrome DevTools Protocol; a wire protocol for programmatically controlling Chromium (the foundation of headless browser automation)	openclaw's browser sandbox drives the agent via CDP
noVNC	A VNC client that runs in the browser, letting you view a sandbox's GUI desktop over HTTP	openclaw's port 6080 lets you "watch the agent click around in Chromium"
RL (Reinforcement Learning)	A training paradigm where the agent learns by trial and error, scored by a "reward function"; usually runs hundreds of rollouts in parallel	§8 discusses why hermes suits training — it's designed as an RL-friendly harness
Modal	A serverless cloud-VM provider with per-second billing and sub-second cold start; an agent can launch one isolated VM per rollout	hermes's RL sandbox runs on Modal
AWS Bedrock	AWS-managed LLM API gateway that serves Claude, Llama, Mistral and others behind one interface	hermes's bedrock_adapter targets it
OpenRouter	Third-party LLM-routing service: one API key calls every major provider, with automatic rate-limit / fallback handling	supported as a hermes provider
OS sandboxes (seatbelt / landlock / bubblewrap / seccomp)	OS-level process-isolation primitives: seatbelt (macOS `sandbox-exec`) · landlock (Linux capability-dropping kernel feature) · bubblewrap (userland container) · seccomp (syscall allowlist)	codex's default mode stacks these to block 99% of accidents
Vercel AI SDK	Provider-agnostic TypeScript SDK from Vercel that abstracts streaming / tool-calling / reasoning across vendors	opencode's provider layer uses it; adding a new vendor is a config one-liner
Self-evolving	Agent that writes new `SKILL.md` files, updates prompts, or augments memory at runtime — so the next run starts from a higher baseline; a step beyond reflection, where what was learned persists	The Skills layer is the persistence entry point · hermes's `skill_manage` tool · a natural next step after §7 Reflection

2. 五层概念地图: Prompt → Harness2. The five-layer stack: prompt to harness

2023 年大家对着 prompt 雕花; 2024 年重心转到 context engineering (检索、memory、压缩); 2025 年前沿又往上走了两层——Skills 和 Harness。本页比的五个开源项目, 本质上都是 harness 工程的不同答卷。

In 2023 the craft was prompt engineering. In 2024 it moved to context engineering — retrieval, memory, compaction. In 2025 the frontier climbed two more layers: Skills and Harnesses. The five projects on this page are all different takes on harness engineering.

层	管什么	产物 / 例子	在本页哪里体现
Harness Engineering	主循环 · 沙箱 · 预算 · hook · session · channel	hermes-agent / claw-code / codex / opencode / openclaw 都是 harness	§4 流程图 + §5 深度拆解
Skills	"什么时候 / 怎么用工具"——可复用的 procedural knowledge	Anthropic `SKILL.md`(markdown + YAML frontmatter), anthropics/skills, agentskills.io spec	openclaw 文档、hermes `skill_manage`、claw-code /Claude Code Skill 工具
Tools	"agent 能调什么"——typed 函数	`bash` / `read` / `write` / `browser` / MCP	§4 流程图里绿色节点 + §1 术语表
Context Engineering	"窗口里装什么"——检索、memory、compaction、prompt cache	RAG、`MEMORY.md`、auto-compaction、cache_control	§1.2 function calling · §5 各家的压缩策略
Prompt Engineering	"输入文本怎么写"	system prompt · few-shot · chain-of-thought	所有层都建立在它之上

Layer	Concern	Artifacts / examples	Where it shows on this page
Harness Engineering	Main loop · sandbox · budget · hook · session · channel	hermes-agent / claw-code / codex / opencode / openclaw are all harnesses	§4 diagrams + §5 deep dives
Skills	"When and how to use tools" — reusable procedural knowledge	Anthropic `SKILL.md` (markdown + YAML frontmatter), anthropics/skills, agentskills.io spec	openclaw docs, hermes `skill_manage`, Claude Code Skill tool
Tools	"What the agent can call" — typed functions	`bash` / `read` / `write` / `browser` / MCP	Green nodes in §4 diagrams + glossary in §1
Context Engineering	"What goes in the window" — retrieval, memory, compaction, cache	RAG, `MEMORY.md`, auto-compaction, cache_control	§1.2 function calling · §5 each project's compaction story
Prompt Engineering	"How to phrase the input"	System prompt · few-shot · chain-of-thought	Every layer above stands on it

一句话记法: Prompt Engineering 是措辞; Context Engineering 是窗口里装什么; Tools 是能干什么; Skills 是什么时候怎么干; Harness Engineering 是整个外骨骼——没有它 LLM 的脑袋没地方安手。

One-line summary: Prompt Engineering is wording; Context Engineering is what fits in the window; Tools are what the agent can do; Skills are when and how to do it; Harness Engineering is the exoskeleton — without it, the LLM brain has nowhere to attach its hands.

3. Claude Code 重点理解3. Understanding Claude Code

Claude Code 最该被理解成一个 agentic harness, 而不是"Claude 加了几个 shell 命令"。它的核心价值不在模型本身, 而在外层状态机: 怎么把用户输入、项目上下文、工具 schema、权限策略、hook、工具结果、压缩摘要组织成一个可持续运行的 turn loop。官方文档把这个循环概括成 gather context → take action → verify results; 本文把它拆成更工程化的状态转移。

Claude Code is best understood as an agentic harness, not "Claude plus a few shell commands." Its value is the outer state machine: how user input, project context, tool schemas, permission policy, hooks, tool results, and compaction summaries are organized into a durable turn loop. The official docs describe the loop as gather context → take action → verify results; this post expands it into implementation-level state transitions.

Claude Code 状态	它在做什么	为什么重要
`Context Assembly`	读 system prompt、`CLAUDE.md` / skills / conversation history / tool schemas, 组装本轮请求。	决定模型"看见什么"; 这比单句 prompt 更接近真实能力上限。
`Model Step`	流式调用模型, 输出自然语言或结构化 `tool_use`。	模型不直接执行动作, 只声明"我想调用什么工具"。
`PreToolUse`	工具执行前先跑 hook, 可以改参、拒绝、要求确认、推迟、补充上下文。	这是 Claude Code 的治理入口: 用户能写程序影响 agent, 但强制权限规则仍会评估。
`Permission`	根据工具类型、路径、命令危险度、用户策略做 allow / ask / deny。	把"模型想做"和"系统允许做"分开, 防止工具失控。
`Execute + Observe`	harness 执行真实 shell / 文件 / MCP 工具, 把结果作为 `tool_result` 放回消息历史。	LLM 的行动能力来自这里; 它通过观察结果进入下一步推理。
`Loop / Terminate`	如果还有 tool call 就回到下一次 model step; 如果没有 tool call, 本 turn 结束。	这就是 coding agent 能多步修 bug 的原因。
`Compaction`	上下文过长时摘要旧历史, 保留关键状态。	长任务能继续跑, 不会因为 context 爆掉直接失忆。

Claude Code state	What it does	Why it matters
`Context Assembly`	Load system prompt, `CLAUDE.md` / skills / conversation history / tool schemas, then assemble the request.	Determines what the model can see; this matters more than any single prompt.
`Model Step`	Stream the model; receive either natural language or structured `tool_use`.	The model does not act directly; it declares which tool it wants.
`PreToolUse`	Run hooks before execution; rewrite input, deny, ask, defer, or add context.	This is the governance surface: users can program the agent, while enforced permission rules still apply.
`Permission`	Allow / ask / deny based on tool type, path, command risk, and user policy.	Separates "the model wants" from "the system permits."
`Execute + Observe`	The harness runs shell / file / MCP tools and appends `tool_result` back into history.	This is where action happens; the model learns the result by observation.
`Loop / Terminate`	If tool calls remain, go back to the model; if none remain, end the turn.	This is why a coding agent can fix bugs over multiple steps.
`Compaction`	Summarize old history when context is too long, preserving important state.	Long tasks can continue instead of losing the session.

最关键的 transition: assistant_message has ToolUse → 进工具管线; no ToolUse → 进入 stop/结束检查; hook 或 permission denied → 生成 error tool_result 让模型读到; context too long → compact 后继续。这四个分支就是 Claude Code 状态机的骨架。

The key transitions: assistant_message has ToolUse → enter the tool pipeline; no ToolUse → enter stop/finalization checks; hook or permission denied → append an error tool_result for the model to read; context too long → compact then continue. Those four branches are the backbone of the Claude Code state machine.

Tools 和 Skills 的分工是 Anthropic 2025 年在 Agent Skills 博客 + anthropics/skills 仓库里推的核心抽象。openclaw 把它复述为一句话: "Tools are what the agent calls; Skills teach the agent when and how."

The Tools / Skills split is the core abstraction Anthropic pushed in 2025 (see their Agent Skills blog and the anthropics/skills repo). openclaw restates it as: "Tools are what the agent calls; Skills teach the agent when and how."

Having all five layers in place only means the system is theoretically capable; whether it actually works requires real-task success rates. That's exactly what ClawBench measures: live web tasks that grade each layer end-to-end, not offline DOM snapshots you can game.

Skill 的标准: SKILL.mdThe SKILL.md standard

github.com/anthropics/skills 就是这个标准的官方参考实现——Anthropic 发布的 Skill 示例合集, 也是 SKILL.md 格式的源头。每个 skill 是一个文件夹, 里面有一个必写文件 SKILL.md: YAML frontmatter 头 (name + description) + Markdown 正文。Claude 在 session 里看到相关任务时, 按 description 自动挂载、读正文、照做。

github.com/anthropics/skills IS the reference repository for this standard — Anthropic's official collection of Skill examples and the origin of the SKILL.md format. Each skill is a folder containing one required file SKILL.md: YAML frontmatter (name + description) followed by a Markdown body. Claude auto-mounts the skill when the description matches the task, reads the body, and follows it.

# 最小模板(来自 anthropics/skills/template/SKILL.md):
---
name: my-skill-name
description: A clear description of what this skill does and when to use it
---

# My Skill Name

[Add your instructions here that Claude will follow when this skill is active]

## Examples · Guidelines · Reference files · etc.

# Minimum template (from anthropics/skills/template/SKILL.md):
---
name: my-skill-name
description: A clear description of what this skill does and when to use it
---

# My Skill Name

[Add your instructions here that Claude will follow when this skill is active]

## Examples · Guidelines · Reference files · etc.

真实例子: anthropics/skills/skills/pdf/SKILL.md 的 description 写得很细——"用户要读 PDF / 合并 / 分页 / 旋转 / 水印 / OCR 时用这个 skill"——Claude 看到这些关键字就自动挂上。正文里放 Python 代码片段、命令行工具指引、REFERENCE.md 链接等可复用知识。仓库目前有 17 个官方 skill (algorithmic-art / pdf / docx / pptx / xlsx / mcp-builder / skill-creator / webapp-testing / brand-guidelines 等), 覆盖创作 / 办公文档 / 开发 / 企业协作四大类。

Real example: anthropics/skills/skills/pdf/SKILL.md has a very precise description — "use this skill whenever the user wants to read PDFs / merge / split / rotate / watermark / OCR" — Claude auto-invokes on those keywords. The body contains Python snippets, CLI guidance, links to REFERENCE.md, etc. The repo ships 17 official skills today (algorithmic-art, pdf, docx, pptx, xlsx, mcp-builder, skill-creator, webapp-testing, brand-guidelines, …), covering creative / office / development / enterprise categories.

Tools vs Skills 对照表Tools vs Skills side-by-side

维度	Tools	Skills
是什么	带类型签名的函数	带 YAML frontmatter 的 Markdown 文件夹
谁执行	harness 执行 (调真实 API / shell / FS)	LLM 自己读完照做 (instructions + 参考资料)
回答的问题	"agent 能调什么?"	"什么时候 / 怎么调?"
进入上下文	schema 列在 `tools[]` 里	description 常驻, 正文按需挂载
跨 harness 复用	每家 harness 都要自己实现	同一 `SKILL.md` 任何支持的 agent 都能装
例子	`bash`、`read`、`write`、`browser`、MCP 工具	`pdf`、`mcp-builder`、`frontend-design`

Dimension	Tools	Skills
What	Typed function with a signature	Folder of Markdown with YAML frontmatter
Executor	The harness runs it (hits real APIs / shell / FS)	The LLM reads it and follows (instructions + refs)
Question	"What can the agent call?"	"When and how should it call things?"
Context cost	Schema sits in `tools[]`	Description always loaded; body mounted on demand
Portability	Each harness re-implements	Same `SKILL.md` works on any compatible agent
Examples	`bash`, `read`, `write`, `browser`, MCP tools	`pdf`, `mcp-builder`, `frontend-design`

在本页五家里对号入座: Claude Code (≈ claw-code) 提供一级 Skill 工具直接挂载 SKILL.md; openclaw 文档大篇幅讲 Skills, 社区 53 个已公开; hermes-agent 提供 skill_view / skills_list / skill_manage 三件工具, 按 agentskills.io spec 加载 SKILL.md; opencode 以 Markdown frontmatter 定义 agent 接近此思路; codex 没有一等 Skill 概念, 用 AGENTS.md 做类 CLAUDE.md 的项目注入。

How the five relate: Claude Code (≈ claw-code) ships a first-class Skill tool that mounts SKILL.md; openclaw devotes major docs space to Skills (53 community-published); hermes-agent provides skill_view / skills_list / skill_manage tools that load SKILL.md per the agentskills.io spec; opencode's Markdown-frontmatter agents sit close to this idea; codex has no first-class Skill concept — it uses AGENTS.md like CLAUDE.md for per-project instructions.

3.5 Flue Framework: 把 harness 写成可编程 TS3.5 Flue Framework: harness as programmable TypeScript

Flue Framework 把本页讨论的"五层栈"重新画成四层模型: Model · Harness · Sandbox · Filesystem。它的口号 "Not another SDK" 表明态度——不是再造一套 chat 抽象, 而是给 harness 这一层提供可编程的 TypeScript 控制面。把它放进本页的对比, 价值在于: 它用一个外部视角验证了 §2 五层地图里 harness 是真正的工程主轴。

Flue Framework recasts the "five-layer stack" of this page into a four-layer model: Model · Harness · Sandbox · Filesystem. Its slogan "Not another SDK" is a stance — it doesn't add another chat abstraction; it offers a programmable TypeScript control plane at the harness layer. It earns a place in this comparison because it independently confirms §2's claim: the harness is the real engineering axis.

Flue 的四层 ↔ 本页五层Flue's four layers ↔ the five-layer stack

Flue 层	它管什么	对应本页 §2 哪一层
Model	tokens · tools · prompts	Prompt + Context + Tools
Harness	skills · memory · sessions	Skills + Harness Engineering
Sandbox	bash 执行 · 隔离 · 网络管理	Harness Engineering 里的 sandbox 子模块
Filesystem	read / write / grep / glob	Tools 层的核心成员

Flue layer	What it owns	Maps to §2 layer
Model	tokens · tools · prompts	Prompt + Context + Tools
Harness	skills · memory · sessions	Skills + Harness Engineering
Sandbox	bash exec · isolation · network policy	Sandbox sub-module of Harness Engineering
Filesystem	read / write / grep / glob	Core members of the Tools layer

三个一等概念Three first-class primitives

概念	Flue 怎么定义	对照其他 harness
Session	持续的工作状态容器, 可挂 skill / 跑 prompt / 执 shell	≈ Claude Code 的 turn loop · opencode 的 session.sql · openclaw 的 session 路由
Skill	带结构化输入输出的可复用 workflow (例: `triage(issueNumber)` → typed result)	比 Anthropic `SKILL.md` 更像"带 schema 的子程序"——更接近 hermes 的 `delegate_task` + skill
Sandbox	三档可换: 内置零配置 virtual sandbox / 远程容器 (Daytona) / 云后端 (Cloudflare Durable Object + SQLite + R2)	codex 走 OS 原语 (seatbelt/landlock); hermes 走 Modal 云 VM; Flue 把这条选项做成plug-in

Concept	Flue's definition	Counterpart in the five
Session	Durable work-state container; you can mount skills, run prompts, exec shell	≈ Claude Code's turn loop · opencode's session.sql · openclaw's session routing
Skill	Reusable workflow with structured I/O (e.g. `triage(issueNumber)` → typed result)	More "schema'd subroutine" than Anthropic's `SKILL.md` — closer to hermes's `delegate_task` + skill
Sandbox	Three pluggable backends: built-in virtual sandbox / remote container (Daytona) / cloud (Cloudflare Durable Object + SQLite + R2)	codex uses OS primitives (seatbelt/landlock); hermes uses Modal VMs; Flue makes the choice swappable

两个值得抄的设计Two ideas worth lifting

敏感凭证从不进 LLM 上下文。Flue 把 GITHUB_TOKEN 这类 secret 存在 harness 边界, 仅在执行 shell 命令的瞬间注入到子进程环境——agent 看不到, sandbox 也只在执行那一拍接触。这是对 §3 "PreToolUse / Permission" 那条治理线的天然补强: 治理负责挡住"做什么", secret 隔离负责挡住"读什么"。
同一段 agent 代码跑五个部署形态。Node.js · Cloudflare Workers · GitHub Actions · GitLab CI · HTTP 服务 — 同一个 harness 实现可以是常驻 server, 也可以是单次 CLI run, 也可以是 CI 任务里的一段。这把 §6 takeaway #4 "agent as a service"再推一步: service vs CLI 不是架构选型, 是部署目标的旋钮。

Secrets never enter the LLM context. Flue keeps tokens like GITHUB_TOKEN at the harness boundary and only injects them into the child-process env at the moment of shell exec — the agent never sees them, and the sandbox only touches them for one syscall. A natural complement to the §3 "PreToolUse / Permission" governance line: governance gates what to do; secret isolation gates what can be read.
One agent codebase, five deployment shapes. Node.js · Cloudflare Workers · GitHub Actions · GitLab CI · HTTP server — the same harness can run as a long-lived service, a one-shot CLI, or a CI step. This pushes §8 takeaway #4 "agent as a service" one step further: service vs CLI is not an architecture choice, it's a deployment knob.

在本页 5+1 张地图里坐标Where Flue sits on the 5+1 map

维度	Flue	最像谁	不一样在哪
语言	TypeScript	opencode (TS+Go) · openclaw (TS)	纯 TS, 不需要 Go runtime
形态	SDK / library, 用户用 TS 写 agent 入口	opencode 的 core library	更彻底——没有自带 TUI, 部署形态完全交给用户
Sandbox	三档可插	hermes (Modal) · openclaw (Docker)	把"挑哪个 sandbox"做成配置而不是源码 fork
定位	"自主代理可编程控制面"	偏 opencode 的服务化思路	更强调"全栈自控": agent 逻辑 + harness + sandbox 都在你这边

Dimension	Flue	Closest sibling	How it differs
Language	TypeScript	opencode (TS+Go) · openclaw (TS)	Pure TS — no Go runtime needed
Shape	SDK / library; users write the agent entry in TS	opencode's core library	More radical — no bundled TUI, deployment shape is fully user-decided
Sandbox	Three pluggable backends	hermes (Modal) · openclaw (Docker)	Backend choice is a config knob, not a source fork
Stance	"Programmable control plane for autonomous agents"	opencode's service-shaped approach	Pushes harder on full-stack ownership: agent logic + harness + sandbox all yours

一句话定位: 如果 §2 把 harness 列为 2025 年最值得做的工程层, Flue 就是把这层做成一个 TS package的最直接尝试——它没有把"agent"当成产品, 而是当成由你写的 TS 代码 + 一个标准 harness runtime。

One-line placement: if §2 names harness engineering as the layer of 2025, Flue is the most literal attempt to ship that layer as a TS package — it doesn't treat "agent" as a product, but as your TS code on top of a standard harness runtime.

4. 六个流程图4. Six diagrams

下面六张图讲的是这些 harness 怎么运作;想量化它们到底 work 得多好, 用我们的 ClawBench 在真实网页任务上跑一跑就知道。The six diagrams below show how these harnesses work; to quantify how well they actually work, run them against our ClawBench on live web tasks.

hermes-agent

Python · multi-provider

ReAct 循环 + 共享迭代预算 + 子代理委派。 ReAct loop with shared iteration budget and sub-agent delegation.

flowchart TD U([User message]):::io A[Apply prompt cache + memory · every 10 turns]:::ctx M{{Adapter.stream · Anthropic · Bedrock · Gemini}}:::model P[Parse tool_calls · preserve reasoning_content]:::model R[ToolRegistry.dispatch · 47 built-in tools]:::tool S{delegate_task?}:::decision SA[[Spawn sub-agent · shared IterationBudget]]:::sub RES[Append tool results]:::tool C[ContextCompressor · if near context limit]:::ctx B{budget > 0?}:::decision Y([Return final message]):::io U --> A --> M --> P --> R --> S S -- yes --> SA --> RES S -- no --> RES RES --> C --> B B -- yes --> A B -- no --> Y class U step1 class A step2 class M step3 class P step4 class R step5 class SA step6 class RES step7 class C step8 class B step9 click U call jumpTo("hermes", 1) click A call jumpTo("hermes", 2) click M call jumpTo("hermes", 3) click P call jumpTo("hermes", 4) click R call jumpTo("hermes", 5) click SA call jumpTo("hermes", 6) click RES call jumpTo("hermes", 7) click C call jumpTo("hermes", 8) click B call jumpTo("hermes", 9) classDef io fill:#233042,stroke:#7aa2f7,color:#e6e8ef; classDef model fill:#2b1f3a,stroke:#bb9af7,color:#e6e8ef; classDef tool fill:#1f3a2b,stroke:#9ece6a,color:#e6e8ef; classDef sub fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef ctx fill:#3a1f2b,stroke:#f7768e,color:#e6e8ef; classDef decision fill:#2d2d3a,stroke:#8a93a6,color:#e6e8ef;

Step 0 / 9

共享预算——父 + 所有子代理共用一个 IterationBudget, 不会 fork-bomb。
临时注入——memory nudge (读取 MEMORY.md 与 USER.md) 只在 API 调用时加, 不污染 prompt cache 前缀。
Adapter 分发——一套 loop 对接 N 家 provider; 错误分类器自动切换。
Modal 沙箱——每个 rollout 独立云 VM, RL 奖励函数看到一致的 FS 状态。

Shared budget — parent + all sub-agents draw from one IterationBudget; can't fork-bomb.
Ephemeral injections — memory nudges (reading MEMORY.md and USER.md) added at API time only, keeping cache prefix stable.
Adapter fan-out — one loop, N providers; error classifier routes failures.
Modal sandbox — each rollout in its own cloud VM; RL reward funcs see identical FS.

run_agent.py:634 — max_iterations=90 default run_agent.py:730 — IterationBudget init run_agent.py:8076 — delegate_task dispatch run_agent.py:100 — apply_anthropic_cache_control

claw-code

Rust · hooks-first

Claude Code 风格状态机: 模型流式产出, 有 ToolUse 就 hook → permission → execute → hook, 没有 ToolUse 就进入结束检查。 Claude Code-style state machine: stream model output; ToolUse triggers hook → permission → execute → hook; no ToolUse enters finalization checks.

flowchart TD U([User message]):::io B[BootstrapPlan — 12 phases, once per session]:::ctx L[Assemble ApiRequest · system_prompt + messages]:::ctx API{{ApiClient.stream · AssistantEvent · PromptCacheEvent}}:::model TU[Parse ToolUses]:::model H1[PreToolUse hook · allow · ask · deny · defer · modify]:::gate PG[PermissionPolicy · authorize_with_context]:::gate EX[Execute tool · bash · file · mcp · web]:::tool H2[PostToolUse hook · success or failure]:::gate CMP[Auto-compact + health probe]:::ctx TS([TurnSummary · persist Session]):::io U --> B --> L --> API --> TU --> H1 --> PG --> EX --> H2 --> CMP CMP -- more tools --> L CMP -- done --> TS class U step1 class B step2 class L step3 class API step4 class TU step5 class H1 step6 class PG step7 class EX step8 class H2 step9 class CMP step10 click U call jumpTo("claw", 1) click B call jumpTo("claw", 2) click L call jumpTo("claw", 3) click API call jumpTo("claw", 4) click TU call jumpTo("claw", 5) click H1 call jumpTo("claw", 6) click PG call jumpTo("claw", 7) click EX call jumpTo("claw", 8) click H2 call jumpTo("claw", 9) click CMP call jumpTo("claw", 10) classDef io fill:#233042,stroke:#7aa2f7,color:#e6e8ef; classDef model fill:#2b1f3a,stroke:#bb9af7,color:#e6e8ef; classDef tool fill:#1f3a2b,stroke:#9ece6a,color:#e6e8ef; classDef gate fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef ctx fill:#3a1f2b,stroke:#f7768e,color:#e6e8ef;

Step 0 / 10

Hook 先于权限——PreToolUse hook 可以在权限引擎之前否决、要求确认、推迟或改写调用; 强制 deny/ask 规则仍是最终安全边界。
主循环无子代理——task registry 只做异步后台; 多 agent 协作被推到 context 外。
有 provenance 的压缩——摘要记录为 SessionCompaction 事件 + 健康探针。
Workspace 绑定——workspace_root 防并行 lane 写错 CWD。

Hooks before permissions — a PreToolUse hook can deny, ask, defer, or rewrite a call before the policy engine; enforced deny/ask rules remain the safety boundary.
No in-loop sub-agents — task registry is for async background only; multi-agent coord pushed outside.
Auto-compaction with provenance — summaries logged as SessionCompaction events + health probe.
Workspace binding — workspace_root prevents parallel lanes writing to wrong CWD.

状态转移速读State transitions

当前状态State	触发条件Trigger	下一状态Next
`UserInput`	用户输入被追加到 session messagesUser message appended to session messages	`BuildRequest`
`BuildRequest`	system prompt + 历史 messages 组装完成System prompt + history assembled	`ModelStream`
`ModelStream`	assistant message 没有 `ToolUse` blockAssistant message has no `ToolUse` block	`TurnDone`
`ModelStream`	解析到一个或多个 `ToolUse` blockOne or more `ToolUse` blocks parsed	`PreToolUse`
`PreToolUse`	hook 允许、改写、要求询问或直接拒绝Hook allows, rewrites, asks, or denies	`Permission` / `ToolResult(error)`
`Permission`	policy allow / ask / denyPolicy allows / asks / denies	`ExecuteTool` / `ToolResult(error)`
`ExecuteTool`	工具 stdout/stderr 或结构化结果返回Tool stdout/stderr or structured result returned	`PostToolUse`
`PostToolUse`	tool_result 被追加回 messages; 本轮工具全部处理完Tool result appended to messages; all tool calls processed	`BuildRequest`

conversation.rs:314 — run_turn() conversation.rs:414 — PreToolUse hook gate conversation.rs:432 — authorize_with_context compact.rs:96 — compact_session hooks.rs:23 — HookEvent enum

codex

OpenAI Responses API

结构化 output[] 流 + 原生推理项 + 沙箱 bash。 Structured output[] stream with first-class reasoning items and sandboxed bash.

flowchart TD U([User message]):::io K[_build_api_kwargs · instructions · tools · reasoning.effort]:::ctx ST{{responses.stream · with reasoning.encrypted_content}}:::model FB[[Fallback — responses.create stream · synthesize from deltas]]:::model N[_normalize_codex_response · parse output array]:::model RS[codex_reasoning_items · dedup by ID across turns]:::ctx PP[PermissionPolicy · ReadOnly · WorkspaceWrite · DangerFull]:::gate SB[Exec in sandbox · seatbelt · landlock]:::tool AP[Append tool result]:::tool CK{incomplete or commentary}:::decision Y([Return message]):::io U --> K --> ST ST -- transport err --> FB --> N ST --> N --> RS RS --> CK CK -- function_call --> PP --> SB --> AP --> K CK -- commentary --> K CK -- completed --> Y class U step1 class K step2 class ST step3 class FB step4 class N step5 class RS step6 class PP step7 class SB step8 class AP step9 class CK step10 click U call jumpTo("codex", 1) click K call jumpTo("codex", 2) click ST call jumpTo("codex", 3) click FB call jumpTo("codex", 4) click N call jumpTo("codex", 5) click RS call jumpTo("codex", 6) click PP call jumpTo("codex", 7) click SB call jumpTo("codex", 8) click AP call jumpTo("codex", 9) click CK call jumpTo("codex", 10) classDef io fill:#233042,stroke:#7aa2f7,color:#e6e8ef; classDef model fill:#2b1f3a,stroke:#bb9af7,color:#e6e8ef; classDef tool fill:#1f3a2b,stroke:#9ece6a,color:#e6e8ef; classDef gate fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef ctx fill:#3a1f2b,stroke:#f7768e,color:#e6e8ef; classDef decision fill:#2d2d3a,stroke:#8a93a6,color:#e6e8ef;

Step 0 / 10

Responses API, 不是 Chat Completions——输出是 typed output[](message / function_call / reasoning)。
推理跨 turn 保留——include: ["reasoning.encrypted_content"], 按 ID 去重。
三级流式回退——stream() → 重试 → create(stream=True) → 从 deltas 合成, 永不静默掉 turn。
OS 级沙箱——seatbelt (macOS) / landlock (Linux) 在 bash 执行前 gate FS/网络。

Responses API, not Chat Completions — typed output[] of message · function_call · reasoning.
Reasoning across turns — include: ["reasoning.encrypted_content"], deduplicated by ID.
Streaming fallback cascade — stream() → retry → create(stream=True) → synthesize.
OS-level sandbox — seatbelt / landlock gate FS/network before bash.

run_agent.py:5168 — _run_codex_stream run_agent.py:5183 — responses.stream(**api_kwargs) run_agent.py:5297 — fallback responses.create run_agent.py:4640 — _normalize_codex_response run_agent.py:7266 — reasoning ID dedup

opencode

TS server + Go TUI

客户端-服务端分离 · HTTP · 任何前端都能驱动同一个 core。 Client-server split over HTTP — any frontend drives the same agent core.

flowchart TD TUI([Go TUI / Web / IDE]):::io SRV[POST /session/:id/message · → Bun server loop]:::ctx MODE{Agent mode}:::decision AI{{Vercel AI SDK stream · Anthropic · OAI · Google · Copilot · local}}:::model TC[Tool-call parts]:::model PG[Permission gate · allow · ask · deny + wildcards]:::gate EX[Tool executor · bash · edit · read · grep · lsp · mcp]:::tool SUB[[Subagent · general · explore]]:::sub APP[Append result]:::tool CC[Compaction / summary / title · hidden system agents]:::ctx SSE([SSE /global/event · → TUI renders parts]):::io TUI --> SRV --> MODE MODE -->|build: full tools| AI MODE -->|plan: read-only, ask first| AI AI --> TC --> PG --> EX EX --> SUB --> APP EX --> APP APP --> CC CC --> AI CC -.stream events.-> SSE class TUI step1 class SRV step2 class MODE step3 class AI step4 class TC step5 class PG step6 class EX step7 class SUB step8 class APP step9 class CC step10 click TUI call jumpTo("opencode", 1) click SRV call jumpTo("opencode", 2) click MODE call jumpTo("opencode", 3) click AI call jumpTo("opencode", 4) click TC call jumpTo("opencode", 5) click PG call jumpTo("opencode", 6) click EX call jumpTo("opencode", 7) click SUB call jumpTo("opencode", 8) click APP call jumpTo("opencode", 9) click CC call jumpTo("opencode", 10) classDef io fill:#233042,stroke:#7aa2f7,color:#e6e8ef; classDef model fill:#2b1f3a,stroke:#bb9af7,color:#e6e8ef; classDef tool fill:#1f3a2b,stroke:#9ece6a,color:#e6e8ef; classDef sub fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef gate fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef ctx fill:#3a1f2b,stroke:#f7768e,color:#e6e8ef; classDef decision fill:#2d2d3a,stroke:#8a93a6,color:#e6e8ef;

Step 0 / 10

HTTP 作为边界——TUI / Web / IDE 都走 /session/*; OpenAPI 3.1 spec 在 /doc(server.ts), 配合 mDNS broadcast (server/mdns.ts), 任何客户端都能自动发现并生成 SDK。
build 与 plan 模式——同一套工具, 不同权限映射: build(默认模式) 放行 edit/write/bash;plan 把写类工具 (edit/write/patch/bash) 降到 ask 或 deny(默认行为因 agent.ts 内置 + 用户 config 合并而定), 只读工具自动放行。一个 loop, 两种人格。
Provider 无关——Vercel AI SDK 把 streaming / tool-calling / reasoning 下放到各 adapter。
一等 LSP + MCP——代码智能和外部工具与原生工具并列。

HTTP as the boundary — TUI / web / IDE all speak to /session/*; OpenAPI 3.1 spec at /doc (server.ts) and mDNS broadcast (server/mdns.ts) let any client discover and generate an SDK.
Build vs plan modes — plan defaults edits/bash to ask, same loop two personas.
Provider-agnostic — Vercel AI SDK delegates streaming / tool / reasoning to each adapter.
First-class LSP + MCP — code intelligence and external tools sit beside native ones.

packages/opencode/src/tool — native tools packages/opencode/src/mcp — MCP client packages/opencode/src/lsp — LSP bridge packages/tui — Go TUI client (SSE consumer)

openclaw

TS · multi-channel gateway

本地 Gateway + 多通道 (IM/CLI/iOS/IDE) + Docker 浏览器沙箱。 Local Gateway + many channels (IM/CLI/iOS/IDE) + Dockerised browser sandbox.

flowchart TD CH([Channel input · IM · CLI · iOS · IDE · 10+ providers]):::io GW[Gateway · local-first orchestrator]:::ctx SR[Resolve session · history · DM pairing]:::ctx BP[Build payloads · system + tools + schemas]:::ctx PR{{Provider plugin · Anthropic · OpenAI · Google · ...}}:::model MC[Parse text + tool_calls · streaming deltas]:::model TP[ToolPolicy pipeline · per sandbox + channel]:::gate EX[Execute tool · bash · file · canvas]:::tool BR[[Browser sandbox · Docker + Chromium + CDP + noVNC]]:::sub CMP[Async compaction · if near context limit]:::ctx OUT(["Emit events → all channels"]):::io CH --> GW --> SR --> BP --> PR --> MC --> TP --> EX EX -- browser tool --> BR --> EX EX --> CMP CMP -- more tools --> PR CMP -- done --> OUT class CH step1 class GW step2 class SR step3 class BP step4 class PR step5 class MC step6 class TP step7 class EX step8 class BR step9 class CMP step10 click CH call jumpTo("openclaw", 1) click GW call jumpTo("openclaw", 2) click SR call jumpTo("openclaw", 3) click BP call jumpTo("openclaw", 4) click PR call jumpTo("openclaw", 5) click MC call jumpTo("openclaw", 6) click TP call jumpTo("openclaw", 7) click EX call jumpTo("openclaw", 8) click BR call jumpTo("openclaw", 9) click CMP call jumpTo("openclaw", 10) classDef io fill:#233042,stroke:#7aa2f7,color:#e6e8ef; classDef model fill:#2b1f3a,stroke:#bb9af7,color:#e6e8ef; classDef tool fill:#1f3a2b,stroke:#9ece6a,color:#e6e8ef; classDef sub fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef gate fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef ctx fill:#3a1f2b,stroke:#f7768e,color:#e6e8ef;

Step 0 / 10

多通道路由——Discord / Slack / Telegram / WhatsApp / iMessage / Signal / Matrix / Microsoft Teams / Google Chat / Zalo 等 IM 消息都进同一个 Gateway(长驻守护进程), CLI / iOS / IDE 则作为额外入口, 全部路由到同一组 session。
Embedded Runner + Gateway 拆分——agent 核心可嵌入 CLI / 浏览器 / 远端, Gateway 管 channels / cron / auth。
Docker 浏览器沙箱——Chromium + CDP + noVNC, 操作可视化调试, 自动化与"我来看它点什么"并存。
ACP 桥接 IDE——openclaw acp 暴露 stdio 协议, Zed / Cursor 可直接驱动同一 agent。

Multi-channel routing — IM traffic from Discord / Slack / Telegram / WhatsApp / iMessage / Signal / Matrix / Teams / Google Chat / Zalo all feeds one Gateway (long-lived daemon); CLI / iOS / IDE act as additional entry points, all landing in shared sessions.
Embedded runner ↔ Gateway split — core agent is portable (CLI / browser / remote); Gateway owns channels / cron / auth.
Dockerised browser sandbox — Chromium + CDP + noVNC; automation while you can watch it click.
ACP bridge to IDEs — openclaw acp exposes stdio protocol; Zed / Cursor drive the same agent.

src/agents/pi-embedded-runner/run.ts — main turn loop src/agents/pi-tools.ts — tool registry + lazy loading src/agents/sandbox/browser.ts — Docker CDP browser src/agents/pi-embedded-runner/compact.ts — async compaction src/acp/session.ts — ACP ↔ Gateway bridge

pi

TypeScript · minimal harness

4 个工具默认 (read/write/edit/bash); 一切其它特性住在 TS Extensions / Skills / Packages 里。 Four default tools (read/write/edit/bash); every other feature lives in TS Extensions / Skills / Packages.

flowchart TD U([User input · Enter = steer · Alt+Enter = queue]):::io AS[AgentSession · assemble system + AGENTS.md + skills + history]:::ctx PR{{Provider · 15+ via OAuth or API key · /model switches mid-session}}:::model ST[Stream typed events · text · tool_use · usage]:::model EX[ExtensionRunner · before-tool hook · may rewrite or block]:::gate T4[[Default tools · read / write / edit / bash]]:::tool EXT[[Extension tools · sub-agent / plan / MCP / sandbox · user-installed]]:::sub TR[Session tree · append message · parentID for branching]:::ctx CMP[Compaction · replaceable strategy · default summary-rewrite]:::ctx OUT(["Render to TUI · or emit JSON · or return via SDK"]):::io U --> AS --> PR --> ST --> EX EX -- default --> T4 EX -- extension --> EXT T4 --> TR EXT --> TR TR --> CMP CMP -- more tool_use --> PR CMP -- done --> OUT OUT -- user steers --> AS class U step1 class AS step2 class PR step3 class ST step4 class EX step5 class T4 step6 class EXT step7 class TR step8 class CMP step9 class OUT step10 click U call jumpTo("pi", 1) click AS call jumpTo("pi", 2) click PR call jumpTo("pi", 3) click ST call jumpTo("pi", 4) click EX call jumpTo("pi", 5) click T4 call jumpTo("pi", 6) click EXT call jumpTo("pi", 7) click TR call jumpTo("pi", 8) click CMP call jumpTo("pi", 9) click OUT call jumpTo("pi", 10) classDef io fill:#233042,stroke:#7aa2f7,color:#e6e8ef; classDef model fill:#2b1f3a,stroke:#bb9af7,color:#e6e8ef; classDef tool fill:#1f3a2b,stroke:#9ece6a,color:#e6e8ef; classDef sub fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef gate fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef ctx fill:#3a1f2b,stroke:#f7768e,color:#e6e8ef;

Step 0 / 10

4 工具默认——read / write / edit / bash 是模型唯一能直接调的工具; find/grep/ls 存在但默认未挂载, 让 bash 走原生工具链。
Steering 双键——agent 跑工具时 Enter 立刻打断后续工具并塞新消息进推理; Alt+Enter 排队等本轮跑完。
"什么不在 core 里"——sub-agent、plan mode、MCP、权限弹窗、todo、后台 bash 全故意放到 extension/package 里; 想要就写一个或装一个。
会话作为树——/tree 跳回任意旧消息从那分叉; 全部分支住在同一文件; /share 上传到 GitHub gist 直出可分享 URL。
SDK 嵌入——同一个 AgentSession 跑 4 模式: TUI / print(JSON) / RPC / SDK; openclaw 用 SDK 把 pi 嵌进自家 runner。

Four default tools — read / write / edit / bash are the only ones the model can call directly; find/grep/ls exist as files but aren't mounted, so the model uses native shell via bash.
Two-key steering — while the agent is running tools, Enter interrupts the remaining tools and lands a new message in reasoning; Alt+Enter queues a follow-up until the current run ends.
"What we didn't build" — sub-agents, plan mode, MCP, permission popups, todos, background bash are all deliberately pushed into extensions / packages; build one or install one.
Session as a tree — /tree jumps back to any old message and forks from there; every branch lives in one file; /share uploads to a GitHub gist and returns a shareable URL.
SDK embedding — the same AgentSession runs in four modes: TUI / print(JSON) / RPC / SDK; openclaw uses the SDK to embed pi as its runner.

packages/coding-agent/src/core/agent-session.ts — AgentSession (3099 lines) packages/coding-agent/src/core/tools/{read,write,edit,bash}.ts — 4 default tools packages/coding-agent/src/core/extensions/ — TS extension runtime packages/coding-agent/src/core/compaction/ — replaceable compaction packages/coding-agent/src/modes/{interactive,print-mode.ts,rpc} — 4 run modes packages/coding-agent/src/core/sdk.ts — embed API (used by openclaw)

5. 一眼看懂5. At a glance

下面表里的术语若有陌生, 后面"深度拆解"会讲透——先扫一眼整体。Unfamiliar terms below will be explained in the deep dives — just skim for now.

AgentAgent	技术栈Stack	主循环Loop driver	沙箱 / 权限Sandbox / perms	招牌特性Signature feature
hermes-agent	Python, 多 provider adapterPython, multi-provider adapters	ReAct + 共享 `IterationBudget`(默认 90)ReAct w/ shared `IterationBudget` (default 90)	Modal 云 VM; bash 权限策略Modal cloud VM; bash policy	子代理与父共享预算Sub-agents share parent budget
claw-code	Rust runtime + Python 参考Rust runtime + Python reference	`run_turn()` 每工具级 gating`run_turn()` per-tool gating	Pre/Post hook → `PermissionPolicy`Pre/Post hooks → `PermissionPolicy`	Hook 触发早于权限检查Hooks fire before permission
codex	OpenAI Responses API (非 Chat Completions)OpenAI Responses API	`responses.stream()` + 回退级联`responses.stream()` w/ fallback cascade	seatbelt (macOS) / landlock (Linux)seatbelt / landlock	加密推理跨 turn 保留Encrypted reasoning across turns
opencode	TS 服务端 (Bun) + Go TUI, HTTP 分离TS server (Bun) + Go TUI, HTTP split	客户端 POST → 服务端 loop → SSE 回推Client POST → server loop → SSE back	每工具 `allow \| ask \| deny`Per-tool `allow \| ask \| deny`	任何前端都能驱动同一个 agent coreAny frontend drives the same core
openclaw	TypeScript, 本地 Gateway + 多通道TypeScript, local Gateway + multi-channel	Channel → Gateway → embedded runner → streamChannel → Gateway → embedded runner → stream	Docker 沙箱 + per-channel ToolPolicyDocker sandbox + per-channel ToolPolicy	IM / CLI / iOS / IDE 都路由到同一 agentIM / CLI / iOS / IDE all route to one agent

6. 六家深度拆解6. Six deep dives

每家按同一模板: 目的 → 核心机制 (带源码行号)→ 为什么 work → 为什么好 → 代价。

Same template for each: purpose → key mechanisms (with line numbers) → why it works → why it's good → cost.

3.1 hermes-agent hermes-agent/run_agent.py

为什么你该看懂这家: 它展示了"一个 loop 兼容多家 LLM provider"和"让 agent 派小弟并防止失控"——自己造 agent 时第一个会撞到的两个工程问题。

Why you should care: it shows how to support multiple LLM providers in one loop, and how to delegate to sub-agents without losing control — the first two engineering problems you'll hit when building your own agent.

目的Purpose

做一个 provider-agnostic 的通用 agent: 今天 Claude, 明天 Gemini, loop 不用动。同时支持把大任务拆给子 agent 并行处理。

A provider-agnostic agent: swap Claude for Gemini without touching the loop. Plus parallel sub-agent delegation for large tasks.

核心机制Key mechanisms

Adapter 模式(agent/anthropic_adapter.py, gemini_native_adapter.py, ...)——抽象一个"stream + 收工具调用"接口, 每家 provider 实现一份, 主 loop 根本不知道自己在跟谁说话。
IterationBudget 共享预算(run_agent.py:730, 父默认 90, 子 agent 默认额度 50)——主 agent 每调一次 API 扣 1; 子 agent 从同一个预算里扣, 并且自己上限 50, 双重防失控。不然"一个蠢 agent 派 10 个子 agent, 每个再派 10 个"会指数爆炸。
delegate_task 子代理(run_agent.py:8076 + tools/delegate_tool.py:13)——做成一个工具。调用它 = 启动隔离 context 的子 AIAgent, 只带主 agent 指定的工具子集; 父 agent 只看到子 agent 的总结, 中间的 50 轮 tool call 不污染父 context。

Adapter pattern (agent/anthropic_adapter.py, gemini_native_adapter.py, ...) — abstract "stream + receive tool calls" interface; each provider implements it. The main loop never knows who it's talking to.
Shared IterationBudget (run_agent.py:730, parent default 90, child default 50) — the main agent decrements once per API call; sub-agents share the same counter AND each is individually capped at 50. Without this, "a dumb agent spawns 10 sub-agents, each spawns 10 more" explodes exponentially.
delegate_task sub-agents (run_agent.py:8076 + tools/delegate_tool.py:13) — shipped as a tool. Calling it spins up an isolated-context child AIAgent with a restricted toolset. The parent only sees the summary; the 50 child tool-calls don't pollute parent context.

为什么 workWhy it works

Adapter 解耦——Claude 限流了改两行换 Bedrock, 不用改 loop。
子代理隔离——"搜资料"这种产生大量 tool-call 噪声的子任务, 丢给子 agent 做, 父 context 只留结果。
共享预算——永远不会失控。

Decoupling via adapters — Claude throttled? Two-line change to Bedrock; loop untouched.
Sub-agent isolation — noisy subtasks ("research this") go to a child; parent only keeps the summary.
Shared budget — never runs away.

为什么好Why it's good

对 RL 训练场景极友好: agent 扔进 Modal 云沙箱, 同时跑 100 个 rollout, 每个独立 FS 但共享奖励函数。MCP server (mcp_serve.py) 把内部对话反向暴露, Claude Code / Cursor 能把 hermes 当工具。

Great for RL training: drop into Modal cloud sandbox, run 100 rollouts in parallel, each with its own FS but a shared reward function. MCP server (mcp_serve.py) exposes internal conversations outward, letting Claude Code / Cursor consume hermes as a tool. ClawBench is a natural RL evaluation target for this setup — its per-evidence scores plug straight in as reward signal.

代价Cost

复杂度高, 单文件 12,000 + 行。Multi-provider 意味着无法 1:1 映射各家最新特性 (比如 Claude 的 extended thinking 在 Gemini 上没有等价物)。

Complex — single file is 12,000 + lines. Multi-provider means you can't 1:1 map each vendor's latest features (e.g., Claude's extended thinking has no Gemini equivalent).

3.2 claw-code claw-code/rust/crates/runtime/src/

为什么你该看懂这家: 它展示了"如何让 agent 在生产环境也敢用"——每个危险动作都能被脚本拦下来审查、改写、记录。想把 agent 上线的人必看。

Why you should care: shows how to make an agent safe enough for production — every dangerous action can be intercepted, rewritten, or audited by scripts. Essential for anyone shipping an agent.

注: claw-code 是 Claude Code 的 Rust 开源复刻。Anthropic 官方文档把 Claude Code 定位为围绕 Claude 的 agentic harness(见 How Claude Code Works)。官方可确认的是 agentic loop、工具、权限、hooks、CLAUDE.md / memory、context compaction 这些机制; 本节的具体源码行、100k 压缩阈值、12 阶段 Bootstrap、health probe 是 claw-code 的实现选择, 不等于官方 Claude Code 内部实现。

Note: claw-code is an open-source Rust reimplementation of Claude Code. Anthropic's docs describe Claude Code as an agentic harness around Claude (see How Claude Code Works). The official surface confirms the agentic loop, tools, permissions, hooks, CLAUDE.md / memory, and context compaction; the source lines, 100k compaction threshold, 12-phase bootstrap, and health probe in this section are claw-code implementation choices, not official Claude Code internals.

目的Purpose

做一个可审计、可干预的 Claude Code 开源实现。每一次工具调用都可以被脚本拦截、改参、查权限、记日志、事后清理。

An auditable, interceptable Claude Code reimplementation. Every tool call can be intercepted, rewritten, permission-checked, logged, or cleaned up afterward.

核心机制Key mechanisms

Pre/Post Hooks(claw-code 的 hooks.rs:23 只实现了 PreToolUse, PostToolUse, PostToolUseFailure 三种; 官方 Claude Code 文档列出了更多生命周期 hook, 包括 SessionStart、UserPromptSubmit、PreCompact、Stop、ConfigChange 等)——工具调用前先跑用户配置的 hook 脚本, 可以放行、要求确认、拒绝、推迟、改写参数或补充上下文; 调用后再跑一次做清理/通知/日志。
PermissionPolicy(permissions.rs:175 authorize_with_context)——hook 放行后过权限检查, 支持静态规则 + 交互式 prompter; bash 命令还会经 bash_validation.rs 做语法检查和危险操作检测。
顺序严格(conversation.rs:414): Pre-hook → Permission → Execute → Post-hook。注意 hook 先于权限——这意味着一个 hook 能主动提升/降级权限, 甚至把一个危险命令改成安全命令再交给权限系统。
自动压缩 + 健康探针——阈值 DEFAULT_AUTO_COMPACTION_INPUT_TOKENS_THRESHOLD = 100_000(conversation.rs:18); 压缩逻辑在 compact.rs:96 compact_session, 压缩完成后触发 health probe (conversation.rs:297 run_session_health_probe, 压缩后对 session 跑一次探测确认可用), 动作记入 AutoCompactionEvent 可审计。

Pre/Post hooks — claw-code's hooks.rs:23 HookEvent enum implements three: PreToolUse, PostToolUse, PostToolUseFailure. Anthropic's official Claude Code docs list more lifecycle hooks, including SessionStart, UserPromptSubmit, PreCompact, Stop, ConfigChange, and others. Pre-hooks can allow / ask / deny / defer / modify input / add context; post-hooks handle cleanup / notify / log.
PermissionPolicy (permissions.rs:175 authorize_with_context) — post-hook authorization with static rules + interactive prompter; bash commands also run through bash_validation.rs for syntax + danger checks.
Strict ordering (conversation.rs:414): Pre-hook → Permission → Execute → Post-hook. Note: hook runs before permission — a hook can request allow / ask / deny / defer or rewrite a dangerous command before permission evaluates the final call.
Auto-compact + health probe — threshold is DEFAULT_AUTO_COMPACTION_INPUT_TOKENS_THRESHOLD = 100_000 (conversation.rs:18); compaction runs via compact.rs:96 compact_session; right after, a health probe fires (conversation.rs:297 run_session_health_probe) that probes the session to confirm it still works. The action is logged as an AutoCompactionEvent for audit.

为什么 workWhy it works

策略与实现解耦——怎么执行 bash 是 runtime 的事, 允不允许执行是 policy 的事。改一份 JSON 就能变出"全自动"或"人工审批每一步"的 agent。
Hook 先于权限带来更强表达力——传统权限是 allow/deny 二元。Hook 是可编程中间层, 能做"生产环境把 rm -rf /foo 重写成 rm -rf /foo.bak"这种事。
压缩有 provenance——不是黑盒丢弃历史, 摘要 + 元数据保留, 问题可追溯。

Policy/runtime decoupling — how to run bash is runtime's job; whether to allow it is policy's. One JSON config turns the same loop into "fully autonomous" or "ask for every step."
Hook-before-permission = more expressive — traditional permission is allow/deny. Hook is a programmable middle layer. You can do "in prod, rewrite a destructive command into a dry-run or safer target before permission sees it."
Compaction has provenance — not a black-box history wipe; summary + metadata preserved, issues traceable.

为什么好Why it's good

同一个 runtime 可以跑出完全不同风格的 agent: 开发者用=宽权限, hook 做 lint; 生产跑=收紧权限, hook 强制 dry-run; 教学用=所有 bash 都 ask。Rust 实现启动快内存低, 可内嵌到其他程序。

Same runtime, different personas: developers get wide permissions with lint hooks; prod gets strict permissions with forced dry-run hooks; teaching mode asks on every bash. Rust impl means fast startup, low memory, embeddable in other programs.

代价Cost

主 loop 里没有子代理 (task registry 只做异步后台)。多 agent 协作被推到 runtime 外部——这是 claw-code 的哲学选择: "让 agent context 专注做事, 不要用来开会"。

Main loop has no sub-agents (task registry is async background only). Multi-agent coordination is pushed outside the runtime — a deliberate philosophy: "keep agent context focused on work, not meetings."

3.3 codex codex/ + hermes adapter run_agent.py:5168+

为什么你该看懂这家: Responses API 是未来几年其他服务商大概率会跟进的方向。提前看懂 = 别家跟进时你能立刻上手。

Why you should care: Responses API is likely the direction other vendors will follow over the next few years. Learn it now, be ready when others catch up.

目的Purpose

展示 OpenAI 把 agent 能力直接内置到 API 会是什么样——不是让客户端组装工具调用, 而是 API 直接返回"我在想什么 / 要调什么工具 / 要说什么"的结构化流。

What it looks like when OpenAI bakes agent capability into the API itself — not client-side tool-call assembly, but the API streaming structured items: "what I'm thinking / which tool to call / what to say."

核心机制Key mechanisms

Responses API 而非 Chat Completions(run_agent.py:5183 responses.stream)——Chat Completions 返回一段 content + 可能的 tool_calls, 客户端拼起来。Responses API 返回 output[] 数组, 每元素是 typed item:{type: "message"} / {type: "function_call"} / {type: "reasoning"}, 客户端按类型分别处理。
加密 reasoning 跨 turn 保留(run_agent.py:7266 dedup 逻辑)——请求加 include: ["reasoning.encrypted_content"], Codex 返回加密推理串; 这串可以作为下 turn 的 input 一部分重传回去。效果: 模型"记得"上轮自己怎么想的, 多 turn 推理连贯。
三级回退(run_agent.py:5168 → :5297):responses.stream() → 重试 → responses.create(stream=True) 自己从 deltas 合成。保证流式断了也不会丢 turn。
Codex 也有 hooks——config.toml 里的 [hooks.pre_tool_use] / [hooks.post_tool_use] 配置脚本, 在工具调用前后注入自定义逻辑 (2026-04 codex_hooks 标记 stable)。
OS 级沙箱——macOS seatbelt(sandbox-exec) 按 .sb 配置限 FS / 网络; Linux landlock + bubblewrap + seccomp (见 codex-linux-sandbox helper) 进程自愿放弃能力。官方默认模式是 read-only(codex --sandbox read-only), 完整四档: read-only / workspace-write / danger-full-access / external-sandbox。不是容器, 但默认就能挡 99% 误操作。

Responses API, not Chat Completions (run_agent.py:5183 responses.stream) — Chat Completions returns a content string + optional tool_calls array (client assembles). Responses API returns an output[] array of typed items: {type: "message"} / {type: "function_call"} / {type: "function_call_output"} / {type: "reasoning"} — clients route by type.
Encrypted reasoning across turns (run_agent.py:7266 dedup logic) — request with include: ["reasoning.encrypted_content"]; Codex returns encrypted reasoning blobs. Those blobs can be fed back as part of the next turn's input — the model "remembers" how it was thinking, multi-turn reasoning stays coherent.
3-step fallback (run_agent.py:5168 → :5297): responses.stream() → retry → responses.create(stream=True) synthesized from deltas. Even if streaming drops, the turn isn't lost.
Codex has hooks too — config.toml accepts [hooks.pre_tool_use] / [hooks.post_tool_use] scripts for pre/post-tool interception (marked stable in April 2026's codex_hooks release).
OS-level sandbox — macOS seatbelt (sandbox-exec) with .sb configs limits FS / network; Linux uses landlock + bubblewrap + seccomp via the codex-linux-sandbox helper. The documented default is read-only (codex --sandbox read-only); the full four modes are read-only / workspace-write / danger-full-access / external-sandbox. Not a container — but the default blocks 99% of accidents.

为什么 workWhy it works

Typed output——不用正则从 content 里抠 tool_use JSON。
推理保留——长任务不失忆, 省 token 一致性好。
OS 沙箱——比 docker 轻 100×, 比纯权限硬一个量级。

Typed output — no more regex-extracting tool_use JSON from content.
Reasoning preserved — long tasks don't go amnesic; saves tokens, stays consistent.
OS sandbox — 100× lighter than docker, an order of magnitude stronger than pure permissions.

为什么好Why it's good

第一方优化——Responses API 是 agent 一等公民。客户端只处理回退与去重, 服务端负责推理、缓存、流式, 整条链路比 "Chat Completions + 手搓 agent loop" 干净得多。对专门用 OpenAI 的团队, codex 是上限最高的方案。

First-party optimization — Responses API treats agents as first-class citizens. The client only handles fallback and dedup; the server owns inference, caching, streaming. The whole pipeline is much cleaner than "Chat Completions + hand-rolled agent loop." For OpenAI-committed teams, codex is the highest-ceiling option.

代价Cost

锁定 OpenAI——Responses API 目前只有 OpenAI。推理加密——你拿不到纯文本推理内容, 只能原样传回。

Locked to OpenAI — Responses API is OpenAI-only today. Reasoning is encrypted — you can't inspect it, only pass it back.

3.4 opencode github.com/sst/opencode

为什么你该看懂这家: 如果你要做 IDE 插件、团队共享 agent、或多端同步, 这是蓝图。

Why you should care: if you want to build an IDE plugin, a team-shared agent, or multi-client sync — this is the blueprint.

目的Purpose

解决一个工程问题: agent 不应该和终端 UI 绑死。今天 TUI, 明天 VS Code 插件, 后天 iPhone app——agent 逻辑应该只写一份。

One engineering problem: agent logic should not be bound to a TUI. TUI today, VS Code plugin tomorrow, iPhone app the day after — write the agent once.

核心机制Key mechanisms

客户端-服务端拆分——Server 是 TypeScript 跑在 Bun 上, 维护所有 session 和 agent loop; Client 是 Go TUI, 但协议是开放的 HTTP:POST /session/:id/message、GET /global/event (SSE)、POST /session/:id/permissions/:id。Server 在 /doc 暴露 OpenAPI 3.1 可自动生成任何语言 SDK, 启动用 mDNS broadcast, 手机 app 都能发现。
Build vs Plan 模式——同一套工具, 可切换人格: build 完整工具自动执行; plan edit 和 bash 默认 ask, 只读工具自动放行。
Per-tool 权限——每个工具独立设 allow | ask | deny, 支持通配符 (mymcp_* 批量放行一组 MCP 工具)。权限请求经 HTTP 回到客户端, 客户端 UI 弹确认。
隐藏系统 agent——compaction(对话过长自动摘要)、summary(生成摘要)、title(自动命名 session)。用户看不到, server 后台在跑。

Client-server split — Server is TypeScript on Bun, holds all sessions + agent loop. Client is Go TUI, but the protocol is open HTTP: POST /session/:id/message, GET /global/event (SSE), POST /session/:id/permissions/:id. OpenAPI 3.1 spec at /doc auto-generates any-language SDKs; mDNS broadcast on startup lets mobile apps discover.
Build vs Plan modes — same tool surface, different permission maps: build (the default) allows edit/write/bash; plan restricts write-class tools (edit / write / patch / bash) to ask or deny (resolved from agent.ts defaults merged with user config), while read-only tools flow through. Two personas, one loop.
Per-tool permissions — each tool independently set to allow | ask | deny, with wildcard support (mymcp_* whitelists a whole MCP bundle). Permission requests flow over HTTP back to the client UI.
Hidden system agents — compaction, summary, title are all marked hidden: true and run server-side on schedule. Users never see them.

为什么 workWhy it works

协议稳定 = 前端繁荣——HTTP + OpenAPI 让社区造出 neovim 插件、手机 app、Web UI。
模式切换零成本——plan 模式只是权限配置不同, 不是两个 agent 实现。
Vercel AI SDK 做底层——换 provider 改一行配置。

Stable protocol → flourishing frontends — HTTP + OpenAPI enables community neovim plugins, mobile apps, web UIs.
Mode switching is free — plan mode is just different permission config, not a separate agent impl.
Vercel AI SDK underneath — swap provider with one config line.

为什么好Why it's good

对团队协作友好——server 跑在共享机器上, 多人接客户端连进来看同一 session。对 IDE 集成友好——任何 IDE 插件都能对接, 不用各自重造 agent。

Team-friendly — run server on a shared machine, multiple clients connect to the same session. IDE-friendly — any IDE plugin can wire up, no need to reinvent the agent.

代价Cost

HTTP 带来的延迟 (毫秒级, 交互上可忽略)。Server 要长期维护, 不像单进程 CLI 那样"用完即退"。

HTTP adds latency (milliseconds, negligible interactively). Server needs long-term maintenance, unlike a fire-and-forget CLI.

3.5 openclaw openclaw/src/agents/pi-embedded-runner/

为什么你该看懂这家: 它展示了"一个 agent 同时吃得下 IM 消息、CLI 命令、iOS 推送、IDE 会话"——想做 all-in-one 个人 copilot 的人必看。

Why you should care: shows how one agent can handle IM messages, CLI commands, iOS pushes, and IDE sessions in parallel — essential reading if you want an "all-in-one" personal copilot.

目的Purpose

做一个本地优先的多通道 agent。官方把它定位成"单个长驻 Gateway + 所有通道共用一个 agent": 不是又一个聊天框, 而是挂在你设备上的控制面板——Discord / Slack / Telegram / WhatsApp / iMessage / Signal / Matrix / Teams / Zalo 等 10 + IM 通道 + CLI / iOS / IDE 都路由进同一个 Gateway, agent 在共享 session 里工作。

A local-first multi-channel agent. The docs position it as "one long-lived Gateway, many channels, one agent" — not another chat box but a control plane on your device. 10 + IM channels (Discord / Slack / Telegram / WhatsApp / iMessage / Signal / Matrix / Teams / Zalo and more) plus CLI / iOS / IDE all feed the same Gateway and share sessions.

核心机制Key mechanisms

官方术语: Tools vs Skills —— Tools 是 agent 可以调用的带类型函数 (bash / read / write / browser / canvas 等, 共 ~19 个核心), Skills 是注入 system prompt 的 Markdown 教材 (SKILL.md, 讲"什么时候、怎么用"工具)。这套分层是 openclaw 文档自己强调的核心抽象。
Docs terminology: Tools vs Skills. Tools are the typed functions the agent can call (bash / read / write / browser / canvas / ~19 core); Skills are Markdown docs (SKILL.md) injected into the system prompt, teaching when and how to use them. This split is called out in the official docs as the core abstraction.

Gateway + Embedded Runner 拆分(pi-embedded-runner/run.ts)——Gateway 是本地 WebSocket 编排器, 管 channels / cron / auth / sessions; Embedded Runner 是可移植的 agent 核心, 能跑在 CLI、浏览器、远端 SSH 里。两者用 session key 对接。
ToolPolicy pipeline(tool-policy-pipeline.ts)——工具按 sandbox mode × channel 过滤: main session 默认全开, sandbox 只放 exec · read · write · edit · sessions_*; 不同 messageProvider (如 voice、node) 有独立的 allow/deny 映射。完全配置驱动, 不是 hardcoded。
Docker 浏览器沙箱(sandbox/browser.ts + Dockerfile.sandbox-browser)——每个 session 起一个隔离容器, Chromium + xvfb + noVNC + CDP, 既能自动化又能通过 6080 端口开浏览器"看它在点什么"。
Async 压缩(compact.ts)——context 逼近上限时启动异步压缩任务, 当前 turn 完成后把历史摘要化; 压缩执行中新 turn 会被阻塞, 避免状态不一致。
ACP 桥接(src/acp/session.ts)——openclaw acp 暴露 Agent Client Protocol (stdio), Zed / Cursor 等 IDE 把 openclaw 当成后端驱动, 不用写原生插件。

Gateway + Embedded Runner split (pi-embedded-runner/run.ts) — Gateway is a local WebSocket orchestrator owning channels / cron / auth / sessions; Embedded Runner is the portable agent core (runs in CLI, browser, remote SSH). They meet via session keys.
ToolPolicy pipeline (tool-policy-pipeline.ts) — tools filter by sandbox mode × channel: main session is permissive; sandbox lane only exposes exec · read · write · edit · sessions_*; different messageProvider values (e.g. voice, node) get their own allow/deny mappings. All config-driven, no hardcoding.
Dockerised browser sandbox (sandbox/browser.ts + Dockerfile.sandbox-browser) — every session spins its own container with Chromium + xvfb + noVNC + CDP. Automation, plus you can open port 6080 and literally watch the agent click.
Async compaction (compact.ts) — when context approaches the limit, an async compaction task is queued; the current turn finishes first, then history is summarised. New turns are blocked during compaction so state stays consistent.
ACP bridge (src/acp/session.ts) — openclaw acp exposes the Agent Client Protocol over stdio. Zed, Cursor and other IDEs drive openclaw as a backend without needing native plugins.

为什么 workWhy it works

本地 Gateway——没有 SaaS 锁定, 所有 credentials / conversation 都在你机器上; SSH 可选暴露。
Channel 抽象统一——聊天消息和 CLI 命令对 agent 是同一种事件, 写一次就支持所有入口。
Plugin-first——channels / providers / skills 都是插件, core 保持瘦。

Local Gateway — no SaaS lock-in; credentials and conversations stay on your machine. SSH exposure is opt-in.
Unified channel abstraction — chat messages and CLI commands are the same event to the agent; write once, every inbox works.
Plugin-first — channels / providers / skills are plugins; core stays lean.

为什么好Why it's good

对"个人 copilot / 值班机器人"场景最到位——开会时 agent 监听 Slack, 下班路上用 iMessage 追问结果, 到家接 CLI 继续改代码, 同一条 session 贯穿。加上 ACP, IDE 会话也一起上。其他 4 家需要你手动切换工具。浏览器沙箱刚好也能跑 ClawBench 任务——用 openclaw 做网页 agent 的研发+评测一条龙。

Ideal for "personal copilot / on-call bot": the agent watches Slack during a meeting, answers iMessage on the commute, resumes via CLI at home — all the same session. Add ACP and the IDE joins in too. The other four ask you to context-switch tools yourself. The browser sandbox doubles as a ClawBench runner, so you can use openclaw for both web-agent dev and evaluation in one place.

代价Cost

运维成本高: Docker、WebSocket、多通道 webhook 要一次跑起来。核心文件 run.ts 2100 + 行, 逻辑密集。不是开箱即用的小工具。

Higher ops cost: Docker + WebSocket + multi-channel webhooks must all be up. run.ts is 2100 + lines of dense logic. Not a plug-and-play mini-tool.

3.6 pi github.com/badlogic/pi-mono · packages/coding-agent pi.dev

为什么你该看懂这家: 当前面五家比的是"我加了多少特性", pi 反过来比"我能砍掉多少特性还活得下去"。openclaw 的 embedded runner 就是基于 pi 的 SDK——这是 pi 在生产里最好的存在证明。如果你想把 agent 做成一个能放进自己 app 里的库, 而不是一个吞掉用户工作流的 CLI, 这就是范例。

Why you should care: while the other five compete on "how many features I add," pi competes on "how many features I can strip out and still survive." openclaw's embedded runner is built on pi's SDK — the best existence proof in production. If you want an agent shaped like a library you embed in your app, not a CLI that eats your workflow, this is the template.

目的Purpose

作者 Mario Zechner (badlogicgames; pi.dev 由 exe.dev 捐赠) 把 pi 称作 "minimal terminal coding harness"——只给模型 4 个原子工具 (read / write / edit / bash), 其他全部由用户用 TypeScript Extensions / Skills / Prompt Templates / Themes 自己长出来, 还能打成 npm/git 包分享。pi.dev 主页直接列了一串 "What we didn't build": 没有 MCP、没有 sub-agent、没有 plan mode、没有权限弹窗、没有内建 todo、没有后台 bash——每一项都给了"你可以这样替代"的提示。

Author Mario Zechner (badlogicgames; pi.dev donated by exe.dev) calls pi "a minimal terminal coding harness." The model gets four atomic tools (read / write / edit / bash); everything else is grown by users via TypeScript Extensions / Skills / Prompt Templates / Themes, which can be shipped as npm or git packages. The pi.dev homepage literally has a "What we didn't build" section: no MCP, no sub-agents, no plan mode, no permission popups, no built-in to-dos, no background bash — each entry tells you the recommended workaround instead.

核心机制Key mechanisms

4 工具默认(packages/coding-agent/src/core/tools/: read.ts · write.ts · edit.ts · bash.ts)——pi README 第一句:"By default, pi gives the model four tools." 也有 find / grep / ls 文件但默认未挂载, 让模型用 bash 走原生工具链。这跟 claw-code 的 40 + 工具是另一极。
4 种运行模式(src/modes/: interactive/ · print-mode.ts · rpc/)——同一个 AgentSession (3099 行, 见 src/core/agent-session.ts) 可跑成: 交互 TUI (默认) · pi -p "query" 打印或 --mode json 事件流 · stdin/stdout JSON-RPC (供非 Node 程序集成) · 或作为 SDK 嵌入自家 app。openclaw 走的就是 SDK 这条路 (见 §3.5)。
会话作为 Git-like 树(src/core/session-manager.ts + compaction/)——sessions 存为树, /tree 跳到任意旧消息从那分叉, 全部分支住在同一文件; /share 上传到 GitHub gist 拿到分享链接。这个原语和 opencode 的 parentID 思路一致, 但 pi 把它做到了 UI 一级。
Steering 与 follow-up 双键(pi.dev 主页直接示范)——agent 跑工具时, Enter 发送 steering message: 当前工具执行完立刻打断后续工具, 新消息塞进推理; Alt+Enter 发送 follow-up: 排队, 等本轮结束再处理。把"用户耐不住先打断"做成显式协议。
Extensions = TypeScript modules(src/core/extensions/)——可注册新工具、新 slash command、新键盘绑定、新 TUI overlay。features 不在 core, 在 extension: 想要 sub-agent? 写一个 extension 起新 pi 实例。想要 plan mode? 写一个 extension 把 edit/bash 翻成 ask。想要 MCP? 写一个 extension 把 MCP 调用桥到 bash。
Skills + Prompt Templates + AGENTS.md/SYSTEM.md(src/core/skills.ts · prompt-templates.ts · system-prompt.ts)——Skills 按 SKILL.md 规范按需加载, 不破坏 prompt cache (progressive disclosure); Prompt Templates 是 Markdown, 输入 /name 展开; AGENTS.md 启动时从 ~/.pi/agent/、父目录、当前目录依次加载——pi 的 CLAUDE.md。
Compaction 是可替换的(src/core/compaction/)——超阈值触发的默认是 summary-rewrite, 但用户可以用 extension 替换为话题分组、code-aware、或换 model 做 summary。这跟 claw-code 把 compaction 做成 runtime 一等概念 (带 health probe) 是另一种思路: 给一个 hook 让你自己写。
15 + provider, 订阅或 API key 二选一(src/core/auth-storage.ts · model-registry.ts)——Anthropic Claude Pro/Max、OpenAI ChatGPT Plus/Pro (Codex)、GitHub Copilot、Gemini CLI 都能通过 OAuth 走订阅; API key 列出 14 家 (Anthropic / OpenAI / Azure / DeepSeek / Bedrock / Mistral / Groq / Cerebras / Cloudflare / xAI / OpenRouter / Vercel AI Gateway 等)。/model 或 Ctrl+L 中途切换, Ctrl+P 在收藏里循环。

Four-tool default (packages/coding-agent/src/core/tools/: read.ts · write.ts · edit.ts · bash.ts) — the pi README opens with "By default, pi gives the model four tools." find / grep / ls exist as files but aren't mounted by default — the model reaches for native shell via bash. Polar opposite of claw-code's 40 + tools.
Four run modes (src/modes/: interactive/ · print-mode.ts · rpc/) — the same AgentSession (3099 lines, src/core/agent-session.ts) runs as: interactive TUI (default), pi -p "query" for scripts (or --mode json for an event stream), JSON-RPC over stdin/stdout (for non-Node integrators), or embedded via the SDK. openclaw takes the SDK path (see §3.5).
Session as a Git-like tree (src/core/session-manager.ts + compaction/) — sessions persist as trees; /tree jumps to any old message, forks a new branch from there, all branches live in the same file; /share uploads to a GitHub gist and returns a shareable URL. Same primitive as opencode's parentID, but promoted to a first-class UX feature.
Two-key steering vs follow-up (shown on pi.dev) — while the agent is running tools, Enter sends a steering message: the current tool finishes, remaining tools are interrupted, and the new message lands in the model's next reasoning step. Alt+Enter sends a follow-up: queued, applied only after the agent finishes the current run. Two-key formalisation of "I can't wait, let me cut in."
Extensions = TypeScript modules (src/core/extensions/) — register new tools, slash commands, keybindings, TUI overlays. Features don't live in core, they live in extensions: want sub-agents? Write an extension that spawns another pi instance. Want plan mode? Flip edit/bash to ask in an extension. Want MCP? Write an extension that bridges MCP calls into bash.
Skills + Prompt Templates + AGENTS.md/SYSTEM.md (src/core/skills.ts · prompt-templates.ts · system-prompt.ts) — Skills load on demand per the SKILL.md spec without busting the prompt cache (progressive disclosure); Prompt Templates are Markdown, expanded via /name; AGENTS.md is loaded at startup from ~/.pi/agent/, parent directories, and cwd — pi's CLAUDE.md equivalent.
Compaction is replaceable (src/core/compaction/) — default behaviour on threshold is summary-rewrite, but extensions can swap in topic-grouping, code-aware compaction, or a different summarisation model. Where claw-code makes compaction a runtime first-class concept (with health probe), pi exposes it as a hook for you to wire.
15 + providers, subscription or API key (src/core/auth-storage.ts · model-registry.ts) — Anthropic Claude Pro/Max, OpenAI ChatGPT Plus/Pro (Codex), GitHub Copilot, Gemini CLI all flow through OAuth subscription. Fourteen API-key providers listed (Anthropic / OpenAI / Azure / DeepSeek / Bedrock / Mistral / Groq / Cerebras / Cloudflare / xAI / OpenRouter / Vercel AI Gateway etc.). /model or Ctrl+L switches mid-session, Ctrl+P cycles your favourites.

为什么 workWhy it works

Token 效率压榨到极致——4 工具 + 极简 system prompt 意味着每 turn 的 prompt 前缀小, prompt cache 命中率高, 上下文窗口更经得住消耗。pi 主页声称"very token efficient due to its minimal system prompt"。
核心面积小, bug 面也小——所有可变性都在 extension 里, core 一年不大改也不会影响生态。这跟 claw-code 把每条防线都内建是相反的赌注。
"Ask pi to build it" 闭环——pi 鼓励你让 pi 自己写一个 extension, /reload 立刻生效。这把"自定义"做成 agent 的自指能力, 不是开发流程外面的事。

Token efficiency squeezed — four tools + a minimal system prompt means small prompt prefixes per turn, higher prompt-cache hit rate, more context budget for actual work. The pi homepage claims it is "very token efficient due to its minimal system prompt."
Small core surface = small bug surface — all variability lives in extensions; core doesn't need to change for new use cases. The opposite bet of claw-code's "every guardrail in the runtime."
"Ask pi to build it" closes the loop — pi encourages you to ask pi itself to write an extension, then /reload makes it live. Customisation is a self-referential capability of the agent, not something outside the dev flow.

为什么好Why it's good

对"我要把 agent 嵌进自己产品里"的团队几乎是唯一选项——SDK 干净、协议稳定 (RPC 模式有 doc), 没有强加给你的 UI 概念。openclaw 把 pi 当 runtime 嵌进 Gateway, 自己只关心通道路由——这就是 pi 设计哲学的最佳广告。pi 还做了一个值得借鉴的事: 作者把自己的 pi-mono 工作 session 持续发到 Hugging Face, 用 pi-share-hf 工具一键分享 OSS session, 给 RL/agent 训练社区提供真实工作流数据。

For teams who want to embed an agent into their own product, pi is essentially the only option — clean SDK, stable protocol (RPC mode is documented), no imposed UX concepts. openclaw embedding pi as a runtime in its Gateway, while only owning channel routing, is the best advertisement for pi's philosophy. One more thing worth copying: the author publishes his own pi-mono work sessions to Hugging Face via pi-share-hf, donating real OSS workflow data to the RL / agent-training community.

代价Cost

"故意没有 X" 的代价就是用户得自己长 X。生产场景需要权限弹窗、subagent、plan mode、MCP 接入的团队, 在 pi 上得先写一组 extension; 直接用 claw-code / opencode 是更省事的选择。Steering 双键虽然优雅, 学习曲线对新用户也不友好——团队人多时谁都得知道 Enter 和 Alt+Enter 的差别, 否则会误打断。

"Deliberately not built" comes with a tax — you grow it yourself. Teams that need permission popups, sub-agents, plan mode, or MCP in production must first write a stack of extensions; reaching for claw-code or opencode is the cheaper path. The two-key steering protocol is elegant but has a learning curve — everyone on a team has to know the Enter vs Alt+Enter distinction or the wrong key will break a long task.

7. 对比与选型7. Comparison & selection

什么场景选哪个Which to pick when

场景	推荐	原因
RL 训练 / 批量 rollout	hermes-agent	Modal 沙箱 + 共享预算 + 子代理做并行
生产跑 agent, 要审计和治理	claw-code	Hook 系统 + 权限策略 + Rust 稳健性
只用 OpenAI, 要最强 reasoning	codex	Responses API 原生支持 + 加密推理保留
多端 (TUI / Web / IDE) 共用	opencode	HTTP 协议 + OpenAPI + 自动 SDK
个人多通道 copilot / 值班机器人	openclaw	本地 Gateway + IM/CLI/iOS/IDE 路由到同一 session
真实浏览器任务评测 / 验证	ClawBench	live web 任务, 不是 offline DOM 快照;动态 JS、cookie 弹窗、多步交互、可追溯 per-evidence 评分

Scenario	Pick	Why
RL training / batch rollouts	hermes-agent	Modal sandbox + shared budget + parallel sub-agents
Production with audit & governance	claw-code	Hooks + policy + Rust robustness
OpenAI-only, max reasoning	codex	Responses API native support + encrypted reasoning
Multi-client (TUI / Web / IDE)	opencode	HTTP + OpenAPI + auto-generated SDKs
Personal multi-channel copilot / on-call bot	openclaw	Local Gateway + IM/CLI/iOS/IDE route into one session
Evaluating real browser tasks	ClawBench	Live web tasks, not offline DOM snapshots; dynamic JS, cookie popups, multi-step interactions, traceable per-evidence scoring

设计维度对比Design dimensions

维度Dimension	hermes	claw	codex	opencode	openclaw
进程模型Process model	单进程Single process	单进程Single process	单进程Single process	客户端/服务端分离C/S split	Gateway + Runner 拆分Gateway + Runner split
子代理Sub-agents	主循环内in-loop	无 (外置)None (external)	无None	@mention	session 路由session routing
权限粒度Permission grain	粗 (bash 分类)Coarse (bash class)	细 (每工具 + hook)Fine (per-tool + hook)	粗 (bash 分类)Coarse (bash class)	细 (每工具)Fine (per-tool)	细 (sandbox × channel)Fine (sandbox × channel)
Provider	多家 adapterMulti via adapter	多家 (Claude 优先)Multi (Claude-first)	仅 OpenAIOpenAI only	多家 (AI SDK)Multi (AI SDK)	多家 pluginMulti via plugins
语言Language	Python	Rust	Python	TS + Go	TypeScript
沙箱Sandbox	Modal 云 VMcloud VM	OS-level	seatbelt / landlock	无 (容器可选)None (container optional)	Docker (含浏览器)Docker (incl. browser)
入口通道Entry channels	CLI	CLI / IDE	CLI / IDE	CLI / TUI / IDE	IM / CLI / iOS / IDEIM / CLI / iOS / IDE

8. Takeaway: 7 条值得借鉴的设计8. Takeaway: seven design patterns worth adopting

自己造 agent 时可以直接借鉴的设计。顺带一提:把它们造出来后, 用 ClawBench 在真实网页任务上打个分, 就知道到底哪几条 idea 真的 work。

Design patterns you can lift directly for your own agent. And once you've built it, run it against ClawBench on live web tasks to see which of these ideas actually pay off in practice.

1. 共享预算防失控 — hermes 的 IterationBudget1. Shared budget to prevent runaway — hermes's IterationBudget

不管 agent 怎么嵌套, 总工具调用次数不会爆炸。自己造时: 给所有工具调用加一个共同递减的计数器, 比"每个 agent 独立限制"稳得多。

No matter how deeply agents nest, total tool calls can't explode. DIY: add a single shared counter decremented by every call — far more robust than "each agent gets its own limit."

2. Hook 先于权限给用户终极表达力 — claw-code2. Hook-before-permission = ultimate expressiveness — claw-code

传统权限是 allow/deny 二元。Hook 是可编程的中间层。自己造时: 给每个关键决策点暴露一个"用户可注入的函数", 而不是做死的规则。

Traditional permission is binary allow/deny. Hooks are a programmable middle layer. DIY: expose a "user-injectable function" at every critical decision point, not hard-coded rules.

3. Reasoning 跨 turn 保留 — codex3. Reasoning persisted across turns — codex

多 turn 任务里, 上轮思考应可延续到这轮, 而不是每轮重新想。自己造时: 如果模型支持, 开启 reasoning persistence; 不支持, 在 system prompt 里人工把"上轮结论"塞回去。

In multi-turn tasks, the prior turn's reasoning should carry to the next — don't re-think from scratch. DIY: turn on reasoning persistence if the model supports it; if not, inject "last turn's conclusion" via the system prompt manually.

4. 把 agent 做成服务 — opencode4. Agent as a service — opencode

Agent 逻辑和 UI 彻底分开。自己造时: 哪怕只做 CLI, 也把 core 拆成独立 library + server mode, 将来扩展成本近似零。

Separate agent logic from UI. DIY: even for a CLI, split core into library + server mode — expanding to new clients later costs near-zero.

5. Ephemeral injection 保 cache — hermes5. Ephemeral injection to preserve cache — hermes

Prompt cache 最怕 prompt 前缀变。把动态内容 (memory, hook output) 作为"仅本次 API 调用生效"的补充, 别污染历史。自己造时: 历史只存用户消息 + 工具结果, 所有 agent 内部的元数据另算。

Prompt cache breaks when the prefix changes. Treat dynamic content (memory, hook output) as "only effective for this API call" — don't pollute the history. DIY: history stores only user messages + tool results; all agent-internal metadata lives elsewhere.

6. 预算耗尽时留一次 "grace call" — codex6. Give the model one "grace call" on budget exhaustion — codex

工具预算打满时不要直接 hard-error。hermes 里的 codex 适配器会再放模型一次 API 调用(run_agent.py:916 _budget_grace_call), 让它有机会给用户一个体面的收尾: 总结已完成的事、列出没跑完的、保存部分结果。自己造时: budget 监督器保留一次 graceful-exit 槽位, 用户体验立竿见影。

When the tool budget is exhausted, don't hard-error. The codex adapter in hermes grants one more model call (run_agent.py:916 _budget_grace_call) so the model can exit gracefully: summarise what got done, what's left, save partial results. DIY: reserve a single graceful-exit slot in your budget watchdog — the UX upgrade is immediate.

7. Session 分叉作为一等公民 — opencode (源码级)7. Session forking as a first-class primitive — opencode (code-level)

opencode 的 session.sql.ts 用 parentID 字段追踪 session 血统, 支持从任意消息点分叉出一条平行路径。注: 公开文档目前只介绍 share 功能, fork 还没进官方 docs——这个 pattern 来自源码。多数 agent 框架把"回溯/重试"做成销毁状态, opencode 把它做成树。自己造时: 消息持久化加一个 parent_id 字段, 就能解锁"多方案并跑"这类体验。

opencode's session.sql.ts uses a parentID field to track session lineage, letting you fork a parallel session from any message. Note: public docs only describe the share feature; forking is present in source but not yet promoted to official docs — this one is from the code. Most agent frameworks treat "undo/retry" as destruction; opencode treats it as a tree. DIY: add a parent_id to persisted messages and you've unlocked "try both approaches at once."

一句话画像One-line mental model

hermes-agent	"ReAct + 预算 + 子代理, 一进程多 provider。" "ReAct + budgets + sub-agents, one process, many providers."
claw-code	"每个工具调用都是 hook → permission → execute → hook。" "Every tool call is hook → permission → execute → hook."
codex	"Responses API + 原生推理项 + OS 级沙箱。" "Responses API with first-class reasoning items and OS-level sandbox."
opencode	"Agent core 藏在 HTTP 服务后, TUI 只是一个客户端。" "Agent core behind an HTTP API; the TUI is just one client."
openclaw	"所有通道 (IM / CLI / iOS / IDE) 都是同一个 agent 的入口。" "Every channel (IM / CLI / iOS / IDE) is a door into the same agent."

9. hermes-agent 与训练9. hermes-agent and training

其他四家都在解决"怎么让 agent 好用", 只有 hermes-agent 同时解决"怎么让 agent 被训练"。这一点在 LLM 领域正在变得越来越重要——评估、RL、离线分析、跨模型对比, 都是"agent 作为训练 target"才能产出的结果。hermes 从进程模型到数据流, 每一层都为这个场景设计。

The other four optimize for running an agent well; hermes-agent is the only one that simultaneously optimizes for training one. As evaluation, RL, offline analysis, and cross-model comparison become the new battleground, "agent-as-training-target" is the axis that matters — and hermes is architected for it from the process model up to the data flow.

为什么训练友好就是优雅Why training-friendly is elegance

Modal 云沙箱 · 一次 rollout 一个 VM(environments/hermes_swe_env/hermes_swe_env.py:62)——RL 训练要求并行百量级 rollout, 每个都有独立 FS 状态且能活到奖励函数跑完。本地 Docker 做不到这个量级, Modal 是把 serverless VM 当"一次性容器"用, 这是 hermes 能做 RL 的底座。
共享 IterationBudget + 子 agent 独立上限(run_agent.py:730, 父 90, 子 50)——探索式训练最怕 fork-bomb, hermes 用一个整数从父到子全局递减, 任何一条链路超限都立刻截断。没有这个, RL 实验一 epoch 能把你的云账单跑爆。
轨迹压缩保留信号头尾(trajectory_compressor.py:86)——上下文溢出时, hermes 只压中间, 保留开头的 system / human / 首工具反馈 + 末尾 4 轮。训练里"任务描述"和"最终结果"都在头尾——这是把压缩算法做成"训练数据友好"的少见案例, 绝大多数 harness 只会截尾。
确定性 cache ID(run_agent.py:4209, SHA256(fn:args:index))——每次 rollout 用同一 prefix, 不是随机 UUID, 这样同批 rollout 能命中 prompt cache。大规模实验下省的钱不是个小数。
多 provider adapter 同进程(agent/anthropic_adapter.py / bedrock / gemini_native / auxiliary_client)——同一条 rollout 在 Claude / Gemini / OpenAI 间切换只改配置。跨模型评测 / 蒸馏 / ablation 都变成一行 diff。
MCP server 反向暴露(mcp_serve.py:431)——把内部对话按 MCP 协议反向暴露给 Claude Code / Cursor, 当训练信号收集器用。agent 的产物变成下一层 agent 的输入, 这是"研究级自举"能做的。
错误分类器指挥恢复(agent/error_classifier.py:24, FailoverReason 枚举)——auth / rate-limit / context-overflow 各走不同恢复路径, 训练 run 不会被单条坏 rollout 拖垮。

Modal cloud sandbox, one VM per rollout (environments/hermes_swe_env/hermes_swe_env.py:62) — RL demands hundreds of parallel rollouts, each with isolated FS state that survives until the reward function scores it. Local Docker can't hit that scale; Modal's serverless VMs as "one-shot containers" is what makes hermes feasible as an RL target.
Shared IterationBudget + per-child cap (run_agent.py:730, parent 90, child 50) — exploratory training fears fork-bombs. hermes decrements one shared integer from parent through every descendant; any chain that overflows is cut immediately. Without this, one RL epoch can nuke your cloud bill.
Trajectory compression that preserves head + tail (trajectory_compressor.py:86) — on context overflow, hermes summarises only the middle, keeping the opening (system / human / first tool feedback) and the final four turns. Training signal lives at head ("what to do") and tail ("what happened") — few harnesses treat compaction as a training-data concern; most just truncate.
Deterministic cache IDs (run_agent.py:4209, SHA256(fn:args:index)) — every rollout shares the same prompt prefix instead of random UUIDs, so same-batch rollouts hit the prompt cache. At scale, the savings are not small.
Multi-provider adapters in one process (agent/anthropic_adapter.py / bedrock / gemini_native / auxiliary_client) — same rollout can swap between Claude / Gemini / OpenAI with a config change. Cross-model eval, distillation, ablation all become one-line diffs.
MCP server mode as a signal tap (mcp_serve.py:431) — hermes can expose its internal conversations outward via MCP to Claude Code / Cursor and act as a training-signal collector. An agent's output becomes the next agent's input — research-grade bootstrapping.
Error classifier drives recovery (agent/error_classifier.py:24, FailoverReason enum) — auth / rate-limit / context-overflow each take distinct recovery paths; a single bad rollout can't sink a training run.

为什么这比"好看"更优雅?训练是 agent 领域最苛刻的负载——要求并行、幂等、成本可控、失败可恢复、数据可追溯。一个能同时扛住这五件事的 harness, 本质上也是一个能在生产跑的 harness, 只是反过来不成立。hermes 把"能被训练"设计进了每一层, 这种系统级一致性才是真正的优雅。

Why does this beat "pretty" for elegance? Training is the harshest load an agent harness can face: parallel, idempotent, cost-bounded, failure-recoverable, data-traceable. A harness that withstands all five is, by definition, a harness that can also run in production — but not the reverse. Hermes bakes "trainable" into every layer; that system-wide coherence is what real elegance looks like.

荣誉提名 (各自最优雅的一处)Honorable mentions (each has one truly elegant choice)

claw-code 的 Pre/Post-hook 先于 Permission——调换一个顺序就撬动整套可编程策略。"允不允许"是二元问题, "先拦截再询问"是编程问题, 后者的表达力强一个数量级。
codex 的 typed output[] 协议——第一个把"说话 / 调工具 / 思考"三件事在 API 协议层区分清楚, 而不是让客户端靠 regex 去猜。
opencode 的 client-server 拆分——把 agent 做成 HTTP 服务, 继承 Unix 管道 + REST 几十年的工程红利, 前端零边际成本。
openclaw 的 Gateway + Channel 抽象——把"消息从哪儿来"统一成一层, agent 不必区分 Slack 与 CLI。这几家里最"统一世界观"的一家。

claw-code's hooks before permission — one tiny ordering choice unlocks an entire programmable policy surface. Binary "allow/deny" → programmable "intercept then ask" is an order-of-magnitude expressivity jump.
codex's typed output[] protocol — the first API to distinguish "the model is talking / calling a tool / thinking" at the protocol level, instead of making clients regex their way through content.
opencode's client-server split — agent-as-HTTP-service, inheriting decades of Unix-pipe and REST wisdom; frontends cost nothing marginally.
openclaw's Gateway + Channel abstraction — collapses "where did this message come from" into one layer; the agent doesn't distinguish Slack from CLI. The most unified worldview among the five.

这是我的口味, 不是唯一正确答案。生产审计选 claw-code; OpenAI 全家桶选 codex; 多端协作选 opencode; 个人多通道 copilot 选 openclaw;想真 stress-test 任何一家在真实网页任务上的表现——ClawBench 就是干这个的。我把票投给 hermes, 是因为"能被训练"这条路, 长期看会把整个 agent 生态拉进一个新范式——你今天不训练, 明年大概率也会训练。

This is my taste, not the only right answer. Production audit → claw-code. OpenAI stack → codex. Multi-client work → opencode. Personal multi-channel copilot → openclaw. And if you want to actually stress-test any of the five on real web tasks, that's what ClawBench is for. I vote hermes because "being trainable" is the axis that, long-term, will pull the whole agent ecosystem into a new paradigm — if you aren't training now, you probably will be next year.

引用Cite

如果这篇文章或 ClawBench 对你的工作有用, 欢迎引用。点右上角按钮一键复制 BibTeX。

If this post or ClawBench is useful to you, please cite. Click the button for one-click BibTeX copy.

ClawBench

@article{zhang2026clawbench,
  title={ClawBench: Can AI Agents Complete Everyday Online Tasks?},
  author={Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen},
  year={2026},
  eprint={2604.08523},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2604.08523},
}

This post

@misc{zhang2026harnessblog,
  author = {Yuxuan Zhang},
  title  = {Agent Harness Engineering: A Source-Level Comparison of Coding Agents},
  year   = {2026},
  url    = {https://reacher-z.github.io/blog/harness/}
}

# Agent Harness Engineering: A Source-Level Comparison of Coding Agents

Author: Yuxuan Zhang (2026)
URL: https://reacher-z.github.io/blog/harness/

## Scope
A source-level comparison of how five open-source coding agents actually run:

1. hermes-agent (Python, multi-provider) — ReAct loop with shared IterationBudget (default 90, child 50); delegate_task sub-agents share the parent's budget; Modal cloud VM per RL rollout; trajectory compression preserves head + last 4 turns; deterministic cache IDs; 47 built-in tools.
2. claw-code (Rust reimplementation of Claude Code) — strict order PreToolUse hook > Permission > Execute > PostToolUse; hooks fire BEFORE permission so users can allow / ask / deny / defer / rewrite a call; DEFAULT_AUTO_COMPACTION_INPUT_TOKENS_THRESHOLD = 100_000; run_session_health_probe at conversation.rs:297.
3. codex (OpenAI Responses API) — typed output[] array with message / function_call / function_call_output / reasoning items; encrypted reasoning carried across turns via include=reasoning.encrypted_content; streaming fallback cascade (stream -> retry -> create(stream=True) -> synthesize); seatbelt (macOS) / landlock + bubblewrap + seccomp (Linux) OS sandbox; default mode read-only.
4. opencode (TS server on Bun + Go TUI) — client-server split over HTTP; OpenAPI 3.1 at /doc; mDNS broadcast; build vs plan agents with different permission maps; hidden system agents (compaction / summary / title); LSP as first-class tool; per-tool allow | ask | deny with wildcards.
5. openclaw (TypeScript, multi-channel) — local Gateway daemon routes 10+ IM channels (Slack / Discord / iMessage / Telegram / WhatsApp / Signal / Matrix / Teams / Google Chat / Zalo) plus CLI / iOS / IDE (ACP) into one session; Docker browser sandbox with Chromium + xvfb + noVNC + CDP; ~19 core Tools + ~53 Skills; Tools vs Skills split = "Tools are what the agent calls; Skills teach when and how".

## Conceptual framework (§2 in the post)
Five-layer stack, bottom to top:
- Prompt Engineering — how to phrase input (system prompt, few-shot, CoT)
- Context Engineering — what fits in the window (retrieval, memory, compaction, prompt cache)
- Tools — typed functions the agent can call
- Skills — Markdown (SKILL.md) teaching when/how to use tools — https://github.com/anthropics/skills is the reference repo
- Harness Engineering — loop + sandbox + budget + hook + session + channel (the five agents above are each a harness)

Anthropic officially calls Claude Code an "agentic harness" — which validates this layering.

## Seven design patterns worth adopting (from §7)
1. Shared budget to prevent runaway — hermes's IterationBudget
2. Hook-before-permission for ultimate expressiveness — claw-code
3. Reasoning persisted across turns — codex
4. Agent as a service (client-server split) — opencode
5. Ephemeral injection to preserve prompt cache — hermes
6. Grace call on budget exhaustion — codex-adapter pattern
7. Session forking as first-class primitive — opencode

## hermes-agent and training (§8)
hermes-agent. Rationale: it is the only harness in the five that is architected for training (RL rollouts in parallel, bounded exploration via shared budget, compaction that preserves training signal, cross-provider adapters for ablations, deterministic cache IDs, MCP as signal collector, semantic error classifier for recovery).

## Related benchmark
ClawBench — live browser-task benchmark that grades whether a harness actually works on real everyday online tasks (cookie popups, dynamic JS, multi-step interactions, traceable per-evidence scoring). arXiv:2604.08523, https://claw-bench.com/

## Where to go next
Open https://reacher-z.github.io/blog/harness/ for the full interactive article with five Mermaid flowcharts you can step through node-by-node, tied to exact file:line citations.