Coding Agent 架构对比 Coding Agents — Architecture Comparison

5 家开源 coding agent (hermes-agent · claw-code · codex · opencode · openclaw) 到底怎么跑, 为什么这么设计。从零讲起, 带交互动画, 每一步对得上真实源码行。 How five open-source coding agents (hermes-agent · claw-code · codex · opencode · openclaw) actually run, and why. Zero-to-deep, with interactive animations tied to real source lines.

一键复制 Quick cite → 跳到引用区→ Jump to cite section
30 秒看懂 agent: 大语言模型 (LLM, 像 Claude / GPT) 本身只做一件事——猜下一个字该是什么。 要让它真正去读文件、执行命令、上网搜, 得给它工具, 外面再套一层循环: 它说要调工具 → 框架去执行 → 结果塞回对话 → 它接着想。 这一整套就叫 agent。本页拆解了 5 种 agent 的设计差异。第一次看, 先从下面的"基本概念"读起。
30-second primer: an LLM (Claude / GPT / ...) only predicts "what text comes next." To make it really read files, run commands, or browse the web, you wrap it with tools + a loop: it asks to call a tool → the framework runs it → result is fed back → it continues. That wrapper is an "agent." This page compares five of them. New here? Start with "Concepts" below.

1. 基本概念1. Concepts

1. LLM 本质是"下一个 token 预测器"1. LLM is a next-token predictor

你给它一段文本, 它给你接下来最可能出现的一段文本。没了。

Feed it text, it returns the most-likely next chunk of text. That's all.

LLM 只做一件事: 看见前面的 token, 预测下一个 token;
然后把预测的那个加到末尾, 再预测下一个——一步步生成出整段话。

先把文字切成 token(≈ 子词):

   "The cat sat on the"
   ──tokenize──▶ ["The", " cat", " sat", " on", " the"]

然后一步步预测: 

   步 1  看到: ["The", " cat", " sat", " on", " the"]   → 预测 " mat"
   步 2  看到: ["The", " cat", " sat", " on", " the", " mat"] → 预测 "."
   步 3  看到: [..., "."]                                 → 预测 <end> (结束)

每预测一次, 输入就长一点。"历史 + 新词"的总长度就是 context window 的上限所在——
超过了, 要么压缩 (§5.2), 要么丢弃尾部。
An LLM only does one thing: look at the tokens so far, predict the next token;
then append that prediction and predict the next one — step by step, a whole reply appears.

First, text is split into tokens (≈ subwords):

   "The cat sat on the"
   ──tokenize──▶ ["The", " cat", " sat", " on", " the"]

Then one step at a time:

   step 1  sees: ["The", " cat", " sat", " on", " the"]   → predicts " mat"
   step 2  sees: ["The", " cat", " sat", " on", " the", " mat"] → predicts "."
   step 3  sees: [..., "."]                                → predicts <end> (stop)

Each step makes the input one token longer. The total length of "history + new tokens" is
capped by the context window — overflow either gets compacted (§5.2) or truncated.

不会执行代码、不会读文件、不会上网。这些全是 agent 在外面套的一层。

It can't execute code, read files, or access the web. All of that is what the agent wraps around it.

2. Function calling: 让 LLM "指点"框架去干活2. Function calling: let the LLM direct the framework

给 LLM 一段 system prompt 告诉它"你可以调 read_file(path)"。用户问"看看 /tmp/foo.py"——LLM 不会猜文件内容, 它会返回结构化 JSON:

Tell the LLM in the system prompt "you can call read_file(path)". A user asks "read /tmp/foo.py" — the LLM won't invent file contents, it returns structured JSON:

{
  "stop_reason": "tool_use",
  "content": [
    { "type": "text", "text": "Let me read it." },
    { "type": "tool_use",
      "id": "toolu_01abc",
      "name": "read_file",
      "input": { "path": "/tmp/foo.py" } }
  ]
}

真正去读文件的是外面的 agent 框架——拿到这个 JSON, 调 open(), 把结果塞回对话:

The agent framework does the actual reading — it takes the JSON, calls open(), and feeds the result back:

{
  "role": "tool",
  "tool_use_id": "toolu_01abc",
  "content": "import os\nprint(os.getcwd())\n"
}

这一来一回就是 agent 的一次工具调用

That round-trip is one tool call inside an agent turn.

3. Agent = LLM + 工具 + 循环3. Agent = LLM + tools + loop

Agent Loop (一次 turn)
Agent loop (one turn)
  1. 用户消息
  2. 把全部历史给 LLM
  3. LLM 输出文本 或 工具调用
  4. 工具调用 → 框架执行 → 拿到结果
  5. 结果塞回历史
  6. 回到 2, 直到 LLM 不再要调用
  1. User message
  2. Send full history to the LLM
  3. LLM outputs text or a tool_use
  4. Tool_use → framework runs it → result
  5. Result appended to history
  6. Go to 2 until no more tool_use

就这 6 步。五个 agent 都遵循这个骨架, 差异在每一步做多少事、加了多少保护、怎么扩展

Six steps. All five agents follow this skeleton — they differ in how much each step does, what safety layers are added, and how it extends.

4. 为什么 agent 能做到 LLM 做不到的事4. Why agents can do what LLMs can't

核心洞察: LLM 的"脑"很强但"手"是假的。Agent 框架负责给它装上真手, 并保证它不把厨房烧了。五家 agent 的差异, 本质是"装手的方式"和"保证不烧厨房的方式"不同

把手装好只是第一步;"手装得好不好"要靠真实任务测——ClawBench 就是为这个场景设计的, 专测这五家(及任何 harness)在真实网页任务上能不能完成 cookie 弹窗、动态 JS、多步交互等操作。

Key insight: LLMs have great brains but fake hands. The agent framework installs real hands — and makes sure it doesn't burn the kitchen down. The five differ in how they install the hands and how they prevent fires.

Installing the hands is step one; measuring whether the hands actually work is step two — ClawBench is designed exactly for this: a live benchmark that tests any harness on cookie popups, dynamic JS, multi-step interactions, and real everyday online tasks.

术语表 — 忘记某个词时展开查 Glossary — expand when you forget a word

术语表Glossary

一句话解释在本文里怎么用
LLM大语言模型, 如 Claude、GPT-4五个 agent 都是套在 LLM 外面的框架
TokenLLM 看到的最小单位, ≈ 一个子词"hello world" ≈ 2 tokens
Context windowLLM 单次能看到的 token 总量上限具体上限取决于模型和运行配置; 对 harness 来说关键是快满时如何压缩、裁剪和重载持久记忆
API程序调程序的接口, 本文指 LLM 服务商 REST API发 HTTP, 模型返回 JSON
Streaming边生成边返回 (打字机效果)减少等待, 能提前开始下一步
Function calling / Tool useLLM 输出"我要调这个函数"的结构化 JSON是 agent 能"动手"的前提
Prompt cache服务端缓存长 system prompt省钱 (最多 10x) 省延迟
Sandbox把进程关进小盒子, 限制它能访问什么防 agent 把电脑搞坏
ProviderLLM 服务商 (Anthropic / OpenAI / Google)绑一家 或 做 adapter 兼容多家
Turn用户说一次话 + agent 干完活返回 = 一个 turn"一次 turn" = 主循环跑一整圈
ReActReasoning + Acting 循环: 想一下 → 做一下五个 agent 都是 ReAct 变体
MCPModel Context Protocol, 外部工具协议让 agent 接入任意第三方工具
CLAUDE.md / AGENTS.md项目根目录的约定配置文件启动时读, 相当于"给 bot 的 README"
Plan-and-execute先让模型出计划, 再一步步执行的编排模式opencode 的 plan 模式、claw-code 的 EnterPlanMode
Reflectionagent 完成动作后再自我检查一轮, 发现错误就重试§7 Takeaway · Reflection 是很多 harness 的辅助轮回
Toolsagent 可以调用的带类型函数 (读文件、执行 bash、浏览网页…)§2 Tools vs Skills 对照表
Skills教 agent "什么时候、怎么用工具" 的 Markdown 文件 (SKILL.md)§2 · anthropics/skills
Subagent父 agent 派出的子 agent, context 隔离, 只回传总结hermes delegate_task、opencode @general/@explore
Orchestration决定"谁来做、按什么顺序、出错怎么接"——harness 的外层编排§2 五层地图的最顶层
Hook用户配置的脚本, 在 agent 生命周期关键时刻 (工具前/后) 自动跑, 能拦截 / 改写 / 否决 / 记日志claw-code 的 PreToolUse / PostToolUse
Adapter通用 agent loop 和具体 provider API 之间的翻译层; 换 adapter 就换 provider, loop 一行不用动hermes 的 anthropic_adapter / gemini_native
Compaction对话历史超过 context 上限时, 自动摘要旧 turn、保留头尾的行为claw-code 的 auto-compact · hermes 的 trajectory 压缩
Rolloutagent 一次从头跑到尾的完整 turn 序列; RL 训练里通常并行跑几百个§8 hermes 的 Modal 云沙箱每 rollout 一个 VM
SSEServer-Sent Events, 单向 HTTP 流式推送协议opencode 用它把 agent 事件从 server 流回 TUI
ACPAgent Client Protocol, IDE 与 agent 之间的 stdio 协议 (Zed / Cursor 推动)openclaw 的 acp 桥、opencode 的 acp 支持
LSPLanguage Server Protocol, IDE 与语言服务器 (跳转定义、查引用、诊断…) 之间的协议opencode 把 LSP 做成一等工具
CDPChrome DevTools Protocol, 程序化控制 Chromium 的协议 (无头浏览器自动化的基础)openclaw 浏览器沙箱用 CDP 给 agent 下操作指令
noVNC浏览器里的 VNC 客户端, 允许你通过 HTTP 端口远程看到沙箱里的图形桌面openclaw 6080 端口可"看 agent 在浏览器里点什么"
RL (强化学习)让 agent 反复试错、按"奖励函数"给出的分数学习的训练范式; 通常需要并行跑几百个 rollout§8 讨论 hermes 与训练的契合——它被设计成 RL-friendly harness
Modalserverless 云 VM 服务商, 按秒计费、秒级启停; agent 可以把每个 rollout 扔进一个独立 VMhermes 的 RL 沙箱就是基于 Modal
AWS BedrockAWS 托管的 LLM API 网关, 里面可以调 Claude、Llama、Mistral 等多家模型hermes 的 bedrock_adapter 就是对接它
OpenRouter第三方 LLM 路由服务, 一个 API key 调所有主流模型, 自动做限流 / 回退hermes 支持它作为 provider 之一
OS 沙箱 (seatbelt / landlock / bubblewrap / seccomp)操作系统层的进程隔离原语: seatbelt (macOS sandbox-exec) · landlock (Linux 内核自愿放弃能力) · bubblewrap (用户态容器) · seccomp (系统调用白名单)codex 默认模式就用这一套挡 99% 误操作
Vercel AI SDKVercel 出的 provider-agnostic TypeScript SDK, 抽象了 streaming / tool-calling / reasoning 的跨家差异opencode 的 provider 层直接用它, 新加一家只改一行配置
Self-evolving(自进化)agent 在运行中自己写新 SKILL.md、改 prompt 或更新 memory, 下一次起跑点比上一次更高; 比 reflection 更进一步——学到的东西能持久化, 不只是当轮改错Skills 层就是自进化的产物入口 · hermes 的 skill_manage 工具 · §7 Reflection 之后的下一层 takeaway
TermOne-linerUsage in this doc
LLMLarge Language Model (Claude, GPT-4, etc.)All five agents wrap an LLM
TokenSmallest unit the LLM sees, ≈ a subword"hello world" ≈ 2 tokens
Context windowMax tokens the LLM can see at onceThe exact limit depends on model and runtime config; for harness design, the key issue is compaction, truncation, and reloading persistent memory near the limit
APIHow programs call programs; here: LLM REST APIsSend HTTP, get JSON back
StreamingReturn tokens as they are generatedLower latency, pipeline next step earlier
Function calling / Tool useLLM returns structured "please call this tool" JSONPrerequisite for an agent to "do things"
Prompt cacheServer-side cache of long system promptUp to 10× cheaper, lower latency
SandboxConfined process env (FS/network limited)Keeps agent from wrecking your machine
ProviderLLM vendor (Anthropic / OpenAI / Google)Pick one, or write adapters
TurnOne user message + full agent response cycle"One turn" = the main loop runs a full pass
ReActReasoning + Acting loop: think → act → thinkAll five are ReAct variants
MCPModel Context Protocol for external toolsLets agents plug in any 3rd-party tool
CLAUDE.md / AGENTS.mdRoot-level project convention fileRead at startup; a "README for bots"
Plan-and-executeAsk the model to plan first, then execute step by stepopencode's plan mode, claw-code's EnterPlanMode
ReflectionAgent self-reviews after acting; retries on error§7 Takeaway · common auxiliary loop
ToolsTyped functions the agent can call (read file, run bash, browse…)§2 Tools vs Skills table
SkillsMarkdown files (SKILL.md) teaching when/how to use tools§2 · anthropics/skills
SubagentChild agent spawned by a parent; isolated context; returns summary onlyhermes delegate_task, opencode @general/@explore
Orchestration"Who does what, in what order, with what fallback" — the harness's outer layerTop row of §2's five-layer map
HookUser-configured script run at lifecycle moments (before / after a tool call); can intercept, modify, veto, or logclaw-code's PreToolUse / PostToolUse
AdapterTranslation layer between a generic agent loop and a specific provider's API; swap adapter → swap provider, loop unchangedhermes's anthropic_adapter / gemini_native
CompactionAuto-summarise old turns when history exceeds the context window, preserving head and tailclaw-code's auto-compact · hermes's trajectory compression
RolloutOne full start-to-end turn sequence of an agent; in RL you run hundreds in parallel§8 hermes's Modal cloud VM per rollout
SSEServer-Sent Events, a one-way HTTP streaming protocolopencode pushes agent events from server to TUI over SSE
ACPAgent Client Protocol; a stdio protocol between an IDE and an agent (pushed by Zed / Cursor)openclaw's acp bridge, opencode's acp support
LSPLanguage Server Protocol; the standard protocol between an IDE and a language server (goto-definition, find-references, diagnostics, ...)opencode ships LSP as a first-class tool
CDPChrome DevTools Protocol; a wire protocol for programmatically controlling Chromium (the foundation of headless browser automation)openclaw's browser sandbox drives the agent via CDP
noVNCA VNC client that runs in the browser, letting you view a sandbox's GUI desktop over HTTPopenclaw's port 6080 lets you "watch the agent click around in Chromium"
RL (Reinforcement Learning)A training paradigm where the agent learns by trial and error, scored by a "reward function"; usually runs hundreds of rollouts in parallel§8 discusses why hermes suits training — it's designed as an RL-friendly harness
ModalA serverless cloud-VM provider with per-second billing and sub-second cold start; an agent can launch one isolated VM per rollouthermes's RL sandbox runs on Modal
AWS BedrockAWS-managed LLM API gateway that serves Claude, Llama, Mistral and others behind one interfacehermes's bedrock_adapter targets it
OpenRouterThird-party LLM-routing service: one API key calls every major provider, with automatic rate-limit / fallback handlingsupported as a hermes provider
OS sandboxes (seatbelt / landlock / bubblewrap / seccomp)OS-level process-isolation primitives: seatbelt (macOS sandbox-exec) · landlock (Linux capability-dropping kernel feature) · bubblewrap (userland container) · seccomp (syscall allowlist)codex's default mode stacks these to block 99% of accidents
Vercel AI SDKProvider-agnostic TypeScript SDK from Vercel that abstracts streaming / tool-calling / reasoning across vendorsopencode's provider layer uses it; adding a new vendor is a config one-liner
Self-evolvingAgent that writes new SKILL.md files, updates prompts, or augments memory at runtime — so the next run starts from a higher baseline; a step beyond reflection, where what was learned persistsThe Skills layer is the persistence entry point · hermes's skill_manage tool · a natural next step after §7 Reflection

2. 五层概念地图: Prompt → Harness2. The five-layer stack: prompt to harness

2023 年大家对着 prompt 雕花; 2024 年重心转到 context engineering (检索、memory、压缩); 2025 年前沿又往上走了两层——SkillsHarness。本页比的五个开源项目, 本质上都是 harness 工程的不同答卷。

In 2023 the craft was prompt engineering. In 2024 it moved to context engineering — retrieval, memory, compaction. In 2025 the frontier climbed two more layers: Skills and Harnesses. The five projects on this page are all different takes on harness engineering.

管什么产物 / 例子在本页哪里体现
Harness Engineering 主循环 · 沙箱 · 预算 · hook · session · channel hermes-agent / claw-code / codex / opencode / openclaw 都是 harness §4 流程图 + §5 深度拆解
Skills "什么时候 / 怎么用工具"——可复用的 procedural knowledge Anthropic SKILL.md(markdown + YAML frontmatter), anthropics/skills, agentskills.io spec openclaw 文档、hermes skill_manage、claw-code /Claude Code Skill 工具
Tools "agent 能调什么"——typed 函数 bash / read / write / browser / MCP §4 流程图里绿色节点 + §1 术语表
Context Engineering "窗口里装什么"——检索、memory、compaction、prompt cache RAG、MEMORY.md、auto-compaction、cache_control §1.2 function calling · §5 各家的压缩策略
Prompt Engineering "输入文本怎么写" system prompt · few-shot · chain-of-thought 所有层都建立在它之上
LayerConcernArtifacts / examplesWhere it shows on this page
Harness Engineering Main loop · sandbox · budget · hook · session · channel hermes-agent / claw-code / codex / opencode / openclaw are all harnesses §4 diagrams + §5 deep dives
Skills "When and how to use tools" — reusable procedural knowledge Anthropic SKILL.md (markdown + YAML frontmatter), anthropics/skills, agentskills.io spec openclaw docs, hermes skill_manage, Claude Code Skill tool
Tools "What the agent can call" — typed functions bash / read / write / browser / MCP Green nodes in §4 diagrams + glossary in §1
Context Engineering "What goes in the window" — retrieval, memory, compaction, cache RAG, MEMORY.md, auto-compaction, cache_control §1.2 function calling · §5 each project's compaction story
Prompt Engineering "How to phrase the input" System prompt · few-shot · chain-of-thought Every layer above stands on it

一句话记法: Prompt Engineering 是措辞; Context Engineering 是窗口里装什么; Tools 是能干什么; Skills 是什么时候怎么干; Harness Engineering 是整个外骨骼——没有它 LLM 的脑袋没地方安手。

One-line summary: Prompt Engineering is wording; Context Engineering is what fits in the window; Tools are what the agent can do; Skills are when and how to do it; Harness Engineering is the exoskeleton — without it, the LLM brain has nowhere to attach its hands.

3. Claude Code 重点理解3. Understanding Claude Code

Claude Code 最该被理解成一个 agentic harness, 而不是"Claude 加了几个 shell 命令"。它的核心价值不在模型本身, 而在外层状态机: 怎么把用户输入、项目上下文、工具 schema、权限策略、hook、工具结果、压缩摘要组织成一个可持续运行的 turn loop。官方文档把这个循环概括成 gather context → take action → verify results; 本文把它拆成更工程化的状态转移。

Claude Code is best understood as an agentic harness, not "Claude plus a few shell commands." Its value is the outer state machine: how user input, project context, tool schemas, permission policy, hooks, tool results, and compaction summaries are organized into a durable turn loop. The official docs describe the loop as gather context → take action → verify results; this post expands it into implementation-level state transitions.

Claude Code 状态它在做什么为什么重要
Context Assembly读 system prompt、CLAUDE.md / skills / conversation history / tool schemas, 组装本轮请求。决定模型"看见什么"; 这比单句 prompt 更接近真实能力上限。
Model Step流式调用模型, 输出自然语言或结构化 tool_use模型不直接执行动作, 只声明"我想调用什么工具"。
PreToolUse工具执行前先跑 hook, 可以改参、拒绝、要求确认、推迟、补充上下文。这是 Claude Code 的治理入口: 用户能写程序影响 agent, 但强制权限规则仍会评估。
Permission根据工具类型、路径、命令危险度、用户策略做 allow / ask / deny。把"模型想做"和"系统允许做"分开, 防止工具失控。
Execute + Observeharness 执行真实 shell / 文件 / MCP 工具, 把结果作为 tool_result 放回消息历史。LLM 的行动能力来自这里; 它通过观察结果进入下一步推理。
Loop / Terminate如果还有 tool call 就回到下一次 model step; 如果没有 tool call, 本 turn 结束。这就是 coding agent 能多步修 bug 的原因。
Compaction上下文过长时摘要旧历史, 保留关键状态。长任务能继续跑, 不会因为 context 爆掉直接失忆。
Claude Code stateWhat it doesWhy it matters
Context AssemblyLoad system prompt, CLAUDE.md / skills / conversation history / tool schemas, then assemble the request.Determines what the model can see; this matters more than any single prompt.
Model StepStream the model; receive either natural language or structured tool_use.The model does not act directly; it declares which tool it wants.
PreToolUseRun hooks before execution; rewrite input, deny, ask, defer, or add context.This is the governance surface: users can program the agent, while enforced permission rules still apply.
PermissionAllow / ask / deny based on tool type, path, command risk, and user policy.Separates "the model wants" from "the system permits."
Execute + ObserveThe harness runs shell / file / MCP tools and appends tool_result back into history.This is where action happens; the model learns the result by observation.
Loop / TerminateIf tool calls remain, go back to the model; if none remain, end the turn.This is why a coding agent can fix bugs over multiple steps.
CompactionSummarize old history when context is too long, preserving important state.Long tasks can continue instead of losing the session.

最关键的 transition: assistant_message has ToolUse → 进工具管线; no ToolUse → 进入 stop/结束检查; hook 或 permission denied → 生成 error tool_result 让模型读到; context too long → compact 后继续。这四个分支就是 Claude Code 状态机的骨架。

The key transitions: assistant_message has ToolUse → enter the tool pipeline; no ToolUse → enter stop/finalization checks; hook or permission denied → append an error tool_result for the model to read; context too long → compact then continue. Those four branches are the backbone of the Claude Code state machine.

Tools 和 Skills 的分工是 Anthropic 2025 年在 Agent Skills 博客 + anthropics/skills 仓库里推的核心抽象。openclaw 把它复述为一句话: "Tools are what the agent calls; Skills teach the agent when and how."

The Tools / Skills split is the core abstraction Anthropic pushed in 2025 (see their Agent Skills blog and the anthropics/skills repo). openclaw restates it as: "Tools are what the agent calls; Skills teach the agent when and how."

Having all five layers in place only means the system is theoretically capable; whether it actually works requires real-task success rates. That's exactly what ClawBench measures: live web tasks that grade each layer end-to-end, not offline DOM snapshots you can game.

Skill 的标准: SKILL.mdThe SKILL.md standard

github.com/anthropics/skills 就是这个标准的官方参考实现——Anthropic 发布的 Skill 示例合集, 也是 SKILL.md 格式的源头。每个 skill 是一个文件夹, 里面有一个必写文件 SKILL.md: YAML frontmatter 头 (name + description) + Markdown 正文。Claude 在 session 里看到相关任务时, 按 description 自动挂载、读正文、照做。

github.com/anthropics/skills IS the reference repository for this standard — Anthropic's official collection of Skill examples and the origin of the SKILL.md format. Each skill is a folder containing one required file SKILL.md: YAML frontmatter (name + description) followed by a Markdown body. Claude auto-mounts the skill when the description matches the task, reads the body, and follows it.

# 最小模板(来自 anthropics/skills/template/SKILL.md):
---
name: my-skill-name
description: A clear description of what this skill does and when to use it
---

# My Skill Name

[Add your instructions here that Claude will follow when this skill is active]

## Examples · Guidelines · Reference files · etc.
# Minimum template (from anthropics/skills/template/SKILL.md):
---
name: my-skill-name
description: A clear description of what this skill does and when to use it
---

# My Skill Name

[Add your instructions here that Claude will follow when this skill is active]

## Examples · Guidelines · Reference files · etc.

真实例子: anthropics/skills/skills/pdf/SKILL.md 的 description 写得很细——"用户要读 PDF / 合并 / 分页 / 旋转 / 水印 / OCR 时用这个 skill"——Claude 看到这些关键字就自动挂上。正文里放 Python 代码片段、命令行工具指引、REFERENCE.md 链接等可复用知识。仓库目前有 17 个官方 skill (algorithmic-art / pdf / docx / pptx / xlsx / mcp-builder / skill-creator / webapp-testing / brand-guidelines 等), 覆盖创作 / 办公文档 / 开发 / 企业协作四大类。

Real example: anthropics/skills/skills/pdf/SKILL.md has a very precise description — "use this skill whenever the user wants to read PDFs / merge / split / rotate / watermark / OCR" — Claude auto-invokes on those keywords. The body contains Python snippets, CLI guidance, links to REFERENCE.md, etc. The repo ships 17 official skills today (algorithmic-art, pdf, docx, pptx, xlsx, mcp-builder, skill-creator, webapp-testing, brand-guidelines, …), covering creative / office / development / enterprise categories.

Tools vs Skills 对照表Tools vs Skills side-by-side

维度ToolsSkills
是什么带类型签名的函数带 YAML frontmatter 的 Markdown 文件夹
谁执行harness 执行 (调真实 API / shell / FS)LLM 自己读完照做 (instructions + 参考资料)
回答的问题"agent 调什么?""什么时候 / 怎么调?"
进入上下文schema 列在 tools[]description 常驻, 正文按需挂载
跨 harness 复用每家 harness 都要自己实现同一 SKILL.md 任何支持的 agent 都能装
例子bashreadwritebrowser、MCP 工具pdfmcp-builderfrontend-design
DimensionToolsSkills
WhatTyped function with a signatureFolder of Markdown with YAML frontmatter
ExecutorThe harness runs it (hits real APIs / shell / FS)The LLM reads it and follows (instructions + refs)
Question"What can the agent call?""When and how should it call things?"
Context costSchema sits in tools[]Description always loaded; body mounted on demand
PortabilityEach harness re-implementsSame SKILL.md works on any compatible agent
Examplesbash, read, write, browser, MCP toolspdf, mcp-builder, frontend-design

在本页五家里对号入座: Claude Code (≈ claw-code) 提供一级 Skill 工具直接挂载 SKILL.md; openclaw 文档大篇幅讲 Skills, 社区 53 个已公开; hermes-agent 提供 skill_view / skills_list / skill_manage 三件工具, 按 agentskills.io spec 加载 SKILL.md; opencode 以 Markdown frontmatter 定义 agent 接近此思路; codex 没有一等 Skill 概念, 用 AGENTS.md 做类 CLAUDE.md 的项目注入。

How the five relate: Claude Code (≈ claw-code) ships a first-class Skill tool that mounts SKILL.md; openclaw devotes major docs space to Skills (53 community-published); hermes-agent provides skill_view / skills_list / skill_manage tools that load SKILL.md per the agentskills.io spec; opencode's Markdown-frontmatter agents sit close to this idea; codex has no first-class Skill concept — it uses AGENTS.md like CLAUDE.md for per-project instructions.

3.5 Flue Framework: 把 harness 写成可编程 TS3.5 Flue Framework: harness as programmable TypeScript

Flue Framework 把本页讨论的"五层栈"重新画成四层模型: Model · Harness · Sandbox · Filesystem。它的口号 "Not another SDK" 表明态度——不是再造一套 chat 抽象, 而是给 harness 这一层提供可编程的 TypeScript 控制面。把它放进本页的对比, 价值在于: 它用一个外部视角验证了 §2 五层地图里 harness 是真正的工程主轴。

Flue Framework recasts the "five-layer stack" of this page into a four-layer model: Model · Harness · Sandbox · Filesystem. Its slogan "Not another SDK" is a stance — it doesn't add another chat abstraction; it offers a programmable TypeScript control plane at the harness layer. It earns a place in this comparison because it independently confirms §2's claim: the harness is the real engineering axis.

Flue 的四层 ↔ 本页五层Flue's four layers ↔ the five-layer stack

Flue 层它管什么对应本页 §2 哪一层
Modeltokens · tools · promptsPrompt + Context + Tools
Harnessskills · memory · sessionsSkills + Harness Engineering
Sandboxbash 执行 · 隔离 · 网络管理Harness Engineering 里的 sandbox 子模块
Filesystemread / write / grep / globTools 层的核心成员
Flue layerWhat it ownsMaps to §2 layer
Modeltokens · tools · promptsPrompt + Context + Tools
Harnessskills · memory · sessionsSkills + Harness Engineering
Sandboxbash exec · isolation · network policySandbox sub-module of Harness Engineering
Filesystemread / write / grep / globCore members of the Tools layer

三个一等概念Three first-class primitives

概念Flue 怎么定义对照其他 harness
Session持续的工作状态容器, 可挂 skill / 跑 prompt / 执 shell≈ Claude Code 的 turn loop · opencode 的 session.sql · openclaw 的 session 路由
Skill结构化输入输出的可复用 workflow (例: triage(issueNumber) → typed result)比 Anthropic SKILL.md 更像"带 schema 的子程序"——更接近 hermes 的 delegate_task + skill
Sandbox三档可换: 内置零配置 virtual sandbox / 远程容器 (Daytona) / 云后端 (Cloudflare Durable Object + SQLite + R2)codex 走 OS 原语 (seatbelt/landlock); hermes 走 Modal 云 VM; Flue 把这条选项做成plug-in
ConceptFlue's definitionCounterpart in the five
SessionDurable work-state container; you can mount skills, run prompts, exec shell≈ Claude Code's turn loop · opencode's session.sql · openclaw's session routing
SkillReusable workflow with structured I/O (e.g. triage(issueNumber) → typed result)More "schema'd subroutine" than Anthropic's SKILL.md — closer to hermes's delegate_task + skill
SandboxThree pluggable backends: built-in virtual sandbox / remote container (Daytona) / cloud (Cloudflare Durable Object + SQLite + R2)codex uses OS primitives (seatbelt/landlock); hermes uses Modal VMs; Flue makes the choice swappable

两个值得抄的设计Two ideas worth lifting

在本页 5+1 张地图里坐标Where Flue sits on the 5+1 map

维度Flue最像谁不一样在哪
语言TypeScriptopencode (TS+Go) · openclaw (TS)纯 TS, 不需要 Go runtime
形态SDK / library, 用户用 TS 写 agent 入口opencode 的 core library更彻底——没有自带 TUI, 部署形态完全交给用户
Sandbox三档可插hermes (Modal) · openclaw (Docker)把"挑哪个 sandbox"做成配置而不是源码 fork
定位"自主代理可编程控制面"偏 opencode 的服务化思路更强调"全栈自控": agent 逻辑 + harness + sandbox 都在你这边
DimensionFlueClosest siblingHow it differs
LanguageTypeScriptopencode (TS+Go) · openclaw (TS)Pure TS — no Go runtime needed
ShapeSDK / library; users write the agent entry in TSopencode's core libraryMore radical — no bundled TUI, deployment shape is fully user-decided
SandboxThree pluggable backendshermes (Modal) · openclaw (Docker)Backend choice is a config knob, not a source fork
Stance"Programmable control plane for autonomous agents"opencode's service-shaped approachPushes harder on full-stack ownership: agent logic + harness + sandbox all yours

一句话定位: 如果 §2 把 harness 列为 2025 年最值得做的工程层, Flue 就是把这层做成一个 TS package的最直接尝试——它没有把"agent"当成产品, 而是当成由你写的 TS 代码 + 一个标准 harness runtime

One-line placement: if §2 names harness engineering as the layer of 2025, Flue is the most literal attempt to ship that layer as a TS package — it doesn't treat "agent" as a product, but as your TS code on top of a standard harness runtime.

4. 六个流程图4. Six diagrams

下面六张图讲的是这些 harness 怎么运作;想量化它们到底 work 得多好, 用我们的 ClawBench 在真实网页任务上跑一跑就知道。The six diagrams below show how these harnesses work; to quantify how well they actually work, run them against our ClawBench on live web tasks.

hermes-agent

Python · multi-provider
ReAct 循环 + 共享迭代预算 + 子代理委派。 ReAct loop with shared iteration budget and sub-agent delegation.
flowchart TD U([User message]):::io A[Apply prompt cache + memory · every 10 turns]:::ctx M{{Adapter.stream · Anthropic · Bedrock · Gemini}}:::model P[Parse tool_calls · preserve reasoning_content]:::model R[ToolRegistry.dispatch · 47 built-in tools]:::tool S{delegate_task?}:::decision SA[[Spawn sub-agent · shared IterationBudget]]:::sub RES[Append tool results]:::tool C[ContextCompressor · if near context limit]:::ctx B{budget > 0?}:::decision Y([Return final message]):::io U --> A --> M --> P --> R --> S S -- yes --> SA --> RES S -- no --> RES RES --> C --> B B -- yes --> A B -- no --> Y class U step1 class A step2 class M step3 class P step4 class R step5 class SA step6 class RES step7 class C step8 class B step9 click U call jumpTo("hermes", 1) click A call jumpTo("hermes", 2) click M call jumpTo("hermes", 3) click P call jumpTo("hermes", 4) click R call jumpTo("hermes", 5) click SA call jumpTo("hermes", 6) click RES call jumpTo("hermes", 7) click C call jumpTo("hermes", 8) click B call jumpTo("hermes", 9) classDef io fill:#233042,stroke:#7aa2f7,color:#e6e8ef; classDef model fill:#2b1f3a,stroke:#bb9af7,color:#e6e8ef; classDef tool fill:#1f3a2b,stroke:#9ece6a,color:#e6e8ef; classDef sub fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef ctx fill:#3a1f2b,stroke:#f7768e,color:#e6e8ef; classDef decision fill:#2d2d3a,stroke:#8a93a6,color:#e6e8ef;
Step 0 / 9
  • 共享预算——父 + 所有子代理共用一个 IterationBudget, 不会 fork-bomb。
  • 临时注入——memory nudge (读取 MEMORY.mdUSER.md) 只在 API 调用时加, 不污染 prompt cache 前缀。
  • Adapter 分发——一套 loop 对接 N 家 provider; 错误分类器自动切换。
  • Modal 沙箱——每个 rollout 独立云 VM, RL 奖励函数看到一致的 FS 状态。
  • Shared budget — parent + all sub-agents draw from one IterationBudget; can't fork-bomb.
  • Ephemeral injections — memory nudges (reading MEMORY.md and USER.md) added at API time only, keeping cache prefix stable.
  • Adapter fan-out — one loop, N providers; error classifier routes failures.
  • Modal sandbox — each rollout in its own cloud VM; RL reward funcs see identical FS.
run_agent.py:634 — max_iterations=90 default run_agent.py:730 — IterationBudget init run_agent.py:8076 — delegate_task dispatch run_agent.py:100 — apply_anthropic_cache_control

claw-code

Rust · hooks-first
Claude Code 风格状态机: 模型流式产出, 有 ToolUse 就 hook → permission → execute → hook, 没有 ToolUse 就进入结束检查。 Claude Code-style state machine: stream model output; ToolUse triggers hook → permission → execute → hook; no ToolUse enters finalization checks.
flowchart TD U([User message]):::io B[BootstrapPlan — 12 phases, once per session]:::ctx L[Assemble ApiRequest · system_prompt + messages]:::ctx API{{ApiClient.stream · AssistantEvent · PromptCacheEvent}}:::model TU[Parse ToolUses]:::model H1[PreToolUse hook · allow · ask · deny · defer · modify]:::gate PG[PermissionPolicy · authorize_with_context]:::gate EX[Execute tool · bash · file · mcp · web]:::tool H2[PostToolUse hook · success or failure]:::gate CMP[Auto-compact + health probe]:::ctx TS([TurnSummary · persist Session]):::io U --> B --> L --> API --> TU --> H1 --> PG --> EX --> H2 --> CMP CMP -- more tools --> L CMP -- done --> TS class U step1 class B step2 class L step3 class API step4 class TU step5 class H1 step6 class PG step7 class EX step8 class H2 step9 class CMP step10 click U call jumpTo("claw", 1) click B call jumpTo("claw", 2) click L call jumpTo("claw", 3) click API call jumpTo("claw", 4) click TU call jumpTo("claw", 5) click H1 call jumpTo("claw", 6) click PG call jumpTo("claw", 7) click EX call jumpTo("claw", 8) click H2 call jumpTo("claw", 9) click CMP call jumpTo("claw", 10) classDef io fill:#233042,stroke:#7aa2f7,color:#e6e8ef; classDef model fill:#2b1f3a,stroke:#bb9af7,color:#e6e8ef; classDef tool fill:#1f3a2b,stroke:#9ece6a,color:#e6e8ef; classDef gate fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef ctx fill:#3a1f2b,stroke:#f7768e,color:#e6e8ef;
Step 0 / 10
  • Hook 先于权限——PreToolUse hook 可以在权限引擎之前否决、要求确认、推迟或改写调用; 强制 deny/ask 规则仍是最终安全边界。
  • 主循环无子代理——task registry 只做异步后台; 多 agent 协作被推到 context 外。
  • 有 provenance 的压缩——摘要记录为 SessionCompaction 事件 + 健康探针。
  • Workspace 绑定——workspace_root 防并行 lane 写错 CWD。
  • Hooks before permissions — a PreToolUse hook can deny, ask, defer, or rewrite a call before the policy engine; enforced deny/ask rules remain the safety boundary.
  • No in-loop sub-agents — task registry is for async background only; multi-agent coord pushed outside.
  • Auto-compaction with provenance — summaries logged as SessionCompaction events + health probe.
  • Workspace bindingworkspace_root prevents parallel lanes writing to wrong CWD.

状态转移速读State transitions

当前状态State 触发条件Trigger 下一状态Next
UserInput 用户输入被追加到 session messagesUser message appended to session messages BuildRequest
BuildRequest system prompt + 历史 messages 组装完成System prompt + history assembled ModelStream
ModelStream assistant message 没有 ToolUse blockAssistant message has no ToolUse block TurnDone
ModelStream 解析到一个或多个 ToolUse blockOne or more ToolUse blocks parsed PreToolUse
PreToolUse hook 允许、改写、要求询问或直接拒绝Hook allows, rewrites, asks, or denies Permission / ToolResult(error)
Permission policy allow / ask / denyPolicy allows / asks / denies ExecuteTool / ToolResult(error)
ExecuteTool 工具 stdout/stderr 或结构化结果返回Tool stdout/stderr or structured result returned PostToolUse
PostToolUse tool_result 被追加回 messages; 本轮工具全部处理完Tool result appended to messages; all tool calls processed BuildRequest
conversation.rs:314 — run_turn() conversation.rs:414 — PreToolUse hook gate conversation.rs:432 — authorize_with_context compact.rs:96 — compact_session hooks.rs:23 — HookEvent enum

codex

OpenAI Responses API
结构化 output[] 流 + 原生推理项 + 沙箱 bash。 Structured output[] stream with first-class reasoning items and sandboxed bash.
flowchart TD U([User message]):::io K[_build_api_kwargs · instructions · tools · reasoning.effort]:::ctx ST{{responses.stream · with reasoning.encrypted_content}}:::model FB[[Fallback — responses.create stream · synthesize from deltas]]:::model N[_normalize_codex_response · parse output array]:::model RS[codex_reasoning_items · dedup by ID across turns]:::ctx PP[PermissionPolicy · ReadOnly · WorkspaceWrite · DangerFull]:::gate SB[Exec in sandbox · seatbelt · landlock]:::tool AP[Append tool result]:::tool CK{incomplete or commentary}:::decision Y([Return message]):::io U --> K --> ST ST -- transport err --> FB --> N ST --> N --> RS RS --> CK CK -- function_call --> PP --> SB --> AP --> K CK -- commentary --> K CK -- completed --> Y class U step1 class K step2 class ST step3 class FB step4 class N step5 class RS step6 class PP step7 class SB step8 class AP step9 class CK step10 click U call jumpTo("codex", 1) click K call jumpTo("codex", 2) click ST call jumpTo("codex", 3) click FB call jumpTo("codex", 4) click N call jumpTo("codex", 5) click RS call jumpTo("codex", 6) click PP call jumpTo("codex", 7) click SB call jumpTo("codex", 8) click AP call jumpTo("codex", 9) click CK call jumpTo("codex", 10) classDef io fill:#233042,stroke:#7aa2f7,color:#e6e8ef; classDef model fill:#2b1f3a,stroke:#bb9af7,color:#e6e8ef; classDef tool fill:#1f3a2b,stroke:#9ece6a,color:#e6e8ef; classDef gate fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef ctx fill:#3a1f2b,stroke:#f7768e,color:#e6e8ef; classDef decision fill:#2d2d3a,stroke:#8a93a6,color:#e6e8ef;
Step 0 / 10
  • Responses API, 不是 Chat Completions——输出是 typed output[](message / function_call / reasoning)。
  • 推理跨 turn 保留——include: ["reasoning.encrypted_content"], 按 ID 去重。
  • 三级流式回退——stream() → 重试 → create(stream=True) → 从 deltas 合成, 永不静默掉 turn。
  • OS 级沙箱——seatbelt (macOS) / landlock (Linux) 在 bash 执行前 gate FS/网络。
  • Responses API, not Chat Completions — typed output[] of message · function_call · reasoning.
  • Reasoning across turnsinclude: ["reasoning.encrypted_content"], deduplicated by ID.
  • Streaming fallback cascadestream() → retry → create(stream=True) → synthesize.
  • OS-level sandbox — seatbelt / landlock gate FS/network before bash.
run_agent.py:5168 — _run_codex_stream run_agent.py:5183 — responses.stream(**api_kwargs) run_agent.py:5297 — fallback responses.create run_agent.py:4640 — _normalize_codex_response run_agent.py:7266 — reasoning ID dedup

opencode

TS server + Go TUI
客户端-服务端分离 · HTTP · 任何前端都能驱动同一个 core。 Client-server split over HTTP — any frontend drives the same agent core.
flowchart TD TUI([Go TUI / Web / IDE]):::io SRV[POST /session/:id/message · → Bun server loop]:::ctx MODE{Agent mode}:::decision AI{{Vercel AI SDK stream · Anthropic · OAI · Google · Copilot · local}}:::model TC[Tool-call parts]:::model PG[Permission gate · allow · ask · deny + wildcards]:::gate EX[Tool executor · bash · edit · read · grep · lsp · mcp]:::tool SUB[[Subagent · general · explore]]:::sub APP[Append result]:::tool CC[Compaction / summary / title · hidden system agents]:::ctx SSE([SSE /global/event · → TUI renders parts]):::io TUI --> SRV --> MODE MODE -->|build: full tools| AI MODE -->|plan: read-only, ask first| AI AI --> TC --> PG --> EX EX --> SUB --> APP EX --> APP APP --> CC CC --> AI CC -.stream events.-> SSE class TUI step1 class SRV step2 class MODE step3 class AI step4 class TC step5 class PG step6 class EX step7 class SUB step8 class APP step9 class CC step10 click TUI call jumpTo("opencode", 1) click SRV call jumpTo("opencode", 2) click MODE call jumpTo("opencode", 3) click AI call jumpTo("opencode", 4) click TC call jumpTo("opencode", 5) click PG call jumpTo("opencode", 6) click EX call jumpTo("opencode", 7) click SUB call jumpTo("opencode", 8) click APP call jumpTo("opencode", 9) click CC call jumpTo("opencode", 10) classDef io fill:#233042,stroke:#7aa2f7,color:#e6e8ef; classDef model fill:#2b1f3a,stroke:#bb9af7,color:#e6e8ef; classDef tool fill:#1f3a2b,stroke:#9ece6a,color:#e6e8ef; classDef sub fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef gate fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef ctx fill:#3a1f2b,stroke:#f7768e,color:#e6e8ef; classDef decision fill:#2d2d3a,stroke:#8a93a6,color:#e6e8ef;
Step 0 / 10
  • HTTP 作为边界——TUI / Web / IDE 都走 /session/*; OpenAPI 3.1 spec 在 /doc(server.ts), 配合 mDNS broadcast (server/mdns.ts), 任何客户端都能自动发现并生成 SDK。
  • build 与 plan 模式——同一套工具, 不同权限映射: build(默认模式) 放行 edit/write/bash;plan 把写类工具 (edit/write/patch/bash) 降到 askdeny(默认行为因 agent.ts 内置 + 用户 config 合并而定), 只读工具自动放行。一个 loop, 两种人格。
  • Provider 无关——Vercel AI SDK 把 streaming / tool-calling / reasoning 下放到各 adapter。
  • 一等 LSP + MCP——代码智能和外部工具与原生工具并列。
  • HTTP as the boundary — TUI / web / IDE all speak to /session/*; OpenAPI 3.1 spec at /doc (server.ts) and mDNS broadcast (server/mdns.ts) let any client discover and generate an SDK.
  • Build vs plan modes — plan defaults edits/bash to ask, same loop two personas.
  • Provider-agnostic — Vercel AI SDK delegates streaming / tool / reasoning to each adapter.
  • First-class LSP + MCP — code intelligence and external tools sit beside native ones.
packages/opencode/src/tool — native tools packages/opencode/src/mcp — MCP client packages/opencode/src/lsp — LSP bridge packages/tui — Go TUI client (SSE consumer)

openclaw

TS · multi-channel gateway
本地 Gateway + 多通道 (IM/CLI/iOS/IDE) + Docker 浏览器沙箱。 Local Gateway + many channels (IM/CLI/iOS/IDE) + Dockerised browser sandbox.
flowchart TD CH([Channel input · IM · CLI · iOS · IDE · 10+ providers]):::io GW[Gateway · local-first orchestrator]:::ctx SR[Resolve session · history · DM pairing]:::ctx BP[Build payloads · system + tools + schemas]:::ctx PR{{Provider plugin · Anthropic · OpenAI · Google · ...}}:::model MC[Parse text + tool_calls · streaming deltas]:::model TP[ToolPolicy pipeline · per sandbox + channel]:::gate EX[Execute tool · bash · file · canvas]:::tool BR[[Browser sandbox · Docker + Chromium + CDP + noVNC]]:::sub CMP[Async compaction · if near context limit]:::ctx OUT(["Emit events → all channels"]):::io CH --> GW --> SR --> BP --> PR --> MC --> TP --> EX EX -- browser tool --> BR --> EX EX --> CMP CMP -- more tools --> PR CMP -- done --> OUT class CH step1 class GW step2 class SR step3 class BP step4 class PR step5 class MC step6 class TP step7 class EX step8 class BR step9 class CMP step10 click CH call jumpTo("openclaw", 1) click GW call jumpTo("openclaw", 2) click SR call jumpTo("openclaw", 3) click BP call jumpTo("openclaw", 4) click PR call jumpTo("openclaw", 5) click MC call jumpTo("openclaw", 6) click TP call jumpTo("openclaw", 7) click EX call jumpTo("openclaw", 8) click BR call jumpTo("openclaw", 9) click CMP call jumpTo("openclaw", 10) classDef io fill:#233042,stroke:#7aa2f7,color:#e6e8ef; classDef model fill:#2b1f3a,stroke:#bb9af7,color:#e6e8ef; classDef tool fill:#1f3a2b,stroke:#9ece6a,color:#e6e8ef; classDef sub fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef gate fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef ctx fill:#3a1f2b,stroke:#f7768e,color:#e6e8ef;
Step 0 / 10
  • 多通道路由——Discord / Slack / Telegram / WhatsApp / iMessage / Signal / Matrix / Microsoft Teams / Google Chat / Zalo 等 IM 消息都进同一个 Gateway(长驻守护进程), CLI / iOS / IDE 则作为额外入口, 全部路由到同一组 session。
  • Embedded Runner + Gateway 拆分——agent 核心可嵌入 CLI / 浏览器 / 远端, Gateway 管 channels / cron / auth。
  • Docker 浏览器沙箱——Chromium + CDP + noVNC, 操作可视化调试, 自动化与"我来看它点什么"并存。
  • ACP 桥接 IDE——openclaw acp 暴露 stdio 协议, Zed / Cursor 可直接驱动同一 agent。
  • Multi-channel routing — IM traffic from Discord / Slack / Telegram / WhatsApp / iMessage / Signal / Matrix / Teams / Google Chat / Zalo all feeds one Gateway (long-lived daemon); CLI / iOS / IDE act as additional entry points, all landing in shared sessions.
  • Embedded runner ↔ Gateway split — core agent is portable (CLI / browser / remote); Gateway owns channels / cron / auth.
  • Dockerised browser sandbox — Chromium + CDP + noVNC; automation while you can watch it click.
  • ACP bridge to IDEsopenclaw acp exposes stdio protocol; Zed / Cursor drive the same agent.
src/agents/pi-embedded-runner/run.ts — main turn loop src/agents/pi-tools.ts — tool registry + lazy loading src/agents/sandbox/browser.ts — Docker CDP browser src/agents/pi-embedded-runner/compact.ts — async compaction src/acp/session.ts — ACP ↔ Gateway bridge

pi

TypeScript · minimal harness
4 个工具默认 (read/write/edit/bash); 一切其它特性住在 TS Extensions / Skills / Packages 里。 Four default tools (read/write/edit/bash); every other feature lives in TS Extensions / Skills / Packages.
flowchart TD U([User input · Enter = steer · Alt+Enter = queue]):::io AS[AgentSession · assemble system + AGENTS.md + skills + history]:::ctx PR{{Provider · 15+ via OAuth or API key · /model switches mid-session}}:::model ST[Stream typed events · text · tool_use · usage]:::model EX[ExtensionRunner · before-tool hook · may rewrite or block]:::gate T4[[Default tools · read / write / edit / bash]]:::tool EXT[[Extension tools · sub-agent / plan / MCP / sandbox · user-installed]]:::sub TR[Session tree · append message · parentID for branching]:::ctx CMP[Compaction · replaceable strategy · default summary-rewrite]:::ctx OUT(["Render to TUI · or emit JSON · or return via SDK"]):::io U --> AS --> PR --> ST --> EX EX -- default --> T4 EX -- extension --> EXT T4 --> TR EXT --> TR TR --> CMP CMP -- more tool_use --> PR CMP -- done --> OUT OUT -- user steers --> AS class U step1 class AS step2 class PR step3 class ST step4 class EX step5 class T4 step6 class EXT step7 class TR step8 class CMP step9 class OUT step10 click U call jumpTo("pi", 1) click AS call jumpTo("pi", 2) click PR call jumpTo("pi", 3) click ST call jumpTo("pi", 4) click EX call jumpTo("pi", 5) click T4 call jumpTo("pi", 6) click EXT call jumpTo("pi", 7) click TR call jumpTo("pi", 8) click CMP call jumpTo("pi", 9) click OUT call jumpTo("pi", 10) classDef io fill:#233042,stroke:#7aa2f7,color:#e6e8ef; classDef model fill:#2b1f3a,stroke:#bb9af7,color:#e6e8ef; classDef tool fill:#1f3a2b,stroke:#9ece6a,color:#e6e8ef; classDef sub fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef gate fill:#3a2b1f,stroke:#e0af68,color:#e6e8ef; classDef ctx fill:#3a1f2b,stroke:#f7768e,color:#e6e8ef;
Step 0 / 10
  • 4 工具默认——read / write / edit / bash 是模型唯一能直接调的工具; find/grep/ls 存在但默认未挂载, 让 bash 走原生工具链。
  • Steering 双键——agent 跑工具时 Enter 立刻打断后续工具并塞新消息进推理; Alt+Enter 排队等本轮跑完。
  • "什么不在 core 里"——sub-agent、plan mode、MCP、权限弹窗、todo、后台 bash 全故意放到 extension/package 里; 想要就写一个或装一个。
  • 会话作为树——/tree 跳回任意旧消息从那分叉; 全部分支住在同一文件; /share 上传到 GitHub gist 直出可分享 URL。
  • SDK 嵌入——同一个 AgentSession 跑 4 模式: TUI / print(JSON) / RPC / SDK; openclaw 用 SDK 把 pi 嵌进自家 runner。
  • Four default tools — read / write / edit / bash are the only ones the model can call directly; find/grep/ls exist as files but aren't mounted, so the model uses native shell via bash.
  • Two-key steering — while the agent is running tools, Enter interrupts the remaining tools and lands a new message in reasoning; Alt+Enter queues a follow-up until the current run ends.
  • "What we didn't build" — sub-agents, plan mode, MCP, permission popups, todos, background bash are all deliberately pushed into extensions / packages; build one or install one.
  • Session as a tree/tree jumps back to any old message and forks from there; every branch lives in one file; /share uploads to a GitHub gist and returns a shareable URL.
  • SDK embedding — the same AgentSession runs in four modes: TUI / print(JSON) / RPC / SDK; openclaw uses the SDK to embed pi as its runner.
packages/coding-agent/src/core/agent-session.ts — AgentSession (3099 lines) packages/coding-agent/src/core/tools/{read,write,edit,bash}.ts — 4 default tools packages/coding-agent/src/core/extensions/ — TS extension runtime packages/coding-agent/src/core/compaction/ — replaceable compaction packages/coding-agent/src/modes/{interactive,print-mode.ts,rpc} — 4 run modes packages/coding-agent/src/core/sdk.ts — embed API (used by openclaw)

5. 一眼看懂5. At a glance

下面表里的术语若有陌生, 后面"深度拆解"会讲透——先扫一眼整体。Unfamiliar terms below will be explained in the deep dives — just skim for now.

AgentAgent 技术栈Stack 主循环Loop driver 沙箱 / 权限Sandbox / perms 招牌特性Signature feature
hermes-agent Python, 多 provider adapterPython, multi-provider adapters ReAct + 共享 IterationBudget(默认 90)ReAct w/ shared IterationBudget (default 90) Modal 云 VM; bash 权限策略Modal cloud VM; bash policy 子代理与父共享预算Sub-agents share parent budget
claw-code Rust runtime + Python 参考Rust runtime + Python reference run_turn() 每工具级 gatingrun_turn() per-tool gating Pre/Post hook → PermissionPolicyPre/Post hooks → PermissionPolicy Hook 触发早于权限检查Hooks fire before permission
codex OpenAI Responses API (非 Chat Completions)OpenAI Responses API responses.stream() + 回退级联responses.stream() w/ fallback cascade seatbelt (macOS) / landlock (Linux)seatbelt / landlock 加密推理跨 turn 保留Encrypted reasoning across turns
opencode TS 服务端 (Bun) + Go TUI, HTTP 分离TS server (Bun) + Go TUI, HTTP split 客户端 POST → 服务端 loop → SSE 回推Client POST → server loop → SSE back 每工具 allow | ask | denyPer-tool allow | ask | deny 任何前端都能驱动同一个 agent coreAny frontend drives the same core
openclaw TypeScript, 本地 Gateway + 多通道TypeScript, local Gateway + multi-channel Channel → Gateway → embedded runner → streamChannel → Gateway → embedded runner → stream Docker 沙箱 + per-channel ToolPolicyDocker sandbox + per-channel ToolPolicy IM / CLI / iOS / IDE 都路由到同一 agentIM / CLI / iOS / IDE all route to one agent

6. 六家深度拆解6. Six deep dives

每家按同一模板: 目的 → 核心机制 (带源码行号)→ 为什么 work → 为什么好 → 代价

Same template for each: purpose → key mechanisms (with line numbers) → why it works → why it's good → cost.

3.1 hermes-agent hermes-agent/run_agent.py

为什么你该看懂这家: 它展示了"一个 loop 兼容多家 LLM provider"和"让 agent 派小弟并防止失控"——自己造 agent 时第一个会撞到的两个工程问题。

Why you should care: it shows how to support multiple LLM providers in one loop, and how to delegate to sub-agents without losing control — the first two engineering problems you'll hit when building your own agent.

目的Purpose

做一个 provider-agnostic 的通用 agent: 今天 Claude, 明天 Gemini, loop 不用动。同时支持把大任务拆给子 agent 并行处理。

A provider-agnostic agent: swap Claude for Gemini without touching the loop. Plus parallel sub-agent delegation for large tasks.

核心机制Key mechanisms

为什么 workWhy it works

为什么好Why it's good

对 RL 训练场景极友好: agent 扔进 Modal 云沙箱, 同时跑 100 个 rollout, 每个独立 FS 但共享奖励函数。MCP server (mcp_serve.py) 把内部对话反向暴露, Claude Code / Cursor 能把 hermes 当工具。

Great for RL training: drop into Modal cloud sandbox, run 100 rollouts in parallel, each with its own FS but a shared reward function. MCP server (mcp_serve.py) exposes internal conversations outward, letting Claude Code / Cursor consume hermes as a tool. ClawBench is a natural RL evaluation target for this setup — its per-evidence scores plug straight in as reward signal.

代价Cost

复杂度高, 单文件 12,000 + 行。Multi-provider 意味着无法 1:1 映射各家最新特性 (比如 Claude 的 extended thinking 在 Gemini 上没有等价物)。

Complex — single file is 12,000 + lines. Multi-provider means you can't 1:1 map each vendor's latest features (e.g., Claude's extended thinking has no Gemini equivalent).

3.2 claw-code claw-code/rust/crates/runtime/src/

为什么你该看懂这家: 它展示了"如何让 agent 在生产环境也敢用"——每个危险动作都能被脚本拦下来审查、改写、记录。想把 agent 上线的人必看。

Why you should care: shows how to make an agent safe enough for production — every dangerous action can be intercepted, rewritten, or audited by scripts. Essential for anyone shipping an agent.

注: claw-code 是 Claude Code 的 Rust 开源复刻。Anthropic 官方文档把 Claude Code 定位为围绕 Claude 的 agentic harness(见 How Claude Code Works)。官方可确认的是 agentic loop、工具、权限、hooks、CLAUDE.md / memory、context compaction 这些机制; 本节的具体源码行、100k 压缩阈值、12 阶段 Bootstrap、health probe 是 claw-code 的实现选择, 不等于官方 Claude Code 内部实现。

Note: claw-code is an open-source Rust reimplementation of Claude Code. Anthropic's docs describe Claude Code as an agentic harness around Claude (see How Claude Code Works). The official surface confirms the agentic loop, tools, permissions, hooks, CLAUDE.md / memory, and context compaction; the source lines, 100k compaction threshold, 12-phase bootstrap, and health probe in this section are claw-code implementation choices, not official Claude Code internals.

目的Purpose

做一个可审计、可干预的 Claude Code 开源实现。每一次工具调用都可以被脚本拦截、改参、查权限、记日志、事后清理。

An auditable, interceptable Claude Code reimplementation. Every tool call can be intercepted, rewritten, permission-checked, logged, or cleaned up afterward.

核心机制Key mechanisms

为什么 workWhy it works

为什么好Why it's good

同一个 runtime 可以跑出完全不同风格的 agent: 开发者用=宽权限, hook 做 lint; 生产跑=收紧权限, hook 强制 dry-run; 教学用=所有 bash 都 ask。Rust 实现启动快内存低, 可内嵌到其他程序。

Same runtime, different personas: developers get wide permissions with lint hooks; prod gets strict permissions with forced dry-run hooks; teaching mode asks on every bash. Rust impl means fast startup, low memory, embeddable in other programs.

代价Cost

主 loop 里没有子代理 (task registry 只做异步后台)。多 agent 协作被推到 runtime 外部——这是 claw-code 的哲学选择: "让 agent context 专注做事, 不要用来开会"。

Main loop has no sub-agents (task registry is async background only). Multi-agent coordination is pushed outside the runtime — a deliberate philosophy: "keep agent context focused on work, not meetings."

3.3 codex codex/ + hermes adapter run_agent.py:5168+

为什么你该看懂这家: Responses API 是未来几年其他服务商大概率会跟进的方向。提前看懂 = 别家跟进时你能立刻上手。

Why you should care: Responses API is likely the direction other vendors will follow over the next few years. Learn it now, be ready when others catch up.

目的Purpose

展示 OpenAI 把 agent 能力直接内置到 API 会是什么样——不是让客户端组装工具调用, 而是 API 直接返回"我在想什么 / 要调什么工具 / 要说什么"的结构化流。

What it looks like when OpenAI bakes agent capability into the API itself — not client-side tool-call assembly, but the API streaming structured items: "what I'm thinking / which tool to call / what to say."

核心机制Key mechanisms

为什么 workWhy it works

为什么好Why it's good

第一方优化——Responses API 是 agent 一等公民。客户端只处理回退与去重, 服务端负责推理、缓存、流式, 整条链路比 "Chat Completions + 手搓 agent loop" 干净得多。对专门用 OpenAI 的团队, codex 是上限最高的方案。

First-party optimization — Responses API treats agents as first-class citizens. The client only handles fallback and dedup; the server owns inference, caching, streaming. The whole pipeline is much cleaner than "Chat Completions + hand-rolled agent loop." For OpenAI-committed teams, codex is the highest-ceiling option.

代价Cost

锁定 OpenAI——Responses API 目前只有 OpenAI。推理加密——你拿不到纯文本推理内容, 只能原样传回。

Locked to OpenAI — Responses API is OpenAI-only today. Reasoning is encrypted — you can't inspect it, only pass it back.

3.4 opencode github.com/sst/opencode

为什么你该看懂这家: 如果你要做 IDE 插件、团队共享 agent、或多端同步, 这是蓝图。

Why you should care: if you want to build an IDE plugin, a team-shared agent, or multi-client sync — this is the blueprint.

目的Purpose

解决一个工程问题: agent 不应该和终端 UI 绑死。今天 TUI, 明天 VS Code 插件, 后天 iPhone app——agent 逻辑应该只写一份。

One engineering problem: agent logic should not be bound to a TUI. TUI today, VS Code plugin tomorrow, iPhone app the day after — write the agent once.

核心机制Key mechanisms

为什么 workWhy it works

为什么好Why it's good

对团队协作友好——server 跑在共享机器上, 多人接客户端连进来看同一 session。对 IDE 集成友好——任何 IDE 插件都能对接, 不用各自重造 agent。

Team-friendly — run server on a shared machine, multiple clients connect to the same session. IDE-friendly — any IDE plugin can wire up, no need to reinvent the agent.

代价Cost

HTTP 带来的延迟 (毫秒级, 交互上可忽略)。Server 要长期维护, 不像单进程 CLI 那样"用完即退"。

HTTP adds latency (milliseconds, negligible interactively). Server needs long-term maintenance, unlike a fire-and-forget CLI.

3.5 openclaw openclaw/src/agents/pi-embedded-runner/

为什么你该看懂这家: 它展示了"一个 agent 同时吃得下 IM 消息、CLI 命令、iOS 推送、IDE 会话"——想做 all-in-one 个人 copilot 的人必看。

Why you should care: shows how one agent can handle IM messages, CLI commands, iOS pushes, and IDE sessions in parallel — essential reading if you want an "all-in-one" personal copilot.

目的Purpose

做一个本地优先的多通道 agent。官方把它定位成"单个长驻 Gateway + 所有通道共用一个 agent": 不是又一个聊天框, 而是挂在你设备上的控制面板——Discord / Slack / Telegram / WhatsApp / iMessage / Signal / Matrix / Teams / Zalo 等 10 + IM 通道 + CLI / iOS / IDE 都路由进同一个 Gateway, agent 在共享 session 里工作。

A local-first multi-channel agent. The docs position it as "one long-lived Gateway, many channels, one agent" — not another chat box but a control plane on your device. 10 + IM channels (Discord / Slack / Telegram / WhatsApp / iMessage / Signal / Matrix / Teams / Zalo and more) plus CLI / iOS / IDE all feed the same Gateway and share sessions.

核心机制Key mechanisms

官方术语: Tools vs Skills —— Tools 是 agent 可以调用的带类型函数 (bash / read / write / browser / canvas 等, 共 ~19 个核心), Skills 是注入 system prompt 的 Markdown 教材 (SKILL.md, 讲"什么时候、怎么用"工具)。这套分层是 openclaw 文档自己强调的核心抽象。

Docs terminology: Tools vs Skills. Tools are the typed functions the agent can call (bash / read / write / browser / canvas / ~19 core); Skills are Markdown docs (SKILL.md) injected into the system prompt, teaching when and how to use them. This split is called out in the official docs as the core abstraction.

为什么 workWhy it works

为什么好Why it's good

对"个人 copilot / 值班机器人"场景最到位——开会时 agent 监听 Slack, 下班路上用 iMessage 追问结果, 到家接 CLI 继续改代码, 同一条 session 贯穿。加上 ACP, IDE 会话也一起上。其他 4 家需要你手动切换工具。 浏览器沙箱刚好也能跑 ClawBench 任务——用 openclaw 做网页 agent 的研发+评测一条龙。

Ideal for "personal copilot / on-call bot": the agent watches Slack during a meeting, answers iMessage on the commute, resumes via CLI at home — all the same session. Add ACP and the IDE joins in too. The other four ask you to context-switch tools yourself. The browser sandbox doubles as a ClawBench runner, so you can use openclaw for both web-agent dev and evaluation in one place.

代价Cost

运维成本高: Docker、WebSocket、多通道 webhook 要一次跑起来。核心文件 run.ts 2100 + 行, 逻辑密集。不是开箱即用的小工具。

Higher ops cost: Docker + WebSocket + multi-channel webhooks must all be up. run.ts is 2100 + lines of dense logic. Not a plug-and-play mini-tool.

3.6 pi github.com/badlogic/pi-mono · packages/coding-agent pi.dev

为什么你该看懂这家: 当前面五家比的是"我加了多少特性", pi 反过来比"我能砍掉多少特性还活得下去"。openclaw 的 embedded runner 就是基于 pi 的 SDK——这是 pi 在生产里最好的存在证明。如果你想把 agent 做成一个能放进自己 app 里的库, 而不是一个吞掉用户工作流的 CLI, 这就是范例。

Why you should care: while the other five compete on "how many features I add," pi competes on "how many features I can strip out and still survive." openclaw's embedded runner is built on pi's SDK — the best existence proof in production. If you want an agent shaped like a library you embed in your app, not a CLI that eats your workflow, this is the template.

目的Purpose

作者 Mario Zechner (badlogicgames; pi.dev 由 exe.dev 捐赠) 把 pi 称作 "minimal terminal coding harness"——只给模型 4 个原子工具 (read / write / edit / bash), 其他全部由用户用 TypeScript Extensions / Skills / Prompt Templates / Themes 自己长出来, 还能打成 npm/git 包分享。pi.dev 主页直接列了一串 "What we didn't build": 没有 MCP、没有 sub-agent、没有 plan mode、没有权限弹窗、没有内建 todo、没有后台 bash——每一项都给了"你可以这样替代"的提示。

Author Mario Zechner (badlogicgames; pi.dev donated by exe.dev) calls pi "a minimal terminal coding harness." The model gets four atomic tools (read / write / edit / bash); everything else is grown by users via TypeScript Extensions / Skills / Prompt Templates / Themes, which can be shipped as npm or git packages. The pi.dev homepage literally has a "What we didn't build" section: no MCP, no sub-agents, no plan mode, no permission popups, no built-in to-dos, no background bash — each entry tells you the recommended workaround instead.

核心机制Key mechanisms

为什么 workWhy it works

为什么好Why it's good

对"我要把 agent 嵌进自己产品里"的团队几乎是唯一选项——SDK 干净、协议稳定 (RPC 模式有 doc), 没有强加给你的 UI 概念。openclaw 把 pi 当 runtime 嵌进 Gateway, 自己只关心通道路由——这就是 pi 设计哲学的最佳广告。pi 还做了一个值得借鉴的事: 作者把自己的 pi-mono 工作 session 持续发到 Hugging Face, 用 pi-share-hf 工具一键分享 OSS session, 给 RL/agent 训练社区提供真实工作流数据。

For teams who want to embed an agent into their own product, pi is essentially the only option — clean SDK, stable protocol (RPC mode is documented), no imposed UX concepts. openclaw embedding pi as a runtime in its Gateway, while only owning channel routing, is the best advertisement for pi's philosophy. One more thing worth copying: the author publishes his own pi-mono work sessions to Hugging Face via pi-share-hf, donating real OSS workflow data to the RL / agent-training community.

代价Cost

"故意没有 X" 的代价就是用户得自己长 X。生产场景需要权限弹窗、subagent、plan mode、MCP 接入的团队, 在 pi 上得先写一组 extension; 直接用 claw-code / opencode 是更省事的选择。Steering 双键虽然优雅, 学习曲线对新用户也不友好——团队人多时谁都得知道 Enter 和 Alt+Enter 的差别, 否则会误打断。

"Deliberately not built" comes with a tax — you grow it yourself. Teams that need permission popups, sub-agents, plan mode, or MCP in production must first write a stack of extensions; reaching for claw-code or opencode is the cheaper path. The two-key steering protocol is elegant but has a learning curve — everyone on a team has to know the Enter vs Alt+Enter distinction or the wrong key will break a long task.

7. 对比与选型7. Comparison & selection

什么场景选哪个Which to pick when

场景推荐原因
RL 训练 / 批量 rollouthermes-agentModal 沙箱 + 共享预算 + 子代理做并行
生产跑 agent, 要审计和治理claw-codeHook 系统 + 权限策略 + Rust 稳健性
只用 OpenAI, 要最强 reasoningcodexResponses API 原生支持 + 加密推理保留
多端 (TUI / Web / IDE) 共用opencodeHTTP 协议 + OpenAPI + 自动 SDK
个人多通道 copilot / 值班机器人openclaw本地 Gateway + IM/CLI/iOS/IDE 路由到同一 session
真实浏览器任务评测 / 验证ClawBenchlive web 任务, 不是 offline DOM 快照;动态 JS、cookie 弹窗、多步交互、可追溯 per-evidence 评分
ScenarioPickWhy
RL training / batch rolloutshermes-agentModal sandbox + shared budget + parallel sub-agents
Production with audit & governanceclaw-codeHooks + policy + Rust robustness
OpenAI-only, max reasoningcodexResponses API native support + encrypted reasoning
Multi-client (TUI / Web / IDE)opencodeHTTP + OpenAPI + auto-generated SDKs
Personal multi-channel copilot / on-call botopenclawLocal Gateway + IM/CLI/iOS/IDE route into one session
Evaluating real browser tasksClawBenchLive web tasks, not offline DOM snapshots; dynamic JS, cookie popups, multi-step interactions, traceable per-evidence scoring

设计维度对比Design dimensions

维度Dimension hermesclawcodexopencodeopenclaw
进程模型Process model 单进程Single process 单进程Single process 单进程Single process 客户端/服务端分离C/S split Gateway + Runner 拆分Gateway + Runner split
子代理Sub-agents 主循环内in-loop 无 (外置)None (external) None @mention session 路由session routing
权限粒度Permission grain 粗 (bash 分类)Coarse (bash class) 细 (每工具 + hook)Fine (per-tool + hook) 粗 (bash 分类)Coarse (bash class) 细 (每工具)Fine (per-tool) 细 (sandbox × channel)Fine (sandbox × channel)
Provider 多家 adapterMulti via adapter 多家 (Claude 优先)Multi (Claude-first) 仅 OpenAIOpenAI only 多家 (AI SDK)Multi (AI SDK) 多家 pluginMulti via plugins
语言Language PythonRustPythonTS + GoTypeScript
沙箱Sandbox Modal 云 VMcloud VM OS-level seatbelt / landlock 无 (容器可选)None (container optional) Docker (含浏览器)Docker (incl. browser)
入口通道Entry channels CLICLI / IDECLI / IDECLI / TUI / IDE IM / CLI / iOS / IDEIM / CLI / iOS / IDE

8. Takeaway: 7 条值得借鉴的设计8. Takeaway: seven design patterns worth adopting

自己造 agent 时可以直接借鉴的设计。顺带一提:把它们造出来后, 用 ClawBench 在真实网页任务上打个分, 就知道到底哪几条 idea 真的 work。

Design patterns you can lift directly for your own agent. And once you've built it, run it against ClawBench on live web tasks to see which of these ideas actually pay off in practice.

1. 共享预算防失控 — hermes 的 IterationBudget1. Shared budget to prevent runaway — hermes's IterationBudget

不管 agent 怎么嵌套, 总工具调用次数不会爆炸。自己造时: 给所有工具调用加一个共同递减的计数器, 比"每个 agent 独立限制"稳得多。

No matter how deeply agents nest, total tool calls can't explode. DIY: add a single shared counter decremented by every call — far more robust than "each agent gets its own limit."

2. Hook 先于权限给用户终极表达力 — claw-code2. Hook-before-permission = ultimate expressiveness — claw-code

传统权限是 allow/deny 二元。Hook 是可编程的中间层。自己造时: 给每个关键决策点暴露一个"用户可注入的函数", 而不是做死的规则。

Traditional permission is binary allow/deny. Hooks are a programmable middle layer. DIY: expose a "user-injectable function" at every critical decision point, not hard-coded rules.

3. Reasoning 跨 turn 保留 — codex3. Reasoning persisted across turns — codex

多 turn 任务里, 上轮思考应可延续到这轮, 而不是每轮重新想。自己造时: 如果模型支持, 开启 reasoning persistence; 不支持, 在 system prompt 里人工把"上轮结论"塞回去。

In multi-turn tasks, the prior turn's reasoning should carry to the next — don't re-think from scratch. DIY: turn on reasoning persistence if the model supports it; if not, inject "last turn's conclusion" via the system prompt manually.

4. 把 agent 做成服务 — opencode4. Agent as a service — opencode

Agent 逻辑和 UI 彻底分开。自己造时: 哪怕只做 CLI, 也把 core 拆成独立 library + server mode, 将来扩展成本近似零。

Separate agent logic from UI. DIY: even for a CLI, split core into library + server mode — expanding to new clients later costs near-zero.

5. Ephemeral injection 保 cache — hermes5. Ephemeral injection to preserve cache — hermes

Prompt cache 最怕 prompt 前缀变。把动态内容 (memory, hook output) 作为"仅本次 API 调用生效"的补充, 别污染历史。自己造时: 历史只存用户消息 + 工具结果, 所有 agent 内部的元数据另算。

Prompt cache breaks when the prefix changes. Treat dynamic content (memory, hook output) as "only effective for this API call" — don't pollute the history. DIY: history stores only user messages + tool results; all agent-internal metadata lives elsewhere.

6. 预算耗尽时留一次 "grace call" — codex6. Give the model one "grace call" on budget exhaustion — codex

工具预算打满时不要直接 hard-error。hermes 里的 codex 适配器会再放模型一次 API 调用(run_agent.py:916 _budget_grace_call), 让它有机会给用户一个体面的收尾: 总结已完成的事、列出没跑完的、保存部分结果。自己造时: budget 监督器保留一次 graceful-exit 槽位, 用户体验立竿见影。

When the tool budget is exhausted, don't hard-error. The codex adapter in hermes grants one more model call (run_agent.py:916 _budget_grace_call) so the model can exit gracefully: summarise what got done, what's left, save partial results. DIY: reserve a single graceful-exit slot in your budget watchdog — the UX upgrade is immediate.

7. Session 分叉作为一等公民 — opencode (源码级)7. Session forking as a first-class primitive — opencode (code-level)

opencode 的 session.sql.tsparentID 字段追踪 session 血统, 支持从任意消息点分叉出一条平行路径注: 公开文档目前只介绍 share 功能, fork 还没进官方 docs——这个 pattern 来自源码。多数 agent 框架把"回溯/重试"做成销毁状态, opencode 把它做成树。自己造时: 消息持久化加一个 parent_id 字段, 就能解锁"多方案并跑"这类体验。

opencode's session.sql.ts uses a parentID field to track session lineage, letting you fork a parallel session from any message. Note: public docs only describe the share feature; forking is present in source but not yet promoted to official docs — this one is from the code. Most agent frameworks treat "undo/retry" as destruction; opencode treats it as a tree. DIY: add a parent_id to persisted messages and you've unlocked "try both approaches at once."

一句话画像One-line mental model

hermes-agent "ReAct + 预算 + 子代理, 一进程多 provider。" "ReAct + budgets + sub-agents, one process, many providers."
claw-code "每个工具调用都是 hook → permission → execute → hook。" "Every tool call is hook → permission → execute → hook."
codex "Responses API + 原生推理项 + OS 级沙箱。" "Responses API with first-class reasoning items and OS-level sandbox."
opencode "Agent core 藏在 HTTP 服务后, TUI 只是一个客户端。" "Agent core behind an HTTP API; the TUI is just one client."
openclaw "所有通道 (IM / CLI / iOS / IDE) 都是同一个 agent 的入口。" "Every channel (IM / CLI / iOS / IDE) is a door into the same agent."

9. hermes-agent 与训练9. hermes-agent and training

其他四家都在解决"怎么让 agent 好用", 只有 hermes-agent 同时解决"怎么让 agent 被训练"。这一点在 LLM 领域正在变得越来越重要——评估、RL、离线分析、跨模型对比, 都是"agent 作为训练 target"才能产出的结果。hermes 从进程模型到数据流, 每一层都为这个场景设计。

The other four optimize for running an agent well; hermes-agent is the only one that simultaneously optimizes for training one. As evaluation, RL, offline analysis, and cross-model comparison become the new battleground, "agent-as-training-target" is the axis that matters — and hermes is architected for it from the process model up to the data flow.

为什么训练友好就是优雅Why training-friendly is elegance

为什么这比"好看"更优雅?训练是 agent 领域最苛刻的负载——要求并行、幂等、成本可控、失败可恢复、数据可追溯。一个能同时扛住这五件事的 harness, 本质上也是一个能在生产跑的 harness, 只是反过来不成立。hermes 把"能被训练"设计进了每一层, 这种系统级一致性才是真正的优雅。

Why does this beat "pretty" for elegance? Training is the harshest load an agent harness can face: parallel, idempotent, cost-bounded, failure-recoverable, data-traceable. A harness that withstands all five is, by definition, a harness that can also run in production — but not the reverse. Hermes bakes "trainable" into every layer; that system-wide coherence is what real elegance looks like.

荣誉提名 (各自最优雅的一处)Honorable mentions (each has one truly elegant choice)

这是我的口味, 不是唯一正确答案。生产审计选 claw-code; OpenAI 全家桶选 codex; 多端协作选 opencode; 个人多通道 copilot 选 openclaw;想真 stress-test 任何一家在真实网页任务上的表现——ClawBench 就是干这个的。我把票投给 hermes, 是因为"能被训练"这条路, 长期看会把整个 agent 生态拉进一个新范式——你今天不训练, 明年大概率也会训练。

This is my taste, not the only right answer. Production audit → claw-code. OpenAI stack → codex. Multi-client work → opencode. Personal multi-channel copilot → openclaw. And if you want to actually stress-test any of the five on real web tasks, that's what ClawBench is for. I vote hermes because "being trainable" is the axis that, long-term, will pull the whole agent ecosystem into a new paradigm — if you aren't training now, you probably will be next year.

引用Cite

如果这篇文章或 ClawBench 对你的工作有用, 欢迎引用。点右上角按钮一键复制 BibTeX。

If this post or ClawBench is useful to you, please cite. Click the button for one-click BibTeX copy.

ClawBench
@article{zhang2026clawbench,
  title={ClawBench: Can AI Agents Complete Everyday Online Tasks?},
  author={Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen},
  year={2026},
  eprint={2604.08523},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2604.08523},
}
This post
@misc{zhang2026harnessblog,
  author = {Yuxuan Zhang},
  title  = {Agent Harness Engineering: A Source-Level Comparison of Coding Agents},
  year   = {2026},
  url    = {https://reacher-z.github.io/blog/harness/}
}