Building effective agents
2024年12⽉20⽇
Over the past year, we’ve worked with dozens of teams building large language model (LLM) agents across industries. Consistently, the most successful implementations weren’t using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.
在过去的一年里,我们(Anthropic)与多个行业的团队合作构建大语言模型 LLM 的 agents(本文后续翻译为智能体)。最成功的实现并不是用了复杂的框架或者专门的库。相反,他们使用简单的、可组合的模式进行构建。
In this post, we share what we’ve learned from working with our customers and building agents ourselves, and give practical advice for developers on building effective agents.
在这片文章里,我们将分享我们在与客户合作以及构建我们自己的智能体的过程中所学到的经验,并为开发者提供构建有效智能体的实用建议。
What are agents? 什么是智能体?
“Agent” can be defined in several ways. Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks. Others use the term to describe more prescriptive implementations that follow predefined workflows. At Anthropic, we categorize all these variations as agentic systems, but draw an important architectural distinction between workflows and agents:
Agent 可以有多种定义方式。有些客户将 agents 定义为完全自动的系统,能够在很长一段时间内独立运行、使用各种工具来完成任务。有些客户则用这个词描述那些遵循预定义的工作流的更具规范性的实现。在Anthropic,我们把这些变体都归类为智能体系统,但是在架构上对工作流和智能体进行了重要区分。
- Workflows are systems where LLMs and tools are orchestrated through predefined code paths.
- Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.
- 工作流是通过预定义代码路径来编排 LLM 和工具的系统
- 智能体是 LLM 动态指导他们自己的流程、工具的使用、并能够控制怎样完成任务的系统
Below, we will explore both types of agentic systems in detail. In Appendix 1 (“Agents in Practice”), we describe two domains where customers have found particular value in using these kinds of systems
下面,我们将详细探讨这两种类型的智能体系统。在附录 1 Agents in Practice 中,我们介绍客户在使用他们的系统中发现的 特别有价值的两个领域。
When (and when not) to use agents 什么时候使用/不使用智能体
When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all. Agentic systems often trade latency and cost for better task performance, and you should consider when this tradeoff makes sense.
在构建基于 LLM 的应用程序时,我们推荐尽可能地寻找最简单的解决方案,仅在必要时增加其复杂度。这意味着可能根本不需要构建智能体系统。智能体系统通常以延迟和成本为代价,来实现更好的任务效果,你应该考虑权衡是否有意义。
When more complexity is warranted, workflows offer predictability and consistency for well-defined tasks, whereas agents are the better option when flexibility and model-driven decision-making are needed at scale. For many applications, however, optimizing single LLM calls with retrieval and in-context examples is usually enough.
当需要更多的复杂性时,工作流为定义明确的任务提供了可预测、一致性,而智能体则是面对复杂的、模型驱动、决策定制场景下更好的选择。然而,对许多应用来说,使用搜索、上下文示例去优化单个 LLM 的调用通常就足够了。
When and how to use frameworks 什么时候使用框架
There are many frameworks that make agentic systems easier to implement, including:
有许多框架可以让智能体的实现更为简单,如
- LangGraph from LangChain;
- Amazon Bedrock’s AI Agent framework;
- Rivet, a drag and drop GUI LLM workflow builder; 拖拽式界面的工作流构建器
- and Vellum, another GUI tool for building and testing complex workflows.另一个构建和测试复杂工作流的界面工具
These frameworks make it easy to get started by simplifying standard low-level tasks like calling LLMs, defining and parsing tools, and chaining calls together. However, they often create extra layers of abstraction that can obscure the underlying prompts and responses, making them harder to debug. They can also make it tempting to add complexity when a simpler setup would suffice.
这些框架使得非常容易上手,通过简化标准的底层任务,如调用 LLM、定义和解析工具、链式调用。然而,他们往往创建额外的抽象层,这会遮盖底层的提示词和响应内容,使得 debug 更难。也会存在即使进行的是简单的设置,实际上确实增加了底层复杂性的情况。
We suggest that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code. Incorrect assumptions about what’s under the hood are a common source of customer error.
我们建议开发者从直接使用 LLM 的 API 开始,许多模式可以通过少量的代码实现。如果你确实想用框架,确保你理解框架底层的代码。对底层内容的错误的假设,是客户出错的常见来源。
See our cookbook for some sample implementations.
对于一些简单的实现,请查阅我们的 cookbook
Building blocks, workflows, and agents 构建块、工作流、智能体
In this section, we’ll explore the common patterns for agentic systems we’ve seen in production. We’ll start with our foundational building block—the augmented LLM—and progressively increase complexity, from simple compositional workflows to autonomous agents.
在本节中,我们将探讨在生产环境中常见的智能体模式(模型)。我们将会从最基础的构建模块 - 增强型 LLM 开始,逐步增加复杂性,从最简单的组合工作流,到自动化的智能体。
Building block: The augmented LLM 构建块:增强型 LLM
The basic building block of agentic systems is an LLM enhanced with augmentations such as retrieval, tools, and memory. Our current models can actively use these capabilities—generating their own search queries, selecting appropriate tools, and determining what information to retain.
智能体系统最基本的构建块是一个增强型 LLM,拥有检索、工具、记忆这些增强功能。我们目前的模型可以实现这些能力,生成他们自己的搜索关键词、选择合适的工具、决定获取哪些信息。
We recommend focusing on two key aspects of the implementation: tailoring these capabilities to your specific use case and ensuring they provide an easy, well-documented interface for your LLM. While there are many ways to implement these augmentations, one approach is through our recently released Model Context Protocol, which allows developers to integrate with a growing ecosystem of third-party tools with a simple client implementation.
我们建议重点关注两个关键的方面:根据您的特定用例定制这些功能、确保给 LLM 提供简单的、文档齐全的接口。虽然实现这些增强功能的方法有很多,但其中一种方法是我们最近发布的Model Context Protocol(MCP),它允许开发者通过简单的客户端,与不断增长的第三方工具生态进行集成。
For the remainder of this post, we’ll assume each LLM call has access to these augmented capabilities.
在本文的剩余部分,将假设每次LLM调用都可以访问这些增强能力。
Workflow: Prompt chaining 工作流:提示词链
Prompt chaining decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one. You can add programmatic checks (see “gate” in the diagram below) on any intermediate steps to ensure that the process is still on track.
提示词链将一个任务分解为一些列的步骤,每一个 LLM 调用都处理前一个 LLM 调用的输出数据。你可以在任意中间步骤之间添加程序化的检查(下图中的 gate)来确保程序按照预期进行。
When to use this workflow: This workflow is ideal for situations where the task can be easily and cleanly decomposed into fixed subtasks. The main goal is to trade off latency for higher accuracy, by making each LLM call an easier task.
这个工作流非常适合那些可以轻松、清晰地拆解为固定子任务的任务。主要目标是在更慢的响应速度和更高准确性间进行权衡,使得每个 LLM 调用更加容易。
Examples where prompt chaining is useful:
- Generating Marketing copy, then translating it into a different language.
- Writing an outline of a document, checking that the outline meets certain criteria, then writing the document based on the outline.
适用场景举例:
- 生成营销文案、然后翻译成不同语言
- 编写文档的大纲、检查大纲是否符合某些标准、根据大纲编写文档
Workflow: Routing 工作流:路由
Routing classifies an input and directs it to a specialized followup task. This workflow allows for separation of concerns, and building more specialized prompts. Without this workflow, optimizing for one kind of input can hurt performance on other inputs.
路由可以对输入进行分类,然后引导到后续的专门的任务,这个工作流允许分离关注点,并构建更专业的提示词。如果没有这种工作流,针对一种输入的优化可能会影响其他的表现。
When to use this workflow: Routing works well for complex tasks where there are distinct categories that are better handled separately, and where classification can be handled accurately, either by an LLM or a more traditional classification model/algorithm.
路由适合那些具有不同类型、并且更适合单独处理的复杂任务,可以准确地进行分类,不管是用传统的分类模型/算法,还是用 LLM 来分类。
Examples where routing is useful:
- Directing different types of customer service queries (general questions, refund requests, technical support) into different downstream processes, prompts, and tools.
- Routing easy/common questions to smaller models like Claude 3.5 Haiku and hard/unusual questions to more capable models like Claude 3.5 Sonnet to optimize cost and speed.
场景举例:
- 将不同类型的客户服务查询(通用问题、退款请求、技术支持)分类到不同的下游处理程序、提示词、工具
- 路由简单的问题到简单/通用LLM,将困难/不寻常的问题路由到更强大的 LLM
Workflow: Parallelization 工作流:并行
LLMs can sometimes work simultaneously on a task and have their outputs aggregated programmatically. This workflow, parallelization, manifests in two key variations:
LLM 有时可以同时完成一项任务,并将他们的输出以编程方式汇总输出。这种并行工作流体现在两个关键的变体中:
- Sectioning: Breaking a task into independent subtasks run in parallel.
- Voting: Running the same task multiple times to get diverse outputs.
- 任务拆解:将任务分解为独立的子任务,并行执行
- 投票:多次运行同一个任务,来获得不同的输出
When to use this workflow: Parallelization is effective when the divided subtasks can be parallelized for speed, or when multiple perspectives or attempts are needed for higher confidence results. For complex tasks with multiple considerations, LLMs generally perform better when each consideration is handled by a separate LLM call, allowing focused attention on each specific aspect.
当分割的子任务可以并行优化来提高速度,或者当需要从多个视角进行尝试来获得更高可靠性的结果时。对于具有多重考虑的复杂任务,把每个考虑因素都用单独的 LLM 调用表现更好,可以从每个特定的方面集中注意力。
Examples where parallelization is useful:
- Sectioning: 任务拆解
- Implementing guardrails where one model instance processes user queries while another screens them for inappropriate content or requests. This tends to perform better than having the same LLM call handle both guardrails and the core response.
- 安全防护:一个模型处理用户查询,另一个模型关注不适当的内容或请求。这通常比让同一个 LLM 调用同时进行安全防护和响应要好。
- Automating evals for evaluating LLM performance, where each LLM call evaluates a different aspect of the model’s performance on a given prompt.
- 自动化评估:LLM 表现评估中,每个 LLM 调用都会评估不同的方面
- Voting: 投票
- Reviewing a piece of code for vulnerabilities, where several different prompts review and flag the code if they find a problem.
- 审查代码中的漏洞:用多个不同的提示词来审查,如果发现了问题就标记出来
- Evaluating whether a given piece of content is inappropriate, with multiple prompts evaluating different aspects or requiring different vote thresholds to balance false positives and negatives.
- 评定给定的内容是否不当:用多个提示词来评估不同的方面,或者设置不同的投票阈值来平衡假阳性和假阴性
Workflow: Orchestrator-workers 工作流:编排者-工人
In the orchestrator-workers workflow, a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results.
在编排者-工人流程中,中心 LLM 动态分解任务,委托给工人 LLM 来处理,并且考虑他们的结果
When to use this workflow: This workflow is well-suited for complex tasks where you can’t predict the subtasks needed (in coding, for example, the number of files that need to be changed and the nature of the change in each file likely depend on the task). Whereas it’s topographically similar, the key difference from parallelization is its flexibility—subtasks aren’t pre-defined, but determined by the orchestrator based on the specific input.
适合哪些复杂任务、如法预测所需的子任务(例如在编码中,需要更改的文件的数量、每个文件内部的更改,取决与任务本身)。与并行工作流的关键区别在于,这种更灵活,而不是预定义子任务的,而是由编排者根据特定的输入进行决定的。
Example where orchestrator-workers is useful:
- Coding products that make complex changes to multiple files each time.
- Search tasks that involve gathering and analyzing information from multiple sources for possible relevant information.
适用场景:
- 编码产品:每次都要更改多个文件时。
- 搜索任务:涉及从多个来源(不固定)收集和分析信息,以寻找可能的相关信息
Workflow: Evaluator-optimizer 工作流:评估器-优化器
In the evaluator-optimizer workflow, one LLM call generates a response while another provides evaluation and feedback in a loop.
在评估器-优化器工作流中,每个 LLM 调用生成一个响应,另一个 LLM 调用提供评估和反馈,进行循环
When to use this workflow: This workflow is particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value. The two signs of good fit are, first, that LLM responses can be demonstrably improved when a human articulates their feedback; and second, that the LLM can provide such feedback. This is analogous to the iterative writing process a human writer might go through when producing a polished document.
这个工作流尤其适合:当我们有明确的评估标准,并且迭代细化可以提供更有意义的价值时。良好的适应性有两个标志:1、当人类表达反馈时,LLM 的响应可以明显改善,2、LLM 可以提供这样的反馈。这类似于人类作家在撰写精炼文档时,可能经历的迭代写作过程。
Examples where evaluator-optimizer is useful:
-
Literary translation where there are nuances that the translator LLM might not capture initially, but where an evaluator LLM can provide useful critiques.
-
Complex search tasks that require multiple rounds of searching and analysis to gather comprehensive information, where the evaluator decides whether further searches are warranted.
-
文学翻译,其中有一些细微之处翻译LLM最初可能无法捕捉到,但评估LLM可以提供有用的改善建议。
-
复杂的搜索任务,需要多轮搜索和分析来收集全面的信息,负责评估的 LLM 决定是否需要进一步搜索。
Agents
Agents are emerging in production as LLMs mature in key capabilities —understanding complex inputs, engaging in reasoning and planning, using tools reliably, and recovering from errors. Agents begin their work with either a command from, or interactive discussion with, the human user. Once the task is clear, agents plan and operate independently, potentially returning to the human for further information or judgement. During execution, it’s crucial for the agents to gain “ground truth” from the environment at each step (such as tool call results or code execution) to assess its progress. Agents can then pause for human feedback at checkpoints or when encountering blockers. The task often terminates upon completion, but it’s also common to include stopping conditions (such as a maximum number of iterations) to maintain control.
随着 LLM 在关键能力上的成熟 - 理解复杂输入、进行推理和规划、可靠地使用工具、从错误中恢复,智能体正在生产环境中崭露头角。智能体从接收人类用户的指令,或者与人类用户进行交互式讨论开始工作。一旦任务明确,智能体就会独立规划和操作,并可能向人类用户寻求进一步的信息或判断。在执行过程中,对于智能体最关键的是:在每一步(如工具调用或代码执行)从环境中获取“真实情况”以评估其进度。然后,智能体可以在检查点或遇到阻碍时暂停以获取人类反馈。任务通常在完成后终止,但为了保持控制,也常常会设置停止条件(如最大迭代次数)。
Agents can handle sophisticated tasks, but their implementation is often straightforward. They are typically just LLMs using tools based on environmental feedback in a loop. It is therefore crucial to design toolsets and their documentation clearly and thoughtfully. We expand on best practices for tool development in Appendix 2 (“Prompt Engineering your Tools”).
智能体能够处理复杂的任务,但其实现过程往往较为直接。它们通常只是大型语言模型(LLM),在循环中使用基于环境反馈的工具。因此,清晰且周到地设计工具集及其文档至关重要。我们在附录2(“工具的提示工程”)中详细阐述了工具开发的最佳实践。
When to use agents: Agents can be used for open-ended problems where it’s difficult or impossible to predict the required number of steps, and where you can’t hardcode a fixed path. The LLM will potentially operate for many turns, and you must have some level of trust in its decision-making. Agents’ autonomy makes them ideal for scaling tasks in trusted environments.
智能体可用于开放性问题,这类问题难以或无法预测所需步骤数,且无法硬编码固定路径。大型语言模型(LLM)可能会运行多个回合,你必须对其决策有一定程度的信任。智能体的自主性使其成为在可信环境中扩展任务的理想选择。
The autonomous nature of agents means higher costs, and the potential for compounding errors. We recommend extensive testing in sandboxed environments, along with the appropriate guardrails.
智能体的自主性意味着更高的成本,以及错误叠加的可能性。我们建议在沙盒环境中进行广泛测试,并设置适当的防护措施。
Examples where agents are useful:
The following examples are from our own implementations:
- A coding Agent to resolve SWE-bench tasks, which involve edits to many files based on a task description;
- 一个编码代理,用于解决SWE-bench任务,这些任务涉及根据任务描述对多个文件进行编辑;
- Our “computer use” reference implementation, where Claude uses a computer to accomplish tasks
- 我们的“计算机使用”参考实现,其中Claude使用计算机来完成任务
Combining and customizing these patterns
These building blocks aren’t prescriptive. They’re common patterns that developers can shape and combine to fit different use cases. The key to success, as with any LLM features, is measuring performance and iterating on implementations. To repeat: you should consider adding complexity only when it demonstrably improves outcomes.
这些构建模块并非硬性规定,而是开发者可以塑造和组合以适应不同用例的常见模式。与任何大型语言模型(LLM)功能一样,成功的关键在于衡量性能并对实现进行迭代。重申一下:只有在明显能提升结果时,才应考虑增加复杂性。
Summary
Success in the LLM space isn’t about building the most sophisticated system. It’s about building the right system for your needs. Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short.
在大型语言模型(LLM)领域取得成功,并非在于构建最复杂的系统,而在于构建一个符合你需求的正确系统。从简单的提示开始,通过全面评估对其进行优化,只有在更简单的解决方案无法满足需求时,才添加多步代理系统。
When implementing agents, we try to follow three core principles:
在实现代理时,我们试图遵循三个核心原则:
-
Maintain simplicity in your agent’s design.
-
Prioritize transparency by explicitly showing the agent’s planning steps.
-
Carefully craft your agent-computer interface (ACI) through thorough tool documentation and testing.
-
在代理的设计中保持简洁。
-
通过明确展示智能体的规划步骤,优先考虑透明度。
-
通过详尽的工具文档和测试,精心打造你的代理-计算机接口(ACI)。
Frameworks can help you get started quickly, but don’t hesitate to reduce abstraction layers and build with basic components as you move to production. By following these principles, you can create agents that are not only powerful but also reliable, maintainable, and trusted by their users.
框架可以帮助你快速入门,但在投入生产环境时,不要犹豫,减少抽象层,使用基本组件进行构建。遵循这些原则,你就能创建出不仅功能强大,而且可靠、易于维护且受用户信赖的代理。
Acknowledgements
Written by Erik Schluntz and Barry Zhang. This work draws upon our experiences building agents at Anthropic and the valuable insights shared by our customers, for which we’re deeply grateful.
由Erik Schluntz和Barry Zhang撰写。这项工作借鉴了我们在Anthropic构建代理的经验以及我们的客户分享的宝贵见解,我们对此深表感激。
Appendix 1: Agents in practice
Our work with customers has revealed two particularly promising applications for AI agents that demonstrate the practical value of the patterns discussed above. Both applications illustrate how agents add the most value for tasks that require both conversation and action, have clear success criteria, enable feedback loops, and integrate meaningful human oversight.
我们与客户的合作揭示了人工智能代理(AI agents)的两种特别有前景的应用,这些应用展示了上述模式的实用价值。这两种应用都说明了代理如何在需要对话和行动、有明确的成功标准、能够形成反馈循环以及融入有意义的人类监督的任务中发挥最大价值。
A. Customer support 客户支持
Customer support combines familiar chatbot interfaces with enhanced capabilities through tool integration. This is a natural fit for more open-ended agents because:
客户支持通过工具集成,将人们熟悉的聊天机器人界面与增强功能相结合。这非常适合更开放式的代理,原因如下:
-
Support interactions naturally follow a conversation flow while requiring access to external information and actions;
-
Tools can be integrated to pull customer data, order history, and knowledge base articles;
-
Actions such as issuing refunds or updating tickets can be handled programmatically; and
-
Success can be clearly measured through user-defined resolutions.
-
在需要访问外部信息和执行操作时,支持交互能够自然地遵循对话流程;
-
可以集成工具来提取客户数据、订单历史和知识库文章;
-
可以通过编程方式处理退款或更新门票等操作;以及
-
成功可以通过用户自定义的分辨率来清晰衡量。
Several companies have demonstrated the viability of this approach through usage-based pricing models that charge only for successful resolutions, showing confidence in their agents’ effectiveness.
多家公司已通过基于使用量的定价模式(仅对成功解决的问题收费)证明了这种方法的可行性,这表明它们对自家代理的有效性充满信心。
B. Coding agents 编程智能体
The software development space has shown remarkable potential for LLM features, with capabilities evolving from code completion to autonomous problem-solving. Agents are particularly effective because:
在软件开发领域,大型语言模型(LLM)功能已展现出显著潜力,其能力已从代码补全发展到自主解决问题。代理尤其有效,原因如下:
-
Code solutions are verifiable through automated tests;
-
Agents can iterate on solutions using test results as feedback;
-
The problem space is well-defined and structured; and
-
Output quality can be measured objectively.
-
代码解决方案可通过自动化测试进行验证;
-
智能体可以根据测试结果作为反馈来迭代解决方案;
-
问题空间定义明确且结构清晰;以及
-
输出质量可以客观衡量。
In our own implementation, agents can now solve real GitHub issues in the SWE-bench Verified benchmark based on the pull request description alone. However, whereas automated testing helps verify functionality, human review remains crucial for ensuring solutions align with broader system requirements.
在我们自己的实现中,代理现在仅根据拉取请求描述,就可以在SWE-bench Verified基准测试中解决真实的GitHub问题。然而,虽然自动化测试有助于验证功能,但人工审查对于确保解决方案符合更广泛的系统要求仍然至关重要。
Appendix 2: Prompt engineering your tools
No matter which agentic system you’re building, tools will likely be an important part of your agent. Tools enable Claude to interact with external services and APIs by specifying their exact structure and definition in our API. When Claude responds, it will include a tool use block in the API response if it plans to invoke a tool. Tool definitions and specifications should be given just as much prompt engineering attention as your overall prompts. In this brief appendix, we describe how to prompt engineer your tools.
无论你正在构建哪种代理系统,工具都可能是你代理的重要组成部分。通过在我们的API中指定外部服务和API的确切结构和定义,工具使Claude能够与它们进行交互。当Claude做出响应时,如果它计划调用某个工具,则会在API响应中包含一个工具使用块。工具的定义和规范应得到与整体提示同样多的提示工程关注。在本简短的附录中,我们将介绍如何对工具进行提示工程。
There are often several ways to specify the same action. For instance, you can specify a file edit by writing a diff, or by rewriting the entire file. For structured output, you can return code inside markdown or inside JSON. In software engineering, differences like these are cosmetic and can be converted losslessly from one to the other. However, some formats are much more difficult for an LLM to write than others. Writing a diff requires knowing how many lines are changing in the chunk header before the new code is written. Writing code inside JSON (compared to markdown) requires extra escaping of newlines and quotes.
通常有多种方式可以指定相同的操作。例如,你可以通过编写差异(diff)来指定文件编辑,或者重写整个文件。对于结构化输出,你可以在Markdown或JSON中返回代码。在软件工程中,此类差异只是表面上的,可以无损地相互转换。然而,对于大型语言模型(LLM)来说,某些格式的编写难度要远大于其他格式。编写差异需要知道在新代码写入之前,块头中有多少行发生了变化。与Markdown相比,在JSON中编写代码需要对换行符和引号进行额外的转义。
Our suggestions for deciding on tool formats are the following:
- Give the model enough tokens to “think” before it writes itself into a corner.
- Keep the format close to what the model has seen naturally occurring in text on the internet.
- Make sure there’s no formatting “overhead” such as having to keep an accurate count of thousands of lines of code, or string-escaping any code it writes.
我们关于选择工具格式的建议如下:
- 给模型足够的思考时间,以免它陷入僵局。
- 保持格式与模型在互联网文本中自然出现的格式相近。
- 确保没有格式化的“额外负担”,比如必须准确统计数千行代码,或者对编写的任何代码进行字符串转义。
One rule of thumb is to think about how much effort goes into humancomputer interfaces (HCI), and plan to invest just as much effort in creating good agent-computer interfaces (ACI). Here are some thoughts on how to do so:
- Put yourself in the model’s shoes. Is it obvious how to use this tool, based on the description and parameters, or would you need to think carefully about it? If so, then it’s probably also true for the model. A good tool definition often includes example usage, edge cases, input format requirements, and clear boundaries from other tools.
- How can you change parameter names or descriptions to make things more obvious? Think of this as writing a great docstring for a junior developer on your team. This is especially important when using many similar tools.
- Test how the model uses your tools: Run many example inputs in our workbench to see what mistakes the model makes, and iterate.
- Poka-yoke your tools. Change the arguments so that it is harder to make mistakes.
一个基本原则是,考虑在人机界面(HCI)上投入了多少精力,并计划在创建良好的代理-计算机界面(ACI)上投入同样多的精力。以下是一些关于如何做到这一点的想法:
- 设身处地地站在模型的角度思考。根据描述和参数,使用这个工具是否显而易见,还是需要你仔细思考?如果是这样,那么对于模型来说可能也是如此。一个好的工具定义通常包括使用示例、边缘情况、输入格式要求,以及与其他工具的明确界限。
- 如何更改参数名称或描述以使事情更加一目了然?不妨将此视为为团队中的初级开发人员编写一份出色的文档说明。在使用许多类似工具时,这一点尤为重要。
- 测试模型如何使用你的工具:在我们的工作台中运行多个示例输入,观察模型会犯哪些错误,并进行迭代。
- 防错你的工具。调整参数,使出错更难发生。
While building our agent for SWE-bench, we actually spent more time optimizing our tools than the overall prompt. For example, we found that the model would make mistakes with tools using relative filepaths after the agent had moved out of the root directory. To fix this, we changed the tool to always require absolute filepaths—and we found that the model used this method flawlessly.
在为SWE-bench构建代理时,我们实际上在优化工具上花费的时间比整体提示的时间还要多。例如,我们发现,在代理移出根目录后,模型在使用相对文件路径的工具时会出错。为了解决这个问题,我们将工具更改为始终要求使用绝对文件路径——我们发现,模型能够完美地使用这种方法。