在 DeepSearch 中使用 DeepSeek R1 推理模型

在我们看来，DeepSearch 本质上是一个大型的 while 循环。在给定最大 token 限额的情况下，它在搜索、阅读和思考之间循环，直到找到最佳答案。LLM 的一个关键任务是根据当前的记忆状态来决定下一步采取什么行动。它应该继续搜索吗？阅读网页吗？还是直接回答问题？

在我们的 node-deepresearch 实现中，我们使用标准 LLM（gemini-2.0-flash）来选择行动。我们称之为"标准"，是因为推理过程是通过提示工程明确设定的，然后作为序列输入-输出运行，选择的行动以 JSON 对象形式返回。因此，一个自然的问题出现了：用专门的推理模型替换这个决策步骤是否会提高 DeepSearch 的性能？

在我们的 DeepSearch 实现中，大型 while 循环的行动决策步骤根据当前上下文和记忆来确定下一步应该采取什么行动。

在本文中，我们探讨使用 DeepSeek R1 671b，一个推理 LLM 来替代这个决策步骤中的 gemini-2.0-flash。我们相信 R1 的推理能力将有助于解决在网络搜索和分析结果中的复杂问题。

为了测试这个概念，我们让它完成一个实际的例子：规划一个完整的三天假期行程，然后评估其表现。执行深度搜索任务的代理可能会遇到与人类相同的问题，所以在我们的假期任务中，模型可能会遇到以下问题：

知识空白（信息依赖于其他信息）：例如，你想参观埃菲尔铁塔，但不知道它在公共假期是否开放。你需要同时了解铁塔的假期安排和法国公共假期的日期。
错误或过时的信息：2020 年的一篇旅游博客建议罗马的某家餐厅周日营业，但到达后发现它已经改变营业时间，现在周日休息。
矛盾信息：一个旅游网站声称纽约的某家酒店提供免费早餐，而另一个网站则表示房费不包含早餐。
模糊信息：旅游论坛的一篇帖子提到"巴塞罗那附近有一个很棒的海滩"，但没有具体说明是哪个海滩或提供清晰的方向指引，这使得很难确定具体位置。

R1 可以将复杂任务分解为可执行的步骤，识别差距和不一致，并处理诸如被屏蔽的网站和订阅墙等障碍。它具备收集所需知识和综合答案的推理能力。然而，它不能单独为我们规划假期——这需要搜索网络并理解返回的结果。我们必须对其进行加强，提供框架并增强其能力，才能胜任这项工作。

tag实现

要运行这个 notebook，你需要免费的 Jina AI 和 OpenRouter API 密钥。

虽然 R1 是我们代理的引擎，但我们还添加了一些工具、状态对象，当然还有一个（相当大的）提示。以下是简化的表示：

工具可以被模型调用来搜索和抓取网页，结果存储在状态中。
状态跟踪工具结果、任务状态和知识。它存储在提示本身中。
单一提示提供指令，指定任务和如何完成它，以及存储状态。

我们稍后会详细介绍每一个部分，特别是提示。但简而言之，系统的工作方式如下：

我们从带有未填充状态对象的提示开始。当任务正在进行（即试图产生答案）时，代理会循环执行以下步骤直到产生答案：

模型从提示中检查任务和其状态，并推理如何最好地使用其工具来获得答案。
模型输出一个 JSON 对象，指定其状态（IN PROGRESS 或 DONE），内存更新，工具调用，和答案（初始为 null）。
代理异步调用工具，结果被嵌入回提示中，以及第 3 步的 JSON 对象。
包含这些新信息的提示被重新输入模型以进行另一个循环。

一旦模型在其输出中提供答案，任务就结束并交付答案。

💡

为了更好地了解代理是如何工作的，我们建议你自己试用这个 notebook并检查每次迭代的输出。

现在我们有了一个高层次的概述，让我们依次看看工具、状态和提示：

tag提示词

我们使用 Jinja 模板格式创建提示词。它包含几个部分：

上下文（在本例中为当前日期）。
说明，涵盖一切如何工作，并告诉模型有哪些可用工具。
状态，如上所述。
工具输出，来自 search 和 scrape 工具。

{% macro format_tool_results(tool_records) %}
{% for to in tool_records %}
Source {{ loop.index }}️: {{ to.tool }}: {{ to.input }}
Result:
```
{{ to.output }}
```
{% endfor %}
{% endmacro %}

The date: `{{ current_date }}`.
You are an information analysis and exploration agent that builds solutions through systematic investigation.

## Investigation Cycle
You operate in a continuous investigation cycle:

1. Review current workspace (your memory blocks)
2. Analyze new tool results (or initial task if first round)
3. Update memory with new insights and track investigation progress
4. Decide on next tools to call based on identified leads and information gaps
5. Repeat until task completion

## Memory Structure
Your memory persists between investigation cycles and consists of:
- **Status**: Always the first line, indicates if the task is IN_PROGRESS or DONE
- **Memory**: A collection of discrete information blocks, each with a unique ID

## Memory Block Usage
- Each memory block has a unique ID in format <abc-123>content</abc-123>
- Create separate blocks for distinct pieces of information:
  * Discovered URLs (both explored and pending)
  * Information gaps that need investigation
  * Actions already taken (to avoid repetition)
  * Promising leads for future exploration
  * Key facts and findings
  * Contradictions or inconsistencies found
- Keep each block focused on a single idea or piece of information
- Always cite sources when recording information from tool results
- Use IDs to track and manage your knowledge (e.g., deleting outdated information)
- Make sure to store sources (URLs) for the facts and findings you store

## Lead Management
- Since you can only make 3 tool calls per round, store promising leads for later
- Create dedicated memory blocks for URLs to scrape later
- Maintain blocks for potential search queries to explore in future rounds
- Prioritize leads based on relevance to the task

## Available Tools
- **search**: Use for broad information gathering on new topics or concepts
  * Example: {"tool": "search", "input": "renewable energy statistics 2023"}
- **scrape**: Use for extracting specific details from discovered URLs
  * Example: {"tool": "scrape", "input": "https://example.com/energy-report"}

## Tool Usage Guidelines
- **When to use search**: For new concepts, filling knowledge gaps, or exploring new directions
- **When to use scrape**: For URLs discovered that likely contain detailed information
- **Maximum 3 tool calls per round**
- **Never repeat the exact same tool call**
- **Always record valuable information from tool results in memory blocks**

## Response Format
You must respond with a valid JSON object containing:

```json
{
  "status_update": "IN_PROGRESS or DONE",
  "memory_updates": [
    {"operation": "add", "content": "New insight or lead to investigate"},
    {"operation": "delete", "id": "abc-123"}
  ],
  "tool_calls": [
    {"tool": "search", "input": "specific search query"},
    {"tool": "scrape", "input": "https://discovered-url.com"}
  ],
  "answer": "Your final, comprehensive answer when status is DONE"
}
```

## Important Rules
- The "add" operation creates a new memory block
	You do not need to specify an ID, it will be added automatically by the system.
- The "delete" operation requires the specific ID of the block to remove
- Never invent or fabricate information - only use facts from your memory or tool results
- Never make up URLs - only use URLs discovered through tool results
- CRITICAL: Any information not recorded in your memory blocks will be lost in the next round
  For example, if you find a potential webpage to scrap, you must store the URL and your intention
  Example: `{"operation": "add", "content": "Found relevant URL: https://... to scrape ..."}`
- Set status to "DONE" only when you have fully addressed the task
- Only include the "answer" field when status is "DONE"

Task:
```
{{ task }}
```

Current workspace:
```
{{ workspace }}
```

Tool Results:
{{ format_tool_results(tool_records) if tool_records else '... no previous tool results ...'}}

IMPORTANT: Generate a valid JSON response following the format above.

Think carefully about:
- what information do you need to preserve
- which tools to call next
- how to build your answer systematically with focused memory blocks

在我们深度搜索系统中使用的完整提示词

💡

提示顶部的 Jinja 宏函数只是以更整洁的格式渲染工具结果。

其次，我们在使用预设的 top_n 为 5 的 Reranker 时发现了另一个问题——抓取的页面实际上包含了超过五个相关结果——我们本可以通过实际检查每个结果的相关性分数来解决这个问题，而不是仅仅取前五个（或任意数量的）结果。然而，理想的重排序配置因任务而异。解决这个问题的更好方法是直接使用整个抓取的页面——可惜由于 R1 有限的上下文长度，这是不可能的。 ### 整体表现和潜在改进模型最初表现不错，但我们注意到，除非特别提示，它很少尝试切换策略或制定复杂的计划。虽然 R1 在数学和编程问题（它专门训练的领域）中自然会采用这些方法，但它并没有将同样的推理应用到搜索任务中。虽然我们可以进一步优化提示（甚至使用多个提示）来解决这个限制，但这并不是我们的主要目标。我们还观察到，R1 没有充分处理时效性信息。简而言之，如果搜索结果没有明确提到错误的日期，模型会在不进一步验证的情况下假定信息有效。例如，在规划 5 月 1 日的航班时： * 德国到克里特岛 5 月 1 日 80 美元：**正确** - 模型可以信任这个信息。 * 德国到克里特岛 1 月 1 日 80 美元：**错误** - 模型正确识别并丢弃这个信息。 * 德国到克里特岛 80 美元：**假阳性** - 当没有指定日期时，模型未能验证信息，错误地假定其有效。如果我们要继续这个项目，我们可能会考虑实施以下几个潜在的改进： * 跟踪记忆块数量，并在状态变得过大时提示模型**总结条目**。 * 指示模型在完成探索和回应查询之前**穷尽所有线索**。 * 强调**验证时效性信息**。 * 确保模型通过抓取搜索工具返回的 URL 来**二次检查结果**。 * 使用**支持更大上下文窗口的未来推理模型**测试我们的系统，不过这需要大量重构和测试来调整提示以适应不同的模型。 ## 结论即使是在 R1 刚发布不久之后，这个领域已经发生了显著的变化。一些项目已经能以极低的成本（有些只需 5 美元）训练推理模型。这种民主化意味着训练专业模型变得比以往更加容易。我们对 R1 的实验为我们继续探索如何增强推理型 LLM 以处理复杂搜索任务提供了有用的基准。虽然我们的假期规划示例展示了很有前景的结果（特别是作为一个快速演示项目），但它也揭示了 R1 在处理搜索和记忆任务方面的局限性，与其在数学和编程方面的优势形成对比。虽然该系统成功地在预算范围内制定了旅行计划，但在验证时效性信息和彻底探索所有可用选项等方面表现不足，凸显了模型训练重点与其在不同领域应用之间的差距。

在 DeepSearch 中使用 DeepSeek R1 推理模型

tag实现

tag工具

tag状态

tag提示词

tag评估

tag目的地和总预算

tag航班

tag酒店