DeepResearch 中用於產生多樣化查詢的次模優化

在實作 DeepResearch 時，至少有兩個地方需要產生多樣化的查詢。首先，你必須根據使用者輸入產生網路搜尋查詢（直接將使用者輸入丟進搜尋引擎並不是個好主意）。其次，許多 DeepResearch 系統包含一個「研究規劃器」，將原始問題分解為子問題，同時呼叫代理程式獨立解決這些子問題，然後合併它們的結果。無論是處理查詢還是子問題，我們的期望都保持不變：它們必須與原始輸入相關，並且具有足夠的多樣性，才能提供獨特的視角。通常，我們需要限制查詢的數量，以避免在不必要地請求搜尋引擎或使用代理程式詞元上浪費金錢。

雖然理解查詢生成的重要性，但大多數開源 DeepResearch 實作並沒有認真對待這種最佳化。他們只是直接提示這些約束。有些人可能會要求大模型額外一輪來評估查詢並使其多樣化。以下是大多數實作基本方法的一個範例：

使用大模型產生多樣化查詢的兩種不同提示詞。頂部的提示詞使用簡單的指令。底部的提示詞更複雜且結構化。給定原始查詢和要產生的查詢數量，我們期望產生的查詢具有足夠的多樣性。在本範例中，我們使用`gemini-2.5-flash`作為大模型，原始查詢是`"embeddings and rerankers"`。

在本文中，我想演示一種更嚴謹的方法，使用句子向量模型和次模最佳化來解決最佳查詢生成問題。早在我的博士研究期間，次模最佳化就是我最喜歡的技術之一，與 L-BFGS 一起。我將展示如何應用它來生成一組在基數約束下的多樣化查詢，這可以顯著提高 DeepResearch 系統的整體品質。

tag透過提示詞生成查詢

首先，我們想檢查提示詞是否是產生多樣化查詢的有效方法。我們也想了解複雜的提示詞是否比簡單的提示詞更有效。讓我們運行一個實驗，比較以下兩個提示詞，以找出答案：

You are an expert at generating diverse search queries. Given any input topic, generate {num_queries} different search queries that explore various angles and aspects of the topic.

簡單提示詞

You are an expert research strategist. Generate an optimal set of diverse search queries that maximizes information coverage while minimizing redundancy.

Task: Create exactly {num_queries} search queries from any given input that satisfy:
- Relevance: Each query must be semantically related to the original input
- Diversity: Each query should explore a unique facet with minimal overlap
- Coverage: Together, the queries should comprehensively address the topic

Process:
1. Decomposition: Break down the input into core concepts and dimensions
2. Perspective Mapping: Identify distinct angles (theoretical, practical, historical, comparative, etc.)
3. Query Formulation: Craft specific, searchable queries for each perspective
4. Diversity Check: Ensure minimal semantic overlap between queries

結構化提示詞

我們使用 gemini-2.5-flash 作為大模型，原始查詢為 "embeddings and rerankers"，並測試簡單和結構化提示詞，以迭代方式生成從 1 到 20 個查詢。然後，我們使用 jina-embeddings-v3 和 text-matching 任務來測量原始查詢和生成的查詢之間的句子相似度，以及生成的查詢本身內的相似度。以下是視覺化圖表。

看看右側的兩個圖，可以看到簡單和結構化提示詞都表現出餘弦相似度分數的較大變異數，許多達到 0.7-0.8 的相似度，表明某些生成的查詢幾乎相同。此外，隨著生成更多查詢，兩種方法都難以保持多樣性。我們沒有看到相似度隨著查詢計數的增加而呈現明顯的下降趨勢，而是觀察到相對穩定（且高）的相似度水平，表明額外的查詢通常會複製現有的觀點。

一種解釋是 Wang 等人 (2025) 發現，即使有提示詞引導，大模型通常不成比例地反映了主要群體的意見，表明對常見觀點的偏見。這是因為大模型的訓練資料可能過度代表某些觀點，導致模型生成與這些普遍觀點一致的變體。Abe 等人 (2025) 也發現，基於大模型的查詢擴展傾向於流行的解釋，而忽略了其他的解釋。例如，「人工智慧的好處是什麼？」可能會產生常見的好處，如自動化、效率、道德，但會錯過不太明顯的好處，如藥物發現。

Multilingual Prompting for Improving LLM Generation Diversity

Large Language Models (LLMs) are known to lack cultural representation and overall diversity in their generations, from expressing opinions to answering factual questions. To mitigate this problem, we propose multilingual prompting: a prompting method which generates several variations of a base prompt with added cultural and linguistic cues from several cultures, generates responses, and then combines the results. Building on evidence that LLMs have language-specific knowledge, multilingual prompting seeks to increase diversity by activating a broader range of cultural knowledge embedded in model training data. Through experiments across multiple models (GPT-4o, GPT-4o-mini, LLaMA 70B, and LLaMA 8B), we show that multilingual prompting consistently outperforms existing diversity-enhancing techniques such as high-temperature sampling, step-by-step recall, and personas prompting. Further analyses show that the benefits of multilingual prompting vary with language resource level and model size, and that aligning the prompting language with the cultural cues reduces hallucination about culturally-specific information.

arXiv.orgQihan Wang

Wisdom from Diversity: Bias Mitigation Through Hybrid Human-LLM Crowds

Despite their performance, large language models (LLMs) can inadvertently perpetuate biases found in the data they are trained on. By analyzing LLM responses to bias-eliciting headlines, we find that these models often mirror human biases. To address this, we explore crowd-based strategies for mitigating bias through response aggregation. We first demonstrate that simply averaging responses from multiple LLMs, intended to leverage the “wisdom of the crowd”, can exacerbate existing biases due to the limited diversity within LLM crowds. In contrast, we show that locally weighted aggregation methods more effectively leverage the wisdom of the LLM crowd, achieving both bias mitigation and improved accuracy. Finally, recognizing the complementary strengths of LLMs (accuracy) and humans (diversity), we demonstrate that hybrid crowds containing both significantly enhance performance and further reduce biases across ethnic and gender-related contexts.

arXiv.orgAxel Abels

tag問題形式化

有人可能會認為我們之前的實驗沒有定論，我們應該改進提示詞並再次嘗試。雖然提示詞肯定可以在一定程度上改變結果，但更重要的是我們已經學到了一些東西：僅僅增加生成的查詢數量使我們更有可能獲得多樣化的查詢。壞消息是，我們也得到了很多重複的查詢作為副產品。

但是，既然生成大量查詢很便宜，最終會產生一些好的查詢，為什麼我們不將其視為子集選擇問題呢？

在數學上，我們可以這樣表述這個問題：給定一個原始輸入 $q_0$ ，一組由大模型使用提示詞工程生成的候選查詢 $V=\{q_1, q_2, \cdots, q_n\}$ 。選擇一個子集 $X\subseteq V$ ，其中包含 $k$ 個查詢，目標是在最小化冗餘的同時最大化覆蓋範圍。

不幸的是，從 $n$ 個候選查詢中找到 $k$ 個查詢的最佳子集需要檢查 $\binom{n}{k}$ 種組合，這是一個指數級複雜度的問題。僅僅 20 個候選查詢和 $k=5$ 的情況下，就有 15,504 種組合。

tag次模函數

在嘗試暴力解決子集選擇問題之前，讓我們先向讀者介紹次模性 (submodularity) 和次模函數 (submodular function) 的概念。它們對許多人來說可能聽起來很陌生，但你可能聽說過「邊際效益遞減」的概念——次模性就是這個概念的數學表示。

考慮在一個大型建築物中放置 Wi-Fi 路由器以提供網路覆蓋。你安裝的第一個路由器提供了巨大的價值——它覆蓋了之前沒有覆蓋的很大區域。第二個路由器也增加了相當大的價值，但它的一些覆蓋區域與第一個路由器重疊，因此邊際效益低於第一個。隨著你繼續添加路由器，每個額外的路由器覆蓋的新區域越來越少，因為大多數空間已經被現有的路由器覆蓋。最終，第 10 個路由器可能提供的額外覆蓋範圍非常小，因為建築物已經被很好地覆蓋了。

這種直覺抓住了次模性的本質。在數學上，一個集合函數 $f: 2^V \rightarrow \mathbb{R}$ 是次模的，如果對於所有 $A \subseteq B \subseteq V$ 和任何元素 $v \notin B$ ：

$f(A \cup {v}) - f(A) \geq f(B \cup {v}) - f(B)$

用通俗的語言來說：將一個元素添加到一個較小的集合所帶來的好處，至少與將同一個元素添加到一個包含該較小集合的較大集合所帶來的好處一樣多。

現在讓我們將這個概念應用於我們的查詢生成問題。人們可能立即意識到，查詢選擇表現出自然的邊際效益遞減：

我們選擇的第一個查詢覆蓋了全新的語義空間
第二個查詢應該覆蓋不同的方面，但一些重疊是不可避免的
隨著我們添加更多查詢，每個額外的查詢覆蓋的新領域越來越少

來自我在 AAAI 2013 上的一個舊投影片，我在其中用一袋球來解釋次模性。向袋子中添加更多的球可以改善「設施」，但相對改善變得越來越小，如右側 y 軸上遞減的 delta 值所示。

tag基於向量模型的次模函數設計

令 $\mathbf{e}_i \in \mathbb{R}^d$ 為查詢 $q_i$ 的向量模型，使用句子向量模型（例如，jina-embeddings-v3）獲得。設計我們的目標函數主要有兩種方法：

tag方法 1：設施選址（基於覆蓋範圍）

$f_{\text{coverage}}(X) = \sum_{j=1}^{n} \max\left(\alpha \cdot \text{sim}(\mathbf{e}_0, \mathbf{e}_j), \max_{q_i \in X} \text{sim}(\mathbf{e}_j, \mathbf{e}_i)\right)$

這個函數衡量了選定的集合 $X$ 「覆蓋」所有候選查詢的程度，其中：

$\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{|\mathbf{u}| |\mathbf{v}|}$ 是餘弦相似度
$\alpha \cdot \text{sim}(\mathbf{e}_0, \mathbf{e}_j)$ 確保與原始查詢的相關性
$\max_{q_i \in X} \text{sim}(\mathbf{e}_j, \mathbf{e}_i)$ 衡量選定集合 $X$ 對候選查詢 $j$ 的覆蓋程度

一個需要注意的是，這個函數只是隱式地鼓勵多樣性。它沒有明確地懲罰選定集合 $X$ 內的相似性。多樣性的出現是因為選擇相似的查詢會提供遞減的覆蓋範圍回報。

tag方法 2：顯式覆蓋範圍 + 多樣性

為了更直接地控制多樣性，我們可以結合覆蓋範圍和一個顯式的多樣性項：

$f(X) = \lambda \cdot f_{\text{coverage}}(X) + (1-\lambda) \cdot f_{\text{diversity}}(X)$

其中多樣性成分可以表示為：

$f_{\text{diversity}}(X) = \sum_{q_i \in X} \sum_{q_j \in V \setminus X} \text{sim}(\mathbf{e}_i, \mathbf{e}_j)$

這個多樣性項衡量了選定的查詢和未選定的查詢之間的總相似性——當我們選擇與剩餘候選查詢不同的查詢時，它會最大化（一種圖割函數的形式）。

tag兩種方法之間的差異

兩種公式都保持了次模性。

設施選址函數是一個著名的次模函數。由於 max 運算，它表現出次模性：當我們向選定的集合中添加一個新的查詢 $q$ 時，每個候選查詢 $j$ 都會被我們集合中「最好」的查詢（相似度最高的查詢）覆蓋。與將 $q$ 添加到一個較大的集合 $B \supseteq A$ 相比，將 $q$ 添加到一個較小的集合 $A$ 更有可能改善各種候選查詢的覆蓋範圍，因為在較大的集合 $B$ 中，許多候選查詢已經被很好地覆蓋。

在圖割多樣性函數中，多樣性項 $\sum_{q_i \in X} \sum_{q_j \in V \setminus X} \text{sim}(\mathbf{e}_i, \mathbf{e}_j)$ 是次模的，因為它衡量了選定集合和未選定集合之間的「割」。與將一個新的查詢添加到一個較大的選定集合相比，將一個新的查詢添加到一個較小的選定集合會創建更多與未選定查詢的新連接。

設施選址方法依賴於通過覆蓋範圍競爭來實現隱式多樣性，而顯式方法直接衡量和優化多樣性。因此，兩者都是有效的，但顯式方法使您可以更直接地控制相關性-多樣性的權衡。

tag實作

由於我們的函數是次模的，因此我們可以使用貪婪演算法，該演算法提供 $(1-1/e) \approx 0.63$ 的近似保證：

$\max_{X \subseteq V} f(X) \quad \text{subject to} \quad |X| \leq k$

這是優化設施選址（基於覆蓋範圍）的程式碼——即具有隱式多樣性的程式碼。

def greedy_query_selection(candidates, embeddings, original_embedding, k, alpha=0.3):
    """
    Greedy algorithm for submodular query selection
    
    Args:
        candidates: List of candidate query strings
        embeddings: Matrix of query embeddings (n x d)
        original_embedding: Embedding of original query (d,)
        k: Number of queries to select
        alpha: Relevance weight parameter
    """
    n = len(candidates)
    selected = []
    remaining = set(range(n))
    
    # Precompute relevance scores
    relevance_scores = cosine_similarity(original_embedding, embeddings)
    
    for _ in range(k):
        best_gain = -float('inf')
        best_query = None
        
        for i in remaining:
            # Calculate marginal gain of adding query i
            gain = compute_marginal_gain(i, selected, embeddings, 
                                       relevance_scores, alpha)
            if gain > best_gain:
                best_gain = gain
                best_query = i
        
        if best_query is not None:
            selected.append(best_query)
            remaining.remove(best_query)
    
    return [candidates[i] for i in selected]

def compute_marginal_gain(new_idx, selected, embeddings, relevance_scores, alpha):
    """Compute marginal gain of adding new_idx to selected set"""
    if not selected:
        # First query: gain is sum of all relevance scores
        return sum(max(alpha * relevance_scores[j], 
                      cosine_similarity(embeddings[new_idx], embeddings[j]))
                  for j in range(len(embeddings)))
    
    # Compute current coverage
    current_coverage = [
        max([alpha * relevance_scores[j]] + 
            [cosine_similarity(embeddings[s], embeddings[j]) for s in selected])
        for j in range(len(embeddings))
    ]
    
    # Compute new coverage with additional query
    new_coverage = [
        max(current_coverage[j], 
            cosine_similarity(embeddings[new_idx], embeddings[j]))
        for j in range(len(embeddings))
    ]
    
    return sum(new_coverage) - sum(current_coverage)

平衡參數 $\alpha$ 控制相關性和多樣性之間的權衡：

高 $\alpha$ （例如，0.8）：優先考慮與原始查詢的相關性，可能會犧牲多樣性
低 $\alpha$ （例如，0.2）：優先考慮選定查詢之間的多樣性，可能會偏離原始意圖
中等 $\alpha$ （例如，0.4-0.6）：平衡方法，通常在實踐中效果良好

tag惰性貪婪演算法

人們可能會注意到上面的程式碼：

for i in remaining:
    # Calculate marginal gain of adding query i
    gain = compute_marginal_gain(i, selected, embeddings, 
                               relevance_scores, alpha)

我們在每次迭代中計算所有剩餘候選查詢的邊際效益。我們可以做得更好。

惰性貪婪演算法是一種巧妙的優化，它利用次模性來避免不必要的計算。關鍵的見解是：如果元素 A 在第 $t$ 次迭代中比元素 B 具有更高的邊際效益，那麼由於次模性，A 在第 $t+1$ 次迭代中仍然比 B 具有更高的邊際效益。

import heapq

def lazy_greedy_query_selection(candidates, embeddings, original_embedding, k, alpha=0.3):
    """
    Lazy greedy algorithm for submodular query selection
    More efficient than standard greedy by avoiding unnecessary marginal gain computations
    """
    n = len(candidates)
    selected = []
    
    # Precompute relevance scores
    relevance_scores = cosine_similarity(original_embedding, embeddings)
    
    # Initialize priority queue: (-marginal_gain, last_updated, query_index)
    # Use negative gain because heapq is a min-heap
    pq = []
    for i in range(n):
        gain = compute_marginal_gain(i, [], embeddings, relevance_scores, alpha)
        heapq.heappush(pq, (-gain, 0, i))
    
    for iteration in range(k):
        while True:
            neg_gain, last_updated, best_idx = heapq.heappop(pq)
            
            # If this gain was computed in current iteration, it's definitely the best
            if last_updated == iteration:
                selected.append(best_idx)
                break
            
            # Otherwise, recompute the marginal gain
            current_gain = compute_marginal_gain(best_idx, selected, embeddings, 
                                               relevance_scores, alpha)
            heapq.heappush(pq, (-current_gain, iteration, best_idx))
    
    return [candidates[i] for i in selected]

Lazy greedy 的運作方式如下：

維護一個優先佇列，其中的元素依照其邊際收益排序。
僅重新計算頂部元素的邊際收益。
如果在重新計算後，它仍然是最高的，則選取它。
否則，將其重新插入到正確的位置並檢查下一個頂部元素。

由於我們避免了重新計算明顯不會被選取的元素的邊際收益，因此可以顯著提高速度。

tag結果

讓我們再次運行實驗。我們使用相同的簡單提示詞來生成 1 到 20 個不同的查詢，並執行與之前相同的餘弦相似度測量。對於次模優化，我們使用不同的 k 值從 20 個生成的候選查詢中選取查詢，並如之前一樣測量相似度。結果表明，透過次模優化選取的查詢更加多樣化，且顯示較低的集合內相似度。

原始查詢 = `"google 2025 revenue breakdown"`

tag最終問題：為什麼次模公式很重要

您可能想知道：為什麼要費力地將其表述為次模優化問題？為什麼不直接使用啟發式方法或其他優化方法？

簡而言之，次模公式將一個特別的「選擇多樣化查詢」啟發式方法轉變為一個具有可證明保證、高效演算法和可衡量目標的嚴格優化問題。

tag保證效率

一旦我們證明我們的目標函數是次模的，我們就可以獲得強大的理論保證和一個高效的演算法。與檢查 $\binom{n}{k}$ 個組合相比，以 $O(nk)$ 時間運行的貪婪演算法實現了 $(1-1/e) \approx 0.63$ 的最佳解近似值。這意味著我們的貪婪解始終至少與最佳可能解一樣好 63%。沒有啟發式方法可以保證這一點。

此外，由於次模函數的數學結構，lazy greedy 演算法在實踐中速度更快。加速來自於遞減回報：在早期迭代中較差的選擇不太可能在以後成為好的選擇。因此，lazy greedy 通常只需要重新計算少數幾個頂部候選者的收益，而不是檢查所有 $n$ 個候選者。

tag無需手工製作的啟發式方法

如果沒有原則性的框架，您可能會求助於特別的規則，例如「確保查詢具有小於 0.7 的餘弦相似度」或「平衡不同的關鍵字類別」。這些規則很難調整，並且不能推廣。次模優化為您提供了一種有原則的、數學基礎的方法。您可以使用驗證集系統地調整超參數，並監控生產系統中的解決方案品質。當系統產生不良結果時，您可以獲得明確的指標來除錯問題所在。

最後，次模優化是一個經過充分研究的領域，擁有數十年的研究成果，讓您可以利用貪婪演算法之外的進階演算法（例如加速貪婪演算法或局部搜尋）、關於某些公式何時效果最佳的理論見解，以及處理其他約束（例如預算限制或公平性要求）的擴展。