Submodular Optimization for Diverse Query Generation in DeepResearch

Learn more from the part 2 of the submodularity series. More applications of submodular optimization on text selection, passage reranking and context engineering.

When implementing DeepResearch, there are at least two places where you need to generate diverse queries. First, you must generate web search queries based on user input (directly throwing user input into the search engine is not a good idea). Second, many DeepResearch systems include a "research planner" that breaks down the original problem into subproblems, calls agents to solve them independently, and then merges their results. Whether generating search queries or subproblems, our expectations remain the same: they must be relevant to the original input and diverse enough to provide unique perspectives on it. Often, we need to limit the number of queries to save money on unnecessary search engine requests or LLM tokens.

While understanding the importance of diverse query generation, most open-source DeepResearch implementations don't take this optimization seriously. They just hardcode these constraints directly into the prompt. Some might ask the LLM for an additional turn to evaluate and diversify the queries. Here's an example of how most implementations basically approach this:

Diagram illustrating methods for generating diverse search queries organized under headings like "PROMPT" and "OUTPUT," empha — Two different prompts for generating diverse queries using LLMs. The top prompt uses simple instructions. The bottom one is more sophisticated and structured. Given the original query and the number of queries to be generated, we expect the generated queries to be sufficiently diverse. In this example, we use `gemini-2.5-flash` as the LLM and the original query is `"embeddings and rerankers"`.

In this article, I want to demonstrate a more rigorous approach to diverse query generation using sentence embeddings and submodular optimization. Back in my Ph.D. days, submodular optimization was one of my favorites along with L-BFGS. I'll show how to apply it for generating an optimal set of diverse queries under a cardinality constraint, which can significantly improve the overall quality of DeepResearch systems.

tagQuery Generation via Prompting

First, we want to check if prompting is an effective approach for generating diverse queries. We also want to understand if a sophisticated prompt is more effective than a simple prompt. Let's run an experiment comparing the two prompts below to find out:

You are an expert at generating diverse search queries. Given any input topic, generate {num_queries} different search queries that explore various angles and aspects of the topic.

Simple prompt

You are an expert research strategist. Generate an optimal set of diverse search queries that maximizes information coverage while minimizing redundancy.

Task: Create exactly {num_queries} search queries from any given input that satisfy:
- Relevance: Each query must be semantically related to the original input
- Diversity: Each query should explore a unique facet with minimal overlap
- Coverage: Together, the queries should comprehensively address the topic

Process:
1. Decomposition: Break down the input into core concepts and dimensions
2. Perspective Mapping: Identify distinct angles (theoretical, practical, historical, comparative, etc.)
3. Query Formulation: Craft specific, searchable queries for each perspective
4. Diversity Check: Ensure minimal semantic overlap between queries

Structured prompt

We use gemini-2.5-flash as the LLM with the original query "embeddings and rerankers" and test both the simple and structured prompts to iteratively generate from one to 20 queries. We then use jina-embeddings-v3 with the text-matching task to measure sentence similarity between the original query and the generated queries, as well as similarity within the generated queries themselves. Here are the visualizations.

Look at the two plots on the right side, one can see that both simple and structured prompt exhibit large variance in cosine similarity scores, with many reaching 0.7-0.8 similarity, suggesting that some generated queries are nearly identical. Besides, both methods struggle to maintain diversity as more queries are generated. Instead of seeing a clear downward trend in similarity with increasing query count, we observe relatively stable (and high) similarity levels, indicating that additional queries often duplicate existing perspectives.

One explanation is what Wang et al. (2025) found that LLMs often reflect opinions of dominant groups disproportionately, even with prompt steering, indicating a bias towards common perspectives. This is because the LLM’s training data may over-represent certain viewpoints, causing the model to generate variations that align with these prevalent perspectives. Abe et al. (2025) also found that LLM-based query expansion favors popular interpretations while overlooking others. For instance, "What are the benefits of AI?" might yield common benefits like automation, efficiency, ethicalness but miss less obvious ones like drug discovery.

Multilingual Prompting for Improving LLM Generation Diversity

Large Language Models (LLMs) are known to lack cultural representation and overall diversity in their generations, from expressing opinions to answering factual questions. To mitigate this problem, we propose multilingual prompting: a prompting method which generates several variations of a base prompt with added cultural and linguistic cues from several cultures, generates responses, and then combines the results. Building on evidence that LLMs have language-specific knowledge, multilingual prompting seeks to increase diversity by activating a broader range of cultural knowledge embedded in model training data. Through experiments across multiple models (GPT-4o, GPT-4o-mini, LLaMA 70B, and LLaMA 8B), we show that multilingual prompting consistently outperforms existing diversity-enhancing techniques such as high-temperature sampling, step-by-step recall, and personas prompting. Further analyses show that the benefits of multilingual prompting vary with language resource level and model size, and that aligning the prompting language with the cultural cues reduces hallucination about culturally-specific information.

arXiv.orgQihan Wang

Wisdom from Diversity: Bias Mitigation Through Hybrid Human-LLM Crowds

Despite their performance, large language models (LLMs) can inadvertently perpetuate biases found in the data they are trained on. By analyzing LLM responses to bias-eliciting headlines, we find that these models often mirror human biases. To address this, we explore crowd-based strategies for mitigating bias through response aggregation. We first demonstrate that simply averaging responses from multiple LLMs, intended to leverage the “wisdom of the crowd”, can exacerbate existing biases due to the limited diversity within LLM crowds. In contrast, we show that locally weighted aggregation methods more effectively leverage the wisdom of the LLM crowd, achieving both bias mitigation and improved accuracy. Finally, recognizing the complementary strengths of LLMs (accuracy) and humans (diversity), we demonstrate that hybrid crowds containing both significantly enhance performance and further reduce biases across ethnic and gender-related contexts.

arXiv.orgAxel Abels

tagProblem Formulation

One might think our previous experiment is inconclusive and that we should improve the prompt and try again. While prompting can certainly change the results to some degree, what's more important is that we've learned something: simply increasing the number of generated queries makes us more likely to get diverse queries. The bad news is that we're also getting bunch of duplicated queries as side-product.

But since it's cheap to generate a large number of queries, which eventually yields some good ones, why don't we treat this as a subset selection problem?

In math, here is how we can formulate this problem: given an original input $q_0$ , a set of candidate queries $V=\{q_1, q_2, \cdots, q_n\}$ generated by an LLM using prompt engineering. Select a subset $X\subseteq V$ of $k$ queries that maximizes coverage while minimizing redundancy.

Unfortunately, finding the optimal subset of $k$ queries from $n$ candidates requires checking $\binom{n}{k}$ combinations - exponential complexity. For just 20 candidates and $k=5$ , that's 15,504 combinations.

tagSubmodular Function

Before trying to brutally solve the subset selection problem, let me introduce the term submodularity and submodular function to the readers. They may sound unfamiliar to many, but you may well have heard of the idea of "diminishing returns" - well, submodularity is the mathematical representation of that.

Consider placing Wi-Fi routers to provide internet coverage in a large building. The first router you install gives tremendous value - it covers a significant area that previously had no coverage. The second router adds substantial value too, but some of its coverage area overlaps with the first router, so the marginal benefit is less than the first. As you continue adding routers, each additional router covers less and less new area because most spaces are already covered by existing routers. Eventually, the 10th router might provide very little additional coverage since the building is already well-covered.

This intuition captures the essence of submodularity. Mathematically, a set function $f: 2^V \rightarrow \mathbb{R}$ is submodular if for all $A \subseteq B \subseteq V$ and any element $v \notin B$ :

$f(A \cup {v}) - f(A) \geq f(B \cup {v}) - f(B)$

In plain English: adding an element to a smaller set gives at least as much benefit as adding the same element to a larger set that contains the smaller set.

Now let's apply this concept to our query generation problem. One may immediately realize that query selection exhibits natural diminishing returns:

The first query we select covers entirely new semantic space
The second query should cover different aspects, but some overlap is inevitable
As we add more queries, each additional query covers less and less new ground

Graph titled "Diminishing Return" with x-axis labeled F(0) to F(4) and y-axis labeled A1 to A4, showing curves illustrating d — From one of my old slides back in AAAI 2013, where I explained submodularity using a bag of balls. Adding more balls to the bag improves the "facility," but the relative improvement becomes smaller and smaller, as seen in the decreasing delta values on the right y-axis.

tagEmbedding-Based Submodular Function Design

Let $\mathbf{e}_i \in \mathbb{R}^d$ be the embedding vector for query $q_i$ , obtained using a sentence embedding model (e.g., jina-embeddings-v3). There are two main approaches to design our objective function:

tagApproach 1: Facility Location (Coverage-Based)

$f_{\text{coverage}}(X) = \sum_{j=1}^{n} \max\left(\alpha \cdot \text{sim}(\mathbf{e}_0, \mathbf{e}_j), \max_{q_i \in X} \text{sim}(\mathbf{e}_j, \mathbf{e}_i)\right)$

This function measures how well the selected set $X$ "covers" all candidate queries, where:

$\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{|\mathbf{u}| |\mathbf{v}|}$ is the cosine similarity
$\alpha \cdot \text{sim}(\mathbf{e}_0, \mathbf{e}_j)$ ensures relevance to the original query
$\max_{q_i \in X} \text{sim}(\mathbf{e}_j, \mathbf{e}_i)$ measures coverage of candidate $j$ by selected set $X$

One caveat is this function only implicitly encourages diversity. It doesn't explicitly penalize similarity within the selected set $X$ . Diversity emerges because selecting similar queries provides diminishing coverage returns.

tagApproach 2: Explicit Coverage + Diversity

For more direct control over diversity, we can combine coverage and an explicit diversity term:

$f(X) = \lambda \cdot f_{\text{coverage}}(X) + (1-\lambda) \cdot f_{\text{diversity}}(X)$

where the diversity component can be formulated as:

$f_{\text{diversity}}(X) = \sum_{q_i \in X} \sum_{q_j \in V \setminus X} \text{sim}(\mathbf{e}_i, \mathbf{e}_j)$

This diversity term measures the total similarity between selected queries and unselected queries - it's maximized when we select queries that are different from the remaining candidates (a form of graph cut function).

tagDifference Between Two Approaches

Both formulations maintain submodularity.

Facility location function is a well-known submodular function. It exhibits submodularity because of the max operation: when we add a new query $q$ to our selected set, each candidate query $j$ gets covered by the "best" query in our set (the one with highest similarity). Adding $q$ to a smaller set $A$ is more likely to improve the coverage of various candidates than adding it to a larger set $B \supseteq A$ where many candidates are already well-covered.

In graph cut diversity function, the diversity term $\sum_{q_i \in X} \sum_{q_j \in V \setminus X} \text{sim}(\mathbf{e}_i, \mathbf{e}_j)$ is submodular because it measures the "cut" between selected and unselected sets. Adding a new query to a smaller selected set creates more new connections to unselected queries than adding it to a larger selected set.

The facility location approach relies on implicit diversity through coverage competition, while the explicit approach directly measures and optimizes for diversity. So both are valid, but the explicit approach gives you more direct control over the relevance-diversity tradeoff.

tagImplementations

The full implementation can be found here on Github.

Since our function is submodular, we can use the greedy algorithm which provides a $(1-1/e) \approx 0.63$ approximation guarantee:

$\max_{X \subseteq V} f(X) \quad \text{subject to} \quad |X| \leq k$

Here's the code for optimizing Facility Location (coverage-based) - the one with implicit diversity.

def greedy_query_selection(candidates, embeddings, original_embedding, k, alpha=0.3):
    """
    Greedy algorithm for submodular query selection
    
    Args:
        candidates: List of candidate query strings
        embeddings: Matrix of query embeddings (n x d)
        original_embedding: Embedding of original query (d,)
        k: Number of queries to select
        alpha: Relevance weight parameter
    """
    n = len(candidates)
    selected = []
    remaining = set(range(n))
    
    # Precompute relevance scores
    relevance_scores = cosine_similarity(original_embedding, embeddings)
    
    for _ in range(k):
        best_gain = -float('inf')
        best_query = None
        
        for i in remaining:
            # Calculate marginal gain of adding query i
            gain = compute_marginal_gain(i, selected, embeddings, 
                                       relevance_scores, alpha)
            if gain > best_gain:
                best_gain = gain
                best_query = i
        
        if best_query is not None:
            selected.append(best_query)
            remaining.remove(best_query)
    
    return [candidates[i] for i in selected]

def compute_marginal_gain(new_idx, selected, embeddings, relevance_scores, alpha):
    """Compute marginal gain of adding new_idx to selected set"""
    if not selected:
        # First query: gain is sum of all relevance scores
        return sum(max(alpha * relevance_scores[j], 
                      cosine_similarity(embeddings[new_idx], embeddings[j]))
                  for j in range(len(embeddings)))
    
    # Compute current coverage
    current_coverage = [
        max([alpha * relevance_scores[j]] + 
            [cosine_similarity(embeddings[s], embeddings[j]) for s in selected])
        for j in range(len(embeddings))
    ]
    
    # Compute new coverage with additional query
    new_coverage = [
        max(current_coverage[j], 
            cosine_similarity(embeddings[new_idx], embeddings[j]))
        for j in range(len(embeddings))
    ]
    
    return sum(new_coverage) - sum(current_coverage)

The balance parameter alpha controls the tradeoff between relevance and diversity:

High $\alpha$ (e.g., 0.8): Prioritizes relevance to original query, may sacrifice diversity
Low $\alpha$ (e.g., 0.2): Prioritizes diversity among selected queries, may drift from original intent
Moderate $\alpha$ (e.g., 0.4-0.6): Balanced approach, often works well in practice

tagLazy Greedy Algorithm

One may notice in the above code:

for i in remaining:
    # Calculate marginal gain of adding query i
    gain = compute_marginal_gain(i, selected, embeddings, 
                               relevance_scores, alpha)

We are computing the marginal gain for all remaining candidates at each iteration. We can do better than this.

The lazy greedy algorithm is a clever optimization that exploits submodularity to avoid unnecessary computations. The key insight is: if element A had higher marginal gain than element B in iteration $t$ , then A will still have higher marginal gain than B in iteration $t+1$ (due to submodularity property).

import heapq

def lazy_greedy_query_selection(candidates, embeddings, original_embedding, k, alpha=0.3):
    """
    Lazy greedy algorithm for submodular query selection
    More efficient than standard greedy by avoiding unnecessary marginal gain computations
    """
    n = len(candidates)
    selected = []
    
    # Precompute relevance scores
    relevance_scores = cosine_similarity(original_embedding, embeddings)
    
    # Initialize priority queue: (-marginal_gain, last_updated, query_index)
    # Use negative gain because heapq is a min-heap
    pq = []
    for i in range(n):
        gain = compute_marginal_gain(i, [], embeddings, relevance_scores, alpha)
        heapq.heappush(pq, (-gain, 0, i))
    
    for iteration in range(k):
        while True:
            neg_gain, last_updated, best_idx = heapq.heappop(pq)
            
            # If this gain was computed in current iteration, it's definitely the best
            if last_updated == iteration:
                selected.append(best_idx)
                break
            
            # Otherwise, recompute the marginal gain
            current_gain = compute_marginal_gain(best_idx, selected, embeddings, 
                                               relevance_scores, alpha)
            heapq.heappush(pq, (-current_gain, iteration, best_idx))
    
    return [candidates[i] for i in selected]

Lazy greedy works like this:

Maintain a priority queue of elements sorted by their marginal gains
Only recompute the marginal gain of the top element
If it's still the highest after recomputation, select it
Otherwise, reinsert it in correct position and check the next top element

This can provide significant speedups because we avoid recomputing marginal gains for elements that clearly won't be selected.

tagResults

Let's run the experiment again. We use the same simple prompt to generate one to 20 diverse queries and perform the same cosine similarity measurements as before. For submodular optimization, we select queries from the 20 generated candidates using different values of k and measure similarity as before. The results show that queries selected through submodular optimization are more diverse and show lower in-set similarity.

Two line graphs comparing query similarity metrics for generated and original datasets. X-axis is the number of queries; Y-ax — Original query = `"embeddings and rerankers"`

Two graphs compare gene similarity scores for original, generated, and selected queries under various algorithms, showing sim — Original query = `"generative ai"`

Two side-by-side line graphs comparing original, generated, and selected queries using gene similarity data over a timeline. — Original query = `"geopolitics USA and China"`

Two graphs compare genome similarity across queries. The left graph shows "Original vs Generated/Selected Queries," and the r — Original query = `"google 2025 revenue breakdown"`

tagFinal Question: Why Submodular Formulation Matters

You might wonder: why go through the trouble of formulating this as a submodular optimization problem? Why not just use heuristics or other optimization approaches?

In short, submodular formulation transforms an ad-hoc "select diverse queries" heuristic into a rigorous optimization problem with provable guarantees, efficient algorithms, and measurable objectives.

tagGuaranteed Efficiency

Once we prove our objective function is submodular, we get powerful theoretical guarantees and an efficient algorithm. The greedy algorithm that runs in $O(nk)$ time comparing to checking $\binom{n}{k}$ combinations achieves a $(1-1/e) \approx 0.63$ approximation to the optimal solution. This means our greedy solution is always at least 63% as good as the best possible solution. No heuristic can promise this.

Moreover, the lazy greedy algorithm is dramatically faster in practice due to the mathematical structure of submodular functions. The speedup comes from diminishing returns: elements that were poor choices in earlier iterations are unlikely to become good choices later. So instead of checking all $n$ candidates, lazy greedy typically only needs to recompute gains for the top few candidates.

tagNo Need for Hand-Crafted Heuristics

Without a principled framework, you might resort to ad-hoc rules like "ensure queries have cosine similarity < 0.7" or "balance different keyword categories." These rules are hard to tune and don't generalize. Submodular optimization gives you a principled, mathematically grounded approach. You can tune hyperparameters systematically using validation sets, and monitor solution quality in production systems. When the system produces poor results, you have clear metrics to debug what went wrong.

Finally, submodular optimization is a well-studied field with decades of research, allowing you to leverage advanced algorithms beyond greedy (like accelerated greedy or local search), theoretical insights about when certain formulations work best, and extensions to handle additional constraints such as budget limits or fairness requirements.