News
Models
Products
keyboard_arrow_down
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
DeepSearch
Search, read and reason until best answer found.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
Query Generation via Prompting
Problem Formulation
Embedding-Based Submodular Function Design
Implementations
Final Question: Why Submodular Formulation Matters
Tech blog
July 04, 2025

Submodular Optimization for Diverse Query Generation in DeepResearch

Many know the importance of query diversity in DeepResearch, but few know how to solve it rigorously via submodular optimization.
Han Xiao
Han Xiao • 13 minutes read

When implementing DeepResearch, there are at least two places where you need to generate diverse queries. First, you must generate web search queries based on user input (directly throwing user input into the search engine is not a good idea). Second, many DeepResearch systems include a "research planner" that breaks down the original problem into subproblems, calls agents to solve them independently, and then merges their results. Whether generating search queries or subproblems, our expectations remain the same: they must be relevant to the original input and diverse enough to provide unique perspectives on it. Often, we need to limit the number of queries to save money on unnecessary search engine requests or LLM tokens.

While understanding the importance of diverse query generation, most open-source DeepResearch implementations don't take this optimization seriously. They just hardcode these constraints directly into the prompt. Some might ask the LLM for an additional turn to evaluate and diversify the queries. Here's an example of how most implementations basically approach this:

Two different prompts for generating diverse queries using LLMs. The top prompt uses simple instructions. The bottom one is more sophisticated and structured. Given the original query and the number of queries to be generated, we expect the generated queries to be sufficiently diverse. In this example, we use gemini-2.5-flash as the LLM and the original query is "embeddings and rerankers".

In this article, I want to demonstrate a more rigorous approach to diverse query generation using sentence embeddings and submodular optimization. Back in my Ph.D. days, submodular optimization was one of my favorites along with L-BFGS. I'll show how to apply it for generating an optimal set of diverse queries under a cardinality constraint, which can significantly improve the overall quality of DeepResearch systems.

tagQuery Generation via Prompting

First, we want to check if prompting is an effective approach for generating diverse queries. We also want to understand if a sophisticated prompt is more effective than a simple prompt. Let's run an experiment comparing the two prompts below to find out:

You are an expert at generating diverse search queries. Given any input topic, generate {num_queries} different search queries that explore various angles and aspects of the topic.

Simple prompt

You are an expert research strategist. Generate an optimal set of diverse search queries that maximizes information coverage while minimizing redundancy.

Task: Create exactly {num_queries} search queries from any given input that satisfy:
- Relevance: Each query must be semantically related to the original input
- Diversity: Each query should explore a unique facet with minimal overlap
- Coverage: Together, the queries should comprehensively address the topic

Process:
1. Decomposition: Break down the input into core concepts and dimensions
2. Perspective Mapping: Identify distinct angles (theoretical, practical, historical, comparative, etc.)
3. Query Formulation: Craft specific, searchable queries for each perspective
4. Diversity Check: Ensure minimal semantic overlap between queries

Structured prompt

We use gemini-2.5-flash as the LLM with the original query "embeddings and rerankers" and test both the simple and structured prompts to iteratively generate from one to 20 queries. We then use jina-embeddings-v3 with the text-matching task to measure sentence similarity between the original query and the generated queries, as well as similarity within the generated queries themselves. Here are the visualizations.

Both prompts show similar patterns in the "Within Generated Queries" analysis (right two plots), with median cosine similarities remaining high (0.4-0.6 range) across different query counts. The simple prompt seems to be even better at diversifying queries when the number of queries is large, whereas the structured prompt maintains slightly better relevance to the original query, keeping the relevance around 0.6.

Look at the two plots on the right side, one can see that both simple and structured prompt exhibit large variance in cosine similarity scores, with many reaching 0.7-0.8 similarity, suggesting that some generated queries are nearly identical. Besides, both methods struggle to maintain diversity as more queries are generated. Instead of seeing a clear downward trend in similarity with increasing query count, we observe relatively stable (and high) similarity levels, indicating that additional queries often duplicate existing perspectives.

One explanation is what Wang et al. (2025) found that LLMs often reflect opinions of dominant groups disproportionately, even with prompt steering, indicating a bias towards common perspectives. This is because the LLM’s training data may over-represent certain viewpoints, causing the model to generate variations that align with these prevalent perspectives. Abe et al. (2025) also found that LLM-based query expansion favors popular interpretations while overlooking others. For instance, "What are the benefits of AI?" might yield common benefits like automation, efficiency, ethicalness but miss less obvious ones like drug discovery.

Multilingual Prompting for Improving LLM Generation Diversity
Large Language Models (LLMs) are known to lack cultural representation and overall diversity in their generations, from expressing opinions to answering factual questions. To mitigate this problem, we propose multilingual prompting: a prompting method which generates several variations of a base prompt with added cultural and linguistic cues from several cultures, generates responses, and then combines the results. Building on evidence that LLMs have language-specific knowledge, multilingual prompting seeks to increase diversity by activating a broader range of cultural knowledge embedded in model training data. Through experiments across multiple models (GPT-4o, GPT-4o-mini, LLaMA 70B, and LLaMA 8B), we show that multilingual prompting consistently outperforms existing diversity-enhancing techniques such as high-temperature sampling, step-by-step recall, and personas prompting. Further analyses show that the benefits of multilingual prompting vary with language resource level and model size, and that aligning the prompting language with the cultural cues reduces hallucination about culturally-specific information.
arXiv.orgQihan Wang
Wisdom from Diversity: Bias Mitigation Through Hybrid Human-LLM Crowds
Despite their performance, large language models (LLMs) can inadvertently perpetuate biases found in the data they are trained on. By analyzing LLM responses to bias-eliciting headlines, we find that these models often mirror human biases. To address this, we explore crowd-based strategies for mitigating bias through response aggregation. We first demonstrate that simply averaging responses from multiple LLMs, intended to leverage the “wisdom of the crowd”, can exacerbate existing biases due to the limited diversity within LLM crowds. In contrast, we show that locally weighted aggregation methods more effectively leverage the wisdom of the LLM crowd, achieving both bias mitigation and improved accuracy. Finally, recognizing the complementary strengths of LLMs (accuracy) and humans (diversity), we demonstrate that hybrid crowds containing both significantly enhance performance and further reduce biases across ethnic and gender-related contexts.
arXiv.orgAxel Abels

tagProblem Formulation

One might think our previous experiment is inconclusive and that we should improve the prompt and try again. While prompting can certainly change the results to some degree, what's more important is that we've learned something: simply increasing the number of generated queries makes us more likely to get diverse queries. The bad news is that we're also getting bunch of duplicated queries as side-product.

But since it's cheap to generate a large number of queries, which eventually yields some good ones, why don't we treat this as a subset selection problem?

In math, here is how we can formulate this problem: given an original input q0q_0q0​, a set of candidate queries V={q1,q2,⋯ ,qn}V=\{q_1, q_2, \cdots, q_n\}V={q1​,q2​,⋯,qn​} generated by an LLM using prompt engineering. Select a subset X⊆VX\subseteq VX⊆V of kkk queries that maximizes coverage while minimizing redundancy.

Unfortunately, finding the optimal subset of kkk queries from nnn candidates requires checking (nk)\binom{n}{k}(kn​) combinations - exponential complexity. For just 20 candidates and k=5k=5k=5, that's 15,504 combinations.

tagSubmodular Function

Before trying to brutally solve the subset selection problem, let me introduce the term submodularity and submodular function to the readers. They may sound unfamiliar to many, but you may well have heard of the idea of "diminishing returns" - well, submodularity is the mathematical representation of that.

Consider placing Wi-Fi routers to provide internet coverage in a large building. The first router you install gives tremendous value - it covers a significant area that previously had no coverage. The second router adds substantial value too, but some of its coverage area overlaps with the first router, so the marginal benefit is less than the first. As you continue adding routers, each additional router covers less and less new area because most spaces are already covered by existing routers. Eventually, the 10th router might provide very little additional coverage since the building is already well-covered.

This intuition captures the essence of submodularity. Mathematically, a set function f:2V→Rf: 2^V \rightarrow \mathbb{R}f:2V→R is submodular if for all A⊆B⊆VA \subseteq B \subseteq VA⊆B⊆V and any element v∉Bv \notin Bv∈/B:

f(A∪v)−f(A)≥f(B∪v)−f(B)f(A \cup {v}) - f(A) \geq f(B \cup {v}) - f(B)f(A∪v)−f(A)≥f(B∪v)−f(B)

In plain English: adding an element to a smaller set gives at least as much benefit as adding the same element to a larger set that contains the smaller set.

Now let's apply this concept to our query generation problem. One may immediately realize that query selection exhibits natural diminishing returns:

  • The first query we select covers entirely new semantic space
  • The second query should cover different aspects, but some overlap is inevitable
  • As we add more queries, each additional query covers less and less new ground
From one of my old slides back in AAAI 2013, where I explained submodularity using a bag of balls. Adding more balls to the bag improves the "facility," but the relative improvement becomes smaller and smaller, as seen in the decreasing delta values on the right y-axis.

tagEmbedding-Based Submodular Function Design

Let ei∈Rd\mathbf{e}_i \in \mathbb{R}^dei​∈Rd be the embedding vector for query qiq_iqi​, obtained using a sentence embedding model (e.g., jina-embeddings-v3). There are two main approaches to design our objective function:

tagApproach 1: Facility Location (Coverage-Based)

fcoverage(X)=∑j=1nmax⁡(α⋅sim(e0,ej),max⁡qi∈Xsim(ej,ei))f_{\text{coverage}}(X) = \sum_{j=1}^{n} \max\left(\alpha \cdot \text{sim}(\mathbf{e}_0, \mathbf{e}_j), \max_{q_i \in X} \text{sim}(\mathbf{e}_j, \mathbf{e}_i)\right)fcoverage​(X)=j=1∑n​max(α⋅sim(e0​,ej​),qi​∈Xmax​sim(ej​,ei​))

This function measures how well the selected set XXX "covers" all candidate queries, where:

  • sim(u,v)=u⋅v∣u∣∣v∣\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{|\mathbf{u}| |\mathbf{v}|}sim(u,v)=∣u∣∣v∣u⋅v​ is the cosine similarity
  • α⋅sim(e0,ej)\alpha \cdot \text{sim}(\mathbf{e}_0, \mathbf{e}_j)α⋅sim(e0​,ej​) ensures relevance to the original query
  • max⁡qi∈Xsim(ej,ei)\max_{q_i \in X} \text{sim}(\mathbf{e}_j, \mathbf{e}_i)maxqi​∈X​sim(ej​,ei​) measures coverage of candidate jjj by selected set XXX

One caveat is this function only implicitly encourages diversity. It doesn't explicitly penalize similarity within the selected set XXX. Diversity emerges because selecting similar queries provides diminishing coverage returns.

tagApproach 2: Explicit Coverage + Diversity

For more direct control over diversity, we can combine coverage and an explicit diversity term:

f(X)=λ⋅fcoverage(X)+(1−λ)⋅fdiversity(X)f(X) = \lambda \cdot f_{\text{coverage}}(X) + (1-\lambda) \cdot f_{\text{diversity}}(X)f(X)=λ⋅fcoverage​(X)+(1−λ)⋅fdiversity​(X)

where the diversity component can be formulated as:

fdiversity(X)=∑qi∈X∑qj∈V∖Xsim(ei,ej)f_{\text{diversity}}(X) = \sum_{q_i \in X} \sum_{q_j \in V \setminus X} \text{sim}(\mathbf{e}_i, \mathbf{e}_j)fdiversity​(X)=qi​∈X∑​qj​∈V∖X∑​sim(ei​,ej​)

This diversity term measures the total similarity between selected queries and unselected queries - it's maximized when we select queries that are different from the remaining candidates (a form of graph cut function).

tagDifference Between Two Approaches

Both formulations maintain submodularity.

Facility location function is a well-known submodular function. It exhibits submodularity because of the max operation: when we add a new query qqq to our selected set, each candidate query jjj gets covered by the "best" query in our set (the one with highest similarity). Adding qqq to a smaller set AAA is more likely to improve the coverage of various candidates than adding it to a larger set B⊇AB \supseteq AB⊇A where many candidates are already well-covered.

In graph cut diversity function, the diversity term ∑qi∈X∑qj∈V∖Xsim(ei,ej)\sum_{q_i \in X} \sum_{q_j \in V \setminus X} \text{sim}(\mathbf{e}_i, \mathbf{e}_j)∑qi​∈X​∑qj​∈V∖X​sim(ei​,ej​) is submodular because it measures the "cut" between selected and unselected sets. Adding a new query to a smaller selected set creates more new connections to unselected queries than adding it to a larger selected set.

The facility location approach relies on implicit diversity through coverage competition, while the explicit approach directly measures and optimizes for diversity. So both are valid, but the explicit approach gives you more direct control over the relevance-diversity tradeoff.

tagImplementations

GitHub - jina-ai/submodular-optimization
Contribute to jina-ai/submodular-optimization development by creating an account on GitHub.
GitHubjina-ai

The full implementation can be found here on Github.

Since our function is submodular, we can use the greedy algorithm which provides a (1−1/e)≈0.63(1-1/e) \approx 0.63(1−1/e)≈0.63 approximation guarantee:

max⁡X⊆Vf(X)subject to∣X∣≤k\max_{X \subseteq V} f(X) \quad \text{subject to} \quad |X| \leq kX⊆Vmax​f(X)subject to∣X∣≤k

Here's the code for optimizing Facility Location (coverage-based) - the one with implicit diversity.

def greedy_query_selection(candidates, embeddings, original_embedding, k, alpha=0.3):
    """
    Greedy algorithm for submodular query selection
    
    Args:
        candidates: List of candidate query strings
        embeddings: Matrix of query embeddings (n x d)
        original_embedding: Embedding of original query (d,)
        k: Number of queries to select
        alpha: Relevance weight parameter
    """
    n = len(candidates)
    selected = []
    remaining = set(range(n))
    
    # Precompute relevance scores
    relevance_scores = cosine_similarity(original_embedding, embeddings)
    
    for _ in range(k):
        best_gain = -float('inf')
        best_query = None
        
        for i in remaining:
            # Calculate marginal gain of adding query i
            gain = compute_marginal_gain(i, selected, embeddings, 
                                       relevance_scores, alpha)
            if gain > best_gain:
                best_gain = gain
                best_query = i
        
        if best_query is not None:
            selected.append(best_query)
            remaining.remove(best_query)
    
    return [candidates[i] for i in selected]

def compute_marginal_gain(new_idx, selected, embeddings, relevance_scores, alpha):
    """Compute marginal gain of adding new_idx to selected set"""
    if not selected:
        # First query: gain is sum of all relevance scores
        return sum(max(alpha * relevance_scores[j], 
                      cosine_similarity(embeddings[new_idx], embeddings[j]))
                  for j in range(len(embeddings)))
    
    # Compute current coverage
    current_coverage = [
        max([alpha * relevance_scores[j]] + 
            [cosine_similarity(embeddings[s], embeddings[j]) for s in selected])
        for j in range(len(embeddings))
    ]
    
    # Compute new coverage with additional query
    new_coverage = [
        max(current_coverage[j], 
            cosine_similarity(embeddings[new_idx], embeddings[j]))
        for j in range(len(embeddings))
    ]
    
    return sum(new_coverage) - sum(current_coverage)

The balance parameter alpha controls the tradeoff between relevance and diversity:

  • High α\alphaα (e.g., 0.8): Prioritizes relevance to original query, may sacrifice diversity
  • Low α\alphaα (e.g., 0.2): Prioritizes diversity among selected queries, may drift from original intent
  • Moderate α\alphaα (e.g., 0.4-0.6): Balanced approach, often works well in practice

tagLazy Greedy Algorithm

One may notice in the above code:

for i in remaining:
    # Calculate marginal gain of adding query i
    gain = compute_marginal_gain(i, selected, embeddings, 
                               relevance_scores, alpha)

We are computing the marginal gain for all remaining candidates at each iteration. We can do better than this.

The lazy greedy algorithm is a clever optimization that exploits submodularity to avoid unnecessary computations. The key insight is: if element A had higher marginal gain than element B in iteration ttt, then A will still have higher marginal gain than B in iteration t+1t+1t+1 (due to submodularity property).

import heapq

def lazy_greedy_query_selection(candidates, embeddings, original_embedding, k, alpha=0.3):
    """
    Lazy greedy algorithm for submodular query selection
    More efficient than standard greedy by avoiding unnecessary marginal gain computations
    """
    n = len(candidates)
    selected = []
    
    # Precompute relevance scores
    relevance_scores = cosine_similarity(original_embedding, embeddings)
    
    # Initialize priority queue: (-marginal_gain, last_updated, query_index)
    # Use negative gain because heapq is a min-heap
    pq = []
    for i in range(n):
        gain = compute_marginal_gain(i, [], embeddings, relevance_scores, alpha)
        heapq.heappush(pq, (-gain, 0, i))
    
    for iteration in range(k):
        while True:
            neg_gain, last_updated, best_idx = heapq.heappop(pq)
            
            # If this gain was computed in current iteration, it's definitely the best
            if last_updated == iteration:
                selected.append(best_idx)
                break
            
            # Otherwise, recompute the marginal gain
            current_gain = compute_marginal_gain(best_idx, selected, embeddings, 
                                               relevance_scores, alpha)
            heapq.heappush(pq, (-current_gain, iteration, best_idx))
    
    return [candidates[i] for i in selected]

Lazy greedy works like this:

  1. Maintain a priority queue of elements sorted by their marginal gains
  2. Only recompute the marginal gain of the top element
  3. If it's still the highest after recomputation, select it
  4. Otherwise, reinsert it in correct position and check the next top element

This can provide significant speedups because we avoid recomputing marginal gains for elements that clearly won't be selected.

tagResults

Let's run the experiment again. We use the same simple prompt to generate one to 20 diverse queries and perform the same cosine similarity measurements as before. For submodular optimization, we select queries from the 20 generated candidates using different values of k and measure similarity as before. The results show that queries selected through submodular optimization are more diverse and show lower in-set similarity.

Original query = "embeddings and rerankers"
Original query = "generative ai"
Original query = "geopolitics USA and China"
Original query = "google 2025 revenue breakdown"

tagFinal Question: Why Submodular Formulation Matters

You might wonder: why go through the trouble of formulating this as a submodular optimization problem? Why not just use heuristics or other optimization approaches?

In short, submodular formulation transforms an ad-hoc "select diverse queries" heuristic into a rigorous optimization problem with provable guarantees, efficient algorithms, and measurable objectives.

tagGuaranteed Efficiency

Once we prove our objective function is submodular, we get powerful theoretical guarantees and an efficient algorithm. The greedy algorithm that runs in O(nk)O(nk)O(nk) time comparing to checking (nk)\binom{n}{k}(kn​) combinations achieves a (1−1/e)≈0.63(1-1/e) \approx 0.63(1−1/e)≈0.63 approximation to the optimal solution. This means our greedy solution is always at least 63% as good as the best possible solution. No heuristic can promise this.

Moreover, the lazy greedy algorithm is dramatically faster in practice due to the mathematical structure of submodular functions. The speedup comes from diminishing returns: elements that were poor choices in earlier iterations are unlikely to become good choices later. So instead of checking all nnn candidates, lazy greedy typically only needs to recompute gains for the top few candidates.

tagNo Need for Hand-Crafted Heuristics

Without a principled framework, you might resort to ad-hoc rules like "ensure queries have cosine similarity < 0.7" or "balance different keyword categories." These rules are hard to tune and don't generalize. Submodular optimization gives you a principled, mathematically grounded approach. You can tune hyperparameters systematically using validation sets, and monitor solution quality in production systems. When the system produces poor results, you have clear metrics to debug what went wrong.

Finally, submodular optimization is a well-studied field with decades of research, allowing you to leverage advanced algorithms beyond greedy (like accelerated greedy or local search), theoretical insights about when certain formulations work best, and extensions to handle additional constraints such as budget limits or fairness requirements.

submodularity.org: Tutorials, References, Activities and Tools for Submodular Optimization

For those who are interested in submodular optimization, I recommend this site to learn more.

Categories:
Tech blog
rss_feed

Read more
June 30, 2025 • 8 minutes read
Quantization-Aware Training of jina-embeddings-v4
Andrei Ungureanu
Scott Martens
Bo Wang
Retro-style digital screen displaying four pixelated images: a cat, a woman, an abstract figure, and a man's portrait, with l
May 28, 2025 • 4 minutes read
Correlations: Vibe-Testing Embeddings in GUI
Jina AI
Technical screen showing green and yellow visual data, including charts in the lower half and a heat-map-like visualization a
May 25, 2025 • 8 minutes read
Fair Scoring for Multimodal Documents with jina-reranker-m0
Nan Wang
Alex C-G
Stacked glowing green ovals on a background transitioning from black to green, with the top oval having an unusual, split sha
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
Reader
Embeddings
Reranker
DeepSearch
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.