What We Learned at SIGIR 2025

SIGIR (Special Interest Group on Information Retrieval) is a top-tier information retrieval conference bringing together researchers, developers, industry experts, and educators from across the globe to share the latest ground-breaking research. Jina AI was at this year’s conference in Padua in July, presenting our work on late chunking at the Robust IR Workshop.

This year’s conference featured amazing research, especially in reranking methods, sparse retrieval models, and using LLMs in information retrieval. Highlights include keynotes from Stephen Robertson on the history and development of the BM25 ranking algorithm and from Iryna Gurevych on perspectives for the future of AI in scientific research. The experts and enthusiastic PhD students in attendance sparked many lively discussions. The conference took place at the Padova Congress Center, located in the heart of the city. Padua itself is a place rich in history and culture, and we thoroughly enjoyed our time there.

tagLate Chunking at Robust IR

The Robust IR workshop is a new event at SIGIR, held for the first time this year. It focused on how well information retrieval systems operate under difficult and exceptional situations, and how we can improve their robustness. The workshop was a mix of invited talks and oral presentations of accepted papers, as well as a panel discussion.

We presented our work on late chunking at the workshop's poster session. There were many insightful questions and comments, quite a few from people who had already read our pre-print.

Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be over-compressed in the embeddings. Consequently, practitioners often split text documents into smaller chunks and encode them separately. However, chunk embeddings created in this way can lose contextual information from surrounding chunks, resulting in sub-optimal representations. In this paper, we introduce a novel method called late chunking, which leverages long context embedding models to first embed all tokens of the long text, with chunking applied after the transformer model and just before mean pooling - hence the term late in its naming. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks. The method is generic enough to be applied to a wide range of long-context embedding models and works without additional training. To further increase the effectiveness of late chunking, we propose a dedicated fine-tuning approach for embedding models.

arXiv.orgMichael Günther

Scientific poster titled "Late Chunks: Contextual Chunk Embeddings Using Long-Context Embedding Models," featuring sections, — Late Chunking Poster at Robust-IR@SIGIR 2025

tagInteresting Research

We found a lot of interesting research presented at SIGIR, but the work below stood out to us.

tagCLIP-AdaM: Adapting Multi-view CLIP for Open-set 3D Object Retrieval

This paper focused on 3D image retrieval, specifically open-set 3D object retrieval, which is the task of retrieving 3D objects for previously unseen categories of objects, without having been trained for them. Their approach uses rendered views of 3D models from multiple angles to recognize objects using CLIP models trained on flat images. One interesting finding of the paper is that CLIP models perform well when averaging the embeddings generated from different views of the object.

Diagram of CLIP-AdaM model illustrating view-based image understanding, with sections for view feature adaptation, aggregatio — Figure 2: Model schematic for CLIP-AdaM

Beyond that, the paper proposes a novel training method for 3D object retrieval that learns to weight different views, as well as adaptive layers that tune the model for specific tasks while preventing overfitting on training data categories and improving zero-shot performance on new categories.

tagOptimizing Compound Retrieval Systems

Most existing ranking systems, which combine multiple ranking models to produce results, are based on ranking cascades. This means one ranking model is executed after another, each keeping only the best-scoring results from the previous one.

This paper proposes a different approach that it calls compound retrieval systems: a framework for combining different rerankers to maximize ranking accuracy and computational efficiency. The authors propose to understand it as a generalization of the cascade approach, which executes multiple rerankers on different subsets of the results from previous ranking stages.

The figure below is given in the paper to show how different rerankers can be combined.

Diagram of a compound retrieval system using models to rank documents in stages, refine predictions, and aggregate scores for — Figure 3: Schemata of the multi-stage reranking process, with the original caption.

In their example, a first-stage ranker produces an initial ranking. Then, the second stage uses two rerankers with different ranking approaches:

A point-wise ranking model that produces a relevance score for documents from the first ranker, based on a query.
A pair-wise ranking model that compares two documents and the query and outputs an estimated probability that one of the two is more relevant to the query than the other.

Each model has a selection policy that’s applied to the results from the previous ranking stage, for example, take only the top-n. There’s also a final ordering function that produces the end result. Both the selection policy and ordering function have parameters set by training, providing for a holistic optimization that yields better and more robust results.

tagRE-AdaptIR: Improving Information Retrieval through Reverse Engineered Adaptation

There has been a lot of research into using linear algebra techniques to optimize embedding model weights. For example, the model soup method improves model accuracy and robustness by averaging the weights of models that result from fine-tuning the same base model with different hyperparameters.

Research poster titled "RE-AdaptIR" detailing improvements in information retrieval, with sections on Motivation, Approach, R — Figure 4: Poster presentation

The research presented in this paper offers up a related idea: Can we use the vector of differences between the weights of a fine-tuned embedding model and its non-fine-tuned base to transfer learning from one model to another? If we fine-tune another copy of the base model on next token prediction for domain-specific text, and then add in the weight differences from the trained embedding model, will we get a better embedding model for the target domain?

Diagram detailing the progression of machine learning models from Pre-Trained LLM to Fine-Tuned models and the RE-AdaptIR Ret — Figure 5: Explanatory figure giving a high-level picture of what RE-AdaptIR proposes.

This has important advantages for training models for new domains. It makes it possible to use plentiful plain text data to train for next-token prediction, and then enjoy improved embeddings as a result.

tagBenchmarking LLM-based Relevance Judgment Methods

This paper evaluates prompting strategies for using LLMs as relevance judges, including using them for binary (yes/no) judgments, graded evaluations (i.e., 0-4 scales), pairwise comparisons of documents for relevance, and “nugget-based” methods, which decide if a document contains specific information.

The authors conclude, from testing with GPT-4o and Llama 3, that results are more aligned with human judgments when the LLM has fewer choices. Binary judgments and pairwise comparisons perform the best, and with very strong AI models, are good enough for large-scale automated use. Good prompt design is a critical factor.

Nugget-based methods provide for human interpretability, but are less reliable.

tagRankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation

This paper explores issues in using LLMs in three distinct roles: ranking results, judging relevance and evaluating results, and support functions like result summarization and query expansion.

It considers the consequences of LLM usage over the entire information cycle, as shown in the figure below, taken from the paper.

Flowchart detailing stages in AI and LLMs development including Data augmentation, Document collection, and Evaluation, prese — Figure 6: An illustration of LLM usage in modern IR

The paper concludes that there are significant issues in using LLM-based judgments in evaluating information retrieval systems that themselves rely on LLMs. The interaction of different LLM-based components can definitely lead to biased and inaccurate results.

tagLLM-Driven Usefulness Labeling for IR Evaluation

This paper makes a distinction between relevance and usefulness in search results. Relevance, in their definition, is about whether the topic of a retrieved document is topically related to the query; usefulness is about whether the document is responsive to the query, i.e., if it fulfills the user’s intent.

Its focus is on whether LLMs can recognize and rank usefulness and whether their judgments align with human ones. They conclude that there is significant alignment between human judges of usefulness and LLMs. However, available LLMs struggle with cases where relevance and usefulness are not aligned, i.e., relevant but not useful documents. The authors find that giving LLMs more context information, instead of just text queries, improves results significantly.

tagLLM-based Relevance Assessment Still Can’t Replace Human Relevance Assessment

The paper discusses using LLMs for automatic relevance assessment in information retrieval, which would make it much easier to train retrieval models, since there’s never enough human-ranked data. Although some recent studies claim LLMs can fully replace human evaluators, this paper identifies key limitations that prevent LLMs from substituting for human judgment.

Insufficient evidence and limited generalizability of current research: Current research lacks strong evidence that LLMs can fully replace human relevance judgments, especially across diverse datasets and real-world scenarios. Where positive results exist, it’s debatable if they really apply to broad domains.
Vulnerability to manipulation: Automated metrics, including those based on LLMs, can be easily manipulated. It’s very easy to improve scores without truly improving performance.
Self-preference bias: LLMs tend to favor outputs similar to their own training data, introducing bias that compromises the objectivity of relevance assessments.
Risks of overfitting: Relying on LLM-based evaluations may cause retrieval systems to optimize for the idiosyncrasies of specific LLMs, reducing performance in real-world use.

tagConclusion

The rapid rise of large language models has significantly transformed information retrieval, replacing established methods like BM25 and opening up new possibilities. The research showcased at SIGIR highlights this transformation.

However, language models do not turn information retrieval into a solved problem. The conference featured a wide range of innovative ideas aimed at aligning IR systems more closely with users' evolving needs. We truly enjoyed connecting with PhD students and experts, exchanging ideas, and sharing our vision for the future of search at Jina AI. We're excited to continue pushing the boundaries of what's possible in this field.