SIGIR (Special Interest Group on Information Retrieval) is a top-tier information retrieval conference bringing together researchers, developers, industry experts, and educators from across the globe to share the latest ground-breaking research. Jina AI was at this year’s conference in Padua in July, presenting our work on late chunking at the Robust IR Workshop.
This year’s conference featured amazing research, especially in reranking methods, sparse retrieval models, and using LLMs in information retrieval. Highlights include keynotes from Stephen Robertson on the history and development of the BM25 ranking algorithm and from Iryna Gurevych on perspectives for the future of AI in scientific research. The experts and enthusiastic PhD students in attendance sparked many lively discussions. The conference took place at the Padova Congress Center, located in the heart of the city. Padua itself is a place rich in history and culture, and we thoroughly enjoyed our time there.
tagLate Chunking at Robust IR
The Robust IR workshop is a new event at SIGIR, held for the first time this year. It focused on how well information retrieval systems operate under difficult and exceptional situations, and how we can improve their robustness. The workshop was a mix of invited talks and oral presentations of accepted papers, as well as a panel discussion.
We presented our work on late chunking at the workshop's poster session. There were many insightful questions and comments, quite a few from people who had already read our pre-print.




Download the poster from Google Drive
tagInteresting Research
We found a lot of interesting research presented at SIGIR, but the work below stood out to us.
tagCLIP-AdaM: Adapting Multi-view CLIP for Open-set 3D Object Retrieval
This paper focused on 3D image retrieval, specifically open-set 3D object retrieval, which is the task of retrieving 3D objects for previously unseen categories of objects, without having been trained for them. Their approach uses rendered views of 3D models from multiple angles to recognize objects using CLIP models trained on flat images. One interesting finding of the paper is that CLIP models perform well when averaging the embeddings generated from different views of the object.

Beyond that, the paper proposes a novel training method for 3D object retrieval that learns to weight different views, as well as adaptive layers that tune the model for specific tasks while preventing overfitting on training data categories and improving zero-shot performance on new categories.
tagOptimizing Compound Retrieval Systems
Most existing ranking systems, which combine multiple ranking models to produce results, are based on ranking cascades. This means one ranking model is executed after another, each keeping only the best-scoring results from the previous one.
This paper proposes a different approach that it calls compound retrieval systems: a framework for combining different rerankers to maximize ranking accuracy and computational efficiency. The authors propose to understand it as a generalization of the cascade approach, which executes multiple rerankers on different subsets of the results from previous ranking stages.
The figure below is given in the paper to show how different rerankers can be combined.

In their example, a first-stage ranker produces an initial ranking. Then, the second stage uses two rerankers with different ranking approaches:
- A point-wise ranking model that produces a relevance score for documents from the first ranker, based on a query.
- A pair-wise ranking model that compares two documents and the query and outputs an estimated probability that one of the two is more relevant to the query than the other.
Each model has a selection policy that’s applied to the results from the previous ranking stage, for example, take only the top-n. There’s also a final ordering function that produces the end result. Both the selection policy and ordering function have parameters set by training, providing for a holistic optimization that yields better and more robust results.
tagRE-AdaptIR: Improving Information Retrieval through Reverse Engineered Adaptation
There has been a lot of research into using linear algebra techniques to optimize embedding model weights. For example, the model soup method improves model accuracy and robustness by averaging the weights of models that result from fine-tuning the same base model with different hyperparameters.

The research presented in this paper offers up a related idea: Can we use the vector of differences between the weights of a fine-tuned embedding model and its non-fine-tuned base to transfer learning from one model to another? If we fine-tune another copy of the base model on next token prediction for domain-specific text, and then add in the weight differences from the trained embedding model, will we get a better embedding model for the target domain?

This has important advantages for training models for new domains. It makes it possible to use plentiful plain text data to train for next-token prediction, and then enjoy improved embeddings as a result.
tagBenchmarking LLM-based Relevance Judgment Methods
This paper evaluates prompting strategies for using LLMs as relevance judges, including using them for binary (yes/no) judgments, graded evaluations (i.e., 0-4 scales), pairwise comparisons of documents for relevance, and “nugget-based” methods, which decide if a document contains specific information.
The authors conclude, from testing with GPT-4o and Llama 3, that results are more aligned with human judgments when the LLM has fewer choices. Binary judgments and pairwise comparisons perform the best, and with very strong AI models, are good enough for large-scale automated use. Good prompt design is a critical factor.
Nugget-based methods provide for human interpretability, but are less reliable.
tagRankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation
This paper explores issues in using LLMs in three distinct roles: ranking results, judging relevance and evaluating results, and support functions like result summarization and query expansion.
It considers the consequences of LLM usage over the entire information cycle, as shown in the figure below, taken from the paper.

The paper concludes that there are significant issues in using LLM-based judgments in evaluating information retrieval systems that themselves rely on LLMs. The interaction of different LLM-based components can definitely lead to biased and inaccurate results.
tagLLM-Driven Usefulness Labeling for IR Evaluation
This paper makes a distinction between relevance and usefulness in search results. Relevance, in their definition, is about whether the topic of a retrieved document is topically related to the query; usefulness is about whether the document is responsive to the query, i.e., if it fulfills the user’s intent.
Its focus is on whether LLMs can recognize and rank usefulness and whether their judgments align with human ones. They conclude that there is significant alignment between human judges of usefulness and LLMs. However, available LLMs struggle with cases where relevance and usefulness are not aligned, i.e., relevant but not useful documents. The authors find that giving LLMs more context information, instead of just text queries, improves results significantly.
tagLLM-based Relevance Assessment Still Can’t Replace Human Relevance Assessment
The paper discusses using LLMs for automatic relevance assessment in information retrieval, which would make it much easier to train retrieval models, since there’s never enough human-ranked data. Although some recent studies claim LLMs can fully replace human evaluators, this paper identifies key limitations that prevent LLMs from substituting for human judgment.
- Insufficient evidence and limited generalizability of current research: Current research lacks strong evidence that LLMs can fully replace human relevance judgments, especially across diverse datasets and real-world scenarios. Where positive results exist, it’s debatable if they really apply to broad domains.
- Vulnerability to manipulation: Automated metrics, including those based on LLMs, can be easily manipulated. It’s very easy to improve scores without truly improving performance.
- Self-preference bias: LLMs tend to favor outputs similar to their own training data, introducing bias that compromises the objectivity of relevance assessments.
- Risks of overfitting: Relying on LLM-based evaluations may cause retrieval systems to optimize for the idiosyncrasies of specific LLMs, reducing performance in real-world use.
tagConclusion
The rapid rise of large language models has significantly transformed information retrieval, replacing established methods like BM25 and opening up new possibilities. The research showcased at SIGIR highlights this transformation.
However, language models do not turn information retrieval into a solved problem. The conference featured a wide range of innovative ideas aimed at aligning IR systems more closely with users' evolving needs. We truly enjoyed connecting with PhD students and experts, exchanging ideas, and sharing our vision for the future of search at Jina AI. We're excited to continue pushing the boundaries of what's possible in this field.