News
Models
Products
keyboard_arrow_down
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
DeepSearch
Search, read and reason until best answer found.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

MCP Server
Add mcp.jina.ai as your MCP server to access our API in LLMs
open_in_new
API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
Late Chunking at Robust IR
Interesting Research
Conclusion
Event
August 11, 2025

What We Learned at SIGIR 2025

Sharing what we saw and learned at SIGIR 2025, feat. CLIP-AdaM, RE-AdaptIR and evaluations for LLM-based retrieval systems.
Michael Günther
Bo Wang
Scott Martens
Michael Günther, Bo Wang, Scott Martens • 8 minutes read

SIGIR (Special Interest Group on Information Retrieval) is a top-tier information retrieval conference bringing together researchers, developers, industry experts, and educators from across the globe to share the latest ground-breaking research. Jina AI was at this year’s conference in Padua in July, presenting our work on late chunking at the Robust IR Workshop.

This year’s conference featured amazing research, especially in reranking methods, sparse retrieval models, and using LLMs in information retrieval. Highlights include keynotes from Stephen Robertson on the history and development of the BM25 ranking algorithm and from Iryna Gurevych on perspectives for the future of AI in scientific research. The experts and enthusiastic PhD students in attendance sparked many lively discussions. The conference took place at the Padova Congress Center, located in the heart of the city. Padua itself is a place rich in history and culture, and we thoroughly enjoyed our time there.

tagLate Chunking at Robust IR

The Robust IR workshop is a new event at SIGIR, held for the first time this year. It focused on how well information retrieval systems operate under difficult and exceptional situations, and how we can improve their robustness. The workshop was a mix of invited talks and oral presentations of accepted papers, as well as a panel discussion.

We presented our work on late chunking at the workshop's poster session. There were many insightful questions and comments, quite a few from people who had already read our pre-print.

Late Chunking in Long-Context Embedding Models
Chunking long documents while preserving contextual information is challenging. We introduce the “Late Chunking” that leverages long-context embedding models to generate contextual chunk embeddings for better retrieval applications.
Jina AIMichael Günther, Han Xiao
What Late Chunking Really Is & What It’s Not: Part II
Part 2 of our exploration of Late Chunking, a deep dive into why it is the best method for chunk embeddings and improving search/RAG performance.
Jina AIHan Xiao
Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models
Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be over-compressed in the embeddings. Consequently, practitioners often split text documents into smaller chunks and encode them separately. However, chunk embeddings created in this way can lose contextual information from surrounding chunks, resulting in sub-optimal representations. In this paper, we introduce a novel method called late chunking, which leverages long context embedding models to first embed all tokens of the long text, with chunking applied after the transformer model and just before mean pooling - hence the term late in its naming. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks. The method is generic enough to be applied to a wide range of long-context embedding models and works without additional training. To further increase the effectiveness of late chunking, we propose a dedicated fine-tuning approach for embedding models.
arXiv.orgMichael Günther
Late Chunking Poster at Robust-IR@SIGIR 2025
Late_Chunking_Poster.pdf
Google Docs

Download the poster from Google Drive

tagInteresting Research

We found a lot of interesting research presented at SIGIR, but the work below stood out to us.

tagCLIP-AdaM: Adapting Multi-view CLIP for Open-set 3D Object Retrieval

This paper focused on 3D image retrieval, specifically open-set 3D object retrieval, which is the task of retrieving 3D objects for previously unseen categories of objects, without having been trained for them. Their approach uses rendered views of 3D models from multiple angles to recognize objects using CLIP models trained on flat images. One interesting finding of the paper is that CLIP models perform well when averaging the embeddings generated from different views of the object.

Figure 2: Model schematic for CLIP-AdaM

Beyond that, the paper proposes a novel training method for 3D object retrieval that learns to weight different views, as well as adaptive layers that tune the model for specific tasks while preventing overfitting on training data categories and improving zero-shot performance on new categories.

tagOptimizing Compound Retrieval Systems

Most existing ranking systems, which combine multiple ranking models to produce results, are based on ranking cascades. This means one ranking model is executed after another, each keeping only the best-scoring results from the previous one.

This paper proposes a different approach that it calls compound retrieval systems: a framework for combining different rerankers to maximize ranking accuracy and computational efficiency. The authors propose to understand it as a generalization of the cascade approach, which executes multiple rerankers on different subsets of the results from previous ranking stages.

The figure below is given in the paper to show how different rerankers can be combined.

Figure 3: Schemata of the multi-stage reranking process, with the original caption.

In their example, a first-stage ranker produces an initial ranking. Then, the second stage uses two rerankers with different ranking approaches:

  • A point-wise ranking model that produces a relevance score for documents from the first ranker, based on a query.
  • A pair-wise ranking model that compares two documents and the query and outputs an estimated probability that one of the two is more relevant to the query than the other.

Each model has a selection policy that’s applied to the results from the previous ranking stage, for example, take only the top-n. There’s also a final ordering function that produces the end result. Both the selection policy and ordering function have parameters set by training, providing for a holistic optimization that yields better and more robust results.

tagRE-AdaptIR: Improving Information Retrieval through Reverse Engineered Adaptation

There has been a lot of research into using linear algebra techniques to optimize embedding model weights. For example, the model soup method improves model accuracy and robustness by averaging the weights of models that result from fine-tuning the same base model with different hyperparameters.

Figure 4: Poster presentation

The research presented in this paper offers up a related idea: Can we use the vector of differences between the weights of a fine-tuned embedding model and its non-fine-tuned base to transfer learning from one model to another? If we fine-tune another copy of the base model on next token prediction for domain-specific text, and then add in the weight differences from the trained embedding model, will we get a better embedding model for the target domain?

Figure 5: Explanatory figure giving a high-level picture of what RE-AdaptIR proposes.

This has important advantages for training models for new domains. It makes it possible to use plentiful plain text data to train for next-token prediction, and then enjoy improved embeddings as a result.

tagBenchmarking LLM-based Relevance Judgment Methods

This paper evaluates prompting strategies for using LLMs as relevance judges, including using them for binary (yes/no) judgments, graded evaluations (i.e., 0-4 scales), pairwise comparisons of documents for relevance, and “nugget-based” methods, which decide if a document contains specific information.

The authors conclude, from testing with GPT-4o and Llama 3, that results are more aligned with human judgments when the LLM has fewer choices. Binary judgments and pairwise comparisons perform the best, and with very strong AI models, are good enough for large-scale automated use. Good prompt design is a critical factor.

Nugget-based methods provide for human interpretability, but are less reliable.

tagRankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation

This paper explores issues in using LLMs in three distinct roles: ranking results, judging relevance and evaluating results, and support functions like result summarization and query expansion.

It considers the consequences of LLM usage over the entire information cycle, as shown in the figure below, taken from the paper.

Figure 6: An illustration of LLM usage in modern IR

The paper concludes that there are significant issues in using LLM-based judgments in evaluating information retrieval systems that themselves rely on LLMs. The interaction of different LLM-based components can definitely lead to biased and inaccurate results.

tagLLM-Driven Usefulness Labeling for IR Evaluation

This paper makes a distinction between relevance and usefulness in search results. Relevance, in their definition, is about whether the topic of a retrieved document is topically related to the query; usefulness is about whether the document is responsive to the query, i.e., if it fulfills the user’s intent.

Its focus is on whether LLMs can recognize and rank usefulness and whether their judgments align with human ones. They conclude that there is significant alignment between human judges of usefulness and LLMs. However, available LLMs struggle with cases where relevance and usefulness are not aligned, i.e., relevant but not useful documents. The authors find that giving LLMs more context information, instead of just text queries, improves results significantly.

tagLLM-based Relevance Assessment Still Can’t Replace Human Relevance Assessment

The paper discusses using LLMs for automatic relevance assessment in information retrieval, which would make it much easier to train retrieval models, since there’s never enough human-ranked data. Although some recent studies claim LLMs can fully replace human evaluators, this paper identifies key limitations that prevent LLMs from substituting for human judgment.

  • Insufficient evidence and limited generalizability of current research: Current research lacks strong evidence that LLMs can fully replace human relevance judgments, especially across diverse datasets and real-world scenarios. Where positive results exist, it’s debatable if they really apply to broad domains.
  • Vulnerability to manipulation: Automated metrics, including those based on LLMs, can be easily manipulated. It’s very easy to improve scores without truly improving performance.
  • Self-preference bias: LLMs tend to favor outputs similar to their own training data, introducing bias that compromises the objectivity of relevance assessments.
  • Risks of overfitting: Relying on LLM-based evaluations may cause retrieval systems to optimize for the idiosyncrasies of specific LLMs, reducing performance in real-world use.

tagConclusion

The rapid rise of large language models has significantly transformed information retrieval, replacing established methods like BM25 and opening up new possibilities. The research showcased at SIGIR highlights this transformation.

However, language models do not turn information retrieval into a solved problem. The conference featured a wide range of innovative ideas aimed at aligning IR systems more closely with users' evolving needs. We truly enjoyed connecting with PhD students and experts, exchanging ideas, and sharing our vision for the future of search at Jina AI. We're excited to continue pushing the boundaries of what's possible in this field.

Categories:
Event
rss_feed

Read more
May 25, 2025 • 21 minutes read
What We Learned at ICLR2025
Jina AI
Three people smiling on a stage at a conference with an ICLR banner visible, suggesting a warm and lively event atmosphere.
November 05, 2024 • 2 minutes read
Call for Participants: EMNLP 2024 BoF on Embeddings, Reranker & Small LMs for Better Search
Jina AI
Event poster for "Embedding Reranker, Small LM & Better Search" on Nov 14, 2024, from 10:30 to 12:00 at Miami Lecture Hall. F
August 07, 2024 • 10 minutes read
What We Learned at ICML2024 ft. PLaG, XRM, tinyBenchmark, MagicLens, Prompt Sketching etc.
Florian Hönicke
Michael Günther
Georgios Mastrapas
Scott Martens
Two logos on gray background: upper "ICML International Conference on Machine Learning," lower abstract "vibo" logo.
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
Reader
Embeddings
Reranker
DeepSearch
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.