News
Models
Products
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
Late Chunking for Context Loss
Late Chunking is Resilient to Poor Boundary Cues
Late Chunking is Bidirectional
Late Chunking Can Be Trained
Late Chunking vs. Contextual Retrieval
Which Embedding Models Support Late Chunking?
Conclusion
Tech blog
October 03, 2024

What Late Chunking Really Is & What It’s Not: Part II

Part 2 of our exploration of Late Chunking, a deep dive into why it is the best method for chunk embeddings and improving search/RAG performance.
Slide depicting the "Late Chunking" process, with flow charts and a model highlighting the transition from a "Long Document"
Han Xiao
Han Xiao • 9 minutes read
💡
Highly recommended to read Part I first, as this article offers a more in-depth view, focusing on common misunderstandings and comparisons. Recommended reading order: part I, part II, research paper.
Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models
Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be over-compressed in the embeddings. Consequently, practitioners often split text documents into smaller chunks and encode them separately. However, chunk embeddings created in this way can lose contextual information from surrounding chunks, resulting in sub-optimal representations. In this paper, we introduce a novel method called late chunking, which leverages long context embedding models to first embed all tokens of the long text, with chunking applied after the transformer model and just before mean pooling - hence the term late in its naming. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks. The method is generic enough to be applied to a wide range of long-context embedding models and works without additional training. To further increase the effectiveness of late chunking, we propose a dedicated fine-tuning approach for embedding models.
arXiv.orgMichael Günther

Chunking a long document has two issues: first, determining the breakpoints—i.e., how to segment the document. You might consider fixed token lengths, a fixed number of sentences, or more advanced techniques like regex or semantic segmentation models. Accurate chunk boundaries not only improve the readability of the search results, but also ensure that the chunks fed to an LLM in a RAG system is precise and sufficient—no more, no less.

The second issue is the loss of context within each chunk. Once the document is segmented, most people’s next logical step is to embed each chunk separately in a batch process. However, this leads to a loss of global context from the original document. Many previous works have tackled the first issue first, arguing that better boundary detection improves semantic representation. For example, "semantic chunking" groups sentences with high cosine similarity in the embedding space to minimize the disruption of semantic units.

From our POV, these two issues are almost orthogonal and can be tackled separately. If we had to prioritize, we’d say the 2nd issue is more critical.

Diagram of issues affecting contextual information divided into "Issue 1: Breakpoints" and "Issue 2: Preservation methods" wi

tagLate Chunking for Context Loss

Late chunking starts by addressing the second issue: loss of context. It’s not about finding the ideal breakpoints or semantic boundaries. You still need to use regex, heuristics, or other techniques to divide a long document into small chunks. But instead of embedding each chunk as soon as it's segmented, late chunking first encodes the entire document in one context window (for jina-embeddings-v3 it's 8192-token). Then, it follows the boundary cues to apply mean pooling for each chunk—hence the term "late" in late chunking.

Diagram comparing "Naive Chunking" and "Late Chunking" methods for processing long documents with labeled steps.
Late chunking still requires boundary cues, but the key difference is when those cues are used. In late chunking, the cues are only applied after the entire document has been embedded, and they’re used to determine the pooling span.

tagLate Chunking is Resilient to Poor Boundary Cues

What's really interesting is that experiments show late chunking eliminates the need for perfect semantic boundaries, which partially addresses the first issue mentioned above. In fact, late chunking applied to fixed-token boundaries outperforms naive chunking with semantic boundary cues. Simple segmentation models, like those using fixed-length boundaries, perform on par with advanced boundary detection algorithms when paired with late chunking. We tested three different sizes of embedding models, and results show that all of them consistently benefit from late chunking across all test datasets. That said, the embedding model itself remains the most significant factor in performance—there is not a single instance where a weaker model with late chunking outperforms a stronger model without it.

Scatter plot chart showing the percentage of relative improvements across various models against a baseline, with a vertical
Relative retrieval improvement over the baseline (i.e., jina-embeddings-v2-small with fixed token length boundary cues and naive chunking). As part of an ablation study, we tested late chunking with different boundary cues (fixed token length, sentence boundaries, and semantic boundaries) and different models (jina-embeddings-v2-small, nomic-v1, and jina-embeddings-v3). Based on their performance on MTEB, the ranking of these three embedding models is: jina-embeddings-v2-small < nomic-v1 < jina-embeddings-v3. However, the focus of this experiment is not on evaluating the performance of the embedding models themselves, but on understanding how a better embedding model interacts with late chunking and boundary cues. For the details of the experiment, please check our research paper.
Combo SciFact NFCorpus FiQA TRECCOVID
Baseline 64.2 23.5 33.3 63.4
Late 66.1 30.0 33.8 64.7
Nomic 70.7 35.3 37.0 72.9
Jv3 71.8 35.6 46.3 73.0
Late + Nomic 70.6 35.3 38.3 75.0
Late + Jv3 73.2 36.7 47.6 77.2
SentBound 64.7 28.3 30.4 66.5
Late + SentBound 65.2 30.0 33.9 66.6
Nomic + SentBound 70.4 35.3 34.8 74.3
Jv3 + SentBound 71.4 35.8 43.7 72.4
Late + Nomic + SentBound 70.5 35.3 36.9 76.1
Late + Jv3 + SentBound 72.4 36.6 47.6 76.2
SemanticBound 64.3 27.4 30.3 66.2
Late + SemanticBound 65.0 29.3 33.7 66.3
Nomic + SemanticBound 70.4 35.3 34.8 74.3
Jv3 + SemanticBound 71.2 36.1 44.0 74.7
Late + Nomic + SemanticBound 70.5 36.9 36.9 76.1
Late + Jv3 + SemanticBound 72.4 36.6 47.6 76.2

Note that being resilient to poor boundaries doesn’t mean we can ignore them—they still matter for both human & LLM readability. Here’s how we see it: when optimizing segmentation, i.e. the aforementioned 1st issue, we can focus fully on readability without worrying about semantic/context loss. Late Chunking handles good or bad breakpoints, so readability is all you need to care.

tagLate Chunking is Bidirectional

Another common misconception about late chunking is that its conditional chunk embeddings rely only on previous chunks without "looking ahead." This is incorrect. The conditional dependency in late chunking is actually bi-directional, not one-directional. This is because the attention matrix in the embedding model—an encoder-only transformer—is fully connected, unlike the masked triangular matrix used in auto-regressive models. Formally, the embedding of chunk kkk, vk∼Q(ck∣D)v_k \sim Q(c_k|D)vk​∼Q(ck​∣D), rather than vk∼Q(ck∣c1,c2,⋯ ,ck−1)v_k \sim Q(c_k | c_1, c_2, \cdots, c_{k-1})vk​∼Q(ck​∣c1​,c2​,⋯,ck−1​), where QQQ denotes a factorization of the language model. This also explains why late chunking is not dependent on precise boundary placement.

Diagrams of a transformer model with detailed encoder on the left and decoder on the right, labeled with tokens, embeddings,
Unlike decoder-only models with masked self-attention, embedding models are typically encoder-only with a full attention matrix. This means that each token embedding is conditioned on all other tokens within the same context window, which, in the case of jina-embeddings-v3, includes up to 8191 other tokens. As a result, the chunk embedding carries global context information in both directions.

tagLate Chunking Can Be Trained

Late chunking does not require additional training for embedding models. It can be applied to any long-context embedding models that use mean pooling, making it highly attractive for practitioners. That said, if you're working on tasks like question-answering or query-document retrieval, performance can still be further improved with some fine-tuning. Specifically, the training data consists of tuples containing:

  • A query (e.g., a question or search term).
  • A document that contains relevant information to answer the query.
  • A relevant span within the document, which is the specific chunk of text that directly addresses the query.

The model is trained by pairing queries with their relevant spans, using a contrastive loss function like InfoNCE. This ensures that relevant spans are closely aligned with the query in the embedding space, while unrelated spans are pushed further apart. As a result, the model learns to focus on the most relevant parts of the document when generating chunk embeddings. For more details, please refer to our research paper.

tagLate Chunking vs. Contextual Retrieval

Introducing Contextual Retrieval
Anthropic is an AI safety and research company that’s working to build reliable, interpretable, and steerable AI systems.

Soon after late chunking was introduced, Anthropic introduced a separate strategy called Contextual Retrieval. Anthropic's method is a brute-force approach to address the issue of lost context, and works as follows:

  1. Each chunk is sent to the LLM along with the full document.
  2. The LLM adds relevant context to each chunk.
  3. This results in richer and more informative embeddings.

In our view, this is essentially context enrichment, where global context is explicitly hardcoded into each chunk using an LLM, which is expensive in terms of cost, time, and storage. Additionally, it’s unclear if this approach is resilient to chunk boundaries, as the LLM relies on accurate and readable chunks to enrich the context effectively. In contrast, late chunking is highly resilient to boundary cues, as demonstrated above. It requires no additional storage since the embedding size remains the same. Despite leveraging the full context length of the embedding model, it is still significantly faster than using an LLM to generate enrichment. In the qualitative study of our research paper, we show that Anthropic's context retrieval performs similarly to late chunking. However, late chunking provides a more low-level, generic, and natural solution by leveraging the inherent mechanics of the encoder-only transformer.

tagWhich Embedding Models Support Late Chunking?

Late chunking isn’t exclusive to jina-embeddings-v3 or v2. It’s a fairly generic approach that can be applied to any long-context embedding model that uses mean pooling. For example, in this post, we show that nomic-v1 supports it too. We warmly welcome all embedding providers to implement support for late chunking in their solutions.

As a model user, when evaluating a new embedding model or API, you can follow these steps to check if it might support late chunking:

  1. Single Output: Does the model/API give you just one final embedding per sentence instead of token-level embeddings? If yes, then it likely can’t support late chunking (especially for those web APIs).
  2. Long-Context Support: Does the model/API handle contexts of at least 8192 tokens? If not, late chunking won’t be applicable—or more precisely, it doesn’t make sense to adapt late chunking for a short-context model. If yes, ensure it actually performs well with long contexts, not just claims to support them. You can usually find this info in the model’s technical report, such as evaluations on LongMTEB or other long-context benchmarks.
  3. Mean Pooling: For self-hosted models or APIs that provide token-level embeddings before pooling, check if the default pooling method is mean pooling. Models using CLS or max pooling aren’t compatible with late chunking.

In summary, if an embedding model supports long-context and uses mean pooling by default, it can easily support late chunking. Check out our GitHub repo for implementation details and further discussion.

tagConclusion

So, what is late chunking? Late chunking is a straightforward method for generating chunk embeddings using long-context embedding models. It’s fast, resilient to boundary cues, and highly effective. It’s not a heuristic or over-engineering—it's a thoughtful design rooted in a deep understanding of the transformer mechanism.

Today, the hype surrounding LLMs is undeniable. In many cases, problems that could be efficiently tackled by smaller models like BERT are instead handed off to LLMs, driven by the allure of larger, more complex solutions. It’s no surprise that big LLM providers push for greater adoption of their models, while embedding providers advocate for embeddings — both are playing to their commercial strengths. But in the end, it’s not about the hype, it's about action, about what truly works. Let the community, the industry, and, most importantly, time reveal which approach is truly leaner, more efficient, and built to last.

Be sure to read our research paper, and we encourage you to benchmark late chunking in various scenarios and share your feedback with us.

Categories:
Tech blog
rss_feed

Read more
May 07, 2025 • 9 minutes read
Model Soup’s Recipe for Embeddings
Bo Wang
Scott Martens
Still life drawing of a purple bowl filled with apples and oranges on a white table. The scene features rich colors against a
April 16, 2025 • 10 minutes read
On the Size Bias of Text Embeddings and Its Impact in Search
Scott Martens
Black background with a simple white ruler marked in centimeters, emphasizing a minimalist design.
April 01, 2025 • 17 minutes read
Using DeepSeek R1 Reasoning Model in DeepSearch
Andrei Ungureanu
Alex C-G
Brown background with a stylized whale graphic and the text "THINK:" and ":SEARCH>" in code-like font.
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
DeepSearch
Reader
Embeddings
Reranker
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.