In February 2025, a team of AI researchers published the NoLiMA paper, which introduces a novel benchmark for evaluating large language models' ability to handle long contexts.

This paper introduces a significant change to the traditional Needle-in-a-Haystack (NIAH) benchmark by removing literal matches between questions and the needle (relevant information) hidden in the haystack (irrelevant text).
It highlights a critical limitation in current LLMs: they heavily rely on surface-level pattern matching, and their ability to perform deep associative reasoning deteriorates rapidly with increasing context length.
Building on these insights, we aim to investigate whether similar performance patterns occur in embedding models, specifically focusing on jina-embeddings-v3. Since the effectiveness of RAG systems depends critically on the quality of retrieval models, we seek to extend NoLiMA’s research through controlled experiments addressing two core questions:
- How do embedding models handle needle-in-a-haystack retrieval across different context lengths when forced to make semantic leaps beyond literal keyword matches?
- can strategic query augmentation with semantically similar content mitigate this performance gap?
The stark contrast observed in LLMs—robust with lexical matching but vulnerable with semantic variations—suggests embedding-based retrieval systems might face similar challenges when moving beyond surface-level term matching, potentially revealing fundamental limitations in current semantic search technologies.
tagNeedles and Haystacks Construction
tagNeedles Construction
Traditional needle-in-haystack tests use needles that reflect the wording of the question being searched for. For example:
- Question: “Which character has been to Dresden?”
- Needle: “Yuki lives in Dresden.”
But like NoLiMA, we want to test for semantic understanding rather than mere keyword matching, so we create one-hop variations (using words specifically not in the documents) with two different word orderings:
- Question: “Which character has been to Dresden?”
- Needle (default): “Actually, Yuki lives next to the Semper Opera House.”
- Needle (inverted): “The Semper Opera House is next to where Yuki lives.”
Following the paper’s methodology, we generate these needle-question groups (consisting of one question, one one-hop needle, and one inverted one-hop needle) across several categories, like the examples below:
Category | Question | Original needle (for reference) | One-hop needle | Inverted one-hop needle |
---|---|---|---|---|
Dietary restrictions | Which character cannot eat fish-based meals? | Alice cannot eat fish-based meals. | Then, Alice mentioned being vegan for years. | Being vegan was important to Alice for years. |
Medical conditions | Which character cannot drink milk? | Bob can’t drink milk. | Bob explained he was lactose intolerant. | Being lactose intolerant affected Bob daily. |
Language proficiency | Which character speaks French? | Charlie speaks French. | Actually, Charlie studied at the Sorbonne. | At the Sorbonne, Charlie completed his degree. |
Professional background | Which character is a musician? | Diane is a musician. | In 2013, Diane conducted at the Sydney Opera House. | The Sydney Opera House performance was conducted by Diane. |
Note that the original needles (literal keyword matches) are provided for reference, and not used in our experiments.
tagHaystacks Construction
We started with ten public domain books, each containing at least 50,000 tokens, randomly concatenating short snippets (under 250 tokens) from them into haystacks of varying lengths, namely 128, 256, 512, 1024, 2048, 4096, and 8192 tokens. We then embedded one needle into each haystack:

For a more concrete example, we’ll take the needle “Actually, Yuki lives next to the Semper Opera House” and put it into a 128-token haystack at position 50:

Using jina-embeddings-v3 to embed the texts, the similarity score between the needle text and the haystack text is:
Question-Haystack similarity = 0.2391
We then normalize the score by dividing this number by the similarity score of the question and default needle (no haystack creation, just direct comparison):
Question-Needle similarity = 0.3598
Normalized Query-Haystack similarity = 0.2391 / 0.3598 = 0.6644
This normalization is necessary because not all models produce the same similarity scores between two texts, and jina-embeddings-v3 has a tendency to under-calculate the similarity between two texts.
For each needle (including all default and inverted) we generated ten haystacks per context length, embedding one needle per haystack at a different location. For a given needle and context length, the haystacks would look something like this:

As a control, we also generated one haystack for each test condition without any needle. In total, that’s 3,234 haystacks. We encoded each haystack with jina-embeddings-v3 (using the default text-matching LoRA), then for each haystack we truncated it (if the total tokens exceeded 8,192, the limit for jina-embeddings-v3) then encoded its corresponding question.
tagEvaluation Metrics
Our evaluation framework uses several metrics to assess embedding model performance across different context lengths:
tagPrimary Metrics
Normalized Similarity Score
The core metric is a normalized similarity score that accounts for both semantic similarity between the question and the entire context (question-haystack similarity), and baseline similarity between the question and its corresponding default needle (question-needle similarity). This normalization ensures that the model's performance is evaluated relative to a meaningful reference point rather than absolute similarity scores alone. The normalization process involves computing the direct cosine similarity score between questions and their corresponding needles (our baseline), and dividing the question-haystack similarity by this baseline score:
Comparative Ratio to Random Chance
For any embedding model, cosine similarity scores between different query-document pairs are only directly comparable when the query remains the same. Therefore, beyond using normalized similarity scores, we also measure how often the question is more similar to the entire haystack than to a random passage of the same length without a needle.
tagSecondary Metrics
Separation Analysis
This metric evaluates how well the model distinguishes between relevant and irrelevant content. It includes the mean separation, which represents the difference between positive examples (passages containing the answer) and negative examples (passages not containing the answer), and the AUC (Area Under the Curve) score, which measures discrimination ability based on the area under the ROC (Receiver Operating Characteristic) curve.
Position Effects
We analyze how needle placement affects performance through the correlation coefficient between position and similarity score, regression slope showing performance change across positions, and position-binned performance analysis.
tagFindings
tagDegradation of Similarity Score and Correctness
Our results clearly show performance degrades as context length increases, with the mean similarity score dropping from 0.37 at 128 tokens to 0.10 at 8K tokens, following a non-linear trend with a sharp decline between 128 and 1K tokens.

In the below figure, we demonstrate that inverting the needle has little difference on the normalized similarity score. Both the default needle (e.g. “Actually, Yuki lives near the Semper Opera House”) and inverted needle (e.g. “The Semper Opera House is next to where Yuki lives”) show almost identical performance:

The dataset’s different semantic connections exhibit varying performance, with location-landmark pairs maintaining the strongest results, while dietary and medical condition connections degrade more quickly:

Comparing the results to random chance backs up our findings, by showing that the bigger the haystack, the more the results approach randomness, i.e. we are almost as likely to select a random passage with no needle (correct answer) as the haystack for a given question:

Again, we see varying performance based on different semantic connections, with some (like dietary restrictions) dropping well below random chance even at relatively short contexts, while others (like locations and landmarks) display much better performance regardless of context length:

Inverting the needle has little effect on performance. In the below graph, we show the comparative ratio of preferring the correct haystack to random chance, split by whether the placed needle contained the answer in default order or inverted order:

Since we can see the results for default- and inverted-order needles both follow the same trend, we won’t continue split analysis regarding this criteria.
tagCan We Separate Positive from Negative Results?
One of our most important findings comes from analyzing how well embedding models can distinguish relevant from irrelevant content across different context lengths. This "separation analysis" reveals that the correctness of retrieval falls rapidly between context length of 128 and 1000 tokens, and then continues to drop, albeit at a slower rate:

For short contexts (128 tokens), the model shows strong separation with a mean difference of 0.1 and clear discrimination, achieving an AUC of 0.81 (meaning that 81% of the time, the model ranks a relevant passage higher than an irrelevant one). This indicates that in shorter contexts, the model can reliably distinguish passages that contain the answer from those that don’t.
However, this rapidly deteriorates as context length increases. By 1,000 tokens, separation drops by 60% to 0.040, and AUC decreases to 0.66, signaling a notable drop in performance. At 8,000 tokens, there’s minimal separation (0.001) and near-random discrimination, with an AUC of just 0.50. This pattern reveals a crucial insight: even when models can compute reasonable similarity scores in longer contexts, they can barely use these scores to tell relevant from irrelevant information. By 8,000 tokens, the model’s ability to differentiate relevant content is essentially random chance.
The speed of this degradation as context grows is striking. Raw similarity scores drop by about 75% from 128 to 8,000 tokens, but separation metrics decline by nearly 99% over the same span. Even more concerning, the effect size shows an even steeper decline, falling by 98.6%. This suggests that embedding models' struggles with long contexts go beyond just reduced similarity scores—their fundamental ability to identify relevant information breaks down far more severely than previously understood.
tagHow Does The Needle Position Affect the Core Metrics?
While core performance metrics are usually best when the needle is at the beginning of the haystack, the performance degradation doesn’t always correlate to the placement in the middle of the context:

We also see that performance is best when the needle is at the start of a given context, and in short contexts we see a small bump in performance when the needle is placed towards the end. However, throughout all contexts we see a drop in performance when the needle is in middle positions:

tagWhat Effect Does Query Expansion Have on the Results?
We recently released a blog post on query expansion, a technique used in search systems to improve search performance by adding relevant terms to queries.

In the post, we used an LLM to generate expansion terms, which were then added to query embeddings for improved retrieval performance. The results showed significant improvements. Now, we want to examine how (or if) the technique will improve results for needle-in-a-haystack search. For example, given a query:
Which character has been to Dresden?
We use an LLM (Gemini 2.0) to expand it and add 100 additional terms that look like this:
Which character has been to Dresden? Character: fictional character literary character protagonist antagonist figure persona role dramatis personae\\n\\nDresden: Dresden Germany; bombing of Dresden World War II historical fiction Kurt Vonnegut Slaughterhouse-Five city in Saxony Elbe River cultural landmark\\n\\nHas been to: visited traveled to journeyed to presence in appears in features in set in takes place in location setting
tagHow Much Does Query Expansion Help Match the Needle to the Haystack?
For our experiment, we generated three sets of expanded query terms (as described in the original post) - 100, 150, and 250 terms. We then ran the same set of experiments as before, repeated three times, once each with each set of expanded query terms.
Results with all expansion sets showed clear degradation as context length increased, with a similar effect to not using query expansion at all (Figures 4 & 7):

Compared to unexpanded queries, all query expansion conditions showed the same pattern of degraded performance as context grew. The degradation trend is also still non-linear with a sharp decline between 128 and 1K tokens:

However, examining the comparative ratio shows that query expansion has clear benefits: The model is much more likely to select the haystack with the needle over the one without. In contrast, without query expansion the probability of selecting the correct passage dropped so much that, at a haystack size of 8K tokens, it was nearly the same as picking a random passage.
tagHow Do We Explain Needle Matching Results with Query Expansion?
These results align with findings from both the NoLiMa paper and the query expansion research, and can be explained as follows:
- Quality vs. quantity trade-off: The better performance of 100-term expansion, compared to 150 and 250 terms, suggests there's an optimal point where additional terms start adding more noise than signal. The 250-term expansion likely introduces terms with weaker semantic relationships to the original query, which become counterproductive at longer contexts.
- Context length remains the primary challenge: Despite the benefits of query expansion, performance still degrades significantly with increasing context length. This suggests that even with expansion, the fundamental architectural limitation of attention-based models in long contexts persists.
- Practical threshold identification: The comparative ratio staying above 0.5 indicates that expansion maintains above-random-chance performance even at 8K tokens, providing a practical way to extend the effective context window for embedding models. Comparison to random chance shows that, even when presented with long context documents, expanding the query makes it more likely to find the correct answer (i.e. the needle) than an incorrect one. This is an improvement compared to non-expanded queries, where the chance of finding the correct answer approaches random as the context length increases.
tagWhat Role Does Lexical Matching Play in Embeddings?
In the experiments above, we measured the effectiveness of embedding models in making semantic “one-hop” inferences in long-context passages, by removing all possibility of literal matching. We found that, even with query expansion, the embedding model’s ability to find relevant passages deteriorates as the context length grows. This effect is significant, and the finding is noteworthy because we would normally expect an embedding model to be able to make the relevant inferences without additional assistance. When replacing literal matches with one-hop variations ( e.g. ”Dresden” → “Semper Opera House”), all we’re doing is replacing one concept with another close by.
Let’s now grab the bull by its horns and ask the question directly: Does literal matching really play a significant enough role in semantic matching, or does the effect of context length overwhelm it? To answer this question, we redid our tests with needles containing literal matches, e.g.
- Question: “Which character has been to Dresden?”
- Needle (default): “Actually, Yuki lives in Dresden.”
- Needle (inverted): “Dresden is where Yuki lives.”
Notice that, instead of a one-hop variation of inferring that Semper Opera House is in Dresden, and hence a character that lives next to it should have been the one to have visited Dresden, these needles directly state the character name that lives in Dresden.
Having reformulated all 22 question-needle pairs in this way, we re-ran our experiments with all included context lengths and needle placements, using the same embedding model jina-embeddings-v3.



The results are striking. Even with literal matches in the context, the model’s ability to distinguish the correct answer from a random one speedily deteriorates as the context length grows, albeit maintaining a slight advantage over a complete absence of any literal match.
This ultimately proves that the ability of an embedding model to find a needle in a haystack is affected far more by the size of the haystack (and placement of the needle in it) than the semantic formulation of the needle.
tagConclusion
Our findings with embedding models align with NoLiMA’s paper on LLMs: Context size is highly determinative of correct matching and retrieval. We show that this is true even when there is an exact letter-for-letter word match.
The problem is not the ability of an embedding to perform semantic matching. Embedding models like jina-embeddings-v3 handle short contexts quite well, but their effectiveness declines as context length increases. Query expansion can reduce this effect to some degree, but retrieval quality still degrades over longer contexts. Furthermore, query expansion poses additional problems, since it is crucially important to identify expansion terms that improve retrieval without adding semantic noise. We are investigating and looking at ways to directly address needle-in-a-haystack retrieval and improve the future jina-embeddings-v4
performance.