On the Size Bias of Text Embeddings and Its Impact in Search

Semantic similarity is what embedding models are built to measure, but those measurements are influenced by a lot of biasing factors. In this article, we're going to look at one pervasive source of bias in text embedding models: the input size.

Embeddings of longer texts generally show higher similarity scores when compared to other text embeddings, regardless of how similar the actual content is. While truly similar texts will still have higher similarity scores than unrelated ones, longer texts introduce a bias—making their embeddings appear more similar on average simply due to their length.

This has real consequences. It means that embedding models, by themselves, aren't able to measure relevance very well. With embeddings-based search, there is always a best match, but size bias means that you can't use the similarity score to decide if the best match, or any lesser matches, are actually relevant. You can't say that, for example, any match with a cosine higher than 0.75 is relevant because there can easily be a long document that matches at that level despite being completely irrelevant.

💡

Comparing embedding vectors can only tell you about relative similarity, not relevance.

We're going to demonstrate this with some simple examples and show you how cosine similarity between text embeddings cannot serve as a general way to assess

tagVisualizing Size Bias

To show how size bias manifests, we’re going to use Jina AI’s latest embedding model jina-embeddings-v3 with the text-matching task option. We will also use text documents from a widely used IR dataset: The CISI dataset, which you can download from Kaggle.

This dataset is used for training IR systems, so it contains both queries and documents to match them. We’re only going to use the documents, which are all in the file CISI.ALL. You can download it from the command line at an alternate source on GitHub with the command:

wget https://raw.githubusercontent.com/GianRomani/CISI-project-MLOps/refs/heads/main/CISI.ALL

CISI contains 1,460 documents. The basic statistics about the sizes of the texts and their size distributions are summarized in the table and histograms below:

	in Words	in Sentences
Average document size	119.2	4.34
Std. Deviation	63.3	2.7
Max size	550	38
Min size	8	1

Histogram showing "Distribution of Word Lengths in CISI Documents" with word counts on the x-axis and share per million on th

Histogram showing the distribution of sentence counts in CISI documents, with sentence increments from 0 to 35.

Let’s read the documents in Python and get embeddings for them. The code below assumes that the file CISI.ALL is in the local directory:

with open("CISI.ALL", "r", encoding="utf-8") as inp:
    cisi_raw = inp.readlines()

docs = []
current_doc = ""
in_text = False
for line in cisi_raw:
    if line.startswith("."):
        in_text = False
        if current_doc:
            docs.append(current_doc.strip())
            current_doc = ""
        if line.startswith(".W"):
            in_text = True
    else:
        if in_text:
            current_doc += line

This will fill the list docs with 1,460 documents. You can inspect them:

print(docs[0])

The present study is a history of the DEWEY Decimal
Classification.  The first edition of the DDC was published
in 1876, the eighteenth edition in 1971, and future editions
will continue to appear as needed.  In spite of the DDC's
long and healthy life, however, its full story has never
been told.  There have been biographies of Dewey
that briefly describe his system, but this is the first
attempt to provide a detailed history of the work that
more than any other has spurred the growth of
librarianship in this country and abroad.

Now, we’re going to construct embeddings for each text using jina-embeddings-v3. For this, you will need an API key from the Jina AI website. You can get a free key for up to 1 million tokens of embeddings, which is sufficient for this article.

Put your key in a variable:

api_key = "<Your Key>"

Now, generate embeddings using the text-matching task with jina-embeddings-v3. This code processes the texts in docs in batches of 10.

import requests
import json
from numpy import array

embeddings  = []

url = "https://api.jina.ai/v1/embeddings"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer " + api_key
}

i = 0
while i < len(docs):
    print(f"Got {len(embeddings)}...")
    data = {
        "model": "jina-embeddings-v3",
        "task": "text-matching",
        "late_chunking": False,
        "dimensions": 1024,
        "embedding_type": "float",
        "input": docs[i:i+10]
    }
    
    response = requests.post(url, headers=headers, data=json.dumps(data))
    for emb in response.json()['data']:
        embeddings.append(array(emb['embedding']))
    i += 10

For each text, there will be a 1024-dimension embedding in the list embeddings. You can see what that looks like:

print(embeddings[0])

array([ 0.0352382 , -0.00594871,  0.03808545, ..., -0.01147173,
         -0.01710563,  0.01109511], shape=(1024,))),

Now, we calculate the cosines between all pairs of embeddings. First, let’s define the cosine function cos_sim using numpy:

from numpy import dot
from numpy.linalg import norm

def cos_sim(a, b): 
    return float((a @ b.T) / (norm(a)*norm(b)))

Then, compute the cosines of each of the 1,460 embeddings compared to the other 1,459:

all_cosines = []
for i, emb1 in enumerate(embeddings):
    for j, emb2 in enumerate(embeddings):
        if i != j:
            all_cosines.append(cos_sim(emb1, emb2))

The result is a list of 2,130,140 values. Their distribution should approximate the cosines between “random” documents in the same language and register. The table and histogram below summarize the results.

Number of texts	1,460
Number of cosines	2,130,140
Average	0.343
Std. Deviation	0.116

Histogram showing cosine similarity distribution between document pairs, with a peak indicating high concentration of values.

These documents, even though not related to each other, typically have cosines well above zero. We might be tempted to set a threshold of 0.459 (average + 1 standard deviation), or maybe round it up to 0.5, and say any pair of documents with a cosine less than that must be largely unrelated.

But let’s do the same experiment on smaller texts. We’ll use the nltk library to break each document into sentences:

import nltk

sentences = []
for doc in docs:
    sentences.extend(nltk.sent_tokenize(doc))

This yields 6,331 sentences with an average length of 27.5 words and a standard deviation of 16.6. In the histogram below, the size distribution of sentences is in red, and for full documents, it's in blue, so you can compare them.

Histogram comparing word length distribution between CISI sent documents and Docs dataset, highlighting the prevalence of sho

We’ll use the same model and methods to get embeddings for each sentence:

sentence_embeddings = []

i = 0
while i < len(sentences):
    print(f"Got {len(sentence_embeddings)}...")
    data = {
        "model": "jina-embeddings-v3",
        "task": "text-matching",
        "late_chunking": False,
        "dimensions": 1024,
        "embedding_type": "float",
        "input": sentences[i:i+10]
    }
    
    response = requests.post(url, headers=headers, data=json.dumps(data))
    for emb in response.json()['data']:
        sentence_embeddings.append(array(emb['embedding']))
    i += 10

And then take the cosine of each sentence’s embedding with each other sentence’s:

sent_cosines = []
for i, emb1 in enumerate(sentence_embeddings):
    for j, emb2 in enumerate(sentence_embeddings):
        if i != j:
            sent_cosines.append(cos_sim(emb1, emb2))

The result is rather more cosine values: 40,075,230, as summarized in the table below:

Number of sentences	6,331
Number of cosines	40,075,230
Average	0.254
Std. Deviation	0.116

Sentence-to-sentence cosines are considerably lower on average than full document-to-document ones. The histogram below compares their distributions, and you can readily see that the sentence pairs form a nearly identical distribution to the document pairs but shifted to the left.

Histogram chart showing cosine value distribution between CISO documents and sentence pairs, with distinct peaks for each.

To test that this size-dependency is robust, let’s get all cosines between sentences and documents and add them to the histogram. Their information is summarized in the table below:

Number of texts	6,331 sentences & 1,460 documents
Number of cosines	9,243,260
Average	0.276
Std. Deviation	0.119

The green line below is the distribution of sentence-to-document cosines. We can see that this distribution fits neatly between the document-to-document cosines and the sentence-to-sentence cosines, showing that the size effect involves both the larger and smaller of the two texts being compared.

Graph showing cosine similarity distributions for sentences, documents, and both in a dataset, with color-coded bars.

Let’s do another test by concatenating the documents together by groups of ten, creating 146 much larger documents and measuring their cosines. The result is summarized below:

Number of texts	146 documents
Number of cosines	21,170
Average	0.658
Std. Deviation	0.09

Graph showing the cosine similarity distribution between long document pairs, with peaks indicating common cosine values.

This is far to the right of the other distributions. A cosine threshold of 0.5 would tell us that nearly all these documents are related to each other. To exclude irrelevant documents of this size, we would have to set the threshold much higher, maybe as high as 0.9, which would undoubtedly exclude good matches among the smaller documents.

This shows that we can't use minimum cosine thresholds at all to estimate how good a match is, at least not without taking document size into account somehow.

tagWhat Causes Size Bias?

Size bias in embeddings isn’t like positional biases in long-context models. It isn’t caused by architectures. It’s not inherently about size, either. If, for example, we had created longer documents by just concatenating copies of the same document over and over, it wouldn't show a size bias.

The problem is that long texts say more things. Even if they’re constrained by a topic and purpose, the whole point of writing more words is to say more stuff.

Longer texts, at least of the kind people normally create, will naturally produce embeddings that “spread” over more semantic space. If a text says more things, its embedding will have a lower angle with other vectors on average, independent of the subject of the text.

tagMeasuring Relevance

The lesson of this post is that you can’t use cosines between semantic vectors by themselves to tell if something is a good match, just that it’s the best match out of those available. You have to do something besides calculate cosines to check the utility and validity of the best matches.

You could try normalization. If you can measure size bias empirically, it may be possible to offset it. However, this approach might not be very robust. What works for one dataset probably won't work for another.

Asymmetric query-document encoding, provided in jina-embeddings-v3, reduces the size bias in embedding models but doesn’t eliminate it. The purpose of asymmetric encoding is to encode documents to be less “spread out” and encode queries to be more so.

The red line in the histogram below is the distribution of document-to-document cosines using asymmetric encoding with jina-embeddings-v3 – each document is encoded using the retrieval.query and retrieval.passage flags, and every document query embedding is compared to every document passage embedding that's not from the same document. The average cosine is 0.200, with a standard deviation of 0.124.

These cosines are considerably smaller than the ones we found above for the same documents using the text-matching flag, as shown in the histogram below.

Bar chart comparing cosine scores with/without asymmetric encoding; blue and red bars show score distribution along a 0-1 sca

However, asymmetric encoding hasn't eliminated size bias. The histogram below compares cosines for full documents and sentences using asymmetric encoding.

Histogram displaying the distribution of asymmetric encoding cosine scores for sentences and documents, highlighted in two di

The average for sentence cosines is 0.124, so using asymmetric encoding, the difference between the average sentence cosine and the average document cosine is 0.076. The difference in averages for symmetric encoding is 0.089. The change in size bias is insignificant.

Although asymmetric encoding improves embeddings for information retrieval, it isn't any better for measuring the relevance of matches.

tagFuture Possibilities

The reranker approach, e.g. jina-reranker-v2-base-multilingual and jina-reranker-m0, is an alternative way of scoring query-document matches that we already know improves query precision. Reranker scores are not normalized, so they don't work as objective similarity measures either. However, they are calculated differently, and it might be possible to normalize reranker scores in ways that make them good estimators of relevance.

Another alternative is to use large language models, preferably with strong reasoning capabilities, to directly evaluate whether a candidate is a good match for a query. Simplistically, we could ask a task-specific large language model, "On a scale of 1 to 10, is this document a good match for this query?" Existing models might not be well-suited to the task, but focused training and more sophisticated prompting techniques are promising.

It's not impossible for models to measure relevance, but it requires a different paradigm from embedding models.

tagUse Your Models for What It's Good For

The size bias effect we've documented above shows one of the fundamental limitations of embedding models: They're excellent at comparing things but unreliable at measuring absolute relevance. This limitation isn't a flaw in the design—it's an inherent characteristic of how these models work.

So what does this mean for you?

First, be skeptical of cosine thresholds. They just don't work. Cosine similarity measures produce temptingly objective-looking floating-point numbers. But just because something outputs numbers doesn't mean it's measuring something objectively.

Second, consider hybrid solutions. Embeddings can efficiently narrow down a large set of items to promising candidates, after which you can apply more sophisticated (and computationally intensive) techniques like rerankers or LLMs, or even human evaluators to determine actual relevance.

Third, when designing systems, think in terms of tasks rather than capabilities. The objectively smartest, highest-scoring models on benchmarks is still a waste of money if it can't do the job you got it for.

Understanding the limitations of our models isn't pessimistic – it reflects a broader principle in applications: Understanding what your models are good at, and what they're not, is critical for building reliable and effective systems. Just like we wouldn't use a hammer to tighten a screw, we shouldn't use embedding models for tasks they aren't able to handle. Respect what your tools are good for.