Embeddings in Depth

This article is the second in our series on embedding technology. In the previous article, we discussed embeddings in very abstract terms, but this one will be much more concrete. It explains where embeddings come from and how they are used.

⚠️

Trigger Warning: This article contains some light algebra, and although it tries to avoid complicated math, there are some notational conventions from vector algebra.

An embedding is a mapping of objects to vectors. Those objects may be real things or digital objects like images and files.

For example, imagine a robot that harvests tomatoes:

This robot selects tomatoes on the vine that are ready to harvest, and then snips their stalks and takes them away. Recognizing which tomatoes are ripe is a good application of image embeddings. Each tomato on the vine is assigned a representative vector based on digital images taken by the robot’s cameras. It then uses these embeddings to decide whether or not to harvest a particular tomato.

Similarly, we might use embeddings to study and predict customer behavior. These embedding vectors stand in for a person — a real human being who presumably exists somewhere — but we construct their embedding vectors from digital information about that person. For a study of consumer buying behavior, this is likely to be some combination of demographic information and purchasing records.

In the previous article, we discussed how vectors are just ordered lists of numbers that we decide to interpret as points in a high-dimensional space. Data stored in a computer is also just an ordered list of numbers. Embeddings are, therefore, made by translating input vectors in one vector space — the one that contains the source data — into vectors in a different space called the embedding space.

But not just any mapping between vector spaces is an embedding. The translation into embeddings is purposeful. The placement of embedding vectors relative to each other should be informative in some way about the objects they represent.

tagCosines: How We Compare Embeddings

When we use embeddings in applications, the thing we most often do is compare two embeddings and calculate how close or far apart they are. This is indicative of the similarity or dissimilarity of the objects those embeddings stand in for with respect to some set of properties.

To do this, we use a metric called the cosine. This section explains cosines and how we calculate them in high-dimensional vector spaces.

An embedding is a vector, and as we showed in the previous section, we can interpret a vector as a point in a multi-dimensional space. However, it also has an interpretation as a direction and magnitude. Traditionally, we visualize this interpretation with an arrow:

Diagram of 3D coordinate axes with labeled x, y, and z axes on a black background

The direction of this three-dimensional vector is specified by the $(x,y,z)$ coordinates, and the magnitude is just the length of the vector, i.e., the distance from the coordinate $(x,y,z)$ to the origin at $(0,0,0)$ . The length is calculated using a multidimensional extension of Pythagoras' theorem:

Mathematical equations featuring variables and square root operations displayed on a black background

The double bars around a vector $\|\vec{v}\|$ indicate the vector's length (or magnitude).

💡

Although this example uses only three dimensions, all these formulas apply exactly the same to arbitrarily large vectors of any number of dimensions.

We can see that different vectors can have the same direction but different lengths. For example, if we have the vectors $(4,2)$ and $(2,1)$ , they have different lengths ( $\sqrt{20} \approx 4.47$ and $\sqrt{5} \approx 2.23$ respectively) but they point in the same direction.

Mathematical chart with two lines depicting different linear equations, with axes labeled 1 to 4 and points plotted at key intersections

How do we know they point in the same direction? We measure the angle between the two vectors. If the angle is zero, they point in the same direction.

To show how this works, let’s consider two vectors: $\vec{a}=(3,2,5)$ and $\vec{b}=(4,5,2)$ :

Graphic representation of a mathematical 3D space with labeled points and lines on a black background, highlighting data points

To calculate the angle, we have a formula that scales up to any number of dimensions: The cosine.

The cosine of the angle $\theta$ between vectors $\vec{a}$ and $\vec{b}$ is:

Mathematical equations displayed in white on a blackboard, featuring variables and operations

The numerator of this fraction, $a⋅b$ , is the dot product of the two vectors, and it’s easy to calculate:

Black background with a white mathematical equation involving multiplication and summation resulting in the number 32

As for ‖a‖ and ‖b‖, those are the lengths of the two vectors: $\|\vec{a}\| = \sqrt{38} \approx 6.164$ and $\vec{b} = \sqrt{45} \approx 6.708$ . Therefore, to calculate $cosθ$ :

Blackboard with complex white mathematical formulas involving cosine, roots, and equations

This cosine corresponds to an angle of approximately 39.3 degrees, but in machine learning, we typically stop once we’ve calculated the cosine because if all the numbers in both vectors are greater than zero, then the cosine of the angle will be between 0 and 1. A cosine of 1 means the angle between two vectors is 0 degrees, i.e., the vectors have the same direction (although they may have different lengths). A cosine of 0 means they have an angle of 90 degrees. i.e., they are perpendicular or orthogonal.

So let’s compare $\vec{a}=(4,2)$ and $\vec{b}=(2,1)$ :

Math equations on a dark background indicative of an educational setting, such as a classroom blackboard

When the cosine of two vectors is one, they point in the same direction, even though they may have different lengths.

Another way to do the same operation is to take each vector and normalize it. Normalizing a vector means creating a new vector that points in the same direction but has a length of exactly 1. If two vectors point in the same direction, they will normalize to identical vectors.

To normalize a vector, divide it in each dimension by the vector’s length. If $\vec{v} = (3,2,5)$ and $\hat{v} = norm(\vec{v})$ then:

Mathematical formulae showcasing vector norms on a black background with specific components and results in white text

If two vectors $\vec{a}$ and $\vec{b}$ have an angle between them $\theta$ , then their normalized forms $\hat{a}$ and $\hat{b}$ will have the same angle $\theta$ between them. Cosine is a measure that is indifferent to the magnitudes of vectors.

💡

When we compare embeddings to decide how near or far they are from each other, we almost always use cosines.

The reason for this is that the magnitudes of embedding vectors are not usually informative about anything. Most neural network architectures perform normalizations between some or all of their layers, rendering the magnitudes of embedding vectors typically useless.

tagNeural Networks As Vector Mappings

Embedding vectors work because they are placed so that things that have common properties have cosines that are closer to 1, and those that do not have cosines closer to 0. An embedding model takes input vectors and translates them into embedding vectors.

Although there are many techniques for performing such translations, today, embedding models are exclusively neural networks.

A neural network is a sequence of vector transformations. In formal mathematical terms, for each layer of the network, we multiply the vector output of the previous layer by a matrix, and then generally add a bias. A bias is a set of values that we add or subtract from the resulting vector. We take the result and make that the input to the next layer of the network.

More formally, the values of the $m$ th layer of the network are $\vec{m}$ and are transformed into the values of the $n$ th layer ( $\vec{n}$ ) by:

An equation in white text on a black backdrop reading "m · W + bias"

Where $\textbf{W}$ is a matrix and $\vec{bias}$ is the vector of bias values.

It’s customary to think of it as a network of “nodes” bearing numerical values, connected by “weights”. This is truer to the roots of neural network theory but is exactly equivalent to multiplying a vector by a matrix.

Graphical representation of a neural network with labeled nodes 'm' and 'n' connected by a web of lines on a black backdrop

If we have a vector $\vec{m} = (m_1,m_2m_3)$ , this one neural network layer transforms it into a vector $\vec{n} = (n_1,n_2,n_3,n_4,n_5)$ . It does that by multiplying the values in $\vec{m}$ by a set of weights (also called parameters), then adding them up, and adding the bias vector $\vec{bias} = (bias_1,bias_2,bias_3,bias_4,bias_5)$

Each line between nodes in the diagram above is one weight. If we designate the weight from $m_1$ to $n_1$ as $w_{11}$ , from $m_2$ to $n_3$ as $w_{23}$ , etc., then:

Five sequential mathematical equations in white text on a black background, representing a series of weighted sums with biases

In matrix algebra notation, this is the same as multiplying the 3-dimensional vector $\vec{n}$ by a $3x5$ matrix to get $\vec{m}$ :

Black background with a complex mathematical equation involving numbered variables, 'W' coefficients, and biases indicating precision

Neural networks have multiple layers, so the output of this matrix multiplication is multiplied by another matrix and then another, typically with some kind of thresholding or normalization between multiplications and sometimes other tricks. The most common way of doing this today is to apply the ReLU (rectified linear activation unit) activation function. If the value of any node is below zero, it’s replaced with a zero, and otherwise, left unchanged.

Mathematical illustration of the ReLU function, "relu(x) = max(0,x) = 0 otherwise," in colorful text on a black background

Otherwise, a neural network is just a series of transformations like the one above from $\vec{m}$ to $\vec{n}$ .

Diagram of a neural network showing signal flow from "Input vector" to "Output vector" across multiple layers

This structure has some important properties. The most important is that:

💡

In theory, any kind of vector transformation is possible with enough layers. Any problem that can be expressed as a mapping from a vector space

\vec{x} \in \textit{A}

to vectors in another vector space

\vec{y} \in \textit{B}

has some solution that looks just like a neural network.

It may be impractical to construct a large enough network for a given problem, and there may be no procedure for correctly calculating the weights. AI models are not guaranteed in advance to be able to learn or perform any task at all.

What happens in practice is that neural networks approximate a solution by learning from examples. How good a solution that approximation is depends on a lot of factors and can’t typically be known in advance of training a model.

There is an aphorism attributed to the British statistician George Box:

All models are wrong, but some are useful.

Neural networks can be very good approximations, but they are still approximations, and it’s not possible to know for certain how good they are at something until you put them into production. The real measure of an AI model is its usefulness, and that’s hard to know in advance.

tagTurning Problems Into Vector Mappings

Defining neural networks as mappings from one vector space to another sounds very limiting. We don’t naturally see the kinds of things we want AI models to do in those terms.

However, many of the most visible developments in recent AI are really just ways of using vector-to-vector mappings in clever ways. Large language models like ChatGPT translate input texts, rendered as vectors, into vectors that designate a single word. They write texts by adding the single-word output to the end of the input and then running again to get the next word.

Flowchart on black background depicting an iterative dialogue process with inputs, translations, and outputs involving a conversation with a language model

It may take some creativity and engineering to find a way to express some task as a vector mapping, but as recent developments show, it is surprisingly often possible.

tagWhere Do Embeddings Fit In?

When you train an AI model, you do it with some goal in mind. In the typical scenario, you have training data that consists of inputs and matching correct outputs. For example, let’s say you want to classify tweets by their emotional content. This is a common benchmarking task used to evaluate text AI models.

One widely used dataset (DAIR.AI Emotion) sorts tweets into eight emotional categories:

Table showing emotion labels like Joy, Sadness, Fear with corresponding example tweets to illustrate each emotion

Bar graph displaying emotional responses with 'joy' and 'sadness' as the most frequent, plotted against a 0-7000 scale — Number of tweets per emotion category in the DAIR.AI Emotion dataset

To train a model to do this task, you need to decide two things:

How to map the output category labels to vectors.
How to map the input texts to vectors

The easiest way to encode these emotion labels is to create six-dimensional vectors where each value corresponds to one category. For example, if we make a list of emotions, like the one above, “anger” is the third in the list. So if a tweet has angry content, we train the network to output a six-dimensional vector with zero values everywhere except the third item in it, which is set to one:

Digit sequence 0 to 5 displayed on a dark background with repetitive decimal points beneath

This approach is called zero-hot encoding among AI engineers.

We also have to encode the tweets as vectors. There is more than one way to do this, but most text-processing models follow broadly the same pattern:

First, input texts are split into tokens, which you can understand as something like splitting it into words. The exact mechanisms vary somewhat, but the procedure broadly follows the following steps:

Normalize the case, i.e., make everything lowercase.
Split a text on punctuation and spaces.
Look up each resulting subsegment of text in a dictionary (which is a part of the AI model bundle) and accept it if it is in the dictionary.
For segments not in the dictionary, split them into the smallest number of parts that are in the dictionary, and then accept those segments.

So, for example, let’s say we wanted an embedding for this famous line from Alice in Wonderland:

Why, sometimes I've believed as many as six impossible things before breakfast.

Jina Embeddings’ text tokenizer will split it up into 18 tokens like this:

[CLS] why , sometimes i ' ve believed as many as six impossible 
things before breakfast . [SEP]

The [CLS] and [SEP] are special tokens added to the text to indicate the beginning and end, but otherwise, you can see that the splits are between words on spaces and between letters and punctuation.

If the tokenizer encounters a sentence with words that are not in the dictionary, it tries to match the parts to entries in its dictionary. For example:

Klaatu barada nikto

This is rendered in the Jina tokenizer as:

[CLS] k ##la ##at ##u bar ##ada nik ##to [SEP]

The ## indicates a segment appearing elsewhere than the beginning of a word.

This approach guarantees that every possible string can be tokenized and can always be processed by the AI model, even if it’s a new word or a nonsense string of letters.

Once the input text is transformed into a list of tokens, we replace each token with a unique vector from a lookup table. Then, we concatenate those vectors into one long vector.

Many text-processing AI models (including Jina Embeddings 2 models) use vectors of 512 dimensions for each token. All text-processing models are limited in the number of tokens they can accept as input. Jina Embeddings 2 models accept 8,192 tokens, but most models only take 512 or fewer.

If we use 512 for this example, then to do emotion classification for these tweets, we need to build a model that takes an input vector of $512 \times 512 = 262144$ dimensions, and outputs a vector of 6 dimensions:

Diagram visualizing an AI model transforming a 262,144-dim input vector into a 6-dim output vector with a theme of fear analysis

Neural networks are often presented as a kind of black box in which the parts on the inside are not described in depth. This is not wrong, but it’s not very informative. In practice, large, modern AI models tend to have three components, although this is not always true, and the boundaries between them are not always clear:

Diagram of an AI model showing a flow from a 144-dim input through an MLP to a 6-dim output, with a "Classifier" and "Input Encoder.

The vocabulary for some of this is not entirely codified, and this schema does not apply to all AI models, but in general, a trained model has three parts:

An input processor or encoder

Especially for networks that take in very large input vectors, there is some part of the network devoted to correlating all parts of the input with each other. This section of the network can be very large, and it can be structurally complex. Many of the biggest breakthroughs in AI in the last two decades have involved innovative structures in this part of the model.
A multi-layer perceptron

There isn’t a consistent term for this “middle” part of the model, but “MLP” or “multi-layer perceptron” appear to be the most common terms in the technical literature. The term itself is a bit old-fashioned, recalling the “perceptron”, a kind of small single-layer neural network invented in the 1950s that is ancestral to modern AI models.

The MLP typically consists of a number of connected layers of the same size. For example, the smallest of the ViT image processing models has twelve layers with 192 dimensions in each, while the largest has 24 layers, each with 768 dimensions.

The size and structure of the MLP are key design variables in an AI model.
The “Classifier”

This is usually the smallest part of the model, typically much smaller than the MLP. It may be as small as one layer but typically has more in modern models. This part of the model translates the output of the MLP into the exact form specified for the output, like the thousand-dimension one-hot vectors used for Imagenet-1k. It may not actually be a classifier. This final section of the model may be trained to do some other task.

The size of the input encoder is mostly determined by your planned input sizes and the size of the first layer of the MLP. The size of the “classifier” is mostly determined by the training objective and the size of the last layer of the MLP. You have relatively few design choices in those areas. The MLP is the only part where you can make free decisions about your model.

The size of the MLP is a design choice balancing three factors:

Smaller MLPs learn faster, run faster, and take less memory than large MLPs.
Large MLPs typically learn more and perform better, all else being equal.
Large MLPs are more likely to “overfit” than small MLPs. There is a risk that instead of generalizing from their training examples, they will memorize them and not learn effectively.

The optimal balance depends on your use case and context, and it is very difficult to calculate the optimum in advance.

Most of the learning takes place in the MLP. The MLP learns to recognize those features of the inputs that are useful in calculating the correct output. The last layer of the MLP outputs vectors that are closer together the more they share those features. The “classifier” then has the relatively easy task of translating that into the correct output, like what class a particular input belongs to..

So, for example, if we train a text classifier with the six emotion labels from our dataset, we would expect the outputs of the MLP to be a high-dimensional analog of something like this:

Emotional scatter plot on a black background with colorful dots representing joy, sadness, anger, fear, love, and surprise

In short, the output of the last layer of the MLP has all the properties we want in an embedding: It puts similar things close together and dissimilar things further apart.

We typically construct embedding models by taking a model we’ve trained for something — often classification, but not necessarily — and throwing away the “classifier” part of the model. We just stop at the last layer of the MLP and use its outputs as embeddings.

Block diagram of an AI model with Input Encoder, Multi-Layer Perceptron (MLP), and Embedding layer

There are four key points to take from this:

Embeddings put things close to each other that are similar based on a set of features.
The set of features that embeddings encode is not usually specified in advance. It comes from training a neural network to do a task and consists of the features the model discovered were relevant to doing that task.
Embedding vectors are typically very high-dimensional but are usually much, much smaller than input vectors.
Most models have an embedding layer in them somewhere, even if they never expose it or use it as a source of embedding vectors.

tagText Embeddings

Text embeddings work in very much the same way we’ve described above, but while it’s easy to understand how apples are like other apples and unlike oranges, it’s not so easy to understand what makes texts similar to each other.

The first and most prototypical application of text embeddings is for search and information retrieval. In this kind of application, we start with a collection of texts:

Informative text describing factual details about Paris, Berlin, and Bubblegum Alley in San Luis Obispo

We then calculate and store their embeddings:

Black background with three lines of white cursive text: 'd1 = embed(d1)', 'd2 = embed(d2)', 'd3 = embed(d3)'

When users want to search this document collection, they write a query in normal language:

Black background with white text reading "q = 'German capital'"

We also convert this text into its corresponding embedding:

Black background with centered white text related to embedding in programming code, presented in a simple and clean layout

We perform search by taking the cosine between $\vec{q}$ and each of the document embeddings $\vec{d}_1,\vec{d}_2,\vec{d}_3$ , identifying the one with the highest cosine (and thus is closest to the query embedding). We then return to the user the corresponding document.

Black background with a white mathematical inequality of cosines suggesting a data or variable comparison

So, one training goal for text similarity is information retrieval: If we train a model to be a good search engine, it will give us embeddings that put two documents close to each other in proportion to how much the same queries will retrieve them both.

There are other possible goals we can use to train text embeddings, like classifying texts by genre, type, or author, which will lead to embeddings that preserve different features. There are a number of ways we can train text embedding models, but we will discuss that topic in the next article, which will go into detail about training and adapting AI models. What matters is what you’ve trained the AI model to see as a text's most relevant features.

tagJina Embeddings

As a concrete example, the Jina Embeddings 2 models represent each token with a 512-dimension vector and can accept a maximum of 8192 tokens as input. This AI model, therefore, has an input vector size of 512x8192 or 4,194,304 dimensions. If the input text is less than 8192 tokens long, the rest of the input vector is set to zeros. If longer, it’s truncated.

The model outputs a 512-dimension embedding vector to represent the input text.

Diagram of Jina Embeddings Model with text from "Alice in Wonderland" and labeled curves for input and embedding vectors

As a developer or user, you don’t have to deal directly with any of these complexities. They are hidden from you by software libraries and web APIs that make using embeddings very straightforward. To transform a text into an embedding can involve as little as one line of code:

curl <https://api.jina.ai/v1/embeddings> \\
  -H "Content-Type: application/json" \\
  -H "Authorization: Bearer jina_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \\
  -d '{
    "input": ["Your text string goes here", "You can send multiple texts"],
    "model": "jina-embeddings-v2-base-en"
  }'

Go to the Jina Embeddings page to get a token and see code snippets for how to get embeddings in your development environment and programming language.

tagEmbeddings: The Heart of AI

Practically all models have an embedding layer in them, even if it's only implicit, and most AI projects can be built on top of embedding models. That's why good embeddings are so important to the effective use of AI technologies.

This series of articles is designed to demystify embeddings so that you can better understand and oversee the introduction of this technology in your business. Read the first installment here. Next, we will dive further into this technology, and provide hands-on tutorials with code you can use in your own projects.

Jina AI is committed to providing you with tools and help for creating, optimizing, evaluating, and implementing embedding models for your enterprise. We’re here to help you navigate the new world of business AI. Contact us via our website or join our community on Discord to get started.