Quantization is a widely used for addressing scaling problems in AI. The name makes it sound complicated, but it’s just rounding numbers off to make them take up less space. This means smaller embedding vectors that take up less memory and storage space, and faster information retrieval because it takes less time to compare vectors. Quantization is a purely numerical technique that doesn’t care what kind of data your model processes or what use cases you have, so it can bring improvements without requiring lots of expensive domain knowledge.
One might expect that, quantization involves good-old trade-offs and nothing comes for free cliché - where we must sacrifice some precision. In this article, we’ll show you a way to make it lossless via quantization-aware training (QAT). This technique is used in jina-embeddings-v4 for providing smaller embeddings that required in space-critical applications.
tagOverview of Quantization Techniques
Model quantization usually means one of four things:
- Post-training quantization (PTQ)
- Training for quantized embedding outputs (Output QAT)
- Training for fully quantized models (Full QAT)
- Distilling a new quantized model from an existing unquantized one
Post-training quantization (PTQ) accepts the trained embedding model as is and doesn’t modify it in any way. It’s just a matter of throwing away the least significant digits of the floating point values produced by the model. We just round the numbers off, and sometimes scale them to a range.
Output QAT means fine-tuning the embedding model to produce optimal reduced-precision vectors. This means modifying the model, but it doesn’t change the precision of the model’s weights, and therefore doesn’t reduce its size. Just the output vector size is reduced.
Full QAT begins with a fully trained, full-precision model and lowers the precision of the model weights, then fine-tunes the performance of this modified model. This produces a significantly smaller model as well as smaller embeddings, at the price of doing some fine-tuning.
Distillation is the process of training a new model to match the performance of an existing one. This means creating a new model that’s designed from scratch as quantized, and then using the existing model to generate as much training data as needed to train it until it performs as closely as possible to the existing model.
The benefits of these four approaches are summarized in the table below:
Approach | More Compact Embeddings? | Requires Training? | Model Compression? | Faster Inference? |
---|---|---|---|---|
PTQ | ✓ | ❌ | ❌ | ❌ |
Output QAT | ✓ | ✓ | ❌ | ❌ |
Full QAT | ✓ | ✓ | ✓ | ✓ |
Distillation | ||||
(to a smaller model) | ✓ | ✓ | ✓ | ✓ |
All four produce more compact embeddings, but other than PTQ, all require some additional training, while only Full QAT and Distillation produce new, faster models. Full QAT and Distillation are much more expensive to implement because they require a great deal more training than Output QAT.
In this article, we’re only going to look at PTQ and Output QAT, which don’t change the size or speed of the embedding model.
tagExperimental Setup
For these experiments, our baseline model is jina-embeddings-v4 with the retrieval adapter, which produces 32-bit-precision floating-point (FP32) vectors in 2048 dimensions. Each embedding is therefore 8196 bytes, or 8kB in size.
We studied several experimental conditions using query-document retrieval benchmark tasks from the NanoBEIR benchmark suite. The retrieval process uses cosine similarity between vectors to find and rank the documents that best match queries.
- Baseline — The performance of jina-embeddings-v4 embedding vectors without any quantization. These experiments all used a beta version of the model, and the release performance is somewhat better.
- PTQ — We quantized the output vectors to binary vectors without changing the model.
- Output QAT — We quantized the output vectors and applied fine-tuning to the retrieval adapter to improve its performance under quantized conditions.
tagQuantization Levels

We experimented with four different levels of quantization.
- 8-bit integers — FP32 values are reduced to integers in the range -128 to 127, shrinking embeddings 4-fold to 2048 bytes.
- 4-bit integers - Same as for 4-bit integers, but we map to the range from -8 to 7, reducing vector sizes by a factor of 8, to 1024 bytes.
- Trinary Quantization — All values are mapped to one of three values: -1, 0, 1. Optimally stored, this reduces each dimension to 1.6 bits, reducing the size of embedding vectors roughly 40-fold to approximately 230 bytes.
- Binary Quantization — We convert FP32 scalar values to one bit, using the
torch.sign
datatype, which provides for just two values, taking one bit to store. This reduces 2048-dimensional embedding vectors from 8192 bytes to 128 bytes, a 64-fold reduction.
tagScaling
For binary quantization, quantization is very simple: If a vector value is above 0 or positive, it maps to 1. Otherwise, it maps to -1.

For the other quantization scenarios, we normalized the values to a range and then rounded to the nearest value allowed by the level of quantization. Embedding vectors consist of scale numbers between -∞ and +∞ (or, in practice, really big positive and negative numbers). We use two numbers, and , to scale the values for quantization.
For trinary quantization, we take each vector component and translate it as follows:
- if ≥ , becomes 1.
- if ≤ , becomes -1.
- if < < , becomes 0.

For 4-bit integers:
- if ≥ , becomes 7.
- if ≤ , becomes -8.
- if < < , becomes , then rounded to the nearest integer. This scales the value to the range .

For 8-bit integers:
- if ≥ , becomes 127.
- if ≤ , becomes -128.
- if < < , becomes , rounded to the nearest integer. This scales the value to the range .

To calculate and , we used two approaches:
- Min/Max — We processed our data in batches, and for each batch, we identified the highest and lowest vector component, setting to the highest and to the lowest.
- Rolling averaging over batches — For each batch, we calculated the average and standard deviation of the vector components. We maintained a moving average of both the average and standard deviation as we processed all batches. If is the current moving average of batch average values, and is the current moving average of the standard deviations, then for each batch:
tagQAT Fine-Tuning
For the PTQ experiments, we used the model as is and quantized the embeddings it produced using the methods described above.
For the Output QAT, we fine-tuned the model using straight-through estimation. This means that we reverse the quantization process, restoring the full precision to the values, before calculating the loss (i.e., error), and then we use that loss metric to fine-tune the model.
We fine-tuned in each case for 10,000 steps, saving a checkpoint every 500 steps. We then retained the checkpoint with the highest score on the NanoBEIR benchmark.
tagAsymmetric Quantization
PTQ and Output QAT reduce the size of the embedding vectors, but don’t reduce model size or inference speed; all the savings are in the size of the stored document embeddings and retrieval speed.
As a result, we tested both quantizing the query vectors and leaving them unquantized at retrieval time because it doesn’t change the size of the stored embedding vectors either way.
tagResults
We tested nine conditions in total, summarized in the tables below:
Condition Name | Fine-Tuning | Quantization Level | Scaling Strategy | Quantized Queries |
---|---|---|---|---|
Baseline | ❌ | n/a | n/a | n/a |
PTQ Both | ❌ | Binary | n/a | ✓ |
PTQ Docs Only | ❌ | Binary | n/a | ❌ |
QAT Binary | ✓ | Binary | n/a | ✓ |
QAT Binary Docs Only | ✓ | Binary | n/a | ❌ |
QAT Trinary | ✓ | Trinary | Rolling Average | ✓ |
QAT 4-bits | ✓ | 4-bits | Rolling Average | ✓ |
QAT 8-bits | ✓ | 8-bits | Rolling Average | ✓ |
QAT 8-bits Min/Max | ✓ | 8-bits | Min/Max | ✓ |
Table 2: Experimental Conditions
Condition Name | Average Score | Difference from baseline |
---|---|---|
Baseline | 60.10 | n/a |
PTQ Binary | 58.33 | -1.78 |
PTQ Binary Docs Only | 59.08 | -1.02 |
QAT Binary | 59.22 | -0.89 |
QAT Binary Docs Only | 60.81 | +0.70 |
QAT Trinary | 59.49 | -0.62 |
QAT 4-bits | 61.73 | +1.62 |
QAT 8-bits | 61.67 | +1.56 |
QAT 8-bits Min/Max | 61.29 | +1.19 |
Table 3: Average score (in % correct) for each condition over the twelve NanoBEIR benchmarks.
You can see from the table above that fine-tuning for quantization improves scores. The only difference between the PTQ Binary and QAT Binary conditions is fine-tuning, and the difference in score is significant. Similarly, we see an almost 2% improvement in scores between the PTQ Binary Docs Only and QAT Binary Docs Only conditions, which are only distinguished by the same fine-tuning.
Unsurprisingly, we also see that scores generally improve the less we quantize, with 4-bit quantization scoring better than trinary, and trinary better than binary. However, going further to 8-bits doesn’t appear to have improved anything.
We only tested leaving queries unquantized in binary cases, but this appears to improve performance.
Finally, our tests suggest that the rolling average scaling method outperforms the simplistic min/max approach.
tagConclusion
Quantization has some important operational advantages for embedding models, by significantly reducing the size of embedding vectors and accelerating information retrieval. While simple post-training quantization (PTQ) provides immediate benefits in terms of memory and storage, our experiments demonstrate that quantization-aware training (QAT) significantly mitigates the inevitable precision losses. Fine-tuning consistently yielded better scores.
The degree of quantization directly impacts performance, which is what you would expect from a method based on reducing the precision of values. Less aggressive quantization (e.g., 4-bit) generally outperforms more aggressive methods (e.g., binary), but surprisingly, there was no significant difference in performance between 8-bit and 4-bit quantization. It would seem that until you reach some threshold of imprecision, there is little difference between greater and lesser quantization.
Scaling strategies are also significant, with the rolling average method showing superior results compared to a fixed min/max approach. Using scaling values that are relative to the data appears to work significantly better and merits further exploration.
Quantization can get you more out of your embedding models for less. Although this article doesn’t explore all the options for quantization, it explores two that are easily accessible, and they have real benefits to offer. We’re working to refine and improve quantization strategies so that we can further reduce users' costs, and expect to release binary support for jina-embeddings-v4 in the near future.