Scalar Quantization - Making Vector Search Lean and Fast

Table of Contents

If you’ve worked with semantic search or dense retrieval systems, you know the problem: embeddings are expensive. Not expensive to compute (well, that too), but especially expensive to store and search through.

A typical embedding might be 768 dimensions of 32-bit floats. That’s over 3KB per vector. Now multiply that by millions of documents, and suddenly you’re looking at gigabytes or even terabytes of storage. And when you need to compute similarity between vectors, you’re doing a lot of floating-point operations across all those dimensions.

This is where scalar quantization comes in.

What is Scalar Quantization?
#

Scalar quantization is essentially a compression technique for vectors. Instead of storing each dimension as a full 32-bit float, we compress it down to an 8-bit integer int8 (or sometimes even fewer bits, e.g binary quantization). That’s a 4x reduction in memory right there.

The idea is simple: take the range of values in your vector and map them to a smaller range of integers. Think of it like going from a high-resolution image to a lower-resolution one - you lose some detail, but the overall picture is still recognizable, if not indistinguishable at first sight.

Why Do We Need It?
#

In dense retrieval, we’re constantly dealing with two main bottlenecks:

Memory: Storing millions of high-dimensional vectors requires massive amounts of RAM. In production systems, this directly translates to infrastructure costs.
Speed: When searching, we need to compute similarity (usually cosine similarity or dot product) between the query vector and potentially millions of document vectors. Fewer bytes mean faster memory access and quicker computations.

By using scalar quantization, we can:

Fit more vectors in memory (4x more with 8-bit quantization)
Speed up similarity computations (integer operations are faster than float operations)
Reduce storage costs significantly
Still maintain decent search quality

How Does It Work?
#

The process is surprisingly straightforward. Let’s walk through 8-bit quantization:

Find the range: Look at all values in your vector and find the minimum and maximum
Create a mapping: Map this range to 0-255 (the range of an 8-bit integer)
Quantize: Convert each float value to its corresponding integer with the following formula: quantized = round((value - min) / (max - min) * 255)

Here’s a simple example:

import numpy as np

# Original vector with float32 values
original_vector = np.array([0.5, 0.8, 0.2, 0.9, 0.1], dtype=np.float32)

min_val = original_vector.min()
max_val = original_vector.max()

# Quantize to 8-bit integers (0-255)
quantized = ((original_vector - min_val) / (max_val - min_val) * 255).astype(np.uint8)

print(f"Original: {original_vector}")
print(f"Quantized: {quantized}")
# Quantized: [127, 223,  31, 255,   0]

We obviously loose some precision, and can not get the exact same original vector through de-quantization only of the quantized vector, but we would find a vector that would be close to it.

The Trade-offs
#

Like everything in engineering, scalar quantization comes with its compromises:

The Upsides:

Memory savings: 4x reduction with 8-bit quantization (or 8x with 4-bit)
Faster similarity computations: Integer operations are cheaper than floating-point. Also, Many libraries have optimized SIMD instructions for int8.
Better cache utilization: More vectors fit in CPU cache, improving overall performance
Lower costs: Less memory means cheaper infrastructure

The Downsides:

Loss of precision: You’re throwing away information, which can hurt search quality
Quantization overhead: You need to compute min/max values and perform the quantization
Quality degradation: In some cases, the loss of precision can noticeably impact search results
One-size-doesn’t-fit-all: Some vectors quantize better than others

When Should You Use It?
#

Scalar quantization makes sense when:

You’re dealing with large-scale vector search (millions of vectors)
Your infrastructure costs are a concern
You can tolerate some loss in search quality (usually 2-5% drop in recall)
Your vectors have a relatively uniform distribution

It might not be worth it when:

You have a small number of vectors (just use full precision)
You need absolute best-in-class search quality
Your vectors have extreme outliers that would skew the quantization

Real-World Impact
#

In production systems, the impact is tangible. A search system with 10 million embeddings at 768 dimensions would need about 30GB of RAM if based on the type float32. With 8-bit quantization, that drops to around 7.5GB. That’s the difference between needing a beefier server and running comfortably on a more modest instance.

And the speed gains? You’ll typically see 2-4x faster similarity computations, which directly translates to lower query latency and the ability to handle more queries per second.

Going further with Binary Quantization
#

To go even further than scalar quantization of float32 to int8, which saves 4x space, we could even do binary quantization, allowing us to reach a 32x compression of the float32 vectors.

In it, each dimension becomes 0 or 1 based on sign: bit = 1 if value > 0 else 0.

It presents massive precision loss, but still works surprisingly well for similarity ranking. And also, when it comes to processing the distances, the processing is way faster (Inner product becomes Hamming distance).

What is Scalar Quantization?#

Why Do We Need It?#

How Does It Work?#

The Trade-offs#

When Should You Use It?#

Real-World Impact#

Going further with Binary Quantization#