How Much RAM Could a Vector Database Use If a Vector Database Could Use RAM

How Much RAM Could a Vector Database Use If a Vector Database Could Use RAM

Featured image generated by ChatGPT 4o model: “a low poly woodchuck by a serene lake, surrounded by mountains and a forest with tree leaves made from DDR memory modules. The woodchuck is munching on a memory DIMM. The only memory DIMM in the image should be the one being eaten.”

How Much RAM Could a Vector Database Use If a Vector Database Could Use RAM?

Although the title is a punn from the famous “woodchuck rhyme,” the question is serious for LLM applications using vector databases. As large language models (LLMs) continue to evolve, leveraging vector databases to store and search embeddings is critical. Understanding the memory usage of these systems is essential for maintaining performance, response times, and ensuring system scalability.

In this article, we’ll explore the RAM requirements for vector databases in LLM solutions, discuss what happens when memory runs out, and investigate scaling techniques like memory tiering with Compute Express Link (CXL). We’ll also uncover how optimizing retrieval mechanisms can supercharge the performance of LLMs by improving service-level agreements (SLAs).

The Napkin Math of Vector Database RAM Usage

Before diving into advanced techniques, let’s cover some foundational “napkin math” to estimate the RAM usage of a vector database based on varying vector lengths, data types, and the number of vectors.

Most vector databases use floating-point representations for embeddings, with FP32 (32-bit) being a common format. The choice of data type depends on the specific needs of the application. For instance:

  • FP32 (32-bit floating point) [4 bytes]: Often the default for storing vectors in applications where precision is key (e.g., embeddings for NLP, image search).
  • FP16/BF16 (16-bit floating point/bfloat) [2 bytes]: Useful in deep learning applications where memory and speed optimizations are needed, while sacrificing precision.
  • INT8 (8-bit integer) [1 byte]: Used in quantized models to reduce memory consumption and increase throughput, especially for inference on edge devices.
  • FP64 (64-bit floating point) [8 bytes]: Rarely used but important in scenarios requiring extremely high numerical precision.

The selected data type impacts memory usage, computational speed, and precision, so it’s crucial to balance these factors based on the specific vector database use case.

Here’s a simple formula to estimate the raw memory required to store the data (vectors):

Memory (bytes) = Vector Length × Number of Vectors × Size of Data Type

Example Calculation

Let’s say you have an LLM producing embeddings of length 768 (common for BERT-based models) and you store 10 million vectors in FP32 format (4 bytes per element):

  • Memory = 768 × 10,000,000 × 4 = 30.72 GB

This is the raw memory needed just for storing the vectors themselves. For other data types (e.g., FP16, INT8) can reduce memory requirements, but also reduce precision, which can impact retrieval accuracy.

Qdrant capacity sizing , for example, suggests an extra 50% is needed for metadata (indexes, point versions, etc.) as well as for temporary segments constructed during the optimization process.

Here’s a table summarizing how different data types and vector lengths affect memory usage:

Vector LengthData TypeSize (bytes)1M Vectors10M Vectors25M Vectors50M Vectors100M Vectors
384FP3241.536 GB15.36 GB38.4 GB76.8 GB153.6 GB
768FP3243.072 GB30.72 GB76.8 GB153.6 GB307.2 GB
1024FP1622.048 GB20.48 GB51.2 GB102.4 GB204.8 GB
2048BF1624.096 GB40.96 GB102.4 GB204.8 GB409.6 GB
4096INT814.096 GB40.96 GB102.4 GB204.8 GB409.6 GB

For a more complete esimate, we can calculate the memory requirements for each data type for a variety of vector sizes and quantity:

Raw Memory Requirements for FP16/BF16 (2 bytes per element)

Vector Length1M Vectors5M Vectors10M Vectors25M Vectors50M Vectors100M Vectors
3840.768 GB3.84 GB7.68 GB19.2 GB38.4 GB76.8 GB
7681.536 GB7.68 GB15.36 GB38.4 GB76.8 GB153.6 GB
10242.048 GB10.24 GB20.48 GB51.2 GB102.4 GB204.8 GB
20484.096 GB20.48 GB40.96 GB102.4 GB204.8 GB409.6 GB
30726.144 GB30.72 GB61.44 GB153.6 GB307.2 GB614.4 GB
40968.192 GB40.96 GB81.92 GB204.8 GB409.6 GB819.2 GB

Raw Memory Requirements for FP32 (4 bytes per element)

Vector Length1M Vectors5M Vectors10M Vectors25M Vectors50M Vectors100M Vectors
3841.536 GB7.68 GB15.36 GB38.4 GB76.8 GB153.6 GB
7683.072 GB15.36 GB30.72 GB76.8 GB153.6 GB307.2 GB
10244.096 GB20.48 GB40.96 GB102.4 GB204.8 GB409.6 GB
20488.192 GB40.96 GB81.92 GB204.8 GB409.6 GB819.2 GB
307212.288 GB61.44 GB122.88 GB307.2 GB614.4 GB1,229 GB
409616.384 GB81.92 GB163.84 GB409.6 GB819.2 GB1,638 GB

Raw Memory Requirements for INT8 (1 byte per element)

Vector Length1M Vectors5M Vectors10M Vectors25M Vectors50M Vectors100M Vectors
3840.384 GB1.92 GB3.84 GB9.6 GB19.2 GB38.4 GB
7680.768 GB3.84 GB7.68 GB19.2 GB38.4 GB76.8 GB
10241.024 GB5.12 GB10.24 GB25.6 GB51.2 GB102.4 GB
20482.048 GB10.24 GB20.48 GB51.2 GB102.4 GB204.8 GB
30723.072 GB15.36 GB30.72 GB76.8 GB153.6 GB307.2 GB
40964.096 GB20.48 GB40.96 GB102.4 GB204.8 GB409.6 GB

These tables give an estimate of the memory required for each data type and vector length for various numbers of vectors.

Note: The memory requirements in the tables above assumes no additional overhead from indexing or metadata. In practice, you would need to account for some additional memory for indexes, metadata, and the database itself.

This napkin math gives you a rough estimate of the memory required. But what happens when the database runs out of RAM?

The Cost of Storing Your Data at Scale

When considering vector databases for your AI applications, understanding the storage costs is crucial. I compiled a cost comparison table using Qdrant’s cloud calculator , which provides insights into the monthly expenses using reserved instances for various vector sizes and quantities:

Vector Size1 Million5 Million10 Million25 Million50 Million100 Million
384$219.00$219.00$219.00$438.00$876.00$1,752.00
1024$219.00$438.00$876.00$1,752.00$3,504.00$7,008.00
2048$438.00$876.00$1,752.00$3,504.00$7,008.00$14,016.00
4096$876.00$1,752.00$3,504.00$7,008.00$14,016.00$19,681.00

Qdrant’s cloud calculator is very useful and contains many variables. I kept it simple and selected AWS US-West-2 region, no storage optimization, 1 replica, no quantization. Since the AWS instances cap out at 256GB RAM, additional EC2 instances are required and the database has to be sharded.

As the table shows, costs increase with both vector size and quantity. When planning your vector database, consider your current needs and potential growth. While higher-dimensional vectors offer more detailed representations, they come at a higher cost. Balancing performance requirements with budget constraints is key to optimizing your vector storage strategy.

What Happens When a Vector Database Runs Out of Memory

Vector databases, like Qdrant, primarily operate in memory to ensure low-latency retrieval. However, when the system runs out of memory, several strategies are employed:

1. Swapping to Disk

  • If RAM is exhausted, the system may start swapping data to disk, using it as virtual memory. This degrades performance significantly because disk access is orders of magnitude slower than RAM. For systems running LLMs, this is a worst-case scenario as response times may increase drastically, violating SLAs.

2. Memory-Mapped Files

  • Some vector databases use memory-mapped files to extend memory with disk storage. This technique maps files from disk into the virtual address space of the application, allowing the database to treat them as RAM. It’s more efficient than pure swapping but still comes with higher latency compared to RAM operations.

3. Sharding and Distributed Storage

  • Another solution is sharding—splitting the database into smaller parts and distributing them across multiple machines. Each shard is stored in RAM, allowing the system to scale horizontally. However, this adds complexity in terms of data management and network latency.

4. Storage Layers in Qdrant

  • Qdrant offers persistent storage to complement its in-memory operations. When memory is insufficient, Qdrant offloads data to SSDs or other storage devices. While this extends capacity, it’s crucial to ensure that the storage I/O is fast enough to minimize the impact on query latency.

Memory limitations are a bottleneck for vector databases, especially in LLM solutions. But Compute Express Link (CXL) offers a promising solution. CXL allows for memory expansion by seamlessly integrating additional memory devices directly into the memory hierarchy.

How CXL Helps:

  • Memory Expansion (Scale-Up): CXL allows you to attach large pools (multi-terrabyte) of memory to your system, beyond the physical limits of traditional DRAM. This enables vector databases to store larger embeddings without running out of memory, all while maintaining low latency and high throughput.

  • Improved Bandwidth: CXL also provides high-bandwidth connections to the memory devices, which reduces the performance degradation usually associated with using slower tiers of memory like SSDs.

  • Latency Reduction: By using CXL-attached memory, the system can utilize the additional memory tier without incurring the latency penalties typically seen when swapping to disk. This ensures that retrieval operations remain performant, even at large scales.

Exciting New Possibilities

With CXL, vector databases like Qdrant can handle much larger datasets with the same hardware, allowing LLMs to store and query more embeddings without performance degradation. The magic here is that this scale-up can be achieved without rewriting software or radically changing the infrastructure. You simply plug in additional memory, and the system manages it as part of the regular memory hierarchy.

This opens up exciting possibilities for LLMs that rely on vector databases for retrieval:

  • Faster Response Times: With more memory available, LLMs can retrieve embeddings and execute similarity searches faster, improving overall query throughput.
  • Higher Accuracy: Storing more vectors in-memory allows for larger search spaces, increasing the accuracy of retrieval.
  • Larger Context Windows: More memory also enables storing embeddings for larger context windows, which is critical for tasks like document summarization or question-answering systems.

The Magic is in the Retrieval: Why Optimizing Vector Search Matters

When we think about the performance of large language models (LLMs), we often focus on the model itself, but the real magic lies in retrieval. As described in an article from InfoWorld, The magic of RAG is in the retrieval leverages a combination of search and generation, where the quality and speed of retrieval directly influence the output of the model.

Vector databases are at the heart of retrieval in RAG systems. If the vector search is slow, even the most powerful LLM will struggle to perform efficiently. By optimizing memory usage with techniques like memory tiering and CXL, we can ensure that the retrieval phase is quick, accurate, and scalable.

Enhancing SLAs with Optimized Retrieval

Service-level agreements (SLAs) in LLM-based systems often require low-latency responses, especially in interactive applications like chatbots or real-time recommendation systems. By improving vector search performance through better memory management and retrieval optimization, the entire LLM pipeline becomes more efficient. This translates to faster responses, more satisfied users, and ultimately, better system performance.

Accelerating Data Retrieval in Retrieval Augmentation Generation (RAG) Pipelines using CXL

In the article a colleague and I wrote Accelerating Data Retrieval in Retrieval Augmentation Generation (RAG) Pipelines using CXL , we explored the idea of accerlating the vector database and found siginificant improvements using CXL.

We only scratched the surface of this fasinating area and plan to do more research, including:

  • Comparing memory scalability using CXL memory devices to see how big we can go. Using a combination of DRAM + CXL, affordable multi-terabyte systems are entirely possible to deploy.
  • Improving data placement by making the database memory management multi-tier aware; ie: the database is aware that it can use DRAM and CXL and intelligently place data and metadata in the optimal memory tier.
  • and much more…

Conclusion

So, “how much RAM could a vector database use if a vector database could use RAM?"—quite a lot! And it’ll cost a lot too! The amount depends on the vector length, data type, and the number of vectors. However, running out of RAM doesn’t have to spell disaster. By using techniques like storage layers in Qdrant, memory-mapped files, and memory expansion via CXL, LLM engineers can build systems that scale effectively without sacrificing performance.

Optimizing retrieval in vector databases isn’t just about reducing memory consumption—it’s about boosting the overall end-to-end performance of LLM systems. With the right memory management and retrieval strategies, you can ensure that your LLM continues to provide fast, accurate results, even as your vector databases grow in size. The future of memory scaling with technologies like CXL is particularly exciting, enabling new possibilities for LLMs while maintaining the high performance that users expect.

How To Extend Volatile System Memory (RAM) using Persistent Memory on Linux

How To Extend Volatile System Memory (RAM) using Persistent Memory on Linux

Intel(R) Optane(TM) Persistent Memory delivers a unique combination of affordable large capacity and support for data persistence.

Read More
A Practical Guide to Identify Compute Express Link (CXL) Devices in Your Server

A Practical Guide to Identify Compute Express Link (CXL) Devices in Your Server

In this article, we will provide four methods for identifying CXL devices in your server and how to determine which CPU socket and NUMA node each CXL device is connected.

Read More
Resolving commands 'Killed' on GCP f1-micro Compute Engine instances

Resolving commands 'Killed' on GCP f1-micro Compute Engine instances

When I want to perform a quick task, I generally spin up a Google GCP Compute Engine instance as they’re cheap.

Read More