Calculate : How much GPU Memory you need to serve any LLM ?

Karan Singh
3 min readJul 11, 2024

--

Just tell me how much GPU Memory do i need to serve my LLM ? Anyone else looking for this answer ? Read On …

What is Model Serving ?

Model serving is the process of deploying a trained machine learning model into production so that it can be used to make predictions on new data. In the context of large language models (LLMs), model serving refers to making the LLM available to answer questions, generate text, or perform other tasks based on user input.

So in one line,

Serving = Prompt IN , Answer Out

What’s the Significance of GPU VRAM, Why not use RAM for LLMs?

Large language models (LLMs) are computationally expensive to run. They require a lot of memory to store the model parameters and intermediate calculations during inferencing. System Memory (RAM) is not ideal for this purpose because it is slower than GPU memory. GPU memory, also known as VRAM (Video Memory) or GDDR (Graphics DDR), is specifically designed for high-performance computing tasks like deep learning. It provides the speed and bandwidth needed to efficiently run large language models. This allows LLMs to perform complex computations efficiently without encountering speed limitations caused by data transfer between memory and processing units.

So, the more VRAM the GPU has, the bigger LLM it can host and serve

What’s the Math for GPU Memory requirements for Serving LLM ?

A common formula used to estimate the GPU memory required for serving an LLM is:

https://www.substratus.ai/blog/calculating-gpu-memory-for-llm
  • P (parameters) : The number of parameters in the model. For example, GPT-3 has 175 Billion parameters, Llama-70b has 70 Billion Parameters etc.
  • Q (Precision or Size per parameter) : Is the data type used to store the model parameters. Common data types include:
  • FP32 (32-bit floating point): 4 bytes per parameter
  • FP16 (half/BF16) (16-bit floating point): 2 bytes per parameter
  • INT8 (8-bit integer): 1 byte per parameter
  • INT4 (4-bit integer): 0.5 bytes per parameter

Overhead factor : This accounts for additional memory used during inference, such as storing activations (intermediate results) of the model. A typical overhead factor is 20%.

For example, let’s consider a fictional LLM called Llama 70B with 70 billion parameters. If the model is stored in float32 format, and we assume a 20% overhead factor, then the memory requirement can be calculated as follows:

To run this model, you would require two NVIDIA A-100 80GB memory models.

How to reduce GPU memory requirements for Serving LLM ?

One approach to reduce GPU memory requirements is quantization. Quantization is a technique that reduces the precision of the model’s parameters by converting them from a higher precision format (e.g., float32) to a lower precision format (e.g., float16 or even lower). This can significantly reduce memory usage without a significant impact on accuracy.

In our example of Llama 70B, using float16 precision instead of float32 would cut the memory requirement in half (from 4 bytes per parameter to 2 bytes per parameter).

What’s the deal with Quantization ?

Quantization techniques can further reduce the memory footprint by using even lower precision formats (like INT8, INT4) however lowering precision can potentially impact the accuracy of the outputs. For example, INT8 quantization can sometimes lead to a more noticeable drop in accuracy compared to FP16. It’s crucial to evaluate the model’s performance pr and post quantization.

Summary

Serving large language models requires significant GPU memory resources. The amount of memory needed depends on the size and complexity of the model, the data type used to store the parameters, and any optimizations applied like quantization. By understanding the factors that influence GPU memory requirements, developers can make informed decisions about how to deploy LLMs for optimal performance and efficiency.

--

--

Karan Singh
Karan Singh

Written by Karan Singh

Co-Founder & CTO @ Scogo ♦ I Love to solve problems using Tech

Responses (6)