Dead Cheap LLM Inferencing in Production

7 min readJul 7, 2024

A guide for leveraging Spot GPU Instances for Cost-Effective LLM Inference Workloads

Preface

At Scogo, we successfully power our LLM inferencing workloads in production using a robust fleet of Azure Spot GPU instances complemented by select on-demand GPU instances. This guide draws from our extensive implementation experience at Scogo, where we leverage both open-source and custom fine-tuned models to achieve optimal performance and cost-efficiency.

Read On …

What are LLM Inference Workloads ?

Large Language Models (LLMs) like GPT-4, BERT, and T5 have revolutionised NLP by enabling machines to process and generate human-like text. These models power applications such as chatbots, automated support, and content creation. LLM inference leverages these pre-trained models to generate predictions or text from new inputs.

LLM inference is computationally intensive due to the models’ billions of parameters — comparable to the number of stars in the Milky Way. Efficient execution demands substantial parallel processing power, which GPUs provide. Their ability to perform parallel computations is vital for the real-time or near-real-time performance required by LLM inference workloads.

Challenge of Cost

Running GPU-heavy workloads on the cloud can be expensive, especially at scale. On-demand instances are flexible but quickly become costly. This is a major hurdle for startups and SMEs with limited budgets.

High demand for GPUs drives up prices, and maintaining high availability and scalability adds to the expense. Therefore, finding cost-effective solutions is crucial for sustainable and scalable AI/ML operations.

Understanding Spot Instances

Spot instances are spare cloud compute resources offered at significantly lower prices compared to on-demand instances. Cloud providers sell these unused instances at a discount, with the caveat that they can reclaim the instances at any time when demand for on-demand instances increases. This potential for sudden termination presents a challenge, but the cost savings are substantial, making spot instances an attractive option for many workloads.

Challenges and Rules of Spot Instances:

Interruption: The primary challenge is the risk of interruption. Cloud providers can terminate spot instances with little or no notice, making them unsuitable for long-running, stateful applications without additional handling.
Limited Availability: Spot instances may not always be available. Their availability can fluctuate based on the overall demand for resources in the cloud provider’s data centers.
Price Variability: Spot instance prices can vary, sometimes significantly, depending on supply and demand dynamics. This can introduce cost unpredictability.
No SLA: Unlike on-demand instances, spot instances typically do not come with service-level agreements (SLAs), which means there are no guarantees on availability or uptime.

Leveraging Kubernetes to Manage Spot Instance Limitations

Kubernetes excels in handling the dynamic nature of spot instances, making it an ideal solution to mitigate their limitations. Here’s how Kubernetes can help:

Auto-scaling: Kubernetes’ Horizontal Pod Autoscaler (HPA) and Cluster Autoscaler (CA) can automatically adjust the number of running pods and nodes based on workload demands. When a cloud provider reclaims spot capacity, Kubernetes can quickly launch new spot instances or fall back to on-demand instances to maintain the required capacity.
Node Pools: Kubernetes supports multiple node pools, which can be configured with different instance types and priorities. By creating separate pools for spot and on-demand instances, Kubernetes can efficiently balance cost and reliability. Workloads can be scheduled on spot instances by default and seamlessly transitioned to on-demand instances when spot instances are terminated.
Tolerations and Affinities: These advanced scheduling features enable precise control over workload placement. Tolerations allow pods to be scheduled on nodes with specific taints, such as spot instances, while affinities ensure that critical workloads are placed on more reliable on-demand instances when necessary.
Fault Tolerance and Resilience: Kubernetes’ self-healing capabilities automatically detect and reschedule pods when a node (including a spot instance) fails. This ensures minimal disruption to running workloads. Stateful applications can leverage Kubernetes StatefulSets and Persistent Volumes to maintain data integrity across node terminations.
Preemptive Handling: Kubernetes can integrate with cloud provider-specific mechanisms to handle spot instance terminations gracefully. For example, using termination notices provided by cloud providers, Kubernetes can proactively drain and reschedule pods from spot instances before they are reclaimed.

By utilising these Kubernetes features, organizations can effectively manage the risks associated with spot instances while maximizing cost savings. The combination of Kubernetes’ robust orchestration capabilities and the economic benefits of spot instances offers a powerful solution for running GPU-intensive LLM inference workloads efficiently and cost-effectively.

GPU Options on Azure

Azure’s India regions offer a decent selection of Nvidia GPUs, some of them are (aka the one we have tested)

Nvidia T4 GPU

The Nvidia T4 is a versatile GPU designed for a wide range of workloads, including AI inference, machine learning, and data analytics. On Azure, T4 GPUs are available in several instance types, offering a balance between performance and cost.

Key features of the Nvidia T4:

16GB GDDR6 memory
2,560 CUDA cores
320 Tensor cores
Up to 65 TFLOPS in mixed-precision performanceSetting Up GPU Instances on AKS

Nvidia A100 GPU

The Nvidia A100 is a high-performance GPU designed for the most demanding AI and HPC (High-Performance Computing) workloads. Azure offers A100-based instances for customers requiring maximum computational power.

Key features of the Nvidia A100:

40GB or 80GB HBM2e memory
6,912 CUDA cores
432 Tensor cores
Up to 624 TFLOPS in mixed-precision performance

On-Demand vs. Spot Instances

On-Demand Instances:

Provide guaranteed availability
Ideal for production workloads and time-sensitive tasks
Higher pricing compared to spot instances

Spot Instances:

Utilize unused Azure capacity
Can be interrupted with short notice
Offer significant cost savings (up to 90% compared to on-demand pricing)
Suitable for fault-tolerant, flexible workloads

Cost Comparison for Nvidia T4 GPUs on Azure (Central-India) Region

Cost Comparison for Nvidia A100 GPUs on Azure (Central-India) Region

Setting Up GPU Instances on AKS

Prerequisites:

AKS Cluster
CLI Access to AKS cluster
Official Setup Instructions

Step-by-Step Guide: Follow these steps to setup Spot Instances GPU node pool on AKS cluster

Add and Update Extensions:

az extension add --name aks-preview
az extension update --name aks-preview
az feature register --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
az feature show --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
az provider register --namespace Microsoft.ContainerService

2. Create Spot Instance Node Pools:

Standard_NC6s_v3:

az aks nodepool add --resource-group Scogo_development_RG \
--cluster-name Scogo-Development-AKS --name gpuspot3 --node-count 1 \
--node-vm-size Standard_NC6s_v3 --node-taints sku=gpu:NoSchedule \
--aks-custom-headers UseGPUDedicatedVHD=true,usegen2vm=true \
--enable-cluster-autoscaler --min-count 1 --max-count 10 --priority Spot

Standard_NC4as_T4_v3

az aks nodepool add --resource-group Scogo_development_RG \
--cluster-name Scogo-Development-AKS --name spotgput4 --node-count 1 \
--node-vm-size Standard_NC4as_T4_v3 --node-taints sku=gpu:NoSchedule \
--aks-custom-headers UseGPUDedicatedVHD=true,usegen2vm=true \
--enable-cluster-autoscaler --min-count 1 --max-count 10 --priority Spot

3. Verifying GPU Nodes on AKS

Check Node Status: Commands to verify the status and capacity of GPU nodes.

kubectl get nodes
kubectl describe node <Node name here>

Expected Output: Under the Capacity section, the GPU should list as

nvidia.com/gpu: 1

Your output should look similar to the following condensed example output

Name:               aks-gpunp-28993262-0
Roles:              agent
Labels:             accelerator=nvidia

[...]

Capacity:
[...]
 nvidia.com/gpu:                 1
[...]

4. Deploying GPU Workloads

Sample YAML for TensorFlow MNIST Demo: Provide a sample YAML file for deploying a TensorFlow job on GPU nodes.

apiVersion: batch/v1
kind: Job
metadata:
  labels:
    app: samples-tf-mnist-demo
  name: samples-tf-mnist2-demo
spec:
  template:
    metadata:
      labels:
        app: samples-tf-mnist-demo
    spec:
      containers:
      - name: samples-tf-mnist-demo
        image: mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu
        args: ["--max_steps", "500"]
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            nvidia.com/gpu: 1
      restartPolicy: OnFailure
      nodeSelector:
        accelerator: nvidia
      tolerations:
      - key: "sku"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"
      - key: "kubernetes.azure.com/scalesetpriority"
        operator: "Equal"
        value: "spot"
        effect: "NoSchedule"

Checking Job and Pod Status: Commands to check the status of jobs and pods.

kubectl get jobs,pods
kubectl logs samples-tf-mnist-demo-smnr6

....
2019-05-16 16:08:31.396886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] 
Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 2fd7:00:00.0, 
compute capability: 3.7)
.....

Conclusion

organizations looking to optimize their LLM inference workloads on Azure, we recommend:

Containerize LLM inference services for deployment on Kubernetes
Implement auto-scaling policies to adjust to varying inference demand
Use a mix of on-demand and spot instances to balance reliability and cost-effectiveness
Regularly monitor and optimize GPU utilization and spot instance usage
Implement robust logging and monitoring to track inference performance and costs

By leveraging Kubernetes with Azure’s GPU offerings and spot instances, organizations can create a flexible, scalable, and cost-effective infrastructure for LLM inferencing. This approach allows for optimal resource utilization while maintaining the ability to handle varying workloads and minimize operational costs.