Optimizing AI Workloads with NVIDA GPUs, Time Slicing, and Karpenter (Part 2)

Introduction: Overcoming GPU Management Challenges

In Part 1 of this blog series, we explored the challenges of hosting large language models (LLMs) on CPU-based workloads within an EKS cluster. We discussed the inefficiencies associated with using CPUs for such tasks, primarily due to the large model sizes and slower inference speeds. The introduction of GPU resources offered a significant performance boost, but it also brought about the need for efficient management of these high-cost resources.

In this second part, we will delve deeper into how to optimize GPU usage for these workloads. We will cover the following key areas:

NVIDIA Device Plugin Setup: This section will explain the importance of the NVIDIA device plugin for Kubernetes, detailing its role in resource discovery, allocation, and isolation.
Time Slicing: We’ll discuss how time slicing allows multiple processes to share GPU resources effectively, ensuring maximum utilization.
Node Autoscaling with Karpenter: This section will describe how Karpenter dynamically manages node scaling based on real-time demand, optimizing resource utilization and reducing costs.

Challenges Addressed

Efficient GPU Management: Ensuring GPUs are fully utilized to justify their high cost.
Concurrency Handling: Allowing multiple workloads to share GPU resources effectively.
Dynamic Scaling: Automatically adjusting the number of nodes based on workload demands.

Section 1: Introduction to NVIDIA Device Plugin

The NVIDIA device plugin for Kubernetes is a component that simplifies the management and usage of NVIDIA GPUs in Kubernetes clusters. It allows Kubernetes to recognize and allocate GPU resources to pods, enabling GPU-accelerated workloads.

Why We Need the NVIDIA Device Plugin

Resource Discovery: Automatically detects NVIDIA GPU resources on each node.
Resource Allocation: Manages the distribution of GPU resources to pods based on their requests.
Isolation: Ensures secure and efficient utilization of GPU resources among different pods.

The NVIDIA device plugin simplifies GPU management in Kubernetes clusters. It automates the installation of the NVIDIA driver, container toolkit, and CUDA, ensuring that GPU resources are available for workloads without requiring manual setup.

NVIDIA Driver: Required for nvidia-smi and basic GPU operations. Interfacing with the GPU hardware. The screenshot below displays the output of the nvidia-smi command, which shows key information such as the driver version, CUDA version, and detailed GPU configuration, confirming that the GPU is properly configured and ready for use

NVIDIA Container Toolkit: Required for using GPUs with containerd. Below we can see the version of the container toolkit version and the status of the service running on the instance

#Installed Version
rpm –qa | grep –i nvidia–container–toolkit
nvidia–container–toolkit–base–1.15.0–1.x86_64
nvidia–container–toolkit–1.15.0–1.x86_64

CUDA: Required for GPU-accelerated applications and libraries. Below is the output of the nvcc command, showing the version of CUDA installed on the system:

/usr/local/cuda/bin/nvcc —version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

Setting Up the NVIDIA Device Plugin

To ensure the DaemonSet runs exclusively on GPU-based instances, we label the node with the key “nvidia.com/gpu” and the value “true”. This is achieved using Node affinity, Node selector and Taints and Tolerations.

Let us now delve into each of these components in detail.

Node Affinity: Node affinity allows to schedule pod on the nodes based on the node labels requiredDuringSchedulingIgnoredDuringExecution: The scheduler cannot schedule the Pod unless the rule is met, and the key is “nvidia.com/gpu” and operator is “in,” and the values is “true.”

affinity:
   nodeAffinity:
       requiredDuringSchedulingIgnoredDuringExecution:
           nodeSelectorTerms:
                – matchExpressions:
                    – key: feature.node.kubernetes.io/pci–10de.present
                      operator: In
                      values:
                        – “true”
                – matchExpressions:
                    – key: feature.node.kubernetes.io/cpu–model.vendor_id
                      operator: In
                      values:
                      – NVIDIA
                – matchExpressions:
                    – key: nvidia.com/gpu
                      operator: In
                      values:
                    – “true”

Node selector: Node selector is the simplest recommendation form for node selection constraints nvidia.com/gpu: “true”
Taints and Tolerations: Tolerations are added to the Daemon Set to ensure it can be scheduled on the tainted GPU nodes(nvidia.com/gpu=true:Noschedule).

kubectl taint node ip–10–20–23–199.us–west–1.compute.internal nvidia.com/gpu=true:Noschedule
kubectl describe node ip–10–20–23–199.us–west–1.compute.internal | grep –i taint
Taints: nvidia.com/gpu=true:NoSchedule

tolerations:
– effect: NoSchedule
key: nvidia.com/gpu
operator: Exists

After implementing the node labeling, affinity, node selector, and taints/tolerations, we can ensure the Daemon Set runs exclusively on GPU-based instances. We can verify the deployment of the NVIDIA device plugin using the following command:

kubectl get ds –n kube–system
NAME                                      DESIRED   CURRENT   READY   UP–TO–DATE   AVAILABLE NODE SELECTOR                AGE

nvidia–device–plugin                      1         1         1       1            1          nvidia.com/gpu=true          75d
nvidia–device–plugin–mps–control–daemon   0         0         0       0            0          nvidia.com/gpu=true,nvidia.com/mps.capable=true   75d

But the challenge here is GPUs are so expensive and need to make sure the maximum utilization of GPU’s and let us explore more on GPU Concurrency.

GPU Concurrency:

Refers to the ability to execute multiple tasks or threads simultaneously on a GPU

Single Process: In a single process setup, only one application or container uses the GPU at a time. This approach is straightforward but may lead to underutilization of the GPU resources if the application does not fully load the GPU.
Multi-Process Service (MPS): NVIDIA’s Multi-Process Service (MPS) allows multiple CUDA applications to share a single GPU concurrently, improving GPU utilization and reducing the overhead of context switching.
Time slicing: Time slicing involves dividing the GPU time between different processes in other words multiple process takes turns on GPU’s (Round Robin context Switching)
Multi Instance GPU(MIG): MIG is a feature available on NVIDIA A100 GPUs that allows a single GPU to be partitioned into multiple smaller, isolated instances, each behaving like a separate GPU.
Virtualization: GPU virtualization allows a single physical GPU to be shared among multiple virtual machines (VMs) or containers, providing each with a virtual GPU.

Section 2: Implementing Time Slicing for GPUs

Time-slicing in the context of NVIDIA GPUs and Kubernetes refers to sharing a physical GPU among multiple containers or pods in a Kubernetes cluster. The technology involves partitioning the GPU’s processing time into smaller intervals and allocating those intervals to different containers or pods.

Time Slice Allocation: The GPU scheduler allocates time slices to each vGPU configured on the physical GPU.
Preemption and Context Switching: At the end of a vGPU’s time slice, the GPU scheduler preempts its execution, saves its context, and switches to the next vGPU’s context.
Context Switching: The GPU scheduler ensures smooth context switching between vGPUs, minimizing overhead, and ensuring efficient use of GPU resources.
Task Completion: Processes within containers complete their GPU-accelerated tasks within their allocated time slices.
Resource Management and Monitoring
Resource Release: As tasks complete, GPU resources are released back to Kubernetes for reallocation to other pods or containers

Why We Need Time Slicing

Cost Efficiency: Ensures high-cost GPUs are not underutilized.
Concurrency: Allows multiple applications to use the GPU simultaneously.

Configuration Example for Time Slicing

Let us apply the time slicing config using config map as shown below. Here replicas: 3 specifies the number of replicas for GPU resources that means that GPU resource can be sliced into 3 sharing instances

apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia–device–plugin
namespace: kube–system
data:
any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        – name: nvidia.com/gpu
          replicas: 3
#We can verify the GPU resources available on your nodes using the following command:
kubectl get nodes –o json | jq –r ‘.items[] | select(.status.capacity.”nvidia.com/gpu” != null)
| {name: .metadata.name, capacity: .status.capacity}’
{
“name”: “ip-10-20-23-199.us-west-1.compute.internal”,
“capacity”: {
    “cpu”: “4”,
    “ephemeral-storage”: “104845292Ki”,
    “hugepages-1Gi”: “0”,
    “hugepages-2Mi”: “0”,
    “memory”: “16069060Ki”,
    “nvidia.com/gpu”: “3”,
    “pods”: “110”
}
}
#The above output shows that the node ip-10-20-23-199.us-west-1. compute.internal has 3 virtual GPUs available.
#We can request GPU resources in their pod specifications by setting resource limits
resources:
      limits:
        cpu: “1”
        memory: 2G
        nvidia.com/gpu: “1”
      requests:
        cpu: “1”
        memory: 2G
        nvidia.com/gpu: “1”

In our case we can be able to host 3 pods in a single node ip-10-20-23-199.us-west-1. compute. Internal and because of time slicing these 3 pods can use 3 virtual GPU’s as below

GPUs have been shared virtually among the pods, and we can see the PIDS assigned for each of the processes below.

Now we optimized GPU at the pod level, let us now focus on optimizing GPU resources at the node level. We can achieve this by using a cluster autoscaling solution called Karpenter. This is particularly important as the learning labs may not always have a constant load or user activity, and GPUs are extremely expensive. By leveraging Karpenter, we can dynamically scale GPU nodes up or down based on demand, ensuring cost-efficiency and optimal resource utilization.

Section 3: Node Autoscaling with Karpenter

Karpenter is an open-source node lifecycle management for Kubernetes. It automates provisioning and deprovisioning of nodes based on the scheduling needs of pods, allowing efficient scaling and cost optimization

Dynamic Node Provisioning: Automatically scales nodes based on demand.
Optimizes Resource Utilization: Matches node capacity with workload needs.
Reduces Operational Costs: Minimizes unnecessary resource expenses.
Improves Cluster Efficiency: Enhances overall performance and responsiveness.

Why Use Karpenter for Dynamic Scaling

Dynamic Scaling: Automatically adjusts node count based on workload demands.
Cost Optimization: Ensures resources are only provisioned when needed, reducing expenses.
Efficient Resource Management: Tracks pods unable to be scheduled due to lack of resources, reviews their requirements, provisions nodes to accommodate them, schedules the pods, and decommissions nodes when redundant.

Installing Karpenter:

#Install Karpenter using HELM:
helm upgrade —install karpenter oci://public.ecr.aws/karpenter/karpenter —version “${KARPENTER_VERSION}“
—namespace “${KARPENTER_NAMESPACE}“ —create–namespace   —set “settings.clusterName=${CLUSTER_NAME}“
—set “settings.interruptionQueue=${CLUSTER_NAME}“    —set controller.resources.requests.cpu=1
—set controller.resources.requests.memory=1Gi    —set controller.resources.limits.cpu=1
—set controller.resources.limits.memory=1Gi

#Verify Karpenter Installation:
kubectl get pod –n kube–system | grep –i karpenter
karpenter–7df6c54cc–rsv8s             1/1     Running   2 (10d ago)   53d
karpenter–7df6c54cc–zrl9n             1/1     Running   0             53d

Configuring Karpenter with NodePools and NodeClasses:

Karpenter can be configured with NodePools and NodeClasses to automate the provisioning and scaling of nodes based on the specific needs of your workloads

Karpenter NodePool: Nodepool is a custom resource that defines a set of nodes with shared specifications and constraints in a Kubernetes cluster. Karpenter uses NodePools to dynamically manage and scale node resources based on the requirements of running workloads

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: g4–nodepool
spec:
template:
    metadata:
      labels:
        nvidia.com/gpu: “true“
    spec:
      taints:
        – effect: NoSchedule
          key: nvidia.com/gpu
          value: “true“
      requirements:
        – key: kubernetes.io/arch
          operator: In
          values: [“amd64”]
        – key: kubernetes.io/os
          operator: In
          values: [“linux”]
        – key: karpenter.sh/capacity–type
          operator: In
          values: [“on-demand”]
        – key: node.kubernetes.io/instance–type
          operator: In
          values: [“g4dn.xlarge” ]
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: g4–nodeclass
limits:
    cpu: 1000
disruption:
    expireAfter: 120m
    consolidationPolicy: WhenUnderutilized

NodeClasses are configurations that define the characteristics and parameters for the nodes that Karpenter can provision in a Kubernetes cluster. A NodeClass specifies the underlying infrastructure details for nodes, such as instance types, launch template configurations and specific cloud provider settings.

Note: The userData section contains scripts to bootstrap the EC2 instance, including pulling a TensorFlow GPU Docker image and configuring the instance to join the Kubernetes cluster.

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: g4–nodeclass
spec:
amiFamily: AL2
launchTemplate:
    name: “ack_nodegroup_template_new”
    version: “7”
role: “KarpenterNodeRole”
subnetSelectorTerms:
    – tags:
        karpenter.sh/discovery: “nextgen-learninglab”
securityGroupSelectorTerms:
    – tags:
        karpenter.sh/discovery: “nextgen-learninglab”
blockDeviceMappings:
    – deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        iops: 10000
        encrypted: true
        deleteOnTermination: true
        throughput: 125
tags:
    Name: Learninglab–Staging–Auto–GPU–Node
userData: |
        MIME–Version: 1.0
        Content–Type: multipart/mixed; boundary=“//”
        —//
        Content–Type: text/x–shellscript; charset=“us-ascii”
        set –ex
        sudo ctr –n=k8s.io image pull docker.io/tensorflow/tensorflow:2.12.0–gpu
        —//
        Content–Type: text/x–shellscript; charset=“us-ascii”
        B64_CLUSTER_CA=” “
        API_SERVER_URL=“”
        /etc/eks/bootstrap.sh nextgen–learninglab-eks —kubelet–extra–args ‘–node-labels=eks.amazonaws.com/capacityType=ON_DEMAND
–pod-max-pids=32768 –max-pods=110’ — b64–cluster–ca $B64_CLUSTER_CA —apiserver–endpoint $API_SERVER_URL —use–max–pods false
         —//
        Content–Type: text/x–shellscript; charset=“us-ascii”
        KUBELET_CONFIG=/etc/kubernetes/kubelet/kubelet–config.json
        echo “$(jq “.podPidsLimit=32768” $KUBELET_CONFIG)” > $KUBELET_CONFIG
        —//
        Content–Type: text/x–shellscript; charset=“us-ascii”
        systemctl stop kubelet
        systemctl daemon–reload
       systemctl start kubelet
–//–

In this scenario, each node (e.g., ip-10-20-23-199.us-west-1.compute.internal) can accommodate up to three pods. If the deployment is scaled to add another pod, the resources will be insufficient, causing the new pod to remain in a pending state.

Karpenter monitors these Un schedulable pods and assesses their resource requirements to act accordingly. There will be nodeclaim which claims the node from the nodepool and Karpenter thus provision a node based on the requirement.

Conclusion: Efficient GPU Resource Management in Kubernetes

With the growing demand for GPU-accelerated workloads in Kubernetes, managing GPU resources effectively is essential. The combination of NVIDIA Device Plugin, time slicing, and Karpenter provides a powerful approach to manage, optimize, and scale GPU resources in a Kubernetes cluster, delivering high performance with efficient resource utilization. This solution has been implemented to host pilot GPU-enabled Learning Labs on developer.cisco.com/learning, providing GPU-powered learning experiences.

Source link

Optimizing AI Workloads with NVIDA GPUs, Time Slicing, and Karpenter (Part 2)

Moe Levin Speaks at WAGMI: Bitcoin’s $100K Growth, Meme Coin Experiments, and Trump’s Token Play – Interview Bitcoin News

Gardai investigate eighth arson attack at proposed asylum seeker site in Coolock

Related Posts

Spain Seeks to Curb Short-Term Rentals Amid Growing Housing Crisis

How To Improve Your Business Management Skills In 2025 – Young Upstarts

How the Trump-Harris Election Shed Light on the Flaws in Traditional PR — and How Businesses Can Adapt | Entrepreneur

Martin Lewis shares ‘crucial weapon’ in cutting your credit card debt

EasyJet halves winter losses as passenger boom prompts profit forecast

Trump Is Said to Push for Early Reopening of North American Trade Deal

Gardai investigate eighth arson attack at proposed asylum seeker site in Coolock

EA expects net bookings of $7B to $7.15B for the FY ending March 31, below guidance of $7.5B to $7.8B, citing underperforming games; stock falls 5%+ after hours (Kif Leswing/CNBC)

Leave a Reply Cancel reply

Bitcoin Technical Analysis: Bulls Eye $100K as Resistance Weakens – Markets and Prices Bitcoin News

Bitcoin Surges Past $100K for the First Time Since December, Hits $102,514 High – Markets and Prices Bitcoin News

NFT Market Sees Over 30% Decline in Weekly Sales – Markets and Prices Bitcoin News

VC Chamath Palihapitiya Predicts Stablecoin Adoption Will Challenge Visa’s and Mastercard’s Duopoly in 2025 – Crypto News Bitcoin News

MARA Holdings Loans 7,377 BTC to Third Parties Amid Bold Yield Strategy – Mining Bitcoin News

Democratizing Investment – The Power of Asset Tokenization in Real-World Assets – Op-Ed Bitcoin News

Madhya Pradesh’s tableau at Republic Day parade to showcase Project Cheetah

US moves missile system that angered China to new Philippine site

Britain told to give India back artefacts in lieu of £52trn in reparations

Spain Seeks to Curb Short-Term Rentals Amid Growing Housing Crisis

Trump 2nd term live updates: Trump sending 1,500 troops to southern border

4 San Antonio police officers shot, wounded by barricaded suspect, officials say

CATEGORIES

LATEST UPDATES

Welcome Back!

Retrieve your password