Microsoft, NVIDIA Boost Shared GPU Power for AI Work

The Azure Kubernetes Service Team released a step-by-step guide for using dynamic resource application DRA with NVIDIA vGPU technology in AKS. This update gives users more control. It also offers faster performance when sharing GPUs for AI and media workloads.

Dynamic Resource Allocation now standardizes GPU management in Kubernetes. It replaces static allocation with dynamic assignment using device classes and resource claims. This change simplifies scheduling, especially with NVIDIA vGPU.

These technologies work together because virtual accelerators like NVIDIA vGPU are ideal for smaller tasks. They let a single GPU be shared among many users or apps. This benefits enterprise AI development, fine-tuning, and audio-for-visual processing. vGPU ensures consistent performance. It also maintains CUDA support for containers.

On the infrastructure side, this feature uses Azure’s NVADSA10_V5 virtual machines. Instead of assigning the entire GPU to one VM, vGPU partitions it into several fixed-size parts at the hypervisor level. Kubernetes treats each VM as having a single GPU. The hypervisor, not the software, controls capacity and memory.

Kubernetes 1.34 or above is required to enable DRA features. Teams create a node tool for use with NVIDIA s8n-v5 instances. It is labeled for the NVIDIA DRA Kubelet plugin and installs the NVIDIA DRA using Helm, adjusting key flags for vGPU support.

Once the driver is active, it registers each node’s vGPU device with Kubernetes. Operators can check gpu.nvidia.com for DeviceClass or ResourceSlices. This helps verify hardware recognition.

Profiles beyond standard_NV6ADS_A10_V5 include one-third and one-half sizes. The hypervisor controls the capacity. This enables teams to efficiently match GPU allocation to workload needs.

The AKS team positions DRA with vGPU as a significant advance. Organizations can now transition from node-level GPU allocation to granular workload-specific sharing at scale. This development supports production grids and shared AI workloads and drives infrastructure efficiency, especially for large, regulated, or cost-conscious deployments.

Google Cloud uses DRA for GPU and TPU scheduling in GKE. DRA lets workloads use selectors for device features. It also supports fractional vGPU VMs. These are managed using container binpacking to improve utilization.

Amazon EKS is taking a different route, using DRA mainly to manage the complexity of its high-end GPU hardware rather than for fractional sharing. DRA became generally available on EKS starting with Kubernetes version 1.33. This feature is important for P6E-GB200 Ultra Server instances where customary static GPU scheduling cannot handle the NWLINK and IMEX interconnects needed for multi-roll workloads. For teams with smaller workloads that want GPU sharing on TKS, DRA now allows structured attribute-based requests. This: let’s schedule this and request something like 10 GB. My G partition has at least one 7-compute instance, rather than merely counting GPUs across all three cloud providers. The move from static device plugins to DRA is accelerating, driven by the need for smarter topology-aware GPU scheduling as AI infrastructure becomes more complex and expensive.

Source: Microsoft Adds DRA-Backed NVIDIA vGPU Support to AKS