As AI models become more complex, serving, scaling, and switching models instantly is now essential. In March 2026, Google Cloud tackled this challenge with its updated Hyperdisk ML storage. Recent benchmarks show that Hyperdisk ML achieves 500,000 IOPS during model hot swapping, establishing a new standard for high-performance AI infrastructure. 

For systems architects, this number is important. It enables “Always-On” generative AI apps that can switch between base models, LoRA adapters, and specialized weights in just milliseconds. 

The Bottleneck: Why IOPS Matter for Model Serving 

Traditional block storage has often quietly limited AI inference performance. Even top GPUs, as well as TPUs, can end up waiting while storage loads large model weights into memory. In production, where “hot swapping” means replacing one active model with another without downtime, IOPS becomes the main bottleneck. 

Loading a 70B parameter model from a regular disk can take seconds or minutes, causing cold-start delays. Hyperdisk ML, using Titanium offload, separates storage processing from the CPU. At 500,000 IOPS, it delivers the random-read performance required for G4 and A3 instances. 

Achieving 500,000 IOPS: The Titanium Advantage 

Hyperdisk ML achieves 500,000 IOPS due to its unique Google Cloud Hypercomputer design. Unlike classic SANs, it’s network-attached yet behaves like a local SSD. 

Concurrent Consumption limits matter. In a typical GKE cluster, many inference nodes access the same weights. Hyperdisk ML supports ReadOnlyMany, letting up to 2,500 nodes mount a volume. Google set 500,000 IOPS and 50 GiB/s throughput at the zonal level, supporting scale. 

Enabling Flawless Model Hot Swapping 

Model hot swapping is the next step in AI deployment. For example, a customer service bot may switch from a general language model to a specialized legal or billing model based on user needs. 

With Hyperdisk ML, developers can use “Weight-Streaming.” High IOPS lets the engine load only the required layers or adapters when needed. 

  • Reduced Idle Time: Accelerators, such as GPUs and TPUs, spend more time computing and less time waiting for the “First Token.” 
  • Cost Efficiency: Faster pod startup times let organizations reduce their pool of idle instances, thereby markedly lowering total costs. 
  • With GKE Volume Populator, weights are pre-cached and moved from Cloud Storage to Hyperdisk ML. When a swap is commanded, data is ready in the fast block layer. 

Performance Tuning for 500,000 IOPS 

To maximize the 500,000 IOPS performance on Hyperdisk ML during active model hot swapping and loading, engineers should focus on three key storage settings: 

  1. Use 4 KB I/O blocks for top IOPS. Larger blocks improve throughput, but 4 KB works well for fast, random reads on small LoRA adapters. 
  1. Queue Length: To fully utilize the Titanium pipeline, set the queue depth to at least 256. This lets the system handle many requests at once without waiting for each one to finish. 
  1. Instance Machine Series: Hyperdisk ML works best with the C3, C4, and G4 machine families. These have the hardware needed to connect with the Titanium storage offload engine at full speed. 

The Future of “Zero-Latency” AI 

Google Cloud Hyperdisk ML, which is reaching 500,000 IOPS for model hot swapping, shows that AI infrastructure is moving from “experimental” to “industrial-grade.” Today, even a 100ms delay can lose users, so storage can’t be ignored. 

With the required throughput and IOPS to make model weights appear “always resident” in memory, Google Cloud is making dynamic, multimodal, and highly personalized AI apps possible. For companies looking to go beyond basic chatbots to real-time, context-aware agents, Hyperdisk ML delivers the speed and reliability needed.

Source: High-performance block storage for any use case 

Amazon

Leave a Reply

Your email address will not be published. Required fields are marked *