NVIDIA Blackwell Ultra Cuts AI GPU Idle Time With New Launch

NVIDIA’s new Blackwell Ultra architecture introduces program-dependent launch, enabling preemptive scheduling of subsequent GPU kernels while the current kernel executes. This advancement in the GB300 NVL72 system enhances GPU utilization and throughput for complex AI workloads, such as agent-based AI and advanced reasoning models.

Highlights Of Programmatic Launch And Blackwell Ultra

The new launch feature cuts GPU idle time between kernels and maximizes throughput for high-performance AI workloads.

The GB300 delivers a 1.5x boost in NVF throughput and doubles the attention task speed compared to standard Blackwell.

The platform targets extended context inference and test-time scalability, supporting agentic systems that require deep reasoning.

Blackwell Ultra supports 800 GB/s networking (Spectrum-X Quantum-X800) and works with NVIDIA Dynamo for large-scale multi-node tasks.

These enhancements are expected to become available through partners in the second half of 2025.

Blackwell Ultra includes a RAS engine to detect faults and cut downtime, adding reliability and efficiency.

AI has advanced for years by scaling pre-training with larger models, more data, and greater computing power to achieve new capabilities. Over the past five years, this approach has increased compute needs by 50 million times, but now making smarter systems is about more than just bigger models. The focus is shifting to refining models and enabling them to think.

Refining AI models with post-training scaling boosts performance and conversational ability. Tuning with domain-specific and synthetic data enables nuanced tech understanding and better inputs. Synthetic data production has no upper limit, increasing demand for post-training compute.

A new approach called test-time scaling has now emerged to boost AI intelligence.

Also known as long-thinking test time, scaling dynamically increases compute during AI inference to enable deeper reasoning. AI reasoning models don’t just generate responses in a single pass; they actively think, weigh multiple possibilities, and refine their answers in real time.

This is moving us closer to true agentic intelligence: AI that can think and act independently to tackle more sophisticated tasks and provide more useful answers.

Switching to post-training and test-time scaling greatly increases the need for computational resources. For example, the post-training process may require up to 30 times as much computational power as the original pre-training phase when creating custom AI models. Likewise, the long thinking involved in test-time scaling can demand up to 100 times as much computation as a single inference would for solving especially complex tasks.

Blackwell Ultra NVIDIA GB300 NVL72

To address these needs, NVIDIA launched Blackwell Ultra, a high-speed computing platform made for advanced AI reasoning. It supports training-time, post-training, and test-time scaling. Blackwell Ultra is built for large-scale AI inference, offering smarter, faster, and more efficient AI while keeping costs down.

Blackwell Ultra powers the NVIDIA GB300 NVL72 systems. These liquid-cooled rack-scale setups connect 36 NVIDIA Grace CPUs and 72 Blackwell Ultra GPUs, all working together as one large GPU. The system offers an NVLink bandwidth of 130 TB/s.

Blackwell Ultra delivers even greater AI inference performance for real-time multi-agent systems and long-term context reasoning. Its new Tensor cores provide 1.5 times more AI compute FLOPS than Blackwell GPUs. The GB300 NVL72 offers 70 times more AI FLOPS than the HGX H100. Blackwell Ultra also supports several FP4 formats to improve memory efficiency for advanced AI. Coherent memory per GB300 NVL72 rack opens the door to breakthroughs in AI, research, real-time analytics, and more. It provides the large-scale memory needed to run many large AI models simultaneously, having a high volume of complex tasks from many concurrent users, improving performance and reducing latency.

Blackwell Ultra Tensor Cores accelerate attention layers twice as fast as the previous Blackwell system. This enables efficient processing of long context lengths, which is vital for real-time AI handling millions of input tokens at once.

Optimized Large Scale Multi-Node Inference

Efficiently inquiring AI inference requests across many GPUs is key to keeping costs low and increasing revenue in AI factories.

Blackwell Ultra uses PCIe Gen 6 and ConnectX-8 800G Super NIC to raise network bandwidth to 800 GB/s.

With more network bandwidth, NVIDIA Dynamo an open-source inference framework scales AI model services across nodes. It allocates GPU workers dynamically to reduce traffic bottlenecks.

Dynamo also offers disaggregated serving. This means it separates the context (pre-fill) and generation (decode) steps for large-language-model inference across GPUs. This setup improves performance, making scaling easier and lowering costs.

GB300 NVL72 supports 800 GB/s per GPU and integrates Quantum-X800 and Spectrum-X networking. It efficiently scales model size, data, and reasoning for AI factories and data centers.

data

reasoning capability

Summary

Blackwell Ultra accelerates AI reasoning, enabling real-time insights, smarter chatbots, better analytics, and productive AI agents in finance, healthcare, and e-commerce. Organizations can run larger models and more demanding AI workloads faster and more efficiently, making advanced AI practical in real life.

Blackwell Ultra products will be available from partners in the second half of 2025, with all major cloud providers and server makers supporting them. See below for more details.

Source: NVIDIA Blackwell Ultra for the Era of AI Reasoning