Cloud providers like Microsoft, CoreWeave, and Oracle Cloud Infrastructure are rolling out NVIDIA GB300 NGL72 systems for low-latency and non-context tasks, including agentic coding and coding assistants.
Building on this hardware expansion, leading inference providers like Base 10, Deep Infra, Fireworks AI, and Together AI use the NVIDIA Blackwell platform to reduce token costs by up to 10 times. The new Blackwell Ultra platform advances these gains for agentic AI applications.
AI agents and coding assistants have driven a significant increase in programming-related AI queries, rising from 11% to above 50%, according to the OpenRouters State of Inference report. As a result, these tools need low latency for instant responses and long context to understand and operate over entire codebases, enabling them to handle larger projects or complex workflows effectively.
Semi-Analysis InferenceX data shows that NVIDIA’s software and Blackwell Ultra platform bring significant advances, with GB300 and NVL72 offering much higher throughput per MW and lower token costs than the Hooper platform.
NVIDIA’s work in chip system design and software boosts performance for AI workloads, from authentic coding to interactive assistants, while also lowering costs at scale.
GB300 NVL72 Delivers Up to 50x Better Performance for Low-Latency Workloads.
Signal 65’s recent analysis shows that NVIDIA, JB200, NVL72, with advanced hardware and software design, deliver over 10 times more tokens per word and cut token costs to one-tenth compared to the HOPO platform. These games keep growing as technology improves.
Current updates from the NVIDIA, TensorRT, LLM, Dynamo, Mooncake, and SGLang teams continue to improve Blackwell NVL72’s throughput for a mixture of X-Buds inference across all latency levels. For example, recent TensorRT LLM updates have made GB200 up to 5 times faster for low-latency tasks compared to 4 months ago.
- Faster, more efficient GPU kernels help Blackpool fully utilize its computing power and increase throughput.
- NVIDIA NVLink Symmetric memory allows GPUs (graphics processing units) to access each other’s memory directly, improving the efficiency of data exchange between processors.
- Programmatic Dependent Launch releases in real time by starting the next kernel setup before the previous one finishes.
With these software improvements, the GB300 NVL72 equipped with Blackwell Ultra GPU now achieves 50 times the throughput per megawatt of the Hopper platform.
These improvements result in much lower costs across all latency levels, with the largest savings up to 35× lower per million tokens at low latency.
The GB300NVL72 system and its software stack, including Dynamo and the TensorRT LLM, offer dynamically lower port token costs than the Hooper platform.
For agent decoding and interactive assistant workloads, where every millisecond counts across multi-step workflows, this ongoing software optimization, paired with next-generation hardware, lets AI platforms scale real-time, interactive experiences to support far more users.
GB300 NVL72 Deliver Superior Economics for Long-Context Workloads.
Both GB200 and GB300 deliver low latency (quick response times), but GB300 and NBL72 are better suited for tasks that require processing large amounts of information at once (long-context tasks). For example, with large code inputs and outputs, GB300 NBL72 reduces token costs by up to 1.5× versus GB200 NBL72.
Context grows as the agent reads in more of the code. This allows it to better understand the codebase, but it also requires much more computing power. Blackwell Ultra delivers 1.5× higher NVFP for compute performance and 2× faster attention processing, enabling the agent to efficiently understand entire code bases.
Infrastructure For Agentic AI
Leading cloud providers and AI innovators have already deployed NVIDIA GB200 NVL72 at scale. They are also deploying GB300 NBL72 in production. Microsoft, CoreWeave, and OCI use GB300 NBL72 for low-latency and long-context use cases, such as agentic coding and coding assistance, by reducing token costs. This GB300 NBL72 enables a new class of applications that can resume across massive codebases in real time.
As inference moves to the center of AI production, long-context performance and token efficiency become critical. Said Chen Goldberg, Senior Vice President of Engineering at CoreWeave. Grace Blackwell NBL72 addresses that challenge directly, and CoreWeave’s AI cloud, including CKS and SUNK, is designed to translate GBL300 systems’ gains—building on the success of GB200 into predictable performance and cost efficiency. The result is better token economics and more usable inference for customers running workloads at scale.
NVIDIA, Vera Rubin, NVL72 To Bring Next Generation Performance
As adoption grows, ongoing software updates for NVIDIA BlackBand systems will continue to improve performance and reduce costs for all users.
Going forward, the NVIDIA Rubin platform, which combines six new chips into a single AI supercomputer, will deliver even greater performance gains. For MOE inference, Rubin offers up to 10 times more throughput per megawatt than Blackwell, cutting costs to $110 per million tokens. Rubin can also train large MOE models with only a quarter as many GPUs as Blackwell.
Source: New SemiAnalysis InferenceX Data Shows NVIDIA Blackwell Ultra Delivers up to 50x Better Performance










