NVIDIA’s Blackwell Ultra builds on the standard Blackwell with up to 50 times better performance and 35 times lower operating costs for Agentic AI. This is thanks to faster HBM3E memory and reading switches, and improved TENSOR RT-LLM software. These upgrades cut inference costs per token, making AI agent applications cheaper and faster, and lowering enterprise AI subscription fees.
Key Differences Between Blackwell and Blackwell Ultra
- Blackwell (GB200) brought major improvements in inference with 192GB HBM3e memory and FP4 precision, already reducing costs compared to the earlier Hopper architecture.
- Blackwell Ultra is an upgraded, next-generation platform designed for long-contact, agentic AI that handles autonomous, multi-step tasks. It uses improved networking with an NVLink switch and better software to cut the cost per token by 35 times compared to older Hopper systems.
Why AI Subscription Costs Are Falling
- Big efficiency gains: Blackwell Ultra GB 300 and NVLink 72 systems deliver 50 times more throughput per megawatt. This allows cloud providers like Microsoft, Oracle, and CoreWeave to offer lower-cost services.
- Lower cost per token: AI inference costs are dropping because the new hardware is built for faster, more energy-efficient token generation.
- Agentic AI Optimization: As applications move from basic chatbots to more complex agents, Blackwell Altra delivers the speed needed to make these necessary high-compute tasks affordable. This lowers the operating expenses that are often passed on to users.
Note: Blackwell Ultra aims to make advanced AI agents more affordable, while the standard Blackwell is already used for high-density, lower-cost inference.
Leading Inference Providers such as Baseten, DeepInfra, Fireworks AI, and Together AI are using the NVIDIA Blackwell platform to reduce token costs by up to 10 times. Now the NVIDIA Blackwell Altra platform is building on this progress for Agentic AI.
AI agents and coding assistants have driven a significant increase in software programming AI queries, rising from 11% to about 50% last year, according to OpenRouters’ State of Inference Report. These tools need low latency for quick responses and a long context to handle entire codebases.
Recent semi-analysis inference data shows that NVIDIA’s software improvements and the new Blackwell Ultra platform have made major progress. The NVIDIA GB300 NVL72 now offers up to 50 times more throughput per megawatt and 35 times lower cost per token compared to the NVIDIA Hopper platform.
NVIDIA’s work in chip system design and software boosts performance across a wide range of AI workloads, from agentic coding to interactive assistants, while also lowering costs at scale.
GB300 Nv L72 Delivers Up To 50x Better Performance For Low-Latency Workloads
Signal 65’s recent analysis shows that the NVIDIA GB200/NVL72, with advanced hardware and software design, delivers over 10 times more tokens per Watt and cuts token costs to 1/10 those of the NVIDIA Hopper platform. These big plans keep growing as the technology improves.
Current updates from NVIDIA TENSOR RT, LLM, NVIDIA DYNAMO, MOONCAKE, and HG-Lan terms teams continue to improve Blackwell LVL-72’s throughput for Mixture of Experts (MoE) inference across all latency levels. For example, recent updates to the TENSOR RT LLM library have made GB200 up to five times faster for low-latency tasks compared to four months ago.
- Faster. More efficient. GPU kernels help Blackwell use its full computing power and increase throughput.
- NVLink Symmetric Memory lets GPUs access each other’s memory directly, improving communication efficiency.
- Programmatic dependent launch reduces idle time by starting the next kernel’s setup before the previous one finishes.
Thanks to these software improvements, the GB300 NVL72 with the Blackwell Ultra GPU now delivers 50 times more throughput per MW than the Hopper platform.
These performance gains translate into greater cost savings, as NVIDIA GB300 lowers costs across all latency levels compared to the Hopper platform. The biggest savings are at low latency, where Agentic applications run with up to 10 times lower cost per million tokens.
For Agentic coding and interactive assistance, where each millisecond matters in complex workflows, this mix of ongoing software improvements and new hardware enables AI platforms to deliver real-time experience to many more users.
GB300 NVL72 Delivers Superior Economics for Long Context Workloads.
Both GB200/NVL72 and GB300/NVL72 offer very low latency, but GB300/NVL72 stands out in long-running contexts, for example, with workloads using 128,000 token inputs and 8,000 token outputs, such as AI coding assistance working through codebases. GB300/NVL72 can lower the cost per token by up to 1.5 times compared to GB200/NVL72.
Context grows as the agent reads in more of the code. This allows it to understand the codebase better, but it also requires much more computing power. Blackwell-Altra delivers 1.5x higher NVFP4 compute performance and 2x faster attention processing, enabling the agent to efficiently understand entire code bases.
Infrastructure for Agentic AI
Leading Cloud Providers and AI Innovators have already deployed NVIDIA GB200/NVL72 at scale and are also deploying GB300/NVL72 in production. Microsoft CoreWeave and OCI are deploying GB300/NVL72 for low-latency and long context use cases, such as agentic coding and coding assistance by reducing token costs. GB300/NVL72 enables a new class of applications that can reason with massive code bases in real time.
As inference moves to the center of AI production, long context performance and token efficiency become critical, said Chen Goldberg, Senior Vice President of Engineering at CoreWeave.
Grace Blackwell, NVIDIA NVL72, addresses the challenge directly, and CoreWeave’s AI cloud, including CKS and Sunk, is designed to translate GB300 systems’ gains, building on the success of GB200, into uniform performance and cost efficiency. The result is better token economics and more usable inference for customers running workloads at scale.
NVIDIA Vera Rubin NVL72 to Bring Next-Generation Performance
As NVIDIA Blackwell systems are increasingly used, ongoing software updates will continue to improve performance and reduce costs for all users. One AI supercomputer is prepared to deliver another round of massive performance leaps for MOE inference. It delivers up to 10X higher throughput per megawatt compared with Blackwell, translating into one-tenth the cost per million tokens. For the next wave of frontier AI models, Robin can train large MOE models with just one-fourth of the number of GPUs required by Blackwell.










