In the past, data centers mainly stored, retrieved, and processed information. Now, with generative and agentic AI, they have become AI token factories. Their main job is running AI inference, using intelligence as tokens.  

This change means we need to rethink how we measure the economics of AI infrastructure, including the total cost of ownership (TCO) and open market. Many companies still focus too much on chip specs, compute costs, and FLOPs per dollar.  

The key difference to focus on is:  

  • Compute cost is the amount companies pay for AI infrastructure, whether they rent it from the cloud or own it themselves.  
  • FLOPS per dollar measures how much raw computing power a company gets for each dollar. But raw compute is not the same as actual token output.  
  • Cost per token is the total amount a company spends to produce each token, usually shown as cost per million tokens.  

The first two are just input metrics. Focusing on inputs when your business depends on outputs is a basic mismatch.  

The cost per token shows whether a company can scale AI profitably. It’s the only TCO metric that directly reflects hardware, software, ecosystem support, and real-world use. NVIDIA offers the lowest cost per token in the industry.  

What Factors Help Lower Token Cost? 

To optimize token costs, we need to examine how the cost per million tokens is calculated.  

When looking at this equation, many companies focus on the numerator column for cost per GPU per hour in the cloud, which is the provider’s on-premises hourly rate. It’s the only cost of spreading out the infrastructure expense. But the real way to lower token cost is to maximize the number of tokens produced.  

That denominator carries huge business implications.  

  • Minimizing token cost: as you increase token output, the cost per token drops, boosting profit margins for every interaction.  
  • Maximizing revenue: delivering more tokens per second also means more tokens per network. This lets you get more intelligence from your AI products and services, increasing revenue from the same infrastructure.  

If you only focus on the numerator, you miss what really drives results. It’s like an iceberg. The numerator is visible above the surface, but the denominator is hidden below and holds the key shackles to well-coordination. To evaluate your infrastructure well, you need to look deeper.  

Surface-Level Inquiry 

  • What is the cost per GPU hour?  
  • What are the peak petaflops and high bandwidth memory capacity?  
  • What are the HLOPS per dollar?  

In-Depth Cost Analysis 

  • What is the cost per million tokens? Specifically, what is the cost per million tokens for large-scale mixture-of-experts (MOE) reasoning models, which are the most widely deployed type of AI models?  
  • What is the delivered token output turnover for enterprises deploying this architecture, where capital commitment to land power and infrastructure is substantial? Maximizing intelligence produced turnover is critical.  
  • Can the scale interconnect handle the all-to-all traffic of MOE models?  
  • Is FP4 precision supported? Can the inference stack make use of FP4 while maintaining high accuracy?  
  • Does the inference runtime support speculative decoding to improve multi-token prediction and increase user interactivity?  
  • Does the serving layer support disaggregated serving, KB-aware routing, KB cache offloading, and other optimizations?  
  • Does the platform support the unique workload requirements of AI, including ultra-low latency, high throughput, and long input sequences? Does the platform support the full lifecycle from training and post-training to high-scale inference across all model architectures to ensure infrastructure flexibility and high utilization?  

All these algorithms, hardware, and software optimizations need to work together. If they don’t, the denominator drops. A cheaper GPU that produces fewer tokens per second actually raises your cost per token. The best AI infrastructure gets every part right, so each optimization supports the others.  

Why Is Cost Per Token Much More Important Than FLOPS Per Dollar? 

Data from the DeepSeek R1 AI model shows the gap between theory and real business results.  

If you only look at compute cost, the NVIDIA Blackwell platform seems about twice as expensive as the NVIDIA Hopper platform, but compute cost doesn’t reflect what you get for your money. FLOPs per dollar suggests Blackwell is twice as good as Hopper. In reality, Blackwell delivers over fifty times more token output per watt and nearly thirty-five times lower cost per million tokens.  

Metric  NVIDIA Hopper (HGHH200)  NVIDIA Blackwell (GB300 NVL72)  NVIDIA Blackwell relative to Hopper  
Cost per GPU per Hour ($).  $1.41  $2.65  2x  
FLOP, per Dollar (PFLOPS)  2.8  
 
5.6  
 
2x  
Tokens per second per GPU  90  6,000  65X  
Tokens per second per MW  54K  2.8M  
 
50 X  
Cost: twelve million tokens ($)  $4.20  $0.12  35 X Lower  
    

Note: Data is sourced from NVIDIA analysis, as in the Insurance X V2 benchmark.  

This significant difference shows that NVIDIA Blackwell offers much greater business value than the older Hopper generation, despite any increase in system costs.  

How to Choose the Right AI Infrastructure 

Looking at AI infrastructure only in terms of compute cost or theoretical FLOPS per dollar does not give a true picture of inference economics. To really understand the revenue potential and profitability, it is better to focus on cost per token and the number of tokens delivered.  

NVIDIA offers the lowest token cost and the highest token throughput in the industry by carefully designing its compute, networking, memory, storage, software, and partner technologies to work together on the inference to open source inference software like vLLM, SGLang, NVIDIA TensorRT-LLM, and NVIDIA Dynamo on the NVIDIA platform help increase token output and lower the cost per token over time, even after the interception is in place.  

Top cloud providers and NVIDIA partners are already offering these benefits at scale. Companies like CoreWeave, Nebius, Nscale, and Together AI use NVIDIA Blackwell infrastructure and have optimized their systems to give businesses the lowest token cost available today, backed by NVIDIA hardware, software, and ecosystem working together.

Source: Rethinking AI TCO: Why Cost per Token Is the Only Metric That Matters