NVIDIA Nemotron 3 Super Reveals Hybrid AI for Agent Systems

Agentic AI systems must use deep models capable of independently solving complex technical problems.

Multi-agent systems can produce up to 15 times more tokens than standard chats since they keep sending history, tool outputs, and reasoning steps at each step during long tasks. This context explosion can lead to the world rift, where agents slowly lose track of the main objective due to the need to use large reasoning models for every subtask, known as the thinking tax. Also, these applications are too costly and slow in real-world use.

Today, we are announcing Nemotron 3 Super to solve these problems. The new Super model has 120 billion total parameters, with 12 billion active at a time. It is designed for maximum effectiveness and precision in complex multi-agent tasks such as software development and security triage. This release follows our introduction of Nemotron 3 Nano in December.

Number 23 Super solves the thinking tax problem with its hybrid mixture-of-experts (MoE) design. It offers more than five times the throughput of the previous Nemotron Super. The model also handles context explosion with a built-in 1-million-token context window, providing agents with long-term memory for accurate reasoning. It is fully open, with open weights, datasets, and recipes, so developers can easily customize, optimize, and deploy it on their own systems.

WhatsApp’s Nemotron 3 Super Apart

Nemotron 3 Super introduces design features to reduce trade-offs between effectiveness and correctness in large reasoning models.

Latent MOE (a type of mixture-of-experts architecture that compresses hidden data representations) uses token compression to activate more experts (specialized submodels) per inference at the same computational cost.

Multi-token prediction accelerates long-sequence generation by enabling single-step future-token prediction, thereby enabling speculative decoding.

A Hybrid Mamba Transformer backbone means the model has two main types of layers. Mamba layers process long sequences efficiently, while Transformer layers are specialized for exact reasoning. This combination increases the model’s speed and makes it four times more memory- and compute-efficient.

Native NVFP4, used for pre-training, is a special low-memory format built for NVIDIA Blackwell chips. It reduces memory usage and speeds inference (model output) by 4x on NVIDIA B200 compared to FP8 on NVIDIA H100, while maintaining high accuracy.

After initial training, the model uses reinforcement learning (AI learns by trial and error) across 21 environments, running on NVIDIA Nemotron Gem and Nemotron RL and accumulating 1.2 million simulated experiences (environment rollouts).

These advantages combine to create a model well-suited for long-running autonomous agents on Pinchbench, an innovative benchmark for evaluating how well LLMs perform as the brain of an open-cloud agent. Nemotron 3 Super scores 85.6% across the full test suite, making it the best open model in its class.

Delving Deeply Into The Architecture

Hybrid Mamba Transformer MOE Backbone

Super uses the same blended approach as nano, but it operates on a much larger scale to understand how its architecture supports this. Consider the way its backbone combines three types of layers.

Mamba-2 layers handle most sequence processing. These State-Space Models operate in linear time with respect to sequence length, enabling the practical use of a 1M Token Context Window. Mamba layers efficiently manage memory when processing large code lengths, extended chat histories, or many documents.

Transformer attention layers are inserted at key points because SSMs may struggle to locate specific facts in a known context. These attention layers preserve the ability to retrieve targeted information within large inputs.

MOE layers increase the number of effective parameters without needing heavy computation. Only some experts are used for each token, which keeps latency low and throughput high. This is important when many agents run concurrently in a shared system.

Latent MOE

In a typical MOE setup, tokens are routed from the full hidden dimension (all the internal data space) to the different experts (specialized sub-networks). This process can slow down computation, increase costs, and limit the number of experts the model can use.

Supersuper uses latent MOE (working in a reduced dimension before routing). Here, token embeddings (numerical summaries of tokens) are compressed into a simpler, smaller space. The experts perform their tasks in this compressed space, and then their results are expanded to match the full model size. This approach has practical effects:

More experts can be used at the same cost per computer (i.e., the same amount of computer resources). Compression enables the model designer to support four times as many experts without requiring more computation.

Finer-grained specialization, when more experts are available, the model can afford highly specialized routing, for example, activating distinct experts for Python syntax versus SQL logic only when strictly necessary. This granularity is especially valuable in agentic settings where a single conversation may span tool calls, code generation, data analysis, and dialogic reasoning within a few terms.

Multi-Token Prediction (MTP)

Standard language models are trained to predict one token at a time. A fundamentally myopic objective, Super is trained with MTP, where specialized prediction heads simultaneously forecast seven future tokens at each position.

This has two concrete benefits:

This leads to better reasoning during training. Predicting multiple future tokens helps the model learn longer patterns and logical connections, rather than just guessing the next word. The model is trained to predict entire sequences, which improves performance on tasks that require step-by-step logic.

MTP also speeds up inference by predicting multiple future tokens in a single pass. This reduces the time required to generate long outputs. The MTP heads make draft predictions that can be checked in parallel, enabling up to 3x faster generation for tasks such as code and tool calls without requiring a separate draft. Both benefits come from a single design. Shared weights across all MTP heads limit the number of extra parameters and stabilize training by aligning heads, keeping draft predictions more consistent across longer sequences.

Native NVFP4 Pre-Training

Most quantized models are first trained in full precision and then compressed, which usually results in some loss of accuracy. Super does things differently. Most floating-point operations during pre-training use NVFP4 and NVIDIA’s 4-bit floating-point format. This format, optimized for Blackwell, greatly reduces memory usage and speeds up inference compared to FP8 while maintaining high accuracy.

By training natively in reduced precision (only using 4-bit math from the beginning), the model learns to be accurate even with small numbers, meaning it stays stable and effective while using much less memory.

Training Super: A Three-Stage Process

Supervised fine-tuning is a phase in which the model is further trained on specific examples relevant to the tasks it will face, so it learns to act appropriately for those tasks.

Reinforcement learning is used to further improve model behavior by letting it learn from actual outcomes in test scenarios.

Pre-training: Super is pre-trained on 25 trillion tokens using the NVFP4 format, learning to be accurate with four-bit math throughout pre-training. The data includes 10 trillion unique curated tokens focused on reasoning and coding.

Supervised Fine-Tuning: Before Reinforcement Training, Super is being fine-tuned with about 7 million supervised samples. These come from a larger set of 40 million samples that include reasoning, following instructions, coding safety, and multi-step agent tasks. This stage lays the foundation for the behavior RL will employ. The model teams learn to give correct responses across different tasks, so RL starts from a stable base rather than a raw, pre-trained model.

Multi-environment reinforcement learning makes the model more agent-like by training it across many environments (such as Nemotron, GitHub, NMEDIAS, and the Open RL Library). These scenarios test tasks such as tool use, coding, and complex planning, creating the main dataset for reinforcement training.

This type of reinforcement learning step helps the model work reliably in multi-step workflows, reduces reasoning errors, and manages the organized tasks often found in agent pipelines.

The Super + Nano Deployment Pattern

Nemotron 3 Nano works well for processing specific targeted steps in an agent-like workflow; however, as multi-agent applications become more complex and involve several steps, a more powerful model is needed for better planning and reasoning. For example, imagine an agent that needs to choose among tools to create a presentation with 10 high-quality slides.

Nemotron 3 Super is a great fit for these situations in software development. For example, Nemotron 3 Nano can handle single merge requests, while Nemotron 3 Super can handle more complex coding tasks that require a deeper understanding of the codebase. For expert-level coding, proprietary models are best.

Building With Super’s Open Resources

Nemotron 3 Super is fully open source, including model weights,datasets, and architectural recipes, enabling developers to tailor, refine, and deploy the model for privacy and security.

Get Started

With these capabilities in mind, getting started is straightforward. Start using Nemotron 3 Super today. Deploy it on your preferred platform, whether on a workstation or in the cloud. To experience its capabilities, sign up for a pro subscription on Perplexity. Access it directly by your API, use OpenRouter, or visit build.NVIDIA.COM. Take the next step. Explore Nemotron 3 Super Now!

Source: Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning