Agentic AI systems are scaling fast, but they come with a hidden cost: context explosion. Every interaction in a multi-agent workflow requires resending full histories, generating up to fifteen times more tokens than standard chat. That volume increases costs and risks goal drift—agents losing track of the original objective. A new model from NVIDIA aims to solve that problem head-on.

Nemotron 3 Super is now available as a 120-billion-parameter open model with only 12 billion active parameters at inference. It claims the top spot on efficiency and openness benchmarks, delivering up to five times higher throughput than its predecessor while maintaining leading accuracy for models of its size.

Hybrid Architecture: The Efficiency Boost

The key innovation is a hybrid mixture-of-experts (MoE) architecture that combines Mamba layers for memory and compute efficiency with transformer layers for advanced reasoning. This design allows only 12 billion parameters to be active during inference, while a new latent MoE technique activates four expert specialists at the cost of one, improving accuracy without extra overhead.

  • Hybrid Architecture: Mamba layers deliver four times higher memory and compute efficiency, while transformer layers handle complex reasoning.
  • MoE: Only 12 billion of 120 billion parameters are active at inference.
  • Latent MoE: Activates four experts for the cost of one to generate the next token, boosting accuracy.
  • Multi-Token Prediction: Predicts multiple future words simultaneously, cutting inference time by three times.

When deployed on NVIDIA’s Blackwell platform in NVFP4 precision, Nemotron 3 Super runs up to four times faster than FP8 on the Hopper architecture without losing accuracy. That performance gap could be a game-changer for developers running large-scale agentic systems.

NVIDIA Nemotron 3 Super: A Leap Forward in Agentic AI Efficiency

Who Benefits—and Who Should Skip?

The model is designed for complex, multi-agent workflows where context retention and tool calling are critical. Software development agents can load entire codebases into memory, enabling end-to-end generation and debugging without segmentation. Financial analysis tools can process thousands of pages of reports in a single session, eliminating the need to restart reasoning mid-conversation.

But those use cases require significant infrastructure. Enterprises and AI-native companies will see the most benefit, while smaller developers may find the model’s scale and cost prohibitive unless they partner with cloud providers or inference services. For now, availability is broad—accessible through platforms like Perplexity, OpenRouter, and Hugging Face—but pricing details remain unclear.

The real question isn’t just whether Nemotron 3 Super works; it’s whether the efficiency gains translate to real-world cost savings for developers. If they do, this could be a turning point in how agentic AI systems are built at scale.