AMD has taken a significant leap forward in AI inference performance, announcing record-breaking results in the MLPerf Inference 6.0 benchmark that push the boundaries of what is possible with large language models (LLMs) at scale.

The company's Instinct MI355X GPUs, built on CDNA 4 architecture with a 3 nm process, have delivered more than just faster compute—they've set new standards for multinode inference efficiency, model coverage, and ecosystem reproducibility. These advancements are particularly notable because they address key pain points in AI deployment: scalability beyond single-node configurations, rapid enablement of first-time models, and consistent performance across diverse hardware configurations.

At the heart of AMD's submission is a milestone that has long been pursued by the industry: surpassing 1 million tokens per second at multinode scale. The Instinct MI355X GPUs achieved this on multiple benchmarks, including Llama 2 70B in both Server and Offline modes, as well as GPT-OSS-120B in Offline mode. This isn't just about raw throughput—it's about proving that inference platforms can sustain performance when moving from pilot deployments to production-scale clusters.

  • Higher aggregate throughput for serving large user populations and larger models.
  • Validation that first-time workloads like GPT-OSS can be optimized quickly while maintaining scalability.
  • A foundation for next-generation rack-scale inference deployments, such as AMD's Helios solution.

The generational leap is equally compelling. Compared to the previous-gen Instinct MI325X GPU, the MI355X delivers 3.1x more throughput on Llama 2 70B Server, reaching 100,282 tokens per second. This jump reflects the power of the full stack: CDNA 4 architecture, high compute density, support for FP4 and FP6 data types, and AMD ROCm software optimizations tailored for modern LLM inference.

Competitiveness isn't limited to single-node performance either. On Llama 2 70B, the MI355X platform tied with NVIDIA's B200 GPU in Offline mode, delivered 97% of Server performance, and exceeded B300 in Interactive mode by 104%. This breadth of results—spanning batch throughput, sustained throughput, and responsiveness—demonstrates that AMD is no longer playing catch-up; it's setting the pace across multiple inference scenarios.

Perhaps the most exciting development is AMD's ability to bring up first-time models with competitive performance. GPT-OSS-120B, a workload introduced for the first time in MLPerf 6.0, saw the MI355X platform deliver 111% of B200 Offline performance and 115% of B200 Server single-node performance. Against NVIDIA's B300, it reached 91% in Offline and 82% in Server—impressive numbers for a model that had never been optimized on AMD hardware before.

AMD didn't stop at LLMs. It expanded into text-to-video generation with a first-time submission on Wan-2.2-t2v, achieving 93% of B200 single-node performance in Single Stream mode. Post-deadline tuning pushed this further to parity with NVIDIA's B300, though these results were not part of the official MLPerf submission. The takeaway is clear: AMD is expanding its model coverage beyond traditional LLMs into multimodal and generative workloads, keeping pace with an evolving AI landscape.

AMD Expands AI Inference Ecosystem with Record-Breaking MLPerf Performance

Multinode inference is where things get particularly interesting. AMD demonstrated efficient scale-out on Llama 2 70B, scaling from one node to 11 nodes while maintaining near-ideal linear efficiency—93% in Offline and Server modes, and a remarkable 98% in Interactive mode. This isn't just about raw tokens per second; it's about proving that inference clusters can grow without losing performance or efficiency as workloads and model sizes increase.

  • Predictable multinode scaling grows inference clusters without sacrificing efficiency.
  • Strong server-scale efficiency builds confidence for real-time inference, not just batch processing.
  • Higher scale-out efficiency improves GPU utilization, lowering cost per token.

The ecosystem aspect of AMD's submission is equally significant. Nine partners—including Cisco, Dell, HPE, and Oracle—submitted results across four Instinct GPU types (MI300X, MI325X, MI350X, and MI355X). These submissions landed within 4% of AMD's own results, with some matching within 1%. This reproducibility is crucial for customers, as it means the performance demonstrated by AMD can be replicated across real-world systems, reducing deployment risk.

One of the most forward-looking results was a heterogeneous submission by Dell and MangoBoost, using three different Instinct GPU types (MI300X, MI325X, and MI355X) across two geographic locations. This wasn't just about mixing generations—it proved that AMD ROCm software can orchestrate inference workloads across systems in different regions, a critical step toward flexible, future-ready AI infrastructure.

AMD ROCm software is the common thread tying all these results together. It enabled efficient FP4 execution, optimized GPU-to-GPU communication for multinode scaling, and supported dynamic workload distribution across heterogeneous configurations. For customers, this means more than just strong benchmark numbers—it means a platform that can scale with model size, workload diversity, and production deployment requirements.

Looking ahead, AMD is building momentum with an annual cadence for Instinct GPUs. The MI300X established a generative AI foothold in 2023, the MI325X extended that foundation in 2024, and the MI350 Series (including the MI355X) pushed boundaries further in 2025. The next generation, the Instinct MI400 Series based on CDNA 5 architecture, is expected to advance this trajectory into rack-scale AI deployments.

The broader context is clear: AMD is not just participating in the generative AI inference transition—it's helping define what production-ready infrastructure looks like. With the Helios rack-scale solution powered by future Instinct generations on the horizon, MLPerf 6.0 results reinforce a message of consistency and long-term vision.

For developers and system administrators, this means a platform that delivers competitive single-node performance while being ready for cluster-scale deployments. It also means confidence in AMD's ability to keep pace with evolving AI workloads, whether they're LLMs, multimodal models, or next-generation generative tasks. The question now isn't just about performance—it's about how quickly and reliably these advancements can be integrated into production environments.