OpenAI has taken a step that could reshape the AI hardware landscape by running its GPT-5.3-Codex-Spark model on Cerebras Systems’ Wafer Scale Engine 3 (WSE-3) chips, marking the first major deployment of Cerebras’ technology outside of in-house research. While NVIDIA remains OpenAI’s primary compute partner, the introduction of Codex-Spark—optimized for real-time coding assistance—reveals a strategic diversification in hardware choices, particularly for workloads where latency is critical.

The shift is notable because OpenAI’s new model prioritizes interactive, low-latency performance, a domain where traditional AI accelerators like NVIDIA’s Blackwell platform, designed for high-throughput batch processing, may not excel. By leveraging Cerebras’ 21 petabyte-per-second on-chip memory bandwidth and 44GB of SRAM, OpenAI claims a 50% reduction in time-to-first-token compared to standard Codex variants, achieving 1,000 tokens per second (TPS)—a benchmark the company describes as approaching the responsiveness of a human pair programmer.

This isn’t just an engineering tweak; it’s a signal. Cerebras’ architecture, built around a full 300mm wafer-scale chip with 900,000 AI-optimized cores, addresses a key limitation in modern AI systems: memory-bound workloads, such as coding and real-time inference, often stall due to data movement bottlenecks. NVIDIA’s strengths lie in scaling training and large-batch inference, but for latency-sensitive applications, Cerebras offers an alternative that OpenAI is now exploiting.

Why the move matters: While NVIDIA’s Blackwell platform has slashed token costs by up to 10x for training, it’s less optimized for the kind of interactive, single-user workflows that Codex-Spark targets. OpenAI’s deployment of WSE-3 suggests that for certain use cases—particularly those requiring sub-100ms response times—Cerebras provides a viable, if not superior, alternative. This doesn’t mean OpenAI is abandoning NVIDIA; the two remain partners in large-scale training. But the Codex-Spark deployment underscores that even AI labs with deep NVIDIA ties are exploring multi-vendor strategies to future-proof their infrastructure.

OpenAI Deploys Codex-Spark on Cerebras Chips, Signaling a Strategic Shift in AI Inference Hardware

Key technical highlights of Cerebras WSE-3

  • Process Node: TSMC 5nm
  • Transistors: ~4 trillion
  • Compute Cores: 900,000 AI-optimized cores
  • On-Chip SRAM: 44GB
  • Memory Bandwidth: 21 PB/s
  • Wafer Size: Full 300mm wafer-scale chip
  • Core Architecture: AI-optimized programmable processing cores

The economic rationale is equally compelling. Training Codex-Spark on NVIDIA’s infrastructure would be cost-prohibitive due to the platform’s focus on batch processing rather than latency optimization. Cerebras, with its monolithic die design, eliminates the need for external memory controllers or high-speed interconnects, reducing both latency and complexity. For OpenAI, this translates to a hardware-agnostic approach: NVIDIA for scaling, Cerebras for speed.

What’s next? The inference market is entering a phase where specialized architectures—whether from Cerebras, AMD, or emerging ASIC players—could challenge NVIDIA’s near-monopoly. OpenAI’s move isn’t a rejection of NVIDIA but a recognition that no single vendor dominates all AI workloads. For developers and enterprises relying on real-time AI, the emergence of Cerebras as a production-ready option could accelerate innovation in latency-critical applications, from coding assistants to interactive AI agents.

The broader implication is clear: The AI hardware ecosystem is diversifying. While NVIDIA remains the default for most workloads, OpenAI’s deployment of Codex-Spark on Cerebras signals that latency-optimized inference is becoming a distinct category—one where alternatives are no longer just theoretical but operational.