GPT-5.3-Codex-Spark: 1,000 Tokens/s on Cerebras Chips

OpenAI launches GPT-5.3-Codex-Spark on Cerebras Wafer Scale Engine 3 — over 1,000 tokens per second, 80% faster roundtrips, and the first non-Nvidia production model. What it means for developers.

GPT-5.3-Codex-Spark: 1,000 Tokens/s on Cerebras Chips

GPT-5.3-Codex-Spark: OpenAI's First Model on Cerebras Chips Delivers 1,000 Tokens Per Second

OpenAI just shipped the fastest coding model in production — and it's not running on Nvidia.

GPT-5.3-Codex-Spark launched on February 12, 2026 as a research preview for ChatGPT Pro users. It's a smaller, speed-optimized version of GPT-5.3-Codex, and the first fruit of OpenAI's partnership with Cerebras Systems. The headline number: over 1,000 tokens per second for real-time coding assistance.

For developers who spend their days waiting for AI suggestions, this changes the interaction model entirely. Let's break down what Codex-Spark is, why Cerebras matters, and what this means for AI-native development.

What Is GPT-5.3-Codex-Spark?

Codex-Spark is a lightweight version of GPT-5.3-Codex, purpose-built for real-time interactive coding. While the full GPT-5.3-Codex excels at long-running autonomous tasks — working for hours or days without intervention — Spark is designed for the opposite: fast, iterative collaboration where you're in the driver's seat.

Key specs:

  • Speed: 1,000+ tokens/second on Cerebras hardware
  • Context window: 128K tokens
  • Modality: Text-only (for now)
  • Availability: Research preview for ChatGPT Pro users
  • Platforms: Codex app, CLI, and VS Code extension
  • Rate limits: Separate from standard limits during preview

On SWE-Bench Pro and Terminal-Bench 2.0 — two benchmarks for agentic software engineering — Codex-Spark delivers strong performance while completing tasks in a fraction of the time compared to GPT-5.3-Codex. It also outperforms GPT-5.1-Codex-mini in capability.

What Is Cerebras and Why Does It Matter?

Cerebras Systems builds the largest chips in the world. Their Wafer Scale Engine 3 (WSE-3) is literally the size of a dinner plate, packed with 4 trillion transistors. Unlike conventional GPUs that use many small chips networked together, Cerebras puts everything on a single massive wafer — eliminating the communication bottlenecks that slow down inference.

The company has demonstrated up to 3,000 tokens per second on other models. The "relatively modest" 1,000 tok/s for Codex-Spark likely reflects the model's complexity rather than hardware limitations.

Cerebras recently raised $1 billion at a $23 billion valuation and is planning an IPO. Their partnership with OpenAI, announced in January 2026, is worth over $10 billion in a multi-year deal.

The Full Speed Story: Not Just the Chip

The 1,000 tok/s headline is only part of the picture. OpenAI also re-engineered its entire inference pipeline:

  • 80% reduction in client/server roundtrip overhead
  • 50% faster time-to-first-token (TTFT)
  • 30% reduction in per-token overhead
  • Persistent WebSocket connections replacing traditional request-response cycles

These infrastructure improvements will roll out to all models, not just Codex-Spark. The WebSocket path is enabled by default for Spark and will become standard across the fleet.

OpenAI's Hardware Diversification Strategy

Codex-Spark signals something bigger than one model: OpenAI is systematically reducing its dependence on Nvidia.

The timeline:

  • October 2025: Multi-year chip deal with AMD
  • November 2025: $38 billion cloud computing agreement with Amazon
  • January 2026: $10B+ partnership with Cerebras announced
  • February 2026: Codex-Spark ships as first non-Nvidia production model
  • Ongoing: Custom AI chip design with TSMC

OpenAI isn't abandoning Nvidia — GPUs remain foundational for training and broad inference. But for latency-critical workloads like real-time coding, specialized hardware like Cerebras offers clear advantages. As OpenAI put it: "GPUs and Cerebras can be combined for single workloads to reach the best performance."

What This Means for Developers

Real-Time Pair Programming Becomes Real

At 1,000 tokens per second, the AI stops feeling like a tool you wait for and starts feeling like a collaborator you think with. You can interrupt, redirect, and iterate with near-instant responses. This is the difference between sending an email and having a conversation.

Two Modes of AI Coding

Codex now supports both paradigms:

  1. Long-running autonomy: GPT-5.3-Codex handles complex, multi-hour tasks
  2. Real-time iteration: Codex-Spark handles rapid prototyping and targeted edits

OpenAI's vision: these modes will eventually blend, with Codex keeping you in a tight interactive loop while delegating longer work to sub-agents in the background.

The Speed Competition Intensifies

With Anthropic's Claude Opus 4.6 (February 2026) pushing agent teams and multi-agent coding, and Google doubling AI investment, the coding AI race is accelerating. Speed is becoming the differentiator — a model that codes faster lets developers iterate faster.

The Context Studios Take

From our Berlin studio, we see Codex-Spark as validation of a thesis we've been building on: the future of development isn't about AI replacing developers — it's about AI matching developer speed of thought.

The best AI coding tools disappear into the workflow. When inference takes seconds, you're forced to context-switch. When it takes milliseconds, you stay in flow. Codex-Spark, combined with tools like Claude Code 2.1 and GitHub Agent HQ, points toward a development experience where the bottleneck shifts from "waiting for the AI" to "knowing what to ask."

For teams building AI-native applications — which is increasingly every team — this means:

  • Faster prototyping cycles: Test ideas in seconds, not minutes
  • Lower cost of experimentation: When iteration is cheap, you try more things
  • New interaction patterns: Real-time steering replaces batch-and-wait

Availability and Pricing

Codex-Spark is currently available as a research preview for ChatGPT Pro users ($200/month). It works in:

  • The Codex app (latest version)
  • The Codex CLI
  • The VS Code extension

API access is rolling out to a small set of design partners first, with broader access coming in the weeks ahead. During the preview, usage has separate rate limits that may adjust based on demand.

What's Next

Codex-Spark is explicitly the "first in a family of ultra-fast models." OpenAI has announced plans for:

  • Larger models on Cerebras hardware
  • Longer context windows
  • Multimodal input support
  • Blended autonomous + real-time workflows

As Sean Lie, Cerebras CTO and co-founder, put it: "What excites us most about GPT-5.3-Codex-Spark is partnering with OpenAI and the developer community to discover what fast inference makes possible — new interaction patterns, new use cases, and a fundamentally different model experience."

The inference speed race is just getting started. And for developers, that's unambiguously good news.


Context Studios is an AI development studio based in Berlin, building AI-native applications and sharing insights on the tools shaping modern software development.


Frequently Asked Questions

What is GPT-5.3-Codex-Spark and how does it differ from GPT-5.3-Codex?

Codex-Spark is a lightweight, speed-optimized version of GPT-5.3-Codex designed for real-time interactive coding. While the full Codex excels at long-running autonomous tasks, Spark delivers over 1,000 tokens per second for fast, iterative collaboration.

Why is Codex-Spark running on Cerebras chips instead of Nvidia GPUs?

Cerebras Wafer Scale Engine 3 chips are purpose-built for inference speed. This is the first production model from OpenAI's partnership with Cerebras Systems, representing a strategic diversification away from sole reliance on Nvidia hardware.

Who can access GPT-5.3-Codex-Spark?

It launched as a research preview for ChatGPT Pro users, available through the Codex app, CLI, and VS Code extension with separate rate limits during the preview period.

What is the context window for Codex-Spark?

Codex-Spark supports a 128K token context window, sufficient for most interactive coding sessions though smaller than some competing models.

How does Codex-Spark perform on coding benchmarks?

It delivers competitive scores on SWE-Bench Pro and Terminal-Bench 2.0 — two benchmarks for agentic software engineering — while maintaining its significant speed advantage over larger models.

Share article

Share: