What GPT-5.4 Computer Use Actually Does
GPT-5.4 operates a computer in two distinct modes, and understanding the difference matters for system design.
Mode 1: Code generation. In code-gen mode, it writes Playwright, Selenium, or similar automation scripts based on a goal and a screenshot. You pass it a task ("export the Q1 report from this SaaS dashboard"), it generates runnable code, your infrastructure executes it. The model never touches the live system directly — it's a playwright writing the script, not the actor performing it.
Mode 2: Direct interaction. In direct interaction mode, it issues mouse and keyboard events from screenshots in a feedback loop. Seeing the screen, decides the next action, executes it, observes the result, and continues. This is closer to how a human VA works: watching the screen, clicking where needed, typing where needed, escalating when stuck.
Both modes are steerable. Developers can inject guidance via developer messages — think of them as operator-level instructions that override user intent. You can also define custom confirmation policies: "always confirm before submitting a form," "never click delete without a second-pass check." This makes GPT-5.4's computer use auditable and controllable in ways earlier approaches weren't, which is the feature that actually gets it past enterprise security reviews.
The vision model underneath has also improved substantially. On MMMU-Pro (a multimodal reasoning benchmark), GPT-5.4 scores 81.2% versus 79.5% for GPT-5.2. On OmniDocBench, error rates dropped from 0.140 to 0.109. This matters because computer use lives or dies on visual understanding — a model that misreads a UI element or misidentifies a button can spiral into compounding errors within three steps. Better vision means more reliable execution.
The Benchmark Reality Check
Benchmarks are maps, not terrain. But these particular maps are worth reading carefully because they cover scenarios that previously had no good measurement.
OSWorld-Verified: 75.0% — This is the headline number. OSWorld tests real desktop task completion on operating systems. GPT-5.2 scored 47.3% on the same benchmark. Human performance sits at 72.4%. At 75.0%, the model clears human baseline on desktop automation, which is a threshold the industry has been eyeing for two years.
WebArena-Verified: 67.3% — Browser-based task completion across realistic web scenarios. Shopping, form submission, information retrieval, account management. 67.3% means roughly two-thirds of browser tasks complete without human rescue. The other third still needs attention.
Online-Mind2Web: 92.8% — Screenshot-based web navigation. This is the highest of the computer-use numbers and reflects its strongest mode: point it at a screenshot, give it a task, and it largely gets there.
BrowseComp: 82.7% — Research browsing with complex multi-step information retrieval. GPT-5.2 was at 65.8% here. A 17-point jump in research quality matters for any agent that needs to gather information before acting.
GDPval: 83.0% — This one gets less attention but deserves more. Across 44 occupational domains, the model matches or exceeds professional human performance 83% of the time. Spreadsheet modeling specifically hits 87.3% (up from 68.4% for GPT-5.2). For anyone building agents in finance, ops, or professional services, these numbers define what's now automatable.
The contrarian take: 75% on OSWorld means 25% failure. In a workflow where 10 steps chain together, even modest per-step failure rates compound fast. The right mental model isn't "GPT-5.4 can automate my computer" — it's "GPT-5.4 can handle the bulk of repeatable, well-defined computer tasks, and needs a supervision layer for the rest." Our guide to AI agents covers how to design that supervision layer properly.
Tool Search: Agents That Find Their Own Tools
One of GPT-5.4's less-discussed upgrades is what OpenAI calls tool search. On 250 Scale MCP Atlas tasks, it uses 47% fewer tokens compared to GPT-5.2 to find and invoke the right tool for a job.
This matters more than the raw number suggests. Token efficiency in tool selection isn't just a cost story — it's a latency story and an architecture story. When an agent needs to decide which tool to call, token-heavy reasoning slows the loop and burns context budget. A 47% reduction means faster agent cycles, more room in the context window for actual task data, and meaningfully lower API costs at scale.
For developers building MCP-connected agents, this changes the calculus on how many tools you can expose to the model at once. Previously, giving an agent access to a large tool registry was a trade-off: more capability, worse selection efficiency, higher cost. The model shifts that curve. You can expose more tools without paying a proportional attention penalty.
Combined with the 1M token context window, GPT-5.4's architecture starts to look like it was designed specifically for long-horizon agentic tasks — the kind where an agent needs to hold a large working memory, consult many tools, and execute dozens of steps without losing the thread. The Claude Code loop approach is one pattern for managing this; it now offers a competitive alternative within the OpenAI ecosystem.
What Changed in 6 Months
| Capability | GPT-5.2 (Sep 2025) | GPT-5.4 (Mar 2026) | Delta |
|---|---|---|---|
| Desktop automation (OSWorld) | 47.3% | 75.0% | +27.7 pts |
| Research browsing (BrowseComp) | 65.8% | 82.7% | +16.9 pts |
| Spreadsheet modeling | 68.4% | 87.3% | +18.9 pts |
| Visual reasoning (MMMU-Pro) | 79.5% | 81.2% | +1.7 pts |
| Document OCR error (OmniDocBench) | 0.140 | 0.109 | -22% |
| False claims | baseline | -33% | significant |
| Errors | baseline | -18% | significant |
| Context window | ~200K | up to 1M tokens | 5× |
| MCP tool search | baseline | -47% tokens | significant |
| Browser tasks (WebArena) | — | 67.3% | new |
| Screenshot navigation (Mind2Web) | — | 92.8% | new |
The 27-point jump on OSWorld is the standout. To put it in perspective: six months ago, a 47% desktop automation score meant computer-use agents were interesting research. At 75%, they're production-relevant for structured, repeatable workflows. That shift happened in a single model generation.
Reliability also improved significantly: 33% fewer false claims and 18% fewer errors versus GPT-5.2. For agents that make decisions — not just retrieve information — reliability is as important as raw capability. An agent that's 10% more capable but 15% less reliable is often worse in practice. This version improves both simultaneously, which is harder than it sounds.
Building Agents With GPT-5.4: What's Different Now
Three things changed in practice for teams building agentic systems.
1. Computer use is a first-class primitive. With GPT-5.2 and earlier, computer use required wrapping external APIs, stitching together separate vision and action models, and debugging a system that wasn't designed to be one thing. With this release, the capability is native. One model, one API, one context. That simplification alone reduces the surface area for production failures.
2. Confirmation policies make agents deployable. The ability to define custom confirmation policies — "pause before any write operation," "confirm before navigating away from the current page" — means you can tune the autonomy/safety dial per workflow. A financial reporting agent that reads data can run fully autonomously. One that submits invoices gets a human-in-the-loop gate. This granularity is what turns demos into deployable systems.
3. The 1M context window changes long-horizon task design. Agents that previously needed to summarize and compress their working memory every N steps can now hold longer task histories, more tool outputs, and larger documents in context simultaneously. For workflows like Karpathy-style autoresearch, where the agent needs to hold a research thread across many sources, this is a genuine architectural unlock.
The practical starting point for most teams is Playwright-mode computer use (code generation, not direct interaction). It's easier to audit, easier to test, and easier to replay when something goes wrong. Direct screenshot-based interaction is better suited for applications where the target environment doesn't have a programmable API — legacy enterprise software, third-party SaaS dashboards, or anywhere you'd otherwise be screenscraping.
The Competitive Picture (Claude, Gemini, Copilot)
GPT-5.4 didn't invent computer-use AI. Anthropic has had computer use since Claude 3.5 Sonnet — now extended and refined in Claude Opus 4.6. Google's Gemini 2.5 Pro has deepening agentic capabilities. Microsoft Copilot is woven into the Office stack in ways that increasingly blur the line between assistant and automation engine.
So what does it actually change competitively?
The key differentiator is the combination of native computer use at this performance level plus a model designed from the start for tool-heavy agentic workflows. Claude's computer use is strong (Anthropic doesn't publish the same OSWorld numbers, which is itself informative), but the MCP tool search efficiency and the 1M context window are GPT-5.4's architectural advantages for multi-tool agent systems.
Gemini 2.5 Pro is competitive on multimodal tasks but lives primarily in Google's ecosystem. For teams not already deep in Google Cloud, the switching cost is real. Microsoft Copilot is powerful for Office workflows specifically — the same-day launch of ChatGPT for Excel is a direct response to that. But Copilot's generalist computer-use capabilities lag the native model approach.
The honest answer: if you're building agents that live in the OpenAI ecosystem or need maximum flexibility across application types, GPT-5.4 is the current best option. If you're building primarily on Anthropic's tooling — where ad agencies are already vibe-coding their own GEO tools with Claude Code — the switch isn't obviously worth it. The gap between the frontrunners is meaningful but not unbridgeable. Architecture decisions matter more than model selection at the margin.
What This Means If You're Building AI Products
Computer use at 75% desktop task completion changes the build/buy calculus for several product categories.
Robotic Process Automation (RPA): Legacy RPA tools like UiPath and Automation Anywhere are built on brittle selector-based automation. The model handles the same workflows using visual understanding — no selectors, no maintenance when the UI changes. The moat around traditional RPA vendors just got shallower.
Browser automation services: Anything that sells "AI-powered browser automation" as a feature is now competing with a capability that ships in the base model. Add differentiation in reliability layers, human escalation UX, and domain-specific training — not the core computer-use capability itself.
Professional services AI: GDPval at 83.0% across 44 occupations means the AI is now more reliable than the median professional on a large swath of structured tasks. That's not a replacement story — it's a leverage story. One professional with AI working at 83% across the task spectrum operates with fundamentally different throughput than one without it. Build tools that amplify that leverage rather than trying to compete with it.
Long-horizon research agents: With the 1M context window and improved BrowseComp performance, research agents that previously needed constant human checkpoints can now run longer unattended. The cost model for deep research automation drops substantially.
If you're evaluating where to apply GPT-5.4 in your stack, start with our services overview — we work through exactly these scoping decisions with teams building on the current generation of models.
FAQ
What is GPT-5.4 and when was it released? GPT-5.4 is OpenAI's latest model, released on March 5, 2026. It's the first general-purpose model with native computer use — able to control browsers, desktop apps, and software via screenshots and instructions.
How does GPT-5.4 computer use compare to human performance? On OSWorld-Verified, GPT-5.4 scores 75.0% versus 72.4% for humans on desktop automation tasks — narrowly exceeding the human baseline. On Online-Mind2Web screenshot navigation, it reaches 92.8%. Humans still outperform it on tasks requiring judgment, context, and exception handling.
Can GPT-5.4 replace RPA tools like UiPath or Automation Anywhere? For structured, repeatable workflows on modern UIs, GPT-5.4 handles a significant share of what traditional RPA covers — without brittle selectors or maintenance overhead when UIs change. For complex enterprise deployments with audit trails, approval workflows, and legacy system integration, RPA tooling still provides value. The two will coexist for 2-3 years, then GPT-5.4's approach wins on most greenfield implementations.
What's the difference between GPT-5.4 Playwright mode and direct screenshot mode? Playwright mode generates automation code (Playwright, Selenium) which your infrastructure executes — the model never touches the live system directly. Screenshot mode issues direct mouse/keyboard events in a feedback loop. Playwright mode is easier to audit, test, and replay; screenshot mode works on any interface, including legacy apps with no programmable API.
How does GPT-5.4 compare to Claude Opus 4.6 for computer use? GPT-5.4 publishes a 75.0% OSWorld score. Anthropic doesn't publish equivalent numbers for Claude Opus 4.6, making direct comparison difficult. GPT-5.4's documented advantages include 47% better MCP tool search efficiency and a 1M token context window — both meaningful for multi-tool agent architectures. Claude's ecosystem advantages (strong tooling, active developer community) remain real.
Is GPT-5.4 available to all API users or only enterprise? GPT-5.4 is available in the standard OpenAI API, in ChatGPT (as GPT-5.4 Thinking), and in Codex. A GPT-5.4 Pro tier with higher rate limits and enterprise SLAs targets professional and enterprise users. Computer-use capabilities are available across tiers, though rate limits and pricing differ.