When David Ondrej posted a clip on April 25 of "Gemini 3.1 Pro" looping ten messages in a row inside OpenClaw — repeating itself, refusing to stop, eventually stalling — it surfaced the question every team using OpenClaw eventually asks: which model actually works in this harness, and which ones quietly fall apart? Marketing benchmarks won't tell you. Leaderboards won't tell you. Only deployment will.
We've shipped OpenClaw across client engagements for six months now, swapping models as new ones land. This is the field guide we wished existed when we started: which models we trust in OpenClaw as of April 2026, which ones we've stopped reaching for, and how to decide for your own workload.
What "Best" Means Inside OpenClaw
OpenClaw isn't a chatbox. It's an agentic harness: tool use, file edits, long-running task loops, persistent context, hooks, and a CLI that runs cron-style automation. A model that scores 90 on coding benchmarks can still be the wrong choice if it doesn't follow OpenClaw's hook conventions, ignores tool-call contracts, or burns through context windows by re-reading the same file four times in a row.
Three traits matter more than benchmark numbers:
- Tool-call discipline — does it call the right tool with the right schema, first try?
- Stop discipline — does it know when the task is done, or does it loop?
- Context economy — does it re-read what's already in context, or trust it?
Almost everything else is downstream of these three. We grade every model in our deployment notes against them. Here's where the major frontier options sit at the end of April 2026.
"The most important thing a model can do in an agentic harness is know when to stop. A model that loops is worse than a model that fails — because it consumes resources, corrupts state, and masks the failure."
— Simon Willison, developer and AI systems researcher
Sonnet 4.6 — The Default That Earns Its Keep
Anthropic Claude Sonnet 4.6 is the model we set as the default in nearly every OpenClaw deployment, and it's earned that position through consistent performance. Tool-call discipline is excellent. Stop discipline is the best in class — when a Sonnet 4.6 task is done, it ends. It rarely re-reads files it was just shown. The cost-per-task ratio for typical agent workflows lands roughly where Haiku used to before the price shift.
Where it falls short: deep multi-step refactors across unfamiliar codebases require sharper analytical reasoning. For code review, intricate architectural decisions, or a debugging trail that needs to hold a long causal chain in mind, Sonnet 4.6 gives up too early. That's exactly when we reach for Opus.
For a deeper take on why agentic work shifted toward this model, see our piece on the agentic work model OpenAI shipped to challenge Claude Mythos.
Opus 4.7 — When Reasoning Depth Actually Pays Off
Opus 4.7 is the heavyweight. We don't run it as a default because the per-task cost adds up fast, but it's our escalation path for three job classes:
- Complex debugging where the cause-effect chain spans multiple files and the symptoms are misleading
- Architectural decisions where the system needs to weigh trade-offs honestly instead of defaulting to the first plausible answer
- High-stakes one-shots like migration scripts, schema changes, or anything that touches production data
Opus 4.7 is also the capability we trust most when adaptive thinking matters — letting the engine spend reasoning tokens before committing to a tool call. The cost is real but the success rate on hard tasks justifies it. Rule of thumb: if a Sonnet run fails twice with similar errors, escalate to Opus instead of retrying.
GPT-5.5 in OpenClaw — Strong Coder, Wrong Tool For Now
GPT-5.5 in OpenClaw is interesting and frustrating at the same time. As a pure coder it's strong, and OpenAI's confirmation that GPT-5.5 IS Codex (Romain Huet on X, April 25) means there's no longer a "use Codex for coding, GPT-5.5 for general" split. One model, two harnesses.
But OpenClaw isn't its harness. We see two recurring failure modes when we wire GPT-5.5 into OpenClaw:
- Tool-call schema drift — it invents tool fields that don't exist, particularly under longer contexts
- Looser stop discipline — it produces "I'll continue working on this" type filler more often than Sonnet 4.6
For OpenClaw specifically, our current recommendation (April 2026) is: leave GPT-5.5 inside the Codex CLI where its conventions match its training, and keep Anthropic models inside OpenClaw. This will shift as the harness matures around other providers. We're testing on every minor release.
DeepSeek V4 — Cost Disruption That Needs Real Testing
DeepSeek V4 (1.6T parameters, MIT licensed, dramatically cheaper than Opus on equivalent tasks) just shipped. We covered the pricing implications in detail in our DeepSeek V4 pricing earthquake post.
Inside OpenClaw, our early testing shows DeepSeek V4 Flash handles 70-80% of typical Haiku-tier workloads at a fraction of the cost — at $0.14/$0.28 per million input/output tokens, it is roughly 17x cheaper than Claude Haiku 4.5 on output. V4 Pro (at $0.145/$3.48 per million tokens) is genuinely competitive with Opus on isolated reasoning tasks, and reached #1 on LiveCodeBench at 0.935 as of April 2026, though stop discipline lags Anthropic models. We're not yet ready to recommend it as a default in client deployments — too early to know how it behaves in prolonged orchestration loops, and the open-weight version requires self-hosting infrastructure most teams don't have.
If you're cost-sensitive and willing to invest in evaluation: start testing V4 Flash on lower-stakes OpenClaw cron jobs (intel scans, summarization, content quality checks) and measure stop discipline and tool-call accuracy. Don't deploy to production-touching jobs until you have a multi-week durability track.
Models We've Tested and Don't Recommend
Key findings from six months of deployment work:
- Gemini 2.5 Pro and 3.x branded variants: Inconsistent in OpenClaw. The Ondrej report matches our own testing. The capability is strong in its designed environment (Vertex, AI Studio) but does not respect OpenClaw's tool conventions reliably. We've stopped reaching for it.
- Nemotron and Qwen mid-tier: Viable as fallback systems in our cost cascade, but timeout rates climb under longer contexts. Use for short-burst jobs only.
- Older Claude versions (3.5, 4.0, 4.5): Superseded. No reason to run these unless cost forces it.
For broader provider context, see our analysis of agentic compute pricing.
How to Pick: A Decision Matrix
Here's the framework we use when a client asks "which model should we run?":
| Workload | Default | Escalate To | Why |
|---|---|---|---|
| Daily cron jobs (audit, scan, summarize) | Sonnet 4.6 | Opus 4.7 if accuracy critical | Sonnet's stop discipline keeps cost predictable |
| Code generation and review | Sonnet 4.6 | Opus 4.7 for hard bugs | Skip GPT-5.5 in OpenClaw; use it in Codex CLI instead |
| One-shot high-stakes tasks (migrations, prod fixes) | Opus 4.7 | — | Cost is justified by single-failure cost |
| Cost-sensitive bulk work | DeepSeek V4 Flash (testing) | Sonnet 4.6 | Validate stop discipline before scaling |
| Multimodal tasks (vision, audio) | Sonnet 4.6 with vision | Opus 4.7 | Most consistent inside OpenClaw |
The lever we pull most often: escalate from Sonnet to Opus on retry, never the other way. If Sonnet fails twice, Opus clears it on first try. If Opus fails, retrying Opus rarely helps — the task probably needs different framing.
FAQ
Q: Can I switch models mid-task in OpenClaw? Yes — OpenClaw supports model switching via CLI flag or per-job config. We use this in our cron fallback cascade: if Sonnet times out twice, the next run automatically tries a different provider. Set this up before you need it.
Q: Is "Gemini 3.1 Pro" actually a released model? As of April 26, 2026, we cannot find an official Google announcement for "Gemini 3.1 Pro." The David Ondrej video may be referencing an internal name or a quiet rollout. Treat any "Gemini 3.x" claim as unverified until Google's blog confirms it.
Q: Should I always use the most powerful model "to be safe"? No. Opus 4.7 on a job Sonnet 4.6 handles well costs five to ten times more for the same outcome, and the longer reasoning loops can introduce their own failure modes. Match the model to the workload — escalate on retry, don't escalate by default.
Q: How often should I re-test which model works best? Monthly minimum, weekly if you're running production OpenClaw deployments. Model behavior shifts after every minor release, and harness compatibility changes faster than benchmark scores would suggest.
Bottom Line
For most teams running OpenClaw in 2026, the right default is Sonnet 4.6 with Opus 4.7 as the escalation path. GPT-5.5 belongs in Codex CLI, not OpenClaw. DeepSeek V4 is worth evaluating for cost-sensitive workloads but isn't production-ready in this harness yet. Gemini variants remain inconsistent.
Benchmark figures evolve constantly. What matters in OpenClaw is tool-call discipline, stop discipline, and context economy — and on those three traits, the Anthropic models hold the lead.
If you want help setting up the right model cascade for your OpenClaw deployment — defaults, fallbacks, escalation rules — book a discovery call with Context Studios. We've done this for enough clients to skip the trial-and-error phase.