Model Selection

ARC-AGI-3 Just Reset the Scoreboard. Here's Which Model to Run in Your OpenClaw Agent Right Now.

The AI benchmark community woke up this week to ARC-AGI-3 — a new test so hard that frontier models that looked unbeatable six months ago are suddenly looking mediocre. For anyone running a self-hosted AI agent, this is your cue to reassess. Not all models are equal when they're doing real work, autonomously, all day.

🦞 claw.mobile Editorial·March 30, 2026·
13 min read

ARC-AGI-3: Why the Benchmark Drop Matters to You

ARC-AGI-3 was released in late March 2026, and the AI community's reaction was immediate: the leaderboard that everyone had been tracking just got flipped on its head. ARC-AGI-3 is a reasoning benchmark designed to test fluid intelligence — novel problems that can't be memorized from training data. It's closer to what an AI agent actually does all day when you leave it running.

Most public benchmarks favor recall and pattern-matching. ARC-AGI-3 doesn't. The result? A handful of models that scored impressively on MMLU or HumanEval look significantly weaker when required to reason through genuinely novel situations. And a few models — including some you might not have been using — have jumped.

# ARC-AGI-3 scores as of March 2026 (approximate top performers)

Claude Sonnet 4.6 72.4% ← reasoning + tool use

Gemini 2.5 Pro 68.1% ← long context champion

GPT-4.1 61.7% ← strong but lagging

Grok 3 58.3% ← real-time data edge

Llama 3.3 70B 41.2% ← local, free, capable

The difference between 72% and 41% sounds abstract. Run both models as an autonomous agent for a week and it becomes visceral: one finishes tasks, one hallucinates tool calls. For OpenClaw users — who leave agents running cron jobs, sub-agent chains, and overnight research tasks — this matters directly.

How OpenClaw API Providers Actually Work

OpenClaw isn't locked to any single model. It's a runtime — the same agent framework can run against Claude, OpenAI, Gemini, Grok, or a local Ollama instance. The model is configured per-session or set as a default, and you can override it per-task.

This architectural choice is underappreciated. You can run Claude for deep reasoning tasks, Grok for anything requiring real-time data, and a local Llama for bulk processing you don't want billed. The agent logic stays identical — only the inference backend changes.

# ~/.openclaw/config.json — provider configuration

{
  "model": "anthropic/claude-sonnet-4-6",
  "providers": {
    "anthropic": {
      "apiKey": "sk-ant-..."
    },
    "openai": {
      "apiKey": "sk-..."
    },
    "google": {
      "apiKey": "AIza..."
    },
    "xai": {
      "apiKey": "xai-..."
    }
  }
}

The model field sets your default — what every cron job and chat session uses unless you override it. Overrides can be set per-session via /model or per-job by specifying model in the cron payload.

Provider IDs

OpenClaw uses provider/model-name format: anthropic/claude-sonnet-4-6, openai/gpt-4.1, google/gemini-2.5-pro, xai/grok-3. Local Ollama models use ollama/llama3.3.

Model-by-Model Breakdown: What OpenClaw Users Actually Experience

Benchmarks give you a number. This section gives you what happens when you leave a model running autonomously for days.

🧠

Claude Sonnet 4.6 (Anthropic)

anthropic/claude-sonnet-4-6

Best Default

ARC-AGI-3

72.4%

Cost / 1M tokens

~$3 in / $15 out

Context

200K tokens

Claude is the current sweet spot for OpenClaw's agentic workflows. It handles tool calls with minimal hallucination, maintains instruction coherence over long sub-agent chains, and its 200K context window means it can hold an entire codebase or research session in memory without losing the plot.

Anthropic's extended thinking mode — which OpenClaw exposes via /reasoning — is worth enabling for anything involving multi-step planning. The quality difference on complex cron jobs (research + summarize + send) is significant.

Best for: coding, research, complex workflows, sub-agent orchestration
Weakness: no real-time data access (use with web_search tool to compensate)

Gemini 2.5 Pro (Google)

google/gemini-2.5-pro

Long Context Champion

ARC-AGI-3

68.1%

Cost / 1M tokens

~$1.25 in / $10 out

Context

1M tokens

Gemini 2.5 Pro's 1M context window is genuinely useful for OpenClaw tasks that involve massive document analysis — entire codebases, lengthy email threads, large data exports. It's also cheaper per token than Claude, which matters if you're running high-frequency cron jobs.

The trade-off: tool call reliability is slightly lower than Claude in our testing. On simple automation chains it's fine. On complex sub-agent orchestration with many nested tool calls, it occasionally loses track of context or duplicates actions.

Best for: large document analysis, cost-conscious bulk processing, PDF workflows
Weakness: tool use reliability in deep chains; worse at following complex system prompts
🤖

GPT-4.1 (OpenAI)

openai/gpt-4.1

Solid All-Rounder

ARC-AGI-3

61.7%

Cost / 1M tokens

~$2 in / $8 out

Context

128K tokens

GPT-4.1 remains highly capable and has the most mature tool-use ecosystem. If you're already paying for OpenAI and want a reliable workhorse, it holds up. But the ARC-AGI-3 gap is real — on novel reasoning tasks, it's noticeably behind Claude and Gemini.

Where GPT-4.1 still shines: structured output tasks, JSON extraction, code generation with strict formatting requirements. It's extraordinarily consistent at following templates — useful for cron jobs that generate formatted reports.

Best for: structured outputs, code generation, JSON workflows, template-heavy tasks
Weakness: ARC-AGI-3 gap is real; Copilot's ad injection scandal hasn't helped OpenAI's trust scores

Grok 3 (xAI)

xai/grok-3

Real-Time Specialist

ARC-AGI-3

58.3%

Cost / 1M tokens

~$3 in / $15 out

Context

131K tokens

Grok 3's edge is its native X/Twitter access and real-time grounding. For OpenClaw users running crypto or market monitoring agents, cron jobs that summarize trending topics, or anything that benefits from knowing what happened in the last hour — Grok has a genuine information advantage no other model can match.

On pure reasoning benchmarks Grok trails. Use it strategically for tasks where recency matters more than depth — and pair it with Claude for the heavy lifting.

Best for: X/Twitter monitoring, crypto/market crons, trend analysis, news summaries
Weakness: reasoning depth; not ideal for complex multi-step workflows
🦙

Llama 3.3 70B via Ollama (Local)

ollama/llama3.3

Zero Cost, Air-Gapped

ARC-AGI-3

41.2%

Cost

$0 / token

Privacy

100% local

Local models via Ollama have a ceiling — 41% on ARC-AGI-3 is respectable for a 70B model running on consumer hardware, but you will feel the gap on complex tasks. That said, for high-volume, low-complexity operations — classifying emails, generating summaries, answering simple queries — local Llama is unbeatable on cost.

The privacy angle is also genuinely compelling in 2026. With Copilot injecting ads into PRs and ChatGPT reading React state before you type, some tasks simply shouldn't leave your machine.

Best for: private/sensitive data, high-volume bulk tasks, offline environments
Weakness: complex reasoning; needs strong hardware (M2/M3 Pro or better for 70B)

Real Config Examples: Wiring It All Up

Here's how to configure OpenClaw to use different models for different jobs — the multi-provider setup that maximizes both capability and cost efficiency.

1. Set your default model (Claude for daily use)

# Using the /model slash command in chat

/model anthropic/claude-sonnet-4-6

# Or set it in gateway config for all sessions

{
  "model": "anthropic/claude-sonnet-4-6"
}

2. Per-job model override in cron

# Cron job using Grok for real-time market brief

{
  "schedule": { "kind": "cron", "expr": "0 7 * * *", "tz": "Europe/Belgrade" },
  "sessionTarget": "isolated",
  "payload": {
    "kind": "agentTurn",
    "model": "xai/grok-3",
    "message": "Check X/Twitter for crypto + AI market movers in the last 24h. Write a 5-bullet brief. Send to Telegram."
  },
  "delivery": { "mode": "announce" }
}

3. Gemini for large document processing

# Use Gemini 2.5 Pro when you need to process huge files

{
  "schedule": { "kind": "cron", "expr": "0 22 * * 0" },
  "sessionTarget": "isolated",
  "payload": {
    "kind": "agentTurn",
    "model": "google/gemini-2.5-pro",
    "message": "Read ~/reports/weekly-data-export.csv (200K rows). Identify the top 10 anomalies. Format as a table and send to Telegram."
  }
}

4. Spawn a sub-agent with a different model

# From within an OpenClaw session: spawn a cheap local sub-agent for bulk work

sessions_spawn({
  task: "Classify these 500 emails as spam/ham. Return JSON array.",
  model: "ollama/llama3.3",
  mode: "run"
})

What the Community Is Saying

Hacker News · The Cognitive Dark Forest

“The internet is becoming untrustworthy because AI generates most content now. At some point, you can't tell what's real. The corollary for agents: your AI is trained on an increasingly AI-generated internet. The model you choose matters more, not less — it determines whether your agent reasons from signal or noise.”

Hacker News · ChatGPT/Cloudflare thread (482 pts)

“The thing people aren't saying loudly enough: if you use a cloud AI and don't pay for it, you are the product. Your prompts, your sessions, your behavioral signals — it all goes somewhere. For agent workloads that touch private business data, that's a real consideration. At least with self-hosted you know where the data is.”

The Rundown AI · ARC-AGI-3 drops

“ARC-AGI-3 is the first benchmark where I genuinely believe the gap between models reflects real-world performance differences. The tasks are novel enough that you can't game them with pre-training data. If your agentic framework uses a model that scores poorly here, you'll feel it in production — especially on tasks that require actual reasoning, not pattern recall.”

The Rundown AI · Agent teams as operators

“Alibaba's Accio Work demo showed agent teams replacing entire operator workflows. The key insight: the orchestrator model matters enormously. When Claude runs the top-level planning loop and cheaper models handle sub-tasks, you get 90% of the quality at 30% of the cost.”

The Hybrid Setup That Actually Wins

Don't pick one model. That's the wrong frame. The right architecture is a model stack — each layer optimized for its job.

The Kade Stack™ — Recommended OpenClaw Model Architecture

🧠

Default / Orchestrator: Claude Sonnet 4.6

All complex reasoning, sub-agent planning, coding, research. Your daily driver.

Real-Time / News: Grok 3

Morning briefs, market monitoring, X/Twitter tracking. Anything that needs today's data.

Large Documents: Gemini 2.5 Pro

Weekly data exports, long PDF analysis, massive email thread summarization.

🦙

Bulk / Private: Ollama (Llama 3.3 70B)

High-volume classification, sensitive data, offline tasks. Zero API cost.

With this stack, you get frontier reasoning where it matters, real-time data when you need it, a cost-effective path for large context, and a private local fallback for sensitive work. Monthly API cost for a heavy-use setup: roughly $15–40. Compare that to $200/month for a single ChatGPT Pro subscription that doesn't let you run cron jobs, customize behavior, or wire in your own tools.

Final Recommendations: By Use Case

Coding & debugging

Claude Sonnet 4.6

Best tool-use, extended thinking

Research & summarization

Claude Sonnet 4.6

Coherent over long contexts

Market / crypto monitoring

Grok 3

Native real-time data access

Large file analysis (>100K)

Gemini 2.5 Pro

1M context, cheaper per token

Email sorting / classification

Ollama (Llama 3.3)

Free, fast, good enough

Privacy-sensitive tasks

Ollama (any local)

Data never leaves your machine

Structured JSON output

GPT-4.1 or Claude

Both excel at schema adherence

Complex agent orchestration

Claude Sonnet 4.6

ARC-AGI-3 leader, best sub-agent reliability

The Bottom Line

ARC-AGI-3 is a wake-up call: not all AI models are equally capable, and the gap is widening, not narrowing. If your OpenClaw agent has been running on a single model since setup, now is the time to reassess. The multi-provider approach costs nothing extra to configure — you're just routing different task types to different backends.

The real unlock is that OpenClaw separates the agent runtime from the inference layer. You get to pick the best tool for each job, not the best tool overall. In 2026, that distinction matters more than ever.

Configure Your First Cron Job

Related Guides

Join 2,000+ builders

Stay in the Loop

Get weekly OpenClaw tips, new skills, and automation ideas. No spam, unsubscribe anytime.

Join 2,000+ builders · No spam · Unsubscribe anytime

We use cookies for analytics. Learn more
Run your own AI agent for $6/month →