Updated 1 weeks ago · 2026-05-18

AI Coding Model Leaderboard

Every coding model that matters, ranked by real benchmarks — not lab press releases. SWE-bench Verified is the primary signal; Aider Polyglot is the secondary. Pricing, context, and speed are straight from each provider.

Best overall

Top SWE-bench Verified

Claude Opus 4.7

Anthropic

87.6% SWE-bench82.0% Aider

Best value

70%+ at lowest output $

Gemini 3 Flash (high reasoning)

Google

75.8% SWE-bench71.0% Aider

Top on Aider

Polyglot edit leader

GPT-5 (high reasoning)

OpenAI

74.9% SWE-bench88.0% Aider
#1Anthropic
Claude Opus 4.7
Reasoning
SWE-bench
87.6%
Aider
82.0%
Context
1M
Speed
35/s
$/1M in
$15
$/1M out
$75

Highest SWE-bench Verified ever (87.6%). 7pt jump over 4.6. The agentic coding benchmark king.

Claude CodeCursorWindsurfZed
#2Anthropic
Claude Opus 4.6
Reasoning
SWE-bench
80.8%
Aider
77.5%
Context
1M
Speed
32/s
$/1M in
$15
$/1M out
$75

Superseded by 4.7 but still strong at 80.8%. Some tools haven't added 4.7 yet.

Claude CodeCursorWindsurfReplitZed
#3OpenAI
GPT-5.2
Reasoning
SWE-bench
80.0%
Aider
87.5%
Context
400K
Speed
55/s
$/1M in
$2.5
$/1M out
$10

OpenAI's current flagship — 1.2pt behind Opus 4.6 on SWE-bench, ahead on Aider.

CursorWindsurfGitHub CopilotBolt.newv0
#4Anthropic
Claude Sonnet 4.6
Reasoning
SWE-bench
79.6%
Aider
76.8%
Context
1M
Speed
75/s
$/1M in
$3.0
$/1M out
$15

Best price-to-performance in the top tier. 1.2pt behind Opus at 1/5 the cost.

Claude CodeCursorWindsurfClineZed
#5Google
Gemini 3 Pro (high thinking)
Reasoning
SWE-bench
78.4%
Aider
82.5%
Context
2M
Speed
70/s
$/1M in
$1.5
$/1M out
$10

Just-released flagship — 78% SWE-bench at Sonnet-tier pricing. Worth trying.

CursorWindsurfClineAider
#6Anthropic
Claude Opus 4.5
Reasoning
SWE-bench
76.8%
Aider
72.0%
Context
200K
Speed
30/s
$/1M in
$15
$/1M out
$75

Superseded by 4.6. Only pick it if a tool does not yet support 4.6.

Claude CodeCursor
#7Google
Gemini 3 Flash (high reasoning)
Reasoning
SWE-bench
75.8%
Aider
71.0%
Context
2M
Speed
180/s
$/1M in
$0.30
$/1M out
$2.5

Best speed-to-quality ratio. 2M context lets it hold a whole repo in memory.

CursorWindsurfClineAider
#8OpenAI
GPT-5 (high reasoning)
Reasoning
SWE-bench
74.9%
Aider
88.0%
Context
400K
Speed
48/s
$/1M in
$2.5
$/1M out
$10

Superseded by 5.2 in Feb 2026 but still the Aider Polyglot leader at 88%.

CursorWindsurfGitHub CopilotBolt.newv0
#9OpenAI
O3-Pro (high)
Reasoning
SWE-bench
73.5%
Aider
84.9%
Context
200K
Speed
18/s
$/1M in
$20
$/1M out
$80

Strong on Aider polyglot but $146 per benchmark run. Niche: hard reasoning tasks.

CursorClaude CodeAider
#10OpenAI
GPT-5 (medium reasoning)
Reasoning
SWE-bench
72.1%
Aider
86.7%
Context
400K
Speed
62/s
$/1M in
$2.5
$/1M out
$10

Daily-driver setting for most people. Default in Cursor Plus.

CursorWindsurfCopilotBolt.newv0
#11Google
Gemini 2.5 Pro (32k thinking)
Reasoning
SWE-bench
71.0%
Aider
83.1%
Context
2M
Speed
90/s
$/1M in
$1.3
$/1M out
$10

Older but still top-4 on Aider. Cheap enough for continuous IDE use.

CursorWindsurfClineAiderBolt.new
#12OpenAI
O3 (high)
Reasoning
SWE-bench
69.0%
Aider
81.3%
Context
200K
Speed
28/s
$/1M in
$2.0
$/1M out
$8.0

Big value after OpenAI dropped the price 80% in Sept 2025.

CursorAiderCline
#13xAI
Grok-4 (high)
Reasoning
SWE-bench
68.5%
Aider
79.6%
Context
256K
Speed
52/s
$/1M in
$3.0
$/1M out
$15

Better than its reputation at coding. Occasional provider outages.

CursorWindsurf
#14DeepSeek
DeepSeek V3.2 (Reasoner)
OpenReasoning
SWE-bench
65.4%
Aider
74.2%
Context
128K
Speed
45/s
$/1M in
$0.14
$/1M out
$0.55

Open weights, $1.30 per Aider benchmark run. Cheapest path to 74% polyglot.

ClineAiderCursorWindsurf
#15Anthropic
Claude Haiku 4.5
SWE-bench
63.2%
Aider
61.5%
Context
200K
Speed
110/s
$/1M in
$1.0
$/1M out
$5.0

Fastest Claude. Use for autocomplete + small edits, not hard refactors.

Claude CodeCursorZed
#16Alibaba
Qwen 3 Coder 480B
Open
SWE-bench
59.8%
Aider
58.3%
Context
262K
Speed
38/s
$/1M in
$0.40
$/1M out
$1.6

Best open-weights coding model right now. Solid for self-hosting.

ClineAider

What the columns mean

SWE-bench Verified
500 real GitHub issues the model has to solve end-to-end. 70%+ is production-viable.
Aider Polyglot
225 Exercism exercises across C++, Go, Java, JavaScript, Python, Rust. Edit accuracy, not pattern-match.
Context
Max tokens the model reads per request. Bigger = can hold more of your codebase at once.
TPS
Output tokens per second. Reasoning models look slower because they burn tokens on internal thinking.
Tools
Products that actually expose this model. Changes often — missing names probably means the tool hasn't shipped it yet.

How we update this

Every Monday we re-read the public SWE-bench and Aider leaderboards and sync any new rows. When a lab ships a flagship model, we bump it the same week. If a score is "—", that benchmark hasn't tested the model yet; we'd rather say that than fill the cell with a guess.

Picked a model? Now pick the tool.

Most of these models are available across several tools. The tool shapes your workflow as much as the model shapes your output.