Updated 1 months ago · 2026-05-18

AI Coding Model Leaderboard

Every coding model that matters, ranked by real benchmarks — not lab press releases. SWE-bench Verified is the primary signal; Aider Polyglot is the secondary. Pricing, context, and speed are straight from each provider.

16 models tracked·Sources: SWE-bench Verified, Aider Polyglot, Artificial Analysis

Best overall

Top SWE-bench Verified

Claude Opus 4.7

Anthropic

87.6% SWE-bench82.0% Aider

Best value

70%+ at lowest output $

Gemini 3 Flash (high reasoning)

Google

75.8% SWE-bench71.0% Aider

Top on Aider

Polyglot edit leader

GPT-5 (high reasoning)

OpenAI

74.9% SWE-bench88.0% Aider

#	Model	SWE-bench	Aider	Context	$/1M in	$/1M out	TPS	Tools
1	Claude Opus 4.7Reasoning Anthropic·released 2026-04-16 Highest SWE-bench Verified ever (87.6%). 7pt jump over 4.6. The agentic coding benchmark king.	87.6%	82.0%	1M	$15	$75	35/s	Claude CodeCursorWindsurf+1
2	Claude Opus 4.6Reasoning Anthropic·released 2026-02-05 Superseded by 4.7 but still strong at 80.8%. Some tools haven't added 4.7 yet.	80.8%	77.5%	1M	$15	$75	32/s	Claude CodeCursorWindsurf+2
3	GPT-5.2Reasoning OpenAI·released 2026-02-28 OpenAI's current flagship — 1.2pt behind Opus 4.6 on SWE-bench, ahead on Aider.	80.0%	87.5%	400K	$2.5	$10	55/s	CursorWindsurfGitHub Copilot+4
4	Claude Sonnet 4.6Reasoning Anthropic·released 2026-02-17 Best price-to-performance in the top tier. 1.2pt behind Opus at 1/5 the cost.	79.6%	76.8%	1M	$3.0	$15	75/s	Claude CodeCursorWindsurf+5
5	Gemini 3 Pro (high thinking)Reasoning Google·released 2026-03-28 Just-released flagship — 78% SWE-bench at Sonnet-tier pricing. Worth trying.	78.4%	82.5%	2M	$1.5	$10	70/s	CursorWindsurfCline+1
6	Claude Opus 4.5Reasoning Anthropic·released 2025-11-02 Superseded by 4.6. Only pick it if a tool does not yet support 4.6.	76.8%	72.0%	200K	$15	$75	30/s	Claude CodeCursor
7	Gemini 3 Flash (high reasoning)Reasoning Google·released 2026-03-12 Best speed-to-quality ratio. 2M context lets it hold a whole repo in memory.	75.8%	71.0%	2M	$0.30	$2.5	180/s	CursorWindsurfCline+1
8	GPT-5 (high reasoning)Reasoning OpenAI·released 2025-12-10 Superseded by 5.2 in Feb 2026 but still the Aider Polyglot leader at 88%.	74.9%	88.0%	400K	$2.5	$10	48/s	CursorWindsurfGitHub Copilot+3
9	O3-Pro (high)Reasoning OpenAI·released 2025-09-18 Strong on Aider polyglot but $146 per benchmark run. Niche: hard reasoning tasks.	73.5%	84.9%	200K	$20	$80	18/s	CursorClaude CodeAider
10	GPT-5 (medium reasoning)Reasoning OpenAI·released 2025-12-10 Daily-driver setting for most people. Default in Cursor Plus.	72.1%	86.7%	400K	$2.5	$10	62/s	CursorWindsurfCopilot+3
11	Gemini 2.5 Pro (32k thinking)Reasoning Google·released 2025-06-05 Older but still top-4 on Aider. Cheap enough for continuous IDE use.	71.0%	83.1%	2M	$1.3	$10	90/s	CursorWindsurfCline+2
12	O3 (high)Reasoning OpenAI·released 2025-04-16 Big value after OpenAI dropped the price 80% in Sept 2025.	69.0%	81.3%	200K	$2.0	$8.0	28/s	CursorAiderCline
13	Grok-4 (high)Reasoning xAI·released 2025-07-10 Better than its reputation at coding. Occasional provider outages.	68.5%	79.6%	256K	$3.0	$15	52/s	CursorWindsurf
14	DeepSeek V3.2 (Reasoner)OpenReasoning DeepSeek·released 2026-01-22 Open weights, $1.30 per Aider benchmark run. Cheapest path to 74% polyglot.	65.4%	74.2%	128K	$0.14	$0.55	45/s	ClineAiderCursor+1
15	Claude Haiku 4.5 Anthropic·released 2025-10-01 Fastest Claude. Use for autocomplete + small edits, not hard refactors.	63.2%	61.5%	200K	$1.0	$5.0	110/s	Claude CodeCursorZed
16	Qwen 3 Coder 480BOpen Alibaba·released 2025-11-20 Best open-weights coding model right now. Solid for self-hosting.	59.8%	58.3%	262K	$0.40	$1.6	38/s	ClineAider

#1Anthropic

Claude Opus 4.7

Reasoning

SWE-bench

87.6%

Aider

82.0%

Context

Speed

35/s

$/1M in

$15

$/1M out

$75

Highest SWE-bench Verified ever (87.6%). 7pt jump over 4.6. The agentic coding benchmark king.

Claude CodeCursorWindsurfZed

#2Anthropic

Claude Opus 4.6

Reasoning

SWE-bench

80.8%

Aider

77.5%

Context

Speed

32/s

$/1M in

$15

$/1M out

$75

Superseded by 4.7 but still strong at 80.8%. Some tools haven't added 4.7 yet.

Claude CodeCursorWindsurfReplitZed

#3OpenAI

GPT-5.2

Reasoning

SWE-bench

80.0%

Aider

87.5%

Context

400K

Speed

55/s

$/1M in

$2.5

$/1M out

$10

OpenAI's current flagship — 1.2pt behind Opus 4.6 on SWE-bench, ahead on Aider.

CursorWindsurfGitHub CopilotBolt.newv0

#4Anthropic

Claude Sonnet 4.6

Reasoning

SWE-bench

79.6%

Aider

76.8%

Context

Speed

75/s

$/1M in

$3.0

$/1M out

$15

Best price-to-performance in the top tier. 1.2pt behind Opus at 1/5 the cost.

Claude CodeCursorWindsurfClineZed

#5Google

Gemini 3 Pro (high thinking)

Reasoning

SWE-bench

78.4%

Aider

82.5%

Context

Speed

70/s

$/1M in

$1.5

$/1M out

$10

Just-released flagship — 78% SWE-bench at Sonnet-tier pricing. Worth trying.

CursorWindsurfClineAider

#6Anthropic

Claude Opus 4.5

Reasoning

SWE-bench

76.8%

Aider

72.0%

Context

200K

Speed

30/s

$/1M in

$15

$/1M out

$75

Superseded by 4.6. Only pick it if a tool does not yet support 4.6.

Claude CodeCursor

#7Google

Gemini 3 Flash (high reasoning)

Reasoning

SWE-bench

75.8%

Aider

71.0%

Context

Speed

180/s

$/1M in

$0.30

$/1M out

$2.5

Best speed-to-quality ratio. 2M context lets it hold a whole repo in memory.

CursorWindsurfClineAider

#8OpenAI

GPT-5 (high reasoning)

Reasoning

SWE-bench

74.9%

Aider

88.0%

Context

400K

Speed

48/s

$/1M in

$2.5

$/1M out

$10

Superseded by 5.2 in Feb 2026 but still the Aider Polyglot leader at 88%.

CursorWindsurfGitHub CopilotBolt.newv0

#9OpenAI

O3-Pro (high)

Reasoning

SWE-bench

73.5%

Aider

84.9%

Context

200K

Speed

18/s

$/1M in

$20

$/1M out

$80

Strong on Aider polyglot but $146 per benchmark run. Niche: hard reasoning tasks.

CursorClaude CodeAider

#10OpenAI

GPT-5 (medium reasoning)

Reasoning

SWE-bench

72.1%

Aider

86.7%

Context

400K

Speed

62/s

$/1M in

$2.5

$/1M out

$10

Daily-driver setting for most people. Default in Cursor Plus.

CursorWindsurfCopilotBolt.newv0

#11Google

Gemini 2.5 Pro (32k thinking)

Reasoning

SWE-bench

71.0%

Aider

83.1%

Context

Speed

90/s

$/1M in

$1.3

$/1M out

$10

Older but still top-4 on Aider. Cheap enough for continuous IDE use.

CursorWindsurfClineAiderBolt.new

#12OpenAI

O3 (high)

Reasoning

SWE-bench

69.0%

Aider

81.3%

Context

200K

Speed

28/s

$/1M in

$2.0

$/1M out

$8.0

Big value after OpenAI dropped the price 80% in Sept 2025.

CursorAiderCline

#13xAI

Grok-4 (high)

Reasoning

SWE-bench

68.5%

Aider

79.6%

Context

256K

Speed

52/s

$/1M in

$3.0

$/1M out

$15

Better than its reputation at coding. Occasional provider outages.

CursorWindsurf

#14DeepSeek

DeepSeek V3.2 (Reasoner)

OpenReasoning

SWE-bench

65.4%

Aider

74.2%

Context

128K

Speed

45/s

$/1M in

$0.14

$/1M out

$0.55

Open weights, $1.30 per Aider benchmark run. Cheapest path to 74% polyglot.

ClineAiderCursorWindsurf

#15Anthropic

Claude Haiku 4.5

SWE-bench

63.2%

Aider

61.5%

Context

200K

Speed

110/s

$/1M in

$1.0

$/1M out

$5.0

Fastest Claude. Use for autocomplete + small edits, not hard refactors.

Claude CodeCursorZed

#16Alibaba

Qwen 3 Coder 480B

Open

SWE-bench

59.8%

Aider

58.3%

Context

262K

Speed

38/s

$/1M in

$0.40

$/1M out

$1.6

Best open-weights coding model right now. Solid for self-hosting.

ClineAider

What the columns mean

SWE-bench Verified: 500 real GitHub issues the model has to solve end-to-end. 70%+ is production-viable.
Aider Polyglot: 225 Exercism exercises across C++, Go, Java, JavaScript, Python, Rust. Edit accuracy, not pattern-match.
Context: Max tokens the model reads per request. Bigger = can hold more of your codebase at once.
TPS: Output tokens per second. Reasoning models look slower because they burn tokens on internal thinking.
Tools: Products that actually expose this model. Changes often — missing names probably means the tool hasn't shipped it yet.

How we update this

Every Monday we re-read the public SWE-bench and Aider leaderboards and sync any new rows. When a lab ships a flagship model, we bump it the same week. If a score is "—", that benchmark hasn't tested the model yet; we'd rather say that than fill the cell with a guess.

Picked a model? Now pick the tool.

Most of these models are available across several tools. The tool shapes your workflow as much as the model shapes your output.

Browse 17+ tools Side-by-side comparisons