Models

Open vs closed.
Which task. Which model.

We rank open-weight models next to their closed-source counterparts, by task. SWE is not agentic. Agentic is not reasoning. The picks update automatically from artificialanalysis.ai every 12 hours.

● Static snapshot · June 2026 Source: artificialanalysis.ai · Built June 29, 2026

Head-to-head by task

Purple rows are open-weight, grey are closed APIs. The closed scores are here for scale only. Using them means your data leaves the building. The General column is the verified AA Intelligence Index; the SWE, Agentic and Reasoning splits are estimates anchored to it, since AA does not publish per-task sub-scores.

Scores are on a 0–100 scale (Artificial Analysis Intelligence Index). i Why are scores only 40–60? The 0–100 scale is based on benchmarks so hard that no AI has scored above 60 yet. Today's world record is Claude Fable 5 at 60 (currently unavailable); the top available model is Claude Opus 4.8 at 56.

These benchmarks are designed to find failure modes, not measure everyday capability:

· Humanity's Last Exam, expert-contributed questions specifically designed to stump AI
· GPQA Diamond, PhD-level science problems
· SciCode, write research-grade scientific code
· Terminal-Bench Hard, shell commands that actually run

For everyday tasks, writing, summarizing, drafting docs, models perform far better than these scores suggest. The benchmarks exist to reveal what's still broken, not what already works.

SWE

Code generation, bug fixes, refactoring

1
Claude Fable 5 (max) API only
59
2
Claude Opus 4.8 (max) API only
55
3
GPT-5.5 (xhigh) API only
55
4
Gemini 3.5 Flash (high) API only
54
5
Claude Opus 4.7 (max) API only
53
6
GPT-5.5 (high) API only
53
7
Grok 4.3 (high) API only
52
8
Claude Sonnet 4.6 (max) API only
51
9
GLM-5.2 open
50
10
Gemini 3.1 Pro Preview API only
45
11
MiniMax M3 open
44
12
DeepSeek V4 Pro open
43
13
Kimi K2.6 open
43
14
MiMo-V2.5-Pro open
40
15
GLM-5.1 Reasoning open
39

Agentic

Multi-step planning, tool use, function calling

1
Claude Fable 5 (max) API only
60
2
Claude Opus 4.8 (max) API only
57
3
Claude Opus 4.7 (max) API only
55
4
GPT-5.5 (xhigh) API only
54
5
Gemini 3.5 Flash (high) API only
53
6
GPT-5.5 (high) API only
53
7
Grok 4.3 (high) API only
53
8
Claude Sonnet 4.6 (max) API only
53
9
GLM-5.2 open
50
10
Gemini 3.1 Pro Preview API only
44
11
MiniMax M3 open
43
12
DeepSeek V4 Pro open
43
13
Kimi K2.6 open
41
14
MiMo-V2.5-Pro open
41
15
GLM-5.1 Reasoning open
39

Reasoning

Chain-of-thought, math, auditable logic

1
Claude Fable 5 (max) API only
59
2
Claude Opus 4.8 (max) API only
56
3
GPT-5.5 (xhigh) API only
54
4
Gemini 3.5 Flash (high) API only
54
5
Claude Opus 4.7 (max) API only
54
6
Grok 4.3 (high) API only
53
7
GPT-5.5 (high) API only
52
8
GLM-5.2 open
50
9
Claude Sonnet 4.6 (max) API only
50
10
Gemini 3.1 Pro Preview API only
47
11
DeepSeek V4 Pro open
44
12
MiMo-V2.5-Pro open
43
13
Kimi K2.6 open
41
14
GLM-5.1 Reasoning open
41
15
MiniMax M3 open
40

General

Overall quality across mixed tasks

1
Claude Fable 5 (max) API only
60
2
Claude Opus 4.8 (max) API only
56
3
GPT-5.5 (xhigh) API only
55
4
Gemini 3.5 Flash (high) API only
55
5
Claude Opus 4.7 (max) API only
54
6
GPT-5.5 (high) API only
53
7
Grok 4.3 (high) API only
53
8
Claude Sonnet 4.6 (max) API only
52
9
GLM-5.2 open
51
10
Gemini 3.1 Pro Preview API only
46
11
MiniMax M3 open
44
12
DeepSeek V4 Pro open
44
13
Kimi K2.6 open
43
14
MiMo-V2.5-Pro open
42
15
GLM-5.1 Reasoning open
40

These scores approximate task-specific benchmarks from artificialanalysis.ai (SWEbench, GPQA, MMLU-Pro, and coding evals). Closed-source models are shown for scale only: they require API keys and route your data through third-party servers. Open-weight models run 100% on your hardware.

Best open-weight pick per use case

Picked algorithmically from live leaderboard scores, not my opinion, not a static list.

Software engineering

Code generation, bug fixes, refactoring, test writing, single-shot tasks on a specific file or function

GLM-5.2

Z AI

Intelligence 51 Code 50 Math 49 132 t/s

Highest coding benchmark in the open-weight leaderboard. Optimized for correctness on discrete tasks: generate, complete, explain, fix.

Agentic / multi-step

Long-horizon planning, tool use, function calling, multi-turn task completion across many steps

GLM-5.2

Z AI

Intelligence 51 Code 50 Math 49 132 t/s

Best composite for long-horizon work: intelligence 51, 1000K context. Agentic loops need a model that tracks state, calls tools reliably, and recovers across many steps, not just writes code.

Reasoning / Chain-of-thought

Explicit step-by-step logic, audits, diagnostics, structured analysis

DeepSeek V4 Pro

DeepSeek

Intelligence 44 Code 43 Math 44 88 t/s

Top reasoning model in current open-weight rankings (intelligence 44). Reasoning-mode models expose their chain-of-thought, every conclusion is auditable, which matters for regulated workflows.

Clinical documentation

SOAP notes, visit summaries, referral letters, must stay on-device

GLM-5.2

Z AI

Intelligence 51 Code 50 Math 49 132 t/s

Top intelligence (51) with 1000K context, handles full visit transcripts without truncation. No data leaves the device.

Legal analysis

Contract review, clause extraction, red-lining, precision matters

GLM-5.2

Z AI

Intelligence 51 Code 50 Math 49 132 t/s

Highest reasoning quality (51) among models with sufficient context for full contracts. Hallucination rate at Q4/Q8 is low enough for attorney review loops.

$

Financial / accounting

Meeting notes → CRM, client memos, regulatory summaries

GLM-5.2

Z AI

Intelligence 51 Code 50 Math 49 132 t/s

Strong math benchmarks + 132 t/s output. Fast enough for live meeting capture; accurate enough for numbers-heavy summaries.

Fast turnaround

Near-real-time tasks: form filling, short summaries, simple Q&A

Granite 4.0 H Small

IBM

Intelligence 5 Code 5 Math 4 374 t/s

374 t/s, fastest open-weight model in current rankings. Sufficient intelligence (5) for structured short-form output.

General purpose

Best single model if you only want to run one

GLM-5.2

Z AI

Intelligence 51 Code 50 Math 49 132 t/s

Highest overall intelligence index (51/100) in the open-weight leaderboard. The go-to when you want one model that handles most tasks well.

Full open-weight leaderboard

Models marked RDMA cluster require the TB5 two-node configuration (244 GB usable). All others run on a single machine.

Ranked by overall intelligence index (0–100 scale). i Why do scores look so low? This is a 0–100 scale, but no AI has scored above 60 yet. The benchmarks are intentionally brutal:

· Humanity's Last Exam, questions designed to stump AI
· GPQA Diamond, PhD-level biology, chemistry, physics
· SciCode, research-grade scientific code
· Terminal-Bench Hard, shell tasks that actually execute

A score of 51 means 51% of that extremely hard mix. By contrast, a poem request would score 99. Expand the tier legend below to see what each score range means in practice.
All models run locally, no API required.

# Model Provider Intelligence Coding (est) Math (est) Speed (t/s) Context Min RAM Level
1
GLM-5.2 RDMA cluster TB5 RDMA cluster required New #1 open-weight on the AA Intelligence Index v4.1 (51), +11 over GLM-5.1. On real coding and agentic work it BEATS GPT-5.5 — SWE-bench Pro 62.1 vs 58.6, FrontierSWE 74.4% vs 72.6%, MCP-Atlas 77.0 vs 75.3, GDPval-AA 1524 vs 1514 — and lands within a point of the top closed model, Opus 4.8, at roughly 1/6 the API cost. It trails only on Terminal-Bench 2.1 (81 vs 84) and the composite Index (where GPT-5.5 is 55), and has no image input yet. 744B total / 40B active MoE, MIT license, 1M context. IQ1_S (~150 GB) runs across the 244 GB 2-node cluster at ~15 t/s; IQ2_M (~222 GB) also fits. Verbose (~43k tokens/task). Self-hostable open weights — beats a cloud frontier model on real work, air-gapped and cheap.

Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency.
Z AI 51 50 49 132 1000K 150 GB Grad Specialist
2
MiniMax M3 RDMA cluster TB5 RDMA cluster required First multimodal M-series (text + image + video in), 1M context. 428B total / 23B active MoE, Q2_K (~160 GB) fits inside the 244 GB 2-node window; needs both nodes. Joint-2nd open-weight Intelligence Index (44, AA v4.1, tied with DeepSeek V4 Pro) behind GLM-5.2 (51). Self-hostable open weights, which suits air-gapped deployment.

Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency.
MiniMax 44 43 42 92 1000K 160 GB Graduate
3
DeepSeek V4 Pro
DeepSeek 44 43 44 88 1000K 580 GB Graduate
4
Kimi K2.6
Moonshot AI 43 43 41 83 262K 360 GB Graduate
5
MiMo-V2.5-Pro
Xiaomi 42 40 43 49 1000K 360 GB Graduate
6
GLM-5.1 Reasoning RDMA cluster TB5 RDMA cluster required 744B total / 40B active MoE; IQ1_S (~150 GB) needs the 2-node cluster. Superseded by GLM-5.2.

Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency.
Z AI 40 38 41 74 200K 150 GB Graduate
7
GLM-5 Reasoning RDMA cluster TB5 RDMA cluster required 744B total / 40B active MoE; IQ1_S (~150 GB) needs the 2-node cluster. Superseded by GLM-5.1 and GLM-5.2.

Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency.
Z AI 40 38 40 68 200K 150 GB Graduate
8
DeepSeek V4 Flash
DeepSeek 40 40 39 104 1000K 135 GB Graduate
9
MiniMax-M2.7
MiniMax 38 37 38 43 205K 110 GB Graduate
10
Kimi K2.5
Moonshot AI 38 37 36 46 262K 360 GB Graduate
11
Qwen3.6 27B
Alibaba 37 35 36 57 262K 17 GB Graduate
12
Qwen3.5 397B A17B
Alibaba 34 33 34 50 262K 150 GB Smart Undergrad
13
Qwen3.5 27B
Alibaba 34 32 33 81 262K 17 GB Smart Undergrad
14
Qwen3.6 35B A3B
Alibaba 32 31 31 161 262K 23 GB Smart Undergrad
15
Mistral Medium 3.5
Mistral 30 29 29 118 131K 79 GB Smart Undergrad
16
Gemma 4 31B
Google 29 27 28 35 131K 20 GB Smart Undergrad
17
Qwen3.5 9B
Alibaba 25 24 24 56 262K 6 GB Smart HS
18
Mistral Small 4
Mistral 21 20 20 173 262K 55 GB Smart HS
19
DeepSeek-R1 671B RDMA cluster TB5 RDMA cluster required Q2_K ~190 GB, fits in 244 GB usable across both nodes. The full 671B / 37B-active reasoning model, not a distilled version. Heavily superseded on the v4.1 leaderboard by GLM-5.2 and the V4 line — kept for reference.

Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency.
DeepSeek 20 18 22 8 131K 190 GB High School
20
Qwen3.5 4B
Alibaba 20 19 19 182 131K 3 GB High School
21
Ling 2.6 Flash
InclusionAI 19 18 18 183 262K 49 GB High School
22
Ling-1T RDMA cluster TB5 RDMA cluster required Q3_K_M ~375 GB, beyond the current 2-node setup (244 GB). Needs 3-node cluster (~366 GB usable) or a future 512 GB Mac. At Q4: ~500 GB (4 nodes). 1T total / 50B active params, MIT license.

Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency.
InclusionAI 13 12 12 6 128K 375 GB High School
23
Gemma 4 E4B
Google 12 11 11 290 131K 5 GB Elementary
24
Granite 4.0 H Small
IBM 5 5 4 374 128K 20 GB Elementary
25
Phi-4 14B
Microsoft 5 5 6 36 16K 10 GB Elementary
26
Llama 3.2 3B
Meta 4 3 3 52 131K 2 GB Elementary
27
Phi-4-mini 3.8B
Microsoft 3 3 4 44 16K 3 GB Elementary
Min RAM: ≤ 36 GB, any M4/M5 37–64 GB, M4/M5 Pro 48 GB+ 65–128 GB, M5 Max 128 GB 129–244 GB, RDMA cluster > 244 GB, bigger cluster needed

Min RAM = minimum unified memory to run at Q3_K_M quantization (large MoE) or Q4_K_M (dense/small). Intelligence Index is artificialanalysis.ai's composite quality score (0–100 scale, AA Intelligence Index v4.1). Scores look low because the benchmarks are hard: Humanity's Last Exam, GPQA Diamond, SciCode, and 6 other graduate-level evals. The current world ceiling is Claude Fable 5 at 60 (currently unavailable; Claude Opus 4.8 at 56 is the top available model). A score of 51 means the model correctly handled 51% of that extremely difficult mix. All models have publicly released weights, no API key required to run them.

Intelligence scale

What the scores actually mean

Each level mapped to real-world tasks, with concrete examples

Expand ↓

These aren't IQ comparisons: they're a shorthand for the kinds of tasks a model handles reliably. Real-world complexity, not exam performance.

PhD · Frontier research 68+

Generates novel hypotheses; solves open research problems; produces work that could advance a field.

  • · Proposes and tests original research hypotheses not in the training corpus
  • · Identifies unsolved problems at the edge of a scientific field
  • · Writes grant proposals reviewers cannot distinguish from expert submissions
  • · Writes poetry with genuine stylistic innovation, not imitation of any existing poet
PhD with distinction 60–67

Systematic expert-level analysis across multiple disciplines. Current world ceiling, Claude Fable 5 scores 60 (Opus 4.8, the top available model, scores 56).

  • · Identifies factual errors and methodological flaws in published papers
  • · Synthesizes across unrelated disciplines to surface non-obvious connections
  • · Produces research-quality writing that could pass peer review
  • · Writes poetry with genuine literary merit a reviewer could attribute to a published poet
PhD 53–59

Tackles novel research questions; comparable to a junior faculty member in a specialized domain.

  • · Identifies gaps in existing literature and proposes studies to fill them
  • · Produces publishable draft sections of a scientific paper
  • · Reviews code and identifies subtle algorithmic inefficiencies across a large codebase
  • · Writes a poem using meter, imagery, and controlling conceit in a unified way
Graduate specialist + real world 45–52

Domain expertise applied to messy, real-world inputs, the ambiguity professionals encounter daily.

  • · Identifies subtle contradictions spread across a 50-page contract
  • · Writes a technically accurate oncology referral letter directly from raw visit notes
  • · Flags methodological problems in a clinical trial design narrative
  • · Writes an original sonnet with correct meter and a genuine emotional argument
Graduate 36–44

Synthesizes research across papers; writes production code; handles structured professional tasks.

  • · Reads five research papers and synthesizes their conclusions into a coherent argument
  • · Builds a working REST API from a spec without scaffolding
  • · Identifies clause-level issues in a standard commercial contract
  • · Drafts a clinical SOAP note from a visit transcript
  • · Analyzes poetic technique using critical theory vocabulary
Smart undergrad 28–35

Competent at structured academic tasks; writes functional code; reasons across a single domain.

  • · Writes a literature review with accurate citations and a coherent argument
  • · Debugs moderately complex code across multiple files
  • · Drafts a business proposal with coherent financial rationale
  • · Writes a structured legal argument at a 1L level
  • · Writes a sonnet with correct rhyme scheme, iambic meter, and a volta
Smart high school 21–27

Handles multi-step reasoning; writes functional short programs; analyzes texts with modest depth.

  • · Writes a short story with a plot arc and character motivation
  • · Writes a Python script to parse a CSV and compute descriptive statistics
  • · Analyzes a poem's structure, imagery, and central theme
  • · Solves introductory chemistry and physics word problems
High school 13–20

Competent at summarization and basic writing; simple code; single-step reasoning.

  • · Summarizes a newspaper article with accurate main points
  • · Writes a 5-paragraph essay with a clear thesis and supporting paragraphs
  • · Writes a loop and handles basic I/O in Python
  • · Writes a rhyming poem on a given topic (ABAB scheme)
Elementary school 0–12

Handles simple factual questions and basic instructions; limited on multi-step or abstract tasks.

  • · Answers simple factual questions ("What is the capital of France?")
  • · Follows basic instructions, translate a phrase, fill in a blank
  • · Writes a few sentences about a familiar topic
  • · Writes a simple rhyming couplet

Closed-source reference: Claude Fable 5 scores 60 (PhD with distinction), the current world ceiling as of June 2026 (currently unavailable; Claude Opus 4.8 at 56 is the top available model). No model, open or closed, has reached the Frontier PhD tier (68+) yet.

Next step

Which hardware runs the top models?

The top models need 36–128 GB of unified memory to run. Our hardware guide covers what that means in practice, every Apple Silicon config, with prices.

See hardware configs →