Models

Open vs closed.
Which task. Which model.

We rank open-weight models next to their closed-source counterparts, by task. SWE is not agentic. Agentic is not reasoning. The picks update automatically from artificialanalysis.ai every 12 hours.

● Static snapshot · June 2026 Source: artificialanalysis.ai · Built June 29, 2026

Head-to-head by task

Purple rows are open-weight, grey are closed APIs. The closed scores are here for scale only. Using them means your data leaves the building. The General column is the verified AA Intelligence Index; the SWE, Agentic and Reasoning splits are estimates anchored to it, since AA does not publish per-task sub-scores.

Scores are on a 0–100 scale (Artificial Analysis Intelligence Index). i Why are scores only 40–60? The 0–100 scale is based on benchmarks so hard that no AI has scored above 60 yet. Today's world record is Claude Fable 5 at 60 (currently unavailable); the top available model is Claude Opus 4.8 at 56.

These benchmarks are designed to find failure modes, not measure everyday capability:

· Humanity's Last Exam, expert-contributed questions specifically designed to stump AI
· GPQA Diamond, PhD-level science problems
· SciCode, write research-grade scientific code
· Terminal-Bench Hard, shell commands that actually run

For everyday tasks, writing, summarizing, drafting docs, models perform far better than these scores suggest. The benchmarks exist to reveal what's still broken, not what already works.

SWE

Code generation, bug fixes, refactoring

Claude Fable 5 (max) API only

Claude Opus 4.8 (max) API only

GPT-5.5 (xhigh) API only

Gemini 3.5 Flash (high) API only

Claude Opus 4.7 (max) API only

GPT-5.5 (high) API only

Grok 4.3 (high) API only

Claude Sonnet 4.6 (max) API only

GLM-5.2 open

Gemini 3.1 Pro Preview API only

MiniMax M3 open

DeepSeek V4 Pro open

Kimi K2.6 open

MiMo-V2.5-Pro open

GLM-5.1 Reasoning open

Agentic

Multi-step planning, tool use, function calling

Claude Fable 5 (max) API only

Claude Opus 4.8 (max) API only

Claude Opus 4.7 (max) API only

GPT-5.5 (xhigh) API only

Gemini 3.5 Flash (high) API only

GPT-5.5 (high) API only

Grok 4.3 (high) API only

Claude Sonnet 4.6 (max) API only

GLM-5.2 open

Gemini 3.1 Pro Preview API only

MiniMax M3 open

DeepSeek V4 Pro open

Kimi K2.6 open

MiMo-V2.5-Pro open

GLM-5.1 Reasoning open

Reasoning

Chain-of-thought, math, auditable logic

Claude Fable 5 (max) API only

Claude Opus 4.8 (max) API only

GPT-5.5 (xhigh) API only

Gemini 3.5 Flash (high) API only

Claude Opus 4.7 (max) API only

Grok 4.3 (high) API only

GPT-5.5 (high) API only

GLM-5.2 open

Claude Sonnet 4.6 (max) API only

Gemini 3.1 Pro Preview API only

DeepSeek V4 Pro open

MiMo-V2.5-Pro open

Kimi K2.6 open

GLM-5.1 Reasoning open

MiniMax M3 open

General

Overall quality across mixed tasks

Claude Fable 5 (max) API only

Claude Opus 4.8 (max) API only

GPT-5.5 (xhigh) API only

Gemini 3.5 Flash (high) API only

Claude Opus 4.7 (max) API only

GPT-5.5 (high) API only

Grok 4.3 (high) API only

Claude Sonnet 4.6 (max) API only

GLM-5.2 open

Gemini 3.1 Pro Preview API only

MiniMax M3 open

DeepSeek V4 Pro open

Kimi K2.6 open

MiMo-V2.5-Pro open

GLM-5.1 Reasoning open

These scores approximate task-specific benchmarks from artificialanalysis.ai (SWEbench, GPQA, MMLU-Pro, and coding evals). Closed-source models are shown for scale only: they require API keys and route your data through third-party servers. Open-weight models run 100% on your hardware.

Best open-weight pick per use case

Picked algorithmically from live leaderboard scores, not my opinion, not a static list.

⌨

Software engineering

Code generation, bug fixes, refactoring, test writing, single-shot tasks on a specific file or function

GLM-5.2

Z AI

Intelligence 51 Code 50 Math 49 132 t/s

Highest coding benchmark in the open-weight leaderboard. Optimized for correctness on discrete tasks: generate, complete, explain, fix.

◈

Agentic / multi-step

Long-horizon planning, tool use, function calling, multi-turn task completion across many steps

GLM-5.2

Z AI

Intelligence 51 Code 50 Math 49 132 t/s

Best composite for long-horizon work: intelligence 51, 1000K context. Agentic loops need a model that tracks state, calls tools reliably, and recovers across many steps, not just writes code.

◎

Reasoning / Chain-of-thought

Explicit step-by-step logic, audits, diagnostics, structured analysis

DeepSeek V4 Pro

DeepSeek

Intelligence 44 Code 43 Math 44 88 t/s

Top reasoning model in current open-weight rankings (intelligence 44). Reasoning-mode models expose their chain-of-thought, every conclusion is auditable, which matters for regulated workflows.

♥

Clinical documentation

SOAP notes, visit summaries, referral letters, must stay on-device

GLM-5.2

Z AI

Intelligence 51 Code 50 Math 49 132 t/s

Top intelligence (51) with 1000K context, handles full visit transcripts without truncation. No data leaves the device.

⚖

Legal analysis

Contract review, clause extraction, red-lining, precision matters

GLM-5.2

Z AI

Intelligence 51 Code 50 Math 49 132 t/s

Highest reasoning quality (51) among models with sufficient context for full contracts. Hallucination rate at Q4/Q8 is low enough for attorney review loops.

Financial / accounting

Meeting notes → CRM, client memos, regulatory summaries

GLM-5.2

Z AI

Intelligence 51 Code 50 Math 49 132 t/s

Strong math benchmarks + 132 t/s output. Fast enough for live meeting capture; accurate enough for numbers-heavy summaries.

⚡

Fast turnaround

Near-real-time tasks: form filling, short summaries, simple Q&A

Granite 4.0 H Small

IBM

Intelligence 5 Code 5 Math 4 374 t/s

374 t/s, fastest open-weight model in current rankings. Sufficient intelligence (5) for structured short-form output.

◆

General purpose

Best single model if you only want to run one

GLM-5.2

Z AI

Intelligence 51 Code 50 Math 49 132 t/s

Highest overall intelligence index (51/100) in the open-weight leaderboard. The go-to when you want one model that handles most tasks well.

Full open-weight leaderboard

Models marked RDMA cluster require the TB5 two-node configuration (244 GB usable). All others run on a single machine.

Ranked by overall intelligence index (0–100 scale). i Why do scores look so low? This is a 0–100 scale, but no AI has scored above 60 yet. The benchmarks are intentionally brutal:

· Humanity's Last Exam, questions designed to stump AI
· GPQA Diamond, PhD-level biology, chemistry, physics
· SciCode, research-grade scientific code
· Terminal-Bench Hard, shell tasks that actually execute

A score of 51 means 51% of that extremely hard mix. By contrast, a poem request would score 99. Expand the tier legend below to see what each score range means in practice. All models run locally, no API required.

#	Model	Provider	Intelligence	Coding (est)	Math (est)	Speed (t/s)	Context	Min RAM	Level
1	GLM-5.2 RDMA cluster TB5 RDMA cluster required New #1 open-weight on the AA Intelligence Index v4.1 (51), +11 over GLM-5.1. On real coding and agentic work it BEATS GPT-5.5 — SWE-bench Pro 62.1 vs 58.6, FrontierSWE 74.4% vs 72.6%, MCP-Atlas 77.0 vs 75.3, GDPval-AA 1524 vs 1514 — and lands within a point of the top closed model, Opus 4.8, at roughly 1/6 the API cost. It trails only on Terminal-Bench 2.1 (81 vs 84) and the composite Index (where GPT-5.5 is 55), and has no image input yet. 744B total / 40B active MoE, MIT license, 1M context. IQ1_S (~150 GB) runs across the 244 GB 2-node cluster at ~15 t/s; IQ2_M (~222 GB) also fits. Verbose (~43k tokens/task). Self-hostable open weights — beats a cloud frontier model on real work, air-gapped and cheap. Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency.	Z AI	51	50	49	132	1000K	150 GB	Grad Specialist
2	MiniMax M3 RDMA cluster TB5 RDMA cluster required First multimodal M-series (text + image + video in), 1M context. 428B total / 23B active MoE, Q2_K (~160 GB) fits inside the 244 GB 2-node window; needs both nodes. Joint-2nd open-weight Intelligence Index (44, AA v4.1, tied with DeepSeek V4 Pro) behind GLM-5.2 (51). Self-hostable open weights, which suits air-gapped deployment. Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency.	MiniMax	44	43	42	92	1000K	160 GB	Graduate
3	DeepSeek V4 Pro	DeepSeek	44	43	44	88	1000K	580 GB	Graduate
4	Kimi K2.6	Moonshot AI	43	43	41	83	262K	360 GB	Graduate
5	MiMo-V2.5-Pro	Xiaomi	42	40	43	49	1000K	360 GB	Graduate
6	GLM-5.1 Reasoning RDMA cluster TB5 RDMA cluster required 744B total / 40B active MoE; IQ1_S (~150 GB) needs the 2-node cluster. Superseded by GLM-5.2. Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency.	Z AI	40	38	41	74	200K	150 GB	Graduate
7	GLM-5 Reasoning RDMA cluster TB5 RDMA cluster required 744B total / 40B active MoE; IQ1_S (~150 GB) needs the 2-node cluster. Superseded by GLM-5.1 and GLM-5.2. Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency.	Z AI	40	38	40	68	200K	150 GB	Graduate
8	DeepSeek V4 Flash	DeepSeek	40	40	39	104	1000K	135 GB	Graduate
9	MiniMax-M2.7	MiniMax	38	37	38	43	205K	110 GB	Graduate
10	Kimi K2.5	Moonshot AI	38	37	36	46	262K	360 GB	Graduate
11	Qwen3.6 27B	Alibaba	37	35	36	57	262K	17 GB	Graduate
12	Qwen3.5 397B A17B	Alibaba	34	33	34	50	262K	150 GB	Smart Undergrad
13	Qwen3.5 27B	Alibaba	34	32	33	81	262K	17 GB	Smart Undergrad
14	Qwen3.6 35B A3B	Alibaba	32	31	31	161	262K	23 GB	Smart Undergrad
15	Mistral Medium 3.5	Mistral	30	29	29	118	131K	79 GB	Smart Undergrad
16	Gemma 4 31B	Google	29	27	28	35	131K	20 GB	Smart Undergrad
17	Qwen3.5 9B	Alibaba	25	24	24	56	262K	6 GB	Smart HS
18	Mistral Small 4	Mistral	21	20	20	173	262K	55 GB	Smart HS
19	DeepSeek-R1 671B RDMA cluster TB5 RDMA cluster required Q2_K ~190 GB, fits in 244 GB usable across both nodes. The full 671B / 37B-active reasoning model, not a distilled version. Heavily superseded on the v4.1 leaderboard by GLM-5.2 and the V4 line — kept for reference. Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency.	DeepSeek	20	18	22	8	131K	190 GB	High School
20	Qwen3.5 4B	Alibaba	20	19	19	182	131K	3 GB	High School
21	Ling 2.6 Flash	InclusionAI	19	18	18	183	262K	49 GB	High School
22	Ling-1T RDMA cluster TB5 RDMA cluster required Q3_K_M ~375 GB, beyond the current 2-node setup (244 GB). Needs 3-node cluster (~366 GB usable) or a future 512 GB Mac. At Q4: ~500 GB (4 nodes). 1T total / 50B active params, MIT license. Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency.	InclusionAI	13	12	12	6	128K	375 GB	High School
23	Gemma 4 E4B	Google	12	11	11	290	131K	5 GB	Elementary
24	Granite 4.0 H Small	IBM	5	5	4	374	128K	20 GB	Elementary
25	Phi-4 14B	Microsoft	5	5	6	36	16K	10 GB	Elementary
26	Llama 3.2 3B	Meta	4	3	3	52	131K	2 GB	Elementary
27	Phi-4-mini 3.8B	Microsoft	3	3	4	44	16K	3 GB	Elementary

Min RAM: ≤ 36 GB, any M4/M5 37–64 GB, M4/M5 Pro 48 GB+ 65–128 GB, M5 Max 128 GB 129–244 GB, RDMA cluster > 244 GB, bigger cluster needed

Min RAM = minimum unified memory to run at Q3_K_M quantization (large MoE) or Q4_K_M (dense/small). Intelligence Index is artificialanalysis.ai's composite quality score (0–100 scale, AA Intelligence Index v4.1). Scores look low because the benchmarks are hard: Humanity's Last Exam, GPQA Diamond, SciCode, and 6 other graduate-level evals. The current world ceiling is Claude Fable 5 at 60 (currently unavailable; Claude Opus 4.8 at 56 is the top available model). A score of 51 means the model correctly handled 51% of that extremely difficult mix. All models have publicly released weights, no API key required to run them.

Intelligence scale

What the scores actually mean

Each level mapped to real-world tasks, with concrete examples

Expand ↓

These aren't IQ comparisons: they're a shorthand for the kinds of tasks a model handles reliably. Real-world complexity, not exam performance.

PhD · Frontier research 68+

Generates novel hypotheses; solves open research problems; produces work that could advance a field.

· Proposes and tests original research hypotheses not in the training corpus
· Identifies unsolved problems at the edge of a scientific field
· Writes grant proposals reviewers cannot distinguish from expert submissions
· Writes poetry with genuine stylistic innovation, not imitation of any existing poet

PhD with distinction 60–67

Systematic expert-level analysis across multiple disciplines. Current world ceiling, Claude Fable 5 scores 60 (Opus 4.8, the top available model, scores 56).

· Identifies factual errors and methodological flaws in published papers
· Synthesizes across unrelated disciplines to surface non-obvious connections
· Produces research-quality writing that could pass peer review
· Writes poetry with genuine literary merit a reviewer could attribute to a published poet

PhD 53–59

Tackles novel research questions; comparable to a junior faculty member in a specialized domain.

· Identifies gaps in existing literature and proposes studies to fill them
· Produces publishable draft sections of a scientific paper
· Reviews code and identifies subtle algorithmic inefficiencies across a large codebase
· Writes a poem using meter, imagery, and controlling conceit in a unified way

Graduate specialist + real world 45–52

Domain expertise applied to messy, real-world inputs, the ambiguity professionals encounter daily.

· Identifies subtle contradictions spread across a 50-page contract
· Writes a technically accurate oncology referral letter directly from raw visit notes
· Flags methodological problems in a clinical trial design narrative
· Writes an original sonnet with correct meter and a genuine emotional argument

Graduate 36–44

Synthesizes research across papers; writes production code; handles structured professional tasks.

· Reads five research papers and synthesizes their conclusions into a coherent argument
· Builds a working REST API from a spec without scaffolding
· Identifies clause-level issues in a standard commercial contract
· Drafts a clinical SOAP note from a visit transcript
· Analyzes poetic technique using critical theory vocabulary

Smart undergrad 28–35

Competent at structured academic tasks; writes functional code; reasons across a single domain.

· Writes a literature review with accurate citations and a coherent argument
· Debugs moderately complex code across multiple files
· Drafts a business proposal with coherent financial rationale
· Writes a structured legal argument at a 1L level
· Writes a sonnet with correct rhyme scheme, iambic meter, and a volta

Smart high school 21–27

Handles multi-step reasoning; writes functional short programs; analyzes texts with modest depth.

· Writes a short story with a plot arc and character motivation
· Writes a Python script to parse a CSV and compute descriptive statistics
· Analyzes a poem's structure, imagery, and central theme
· Solves introductory chemistry and physics word problems

High school 13–20

Competent at summarization and basic writing; simple code; single-step reasoning.

· Summarizes a newspaper article with accurate main points
· Writes a 5-paragraph essay with a clear thesis and supporting paragraphs
· Writes a loop and handles basic I/O in Python
· Writes a rhyming poem on a given topic (ABAB scheme)

Elementary school 0–12

Handles simple factual questions and basic instructions; limited on multi-step or abstract tasks.

· Answers simple factual questions ("What is the capital of France?")
· Follows basic instructions, translate a phrase, fill in a blank
· Writes a few sentences about a familiar topic
· Writes a simple rhyming couplet

Closed-source reference: Claude Fable 5 scores 60 (PhD with distinction), the current world ceiling as of June 2026 (currently unavailable; Claude Opus 4.8 at 56 is the top available model). No model, open or closed, has reached the Frontier PhD tier (68+) yet.

Next step

Which hardware runs the top models?

The top models need 36–128 GB of unified memory to run. Our hardware guide covers what that means in practice, every Apple Silicon config, with prices.

See hardware configs →

Open vs closed.Which task. Which model.

Head-to-head by task

Best open-weight pick per use case

Full open-weight leaderboard

What the scores actually mean

Which hardware runs the top models?

Open vs closed.
Which task. Which model.