Models
Open vs closed.
Which task. Which model.
We rank open-weight models next to their closed-source counterparts, by task. SWE is not agentic. Agentic is not reasoning. The picks update automatically from artificialanalysis.ai every 12 hours.
Head-to-head by task
Purple rows are open-weight, grey are closed APIs. The closed scores are here for scale only. Using them means your data leaves the building. The General column is the verified AA Intelligence Index; the SWE, Agentic and Reasoning splits are estimates anchored to it, since AA does not publish per-task sub-scores.
Scores are on a 0–100 scale (Artificial Analysis Intelligence Index).
Why are scores only 40–60?
The 0–100 scale is based on benchmarks so hard that no AI has scored above 60 yet. Today's world record is Claude Fable 5 at 60 (currently unavailable); the top available model is Claude Opus 4.8 at 56.
These benchmarks are designed to find failure modes, not measure everyday capability:
· Humanity's Last Exam, expert-contributed questions specifically designed to stump AI
· GPQA Diamond, PhD-level science problems
· SciCode, write research-grade scientific code
· Terminal-Bench Hard, shell commands that actually run
For everyday tasks, writing, summarizing, drafting docs, models perform far better than these scores suggest. The benchmarks exist to reveal what's still broken, not what already works.
These scores approximate task-specific benchmarks from artificialanalysis.ai (SWEbench, GPQA, MMLU-Pro, and coding evals). Closed-source models are shown for scale only: they require API keys and route your data through third-party servers. Open-weight models run 100% on your hardware.
Best open-weight pick per use case
Picked algorithmically from live leaderboard scores, not my opinion, not a static list.
Full open-weight leaderboard
Models marked RDMA cluster require the TB5 two-node configuration (244 GB usable). All others run on a single machine.
Ranked by overall intelligence index (0–100 scale).
Why do scores look so low?
This is a 0–100 scale, but no AI has scored above 60 yet. The benchmarks are intentionally brutal:
· Humanity's Last Exam, questions designed to stump AI
· GPQA Diamond, PhD-level biology, chemistry, physics
· SciCode, research-grade scientific code
· Terminal-Bench Hard, shell tasks that actually execute
A score of 51 means 51% of that extremely hard mix. By contrast, a poem request would score 99. Expand the tier legend below to see what each score range means in practice.
All models run locally, no API required.
| # | Model | Provider | Intelligence | Coding (est) | Math (est) | Speed (t/s) | Context | Min RAM | Level |
|---|---|---|---|---|---|---|---|---|---|
| 1 | GLM-5.2
RDMA cluster
TB5 RDMA cluster required New #1 open-weight on the AA Intelligence Index v4.1 (51), +11 over GLM-5.1. On real coding and agentic work it BEATS GPT-5.5 — SWE-bench Pro 62.1 vs 58.6, FrontierSWE 74.4% vs 72.6%, MCP-Atlas 77.0 vs 75.3, GDPval-AA 1524 vs 1514 — and lands within a point of the top closed model, Opus 4.8, at roughly 1/6 the API cost. It trails only on Terminal-Bench 2.1 (81 vs 84) and the composite Index (where GPT-5.5 is 55), and has no image input yet. 744B total / 40B active MoE, MIT license, 1M context. IQ1_S (~150 GB) runs across the 244 GB 2-node cluster at ~15 t/s; IQ2_M (~222 GB) also fits. Verbose (~43k tokens/task). Self-hostable open weights — beats a cloud frontier model on real work, air-gapped and cheap. Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency. | Z AI | 51 | 50 | 49 | 132 | 1000K | 150 GB | Grad Specialist |
| 2 | MiniMax M3
RDMA cluster
TB5 RDMA cluster required First multimodal M-series (text + image + video in), 1M context. 428B total / 23B active MoE, Q2_K (~160 GB) fits inside the 244 GB 2-node window; needs both nodes. Joint-2nd open-weight Intelligence Index (44, AA v4.1, tied with DeepSeek V4 Pro) behind GLM-5.2 (51). Self-hostable open weights, which suits air-gapped deployment. Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency. | MiniMax | 44 | 43 | 42 | 92 | 1000K | 160 GB | Graduate |
| 3 | DeepSeek V4 Pro | DeepSeek | 44 | 43 | 44 | 88 | 1000K | 580 GB | Graduate |
| 4 | Kimi K2.6 | Moonshot AI | 43 | 43 | 41 | 83 | 262K | 360 GB | Graduate |
| 5 | MiMo-V2.5-Pro | Xiaomi | 42 | 40 | 43 | 49 | 1000K | 360 GB | Graduate |
| 6 | GLM-5.1 Reasoning
RDMA cluster
TB5 RDMA cluster required 744B total / 40B active MoE; IQ1_S (~150 GB) needs the 2-node cluster. Superseded by GLM-5.2. Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency. | Z AI | 40 | 38 | 41 | 74 | 200K | 150 GB | Graduate |
| 7 | GLM-5 Reasoning
RDMA cluster
TB5 RDMA cluster required 744B total / 40B active MoE; IQ1_S (~150 GB) needs the 2-node cluster. Superseded by GLM-5.1 and GLM-5.2. Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency. | Z AI | 40 | 38 | 40 | 68 | 200K | 150 GB | Graduate |
| 8 | DeepSeek V4 Flash | DeepSeek | 40 | 40 | 39 | 104 | 1000K | 135 GB | Graduate |
| 9 | MiniMax-M2.7 | MiniMax | 38 | 37 | 38 | 43 | 205K | 110 GB | Graduate |
| 10 | Kimi K2.5 | Moonshot AI | 38 | 37 | 36 | 46 | 262K | 360 GB | Graduate |
| 11 | Qwen3.6 27B | Alibaba | 37 | 35 | 36 | 57 | 262K | 17 GB | Graduate |
| 12 | Qwen3.5 397B A17B | Alibaba | 34 | 33 | 34 | 50 | 262K | 150 GB | Smart Undergrad |
| 13 | Qwen3.5 27B | Alibaba | 34 | 32 | 33 | 81 | 262K | 17 GB | Smart Undergrad |
| 14 | Qwen3.6 35B A3B | Alibaba | 32 | 31 | 31 | 161 | 262K | 23 GB | Smart Undergrad |
| 15 | Mistral Medium 3.5 | Mistral | 30 | 29 | 29 | 118 | 131K | 79 GB | Smart Undergrad |
| 16 | Gemma 4 31B | 29 | 27 | 28 | 35 | 131K | 20 GB | Smart Undergrad | |
| 17 | Qwen3.5 9B | Alibaba | 25 | 24 | 24 | 56 | 262K | 6 GB | Smart HS |
| 18 | Mistral Small 4 | Mistral | 21 | 20 | 20 | 173 | 262K | 55 GB | Smart HS |
| 19 | DeepSeek-R1 671B
RDMA cluster
TB5 RDMA cluster required Q2_K ~190 GB, fits in 244 GB usable across both nodes. The full 671B / 37B-active reasoning model, not a distilled version. Heavily superseded on the v4.1 leaderboard by GLM-5.2 and the V4 line — kept for reference. Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency. | DeepSeek | 20 | 18 | 22 | 8 | 131K | 190 GB | High School |
| 20 | Qwen3.5 4B | Alibaba | 20 | 19 | 19 | 182 | 131K | 3 GB | High School |
| 21 | Ling 2.6 Flash | InclusionAI | 19 | 18 | 18 | 183 | 262K | 49 GB | High School |
| 22 | Ling-1T
RDMA cluster
TB5 RDMA cluster required Q3_K_M ~375 GB, beyond the current 2-node setup (244 GB). Needs 3-node cluster (~366 GB usable) or a future 512 GB Mac. At Q4: ~500 GB (4 nodes). 1T total / 50B active params, MIT license. Setup: Two MacBook Pro M5 Max 128 GB nodes connected via Thunderbolt 5 RDMA. Set Metal heap manually to leave 12 GB overhead (244 GB usable). ~800 GB/s inter-node bandwidth at ~3 µs latency. | InclusionAI | 13 | 12 | 12 | 6 | 128K | 375 GB | High School |
| 23 | Gemma 4 E4B | 12 | 11 | 11 | 290 | 131K | 5 GB | Elementary | |
| 24 | Granite 4.0 H Small | IBM | 5 | 5 | 4 | 374 | 128K | 20 GB | Elementary |
| 25 | Phi-4 14B | Microsoft | 5 | 5 | 6 | 36 | 16K | 10 GB | Elementary |
| 26 | Llama 3.2 3B | Meta | 4 | 3 | 3 | 52 | 131K | 2 GB | Elementary |
| 27 | Phi-4-mini 3.8B | Microsoft | 3 | 3 | 4 | 44 | 16K | 3 GB | Elementary |
Min RAM = minimum unified memory to run at Q3_K_M quantization (large MoE) or Q4_K_M (dense/small). Intelligence Index is artificialanalysis.ai's composite quality score (0–100 scale, AA Intelligence Index v4.1). Scores look low because the benchmarks are hard: Humanity's Last Exam, GPQA Diamond, SciCode, and 6 other graduate-level evals. The current world ceiling is Claude Fable 5 at 60 (currently unavailable; Claude Opus 4.8 at 56 is the top available model). A score of 51 means the model correctly handled 51% of that extremely difficult mix. All models have publicly released weights, no API key required to run them.
Intelligence scale
What the scores actually mean
Each level mapped to real-world tasks, with concrete examples
Expand ↓ Collapse ↑
Intelligence scale
What the scores actually mean
Each level mapped to real-world tasks, with concrete examples
These aren't IQ comparisons: they're a shorthand for the kinds of tasks a model handles reliably. Real-world complexity, not exam performance.
Generates novel hypotheses; solves open research problems; produces work that could advance a field.
- · Proposes and tests original research hypotheses not in the training corpus
- · Identifies unsolved problems at the edge of a scientific field
- · Writes grant proposals reviewers cannot distinguish from expert submissions
- · Writes poetry with genuine stylistic innovation, not imitation of any existing poet
Systematic expert-level analysis across multiple disciplines. Current world ceiling, Claude Fable 5 scores 60 (Opus 4.8, the top available model, scores 56).
- · Identifies factual errors and methodological flaws in published papers
- · Synthesizes across unrelated disciplines to surface non-obvious connections
- · Produces research-quality writing that could pass peer review
- · Writes poetry with genuine literary merit a reviewer could attribute to a published poet
Tackles novel research questions; comparable to a junior faculty member in a specialized domain.
- · Identifies gaps in existing literature and proposes studies to fill them
- · Produces publishable draft sections of a scientific paper
- · Reviews code and identifies subtle algorithmic inefficiencies across a large codebase
- · Writes a poem using meter, imagery, and controlling conceit in a unified way
Domain expertise applied to messy, real-world inputs, the ambiguity professionals encounter daily.
- · Identifies subtle contradictions spread across a 50-page contract
- · Writes a technically accurate oncology referral letter directly from raw visit notes
- · Flags methodological problems in a clinical trial design narrative
- · Writes an original sonnet with correct meter and a genuine emotional argument
Synthesizes research across papers; writes production code; handles structured professional tasks.
- · Reads five research papers and synthesizes their conclusions into a coherent argument
- · Builds a working REST API from a spec without scaffolding
- · Identifies clause-level issues in a standard commercial contract
- · Drafts a clinical SOAP note from a visit transcript
- · Analyzes poetic technique using critical theory vocabulary
Competent at structured academic tasks; writes functional code; reasons across a single domain.
- · Writes a literature review with accurate citations and a coherent argument
- · Debugs moderately complex code across multiple files
- · Drafts a business proposal with coherent financial rationale
- · Writes a structured legal argument at a 1L level
- · Writes a sonnet with correct rhyme scheme, iambic meter, and a volta
Handles multi-step reasoning; writes functional short programs; analyzes texts with modest depth.
- · Writes a short story with a plot arc and character motivation
- · Writes a Python script to parse a CSV and compute descriptive statistics
- · Analyzes a poem's structure, imagery, and central theme
- · Solves introductory chemistry and physics word problems
Competent at summarization and basic writing; simple code; single-step reasoning.
- · Summarizes a newspaper article with accurate main points
- · Writes a 5-paragraph essay with a clear thesis and supporting paragraphs
- · Writes a loop and handles basic I/O in Python
- · Writes a rhyming poem on a given topic (ABAB scheme)
Handles simple factual questions and basic instructions; limited on multi-step or abstract tasks.
- · Answers simple factual questions ("What is the capital of France?")
- · Follows basic instructions, translate a phrase, fill in a blank
- · Writes a few sentences about a familiar topic
- · Writes a simple rhyming couplet
Closed-source reference: Claude Fable 5 scores 60 (PhD with distinction), the current world ceiling as of June 2026 (currently unavailable; Claude Opus 4.8 at 56 is the top available model). No model, open or closed, has reached the Frontier PhD tier (68+) yet.
Next step
Which hardware runs the top models?
The top models need 36–128 GB of unified memory to run. Our hardware guide covers what that means in practice, every Apple Silicon config, with prices.
See hardware configs →