Research Summary · March 2026
AI Capability Progression 2024–2026
Benchmark Analysis Frontier Models Sources: Epoch AI · Stanford HAI · METR · ARC Prize
01 — Overview

Two years of
accelerating progress

Frontier AI capability has advanced faster in 2024–2026 than in any prior two-year period. Three benchmarks illustrate the pace — and the acceleration is real.

SWE-bench Verified
80%+
from 4% in early 2023
GPQA Diamond
94%
from ~36% in early 2024
Cost per M tokens
~$1
from $30+ in 2023 (30× drop)
Acceleration confirmed. Epoch AI found their Capabilities Index grew almost twice as fast over the last two years compared to the prior two — a ~90% acceleration beginning April 2024, coinciding with the rise of reasoning models and reinforcement learning at frontier labs.
02 — Benchmark Trends

Capability across
three key metrics

SWE-bench (coding), GPQA Diamond (scientific reasoning), and AIME math competition scores — each showing consistent upward progression from March 2024 to March 2026.

Benchmark score progression
Best-in-class scores per benchmark, Mar 2024 – Mar 2026
SWE-bench (coding)
GPQA Diamond (science)
AIME math
Data sourced from model cards, Stanford HAI AI Index 2025, and Epoch AI Benchmarking Hub. Scores reflect best published results per period across all frontier models.
03 — METR Time Horizon

Task length doubling
every 5–7 months

METR measures the length of software engineering tasks an AI can complete with 50% success rate. The log scale shows the exponential trend clearly — the linear view reveals the dramatic acceleration.

AI task horizon (log scale)
Exponential growth appears as a straight line on a log axis
Measured task length
7-month doubling trend
Red dot marks November 2025: Claude Opus 4.5 completing ~5-hour human tasks. Doubling pace accelerated from every 7 months (2024) to every ~5 months by mid-2025.
AI task horizon (linear scale)
The same data without log compression — the hockey stick shape
Measured task length
7-month doubling trend
Linear scale makes the exponential shape visceral — the last two data points account for most of the total growth. This is what 9× improvement in 12 months looks like.
04 — ARC-AGI-2

The inflection
benchmark

ARC-AGI-2 launched in March 2025 with all frontier models below 5%. Within a year, scores surged past the human baseline of 60%. This is the sharpest capability inflection on record.

ARC-AGI-2 score progression
Best AI score vs. human baseline (60%) — Mar 2025 to Mar 2026
Best AI score
Human baseline
ARC-AGI-2 tests abstract reasoning that can't be brute-forced — easy for humans, hard for AI. Progress driven by refinement loops and test-time compute. Sources: ARC Prize Foundation, arcprize.org.
What changed: Progress wasn't just larger models — it was application-layer engineering. Refinement loops and structured test-time reasoning pushed scores from near-zero to beyond the human baseline in under 12 months. ARC-AGI-3 (interactive reasoning) launched March 2026 to reset the challenge.
05 — Key Findings

What the data shows

Six observations from two years of frontier AI development.

01 —
The acceleration is real
Epoch AI confirmed a ~90% acceleration in capability growth rate from April 2024, driven by reasoning models and reinforcement learning becoming standard at frontier labs.
02 —
Coding crossed the threshold
SWE-bench went from 4% (2023) to 80%+ (2026). AI can now resolve real GitHub issues at scale. Agentic coding workflows are no longer experimental.
03 —
Open-weight models caught up
The gap between closed and open-weight frontier models narrowed from 8% Elo in January 2024 to just 1.7% by February 2025. DeepSeek, Qwen, and Llama now compete directly.
04 —
Competition compressed at the top
The Elo gap between the #1 and #10 ranked model shrank from 11.9% to 5.4%. The #1 vs #2 gap shrank from 4.9% to 0.7%. No single lab dominates.
05 —
Benchmarks saturate fast
MMLU, GSM8K, and HumanEval are all effectively saturated at frontier tier. The field shifted to GPQA Diamond, ARC-AGI-2, Humanity's Last Exam, and METR as meaningful measures.
06 —
Cost collapsed 30×
GPT-4-level performance cost ~$30/M tokens in 2023. Today it costs under $1/M. Each order-of-magnitude reduction unlocks use cases that were previously economically impossible.
06 — Model Timeline

Key releases &
inflection points

The models that moved the benchmarks, in order.

Mar 2024
Claude 3 Opus / GPT-4 Turbo
Last generation before the reasoning model era. GPQA ~40%, SWE-bench ~18%.
Jun 2024
GPT-4o
Multimodal as default. Real-time voice. Cost drop cycle begins.
Sep 2024
Claude 3.5 Sonnet
New SWE-bench record. Computer use introduced. SWE-bench ~38%.
Nov 2024
OpenAI o1 + DeepSeek V3
Test-time compute arrives. o1 scores 74% on IMO qualifier vs GPT-4o's 9%. DeepSeek signals China's capability.
Mar 2025
ARC-AGI-2 launched
All frontier models score below 5%. Human baseline: 60%. The hockey stick benchmark begins.
May 2025
Claude 4 / GPT-4.5
Reasoning as standard, not premium. GPQA crosses 80%+. SWE-bench ~64%.
Aug–Nov 2025
Gemini 3 / Claude 4.5 / Grok 4
Claude Opus 4.5 completes 5-hour software tasks. ARC-AGI-2 breaks 45%. Gemini 3 hits 100% AIME 2025.
Mar 2026
Claude 4.6 / Gemini 3.1 Pro / GPT-5.4
ARC-AGI-2 at 77–84%+. SWE-bench 80%+. 1M context windows. ARC-AGI-3 (interactive) released.