AI Capability Progression 2024

01 — Overview

Two years of
accelerating progress

Frontier AI capability has advanced faster in 2024–2026 than in any prior two-year period. Three benchmarks illustrate the pace — and the acceleration is real.

SWE-bench Verified

80%+

from 4% in early 2023

GPQA Diamond

94%

from ~36% in early 2024

Cost per M tokens

~$1

from $30+ in 2023 (30× drop)

Acceleration confirmed. Epoch AI found their Capabilities Index grew almost twice as fast over the last two years compared to the prior two — a ~90% acceleration beginning April 2024, coinciding with the rise of reasoning models and reinforcement learning at frontier labs.

02 — Benchmark Trends

Capability across
three key metrics

SWE-bench (coding), GPQA Diamond (scientific reasoning), and AIME math competition scores — each showing consistent upward progression from March 2024 to March 2026.

Benchmark score progression

Best-in-class scores per benchmark, Mar 2024 – Mar 2026

SWE-bench (coding)

GPQA Diamond (science)

AIME math

Data sourced from model cards, Stanford HAI AI Index 2025, and Epoch AI Benchmarking Hub. Scores reflect best published results per period across all frontier models.

03 — METR Time Horizon

Task length doubling
every 5–7 months

METR measures the length of software engineering tasks an AI can complete with 50% success rate. The log scale shows the exponential trend clearly — the linear view reveals the dramatic acceleration.

AI task horizon (log scale)

Exponential growth appears as a straight line on a log axis

Measured task length

7-month doubling trend

Red dot marks November 2025: Claude Opus 4.5 completing ~5-hour human tasks. Doubling pace accelerated from every 7 months (2024) to every ~5 months by mid-2025.

AI task horizon (linear scale)

The same data without log compression — the hockey stick shape

Measured task length

7-month doubling trend

Linear scale makes the exponential shape visceral — the last two data points account for most of the total growth. This is what 9× improvement in 12 months looks like.

04 — ARC-AGI-2

The inflection
benchmark

ARC-AGI-2 launched in March 2025 with all frontier models below 5%. Within a year, scores surged past the human baseline of 60%. This is the sharpest capability inflection on record.

ARC-AGI-2 score progression

Best AI score vs. human baseline (60%) — Mar 2025 to Mar 2026

Best AI score

Human baseline

ARC-AGI-2 tests abstract reasoning that can't be brute-forced — easy for humans, hard for AI. Progress driven by refinement loops and test-time compute. Sources: ARC Prize Foundation, arcprize.org.

What changed: Progress wasn't just larger models — it was application-layer engineering. Refinement loops and structured test-time reasoning pushed scores from near-zero to beyond the human baseline in under 12 months. ARC-AGI-3 (interactive reasoning) launched March 2026 to reset the challenge.

05 — Key Findings

What the data shows

Six observations from two years of frontier AI development.

01 —

The acceleration is real

Epoch AI confirmed a ~90% acceleration in capability growth rate from April 2024, driven by reasoning models and reinforcement learning becoming standard at frontier labs.

02 —

Coding crossed the threshold

SWE-bench went from 4% (2023) to 80%+ (2026). AI can now resolve real GitHub issues at scale. Agentic coding workflows are no longer experimental.

03 —

Open-weight models caught up

The gap between closed and open-weight frontier models narrowed from 8% Elo in January 2024 to just 1.7% by February 2025. DeepSeek, Qwen, and Llama now compete directly.

04 —

Competition compressed at the top

The Elo gap between the #1 and #10 ranked model shrank from 11.9% to 5.4%. The #1 vs #2 gap shrank from 4.9% to 0.7%. No single lab dominates.

05 —

Benchmarks saturate fast

MMLU, GSM8K, and HumanEval are all effectively saturated at frontier tier. The field shifted to GPQA Diamond, ARC-AGI-2, Humanity's Last Exam, and METR as meaningful measures.

06 —

Cost collapsed 30×

GPT-4-level performance cost ~$30/M tokens in 2023. Today it costs under $1/M. Each order-of-magnitude reduction unlocks use cases that were previously economically impossible.

06 — Model Timeline

Key releases &
inflection points

The models that moved the benchmarks, in order.

Mar 2024

Claude 3 Opus / GPT-4 Turbo

Last generation before the reasoning model era. GPQA ~40%, SWE-bench ~18%.

Jun 2024

GPT-4o

Multimodal as default. Real-time voice. Cost drop cycle begins.

Sep 2024

Claude 3.5 Sonnet

New SWE-bench record. Computer use introduced. SWE-bench ~38%.

Nov 2024

OpenAI o1 + DeepSeek V3

Test-time compute arrives. o1 scores 74% on IMO qualifier vs GPT-4o's 9%. DeepSeek signals China's capability.

Mar 2025

ARC-AGI-2 launched

All frontier models score below 5%. Human baseline: 60%. The hockey stick benchmark begins.

May 2025

Claude 4 / GPT-4.5

Reasoning as standard, not premium. GPQA crosses 80%+. SWE-bench ~64%.

Aug–Nov 2025

Gemini 3 / Claude 4.5 / Grok 4

Claude Opus 4.5 completes 5-hour software tasks. ARC-AGI-2 breaks 45%. Gemini 3 hits 100% AIME 2025.

Mar 2026

Claude 4.6 / Gemini 3.1 Pro / GPT-5.4

ARC-AGI-2 at 77–84%+. SWE-bench 80%+. 1M context windows. ARC-AGI-3 (interactive) released.

Two years ofaccelerating progress

Capability acrossthree key metrics

Task length doublingevery 5–7 months

The inflectionbenchmark

What the data shows

Key releases &inflection points

Two years of
accelerating progress

Capability across
three key metrics

Task length doubling
every 5–7 months

The inflection
benchmark

Key releases &
inflection points