Can Inorganic Intelligence Do Useful Tasks?

For this research topic, GDPVal is used as a framework for useful tasks, even though GDP itself may not be the best measure of societal output.

What is GDPval?

GDPval measures real-world, economically valuable professional work, not academic puzzles.

Key properties:

Measures knowledge-work productivity, not raw correctness
Compares AI outputs vs experienced human professionals
Uses blind expert judging (win / tie / loss)
Focuses on deliverables, not multiple-choice answers

Scale:

44 occupations
9 economic sectors
~1,320 total tasks (~30 per occupation)
Authored by professionals with ~14+ years of experience

Occupations Covered (Examples)

GDPval spans representative roles across the economy, including:

Professional & Technical

Software developer
Mechanical engineer
Industrial engineer
Project management specialist
Computer & information systems manager

Finance & Legal

Accountant & auditor
Financial analyst
Financial manager
Personal financial advisor
Lawyer

Healthcare

Registered nurse
Clinical nurse
Healthcare services manager
Medical secretary / admin

Sales, Marketing & Media

Sales manager
Real estate agent / broker
Marketing & sales representatives
Journalist / editor
Producer / director

Operations & Public Services

Compliance officer
Administrative services manager
Manufacturing operations supervisor
Social worker

(44 total roles across 9 sectors)

What GDPval Tasks Look Like

**GDPval tasks are real work outputs, not trivia.

Typical deliverables include:

Financial models and forecasts
Accounting spreadsheets and variance analyses
Legal briefs and contract risk summaries
Engineering design documents
Sales decks and pitch presentations
Project plans and operational schedules
Customer support workflows
Editorial articles and media summaries
Healthcare care plans

Tasks often include reference files (spreadsheets, datasets, templates) and require multi-step reasoning.

–

The Public “Gold Subset”

OpenAI has released a 220-task “gold subset” for public inspection and research.

Purpose:

Transparency
Reproducible evaluation
Representative sampling of the full benchmark

These tasks:

Come from the same 44 occupations
Include full prompts and reference materials
Are used in OpenAI’s Evals framework

Finance & Accounting Example

Auditor Sample Testing & Variance Analysis

Scenario: You are an auditor reviewing an Anti-Financial Crime risk dataset.

Required deliverables:

Select a statistically valid sample (90% confidence)
Perform Q2 vs Q3 variance analysis
Add flags and calculations in new spreadsheet tabs

Skills tested:

Statistics
Spreadsheet manipulation
Professional judgment
Clear documentation

Finance Example

Profit & Loss Report for a Music Tour

Scenario: You are Finance Lead for a touring production company.

Required deliverables:

Consolidate income, costs, and tax data
Build a multi-sheet P&L workbook
Write an executive-level financial summary

Skills tested:

Financial modeling
Data synthesis
Business communication

How Outputs Are Judged

GDPval does not score answers as “correct” or “incorrect”.

Instead:

Expert judges blindly compare AI output vs human output
Judged on:
- Accuracy
- Completeness
- Professional quality
- Usefulness
Results are reported as win / tie rates.

A reported score like ~70% GDPval means:

The model’s output was judged as good as or better than a human professional in ~70% of tasks.

Why GDPval Matters

GDPval attempts to capture something traditional benchmarks miss:

Real economic value
Professional judgment
Output quality, not token accuracy
Productivity without direct labor input

It’s designed to answer:

Can this model actually do economically useful work?

Summary

GDPval evaluates real-world knowledge work
Covers 44 occupations, 9 sectors
Uses expert-authored tasks and expert judging
Includes 1,320 total tasks, with 220 publicly released
GPT-5.2’s ~70% score reflects professional-level output parity, not exam performance

Appendix

Here are actual example tasks from the 220-task gold subset of OpenAI’s GDPval benchmark — the portion that’s been open-sourced so researchers can inspect real prompts and reference files.

Finance & Accounting

Auditor Sample Testing & Variance Analysis

Scenario
You’re an auditor reviewing a spreadsheet of Anti-Financial Crime risk metrics.

Deliverables include:

Choose a statistically justified sample at 90% confidence
Perform a Q2 vs Q3 variance analysis
Add sample flags and calculations in new spreadsheet tabs

Clear instructions and structured output expected.

Profit and Loss Report for a Music Tour

Scenario
You’re Finance Lead for a production company’s fall tour.

Deliverables include:

Consolidate income / cost / tax data across multiple sheets
Build a P&L Excel workbook with detailed breakdowns
Summarise results for executive review

Requires both numerical analysis and professional report structure.

Professional Services / Consulting

Many tasks require multi-step deliverables such as slide decks or written outputs based on real data and business context (not all prompts are publicly visible).

Examples include:

Create market competitor landscape presentations
Formulate strategic recommendations based on provided datasets

These combine analysis + narrative + structured formatting.

Where to Explore the Full Gold Subset

OpenAI’s GDPval gold subset (220 tasks with prompts and reference files) is available via:

GDPval is designed to answer: Can this system actually do economically useful work at a professional level?