Open research

AGI Progress
Signal Map.

A composite estimate of AGI readiness, computed as a weighted geometric mean across science-first signal categories: reasoning, long-horizon agency, coding and R&D leverage, multimodal grounding, verification, and deployment readiness. The geometric mean is deliberate — weak bottlenecks (agency, reliability) pull the composite down even when benchmark scores are high. This is curated public-signal strength and editorial readiness, not measured AGI attainment or a probability forecast.

← Back to AI for Science Landscape

AGI progress signal

Percent readiness, not a date forecast.

Public signal strength across AGI bottlenecks. The solid line is evidence through 2026; dashed lines are scenario overlays to 2032.

Current

62%

Uneven readiness

Evidence: OverviewSteadyR&D accelerationVerification-gated

General cognitive AGI estimate

General AGI readiness: 63%

Cognitive and digital AGI: broad strong-human-level intellectual work, transfer across domains, tool use, multi-step planning, and enough reliability for real deployment. This is not superintelligence, full human replacement, or embodied robotics.

Readiness

63.4%

Rounded 63%; uncertainty 58-68%

Why 63%

Reasoning, coding, multimodal work, and economic knowledge tasks are already strong enough to make 55% too low.

Why not 70%+

Long-horizon agency, reliability, calibration, and robust real-world execution remain the binding constraints.

Science-first discount

A science-first AGI estimate falls closer to 53% because verification, reproducibility, scientific autonomy, and lab deployment dominate.

Readiness method

Weighted geometric mean

The geometric mean keeps weak bottlenecks visible: agency and reliability pull the composite down even when benchmark and coding scores are high.

100 * 0.72^0.18 * 0.71^0.18 * 0.76^0.14 * 0.70^0.10 * 0.55^0.18 * 0.48^0.16 * 0.55^0.06 = 63.4%

General reasoning & knowledge

Strong GPQA, AIME, and ARC-AGI-1 progress; ARC-AGI-2 and HLE still leave headroom.

w 18%

72/100

Economic knowledge work

GDPval shows frontier models approaching expert work products across many occupations.

w 18%

71/100

Coding & tool use

SWE-bench Verified and related tool-use evals are among the most mature public signals.

w 14%

76/100

Multimodal/context handling

Vision, long-context, document, and screen workflows are much stronger, but not yet universal world-modeling.

w 10%

70/100

Long-horizon agency

The main bottleneck: autonomous tasks are still short, scaffolded, or well-specified.

w 18%

55/100

Reliability/calibration/safety

Hallucination, overconfidence, brittle behavior, and hard-to-detect errors still limit delegation.

w 16%

48/100

Deployment / real-world integration

Productivity is already widespread, but not full autonomous replacement of roles.

w 6%

55/100

Percent is curated public signal strength and editorial readiness, not measured AGI attainment or probability.

Benchmark signals

The evaluations
behind the bars.

Each readiness bar above aggregates several public benchmark signals. The records below name the evaluations actually used for scoring — what they measure, the latest tracked result, and why they matter for autonomous science.

Sources for each benchmark are linked inline. Scores update with the dataset; see the methodology section for weighting.

Apr 2026

General model intelligence

Artificial Analysis Intelligence Index

Composite score across agents, coding, science, reasoning, knowledge, and instruction following

Provides a production-oriented view of which frontier models are strong enough to act as reasoning engines inside scientific agents.

Current signal

Live leaderboard score, provider, price, speed, latency, and context-window tracking

Apr 2026

Capability trends

Epoch AI Capabilities

Benchmark results across 40+ evaluations, with internal and external result provenance

Useful for seeing whether scientific reasoning, agentic work, math, coding, and multimodal capabilities are improving fast enough to change lab workflows.

Current signal

Tracks frontier model progress by benchmark, model, organization, and task category

Mar 2026

Mathematics

FrontierMath

Accuracy on extremely difficult math problems and open-problem variants

High-end mathematical reasoning is one of the cleanest proxies for whether models can contribute to formal scientific discovery.

Current signal

Used by Epoch AI as a frontier reasoning signal

Apr 2026

Scientific reasoning

GPQA Diamond

Expert-level graduate science multiple-choice accuracy

Directly probes PhD-level physics, chemistry, and biology reasoning, though it remains a static question-answer benchmark.

Current signal

Tracked as a core scientific reasoning benchmark

Mar 2026

Scientific coding

SciCode

Pass rate on scientific programming tasks

Measures whether models can turn scientific specifications into executable code, a core dependency for autonomous analysis and simulation.

Current signal

Included in Artificial Analysis Intelligence Index v4

Apr 2026

Agentic software engineering

SWE-bench Verified

Resolved real GitHub issues

Software-engineering agents are a leading indicator for whether models can operate long-horizon scientific toolchains and repair failed experiments.

Current signal

Tracked by Epoch AI as a hard agentic benchmark

Mar 2026

Long-horizon agents

APEX-Agents

Pass@1 task completion in realistic multi-application workflows

Lab automation requires agents that coordinate files, tools, state, and multi-step objectives rather than answering isolated prompts.

Current signal

Added to the Epoch Capabilities Index in 2026

Mar 2026

Cross-domain expert reasoning

Humanity's Last Exam

Accuracy on hard expert-written questions

A broad stress test for frontier models, useful only when interpreted alongside domain-specific science benchmarks and tool-use evaluations.

Current signal

Used as an unsaturated frontier capability signal

Mar 2025

Bioinformatics agents

BixBench

Open-answer accuracy on real-world bioinformatics analysis scenarios

Measures whether agents can explore biological datasets, run multi-step analyses, and interpret results rather than only answer static science questions.

Current signal

Public benchmark with 53 analysis scenarios and 296 questions for agentic computational-biology workflows

May 2026

Biology research tasks

LABBench2

Performance across 1,892 practical biology-research tasks spanning literature, databases, sequences, protocols, patents, trials, and source quality

Moves biology-agent evaluation toward practical research work, retrieval, file handling, and tool use rather than short-form knowledge recall.

Current signal

Open dataset and harness with published model comparisons across 11 task families

Apr 2026

Bioinformatics agents

BioMysteryBench

Accuracy and reliability on 99 expert-level bioinformatics tasks with objective ground truth from real experimental data

Tests whether frontier agents can produce reproducible scientific conclusions from messy biological data, including problems not solved by expert panels.

Current signal

Anthropic reports Claude-family and expert baselines, including separate human-solvable and human-difficult task sets

Apr 2026

Genomics and quantitative biology agents

GeneBench

Pass rate on 103 multi-stage scientific data analysis tasks across 10 genomics and quantitative biology domains

Probes whether agents can clean assay or clinical data, run exploratory analysis, select statistical models, and produce conclusions that inform downstream scientific decisions.

Current signal

Frontier models score roughly 25-33% on tasks that often correspond to multi-day projects for expert computational biologists

Jul 2024

biology

LAB-Bench

task accuracy and open-response accuracy

Measures biology research-agent skills across literature QA, table/figure reasoning, protocols, databases, sequences, and cloning scenarios.

Current signal

Human experts 0.70-1.00 across many subtasks; frontier models remain uneven.

Sep 2024

scientific literature

LitQA2 / PaperQA2

precision, accuracy, DOI recall, contradiction-detection AUC

Evaluates retrieval-grounded scientific literature QA, synthesis, and contradiction detection against human experts.

Current signal

PaperQA2 precision 85.2% and accuracy 66.0% on LitQA2.

Dec 2025

biology

FoldBench

DockQ AUC, DockQ success rate, LDDT, ligand success rate

Independent benchmark of all-atom biomolecular structure prediction across protein-ligand, protein-protein, antibody, and nucleic-acid tasks.

Current signal

AlphaFold 3 leads most measured all-atom structure-prediction tasks.

Nov 2023

materials

GNoME materials discovery benchmark

new stable crystal structures on updated convex hull

Measures AI-assisted inorganic materials discovery at the scale of DFT-verified stable structures.

Current signal

GNoME reports 381,000 newly stable convex-hull entries from 2.2M candidate structures.

May 2025

algorithmic science

AlphaEvolve algorithm discovery tasks

objective-specific best-known construction score

Measures autonomous code-evolution workflows on verifiable mathematical and scientific optimization objectives.

Current signal

AlphaEvolve found a 48-multiplication 4x4 complex matrix multiplication algorithm versus Strassen 49.

Jun 2025

genomics

AlphaGenome regulatory genomics benchmarks

benchmark tasks with state-of-the-art performance

Aggregates genome-track prediction and regulatory variant-effect prediction measurements against external genomics baselines.

Current signal

AlphaGenome achieved SOTA on 22 of 24 track-prediction tasks and 25 of 26 variant-effect tasks.

Methodology

How the composite is built.

Weighting

Each signal category is scored 0–100 and weighted by its contribution to general cognitive AGI: reasoning 18%, economic knowledge work 18%, coding and tool use 14%, multimodal/context 10%, long-horizon agency 18%, reliability and calibration 16%, and deployment integration 6%. Weights sum to 100. The science-first view discounts the composite further because verification, reproducibility, and lab deployment dominate trustworthy autonomy.

Aggregation

Categories combine as a weighted geometric mean, not an arithmetic average. A weak bottleneck cannot be hidden by a strong score elsewhere: 90 in reasoning and 30 in agency is not the same as 60 in both. Geometric mean keeps the binding constraint visible.

Score derivation

Each category score is curated from public benchmark results, agent evaluation suites, deployment metrics, and qualitative evidence. The benchmarks above list the specific evaluations behind each signal. Scenario overlays (Steady, R&D acceleration, Verification-gated) are illustrative, not probabilistic.

Verification sits lowest in the composite. That gap is the one Scivity exists to close — see Verification for what we ship against it.

Source policy

Every benchmark record carries explicit source citations. Source tiers (A regulatory/peer-reviewed, B official/preprint, C industry/media, D rumor/social) and editorial policy are documented in the Landscape methodology.

What this is not

This map is not a probability that AGI arrives by a given date. It does not predict capabilities, take a position on intelligence definitions, or claim measured attainment. Treat it as a structured way to read the field, not a forecast.

Open data

Free to reuse, with attribution.

License

AGI Progress Signal Map is published by Scivity Labs under CC BY 4.0. You may reuse, remix, and republish with attribution.

Cite as

Scivity Labs (2026, May). AGI Progress Signal Map [Dataset]. scivity.org/agi-progress
BibTeXShow
@misc{scivity_agi_progress_2026,
  author       = {{Scivity Labs}},
  title        = {AGI Progress Signal Map},
  year         = {2026},
  month        = may,
  howpublished = {\url{https://scivity.org/agi-progress}},
  note         = {Dataset. CC BY 4.0}
}