The Life Sciences AI Handbook: Steering Frontier Models in Biology

Tegomoh, Bryan

Benchmarks for Bio AI

Published

July 7, 2026

Benchmarks are social infrastructure for scientific claims. A good benchmark narrows the space of plausible claims; it does not settle all uses of a model. CASP, CAMEO, PoseBusters, MoleculeNet, scIB, OpenProblems, Therapeutics Data Commons, and the Arc Institute Virtual Cell Challenge each illustrate a different benchmark role: blinded community assessment, continuous server evaluation, validity-aware metrics, shared splits, multi-task evaluation, and prospective evaluation. Dangerous-capability evaluations add a second question: whether a model materially lowers the barrier to misuse. The shared discipline is that a leaderboard is a filter, not a validation plan.

Learning Objectives

Use this chapter to:

Show how benchmarks convert broad AI-biology claims into measurable tasks that can be compared, criticized, and improved.
A benchmark is only useful if its split, metric, leakage controls, task definition, and biological endpoint match the claim.

Prerequisites: Evaluation Principles for Life Sciences AI for the credibility hierarchy.

Chapter Summary (TL;DR)

Summary: Show how benchmarks convert broad AI-biology claims into measurable tasks that can be compared, criticized, and improved. Some benchmark cultures are mature, especially structure prediction, while newer areas still need better blind tests, prospective evaluation, and dangerous-capability evaluation.

Key point: A benchmark is only useful if its split, metric, leakage controls, task definition, and biological endpoint match the claim. Open question: whether benchmark results predict prospective experimental value rather than leaderboard rank.

Bottom line: Benchmarks connect methods across the handbook by giving molecular, cellular, therapeutic, and automation claims a common evidence discipline.

Field Guide

What is this field trying to solve? Show how benchmarks convert broad AI-biology claims into measurable tasks that can be compared, criticized, and improved.

What is the core idea? A benchmark is only useful if its split, metric, leakage controls, task definition, and biological endpoint match the claim.

What is the current state of the field? Some benchmark cultures are mature, especially structure prediction, while newer areas still need better blind tests and prospective evaluation.

What do we know, and what remains open? Known reference points include CASP, CAMEO, PoseBusters, MoleculeNet, Therapeutics Data Commons, scIB, OpenProblems, Virtual Cell Challenge, DOME, WMDP, LAB-Bench, uplift studies, and calibration metrics. What remains open is whether benchmark results predict prospective experimental value rather than leaderboard rank or unsafe capability access.

Why does this matter? Benchmarks connect methods across the handbook by giving molecular, cellular, therapeutic, and automation claims a common evidence discipline.

Introduction

CASP, CAMEO, MoleculeNet, and PoseBusters each illustrate a different benchmark role: blinded community assessment, continuous server evaluation, shared molecular datasets, and physical validity checks (Kryshtafovych et al., 2024; Haas et al., 2018; Wu et al., 2018; Buttenschoen et al., 2024). The single-cell side adds scIB and OpenProblems; the therapeutic-discovery side adds the Therapeutics Data Commons; the cell-perturbation side adds the Virtual Cell Challenge as a peer-reviewed Cell benchmark frame for virtual-cell claims (Roohani et al., 2025). AI-bio safety evaluation adds another class: benchmarks and human-uplift studies that test whether a model changes access to hazardous knowledge or practical biological capability.

What is demonstrated?

Demonstrated capability includes benchmark-driven progress in protein structure prediction and increasingly strict evaluation of molecular docking and generation. CASP documents categories beyond single-chain structure, including complexes, RNA, and ligand binding (Kryshtafovych et al., 2024). CAMEO complements biennial CASP cycles with continuous blind server evaluation against newly released protein structures (Haas et al., 2018). PoseBusters demonstrated that physically invalid poses can pass simpler docking metrics (Buttenschoen et al., 2024). GuacaMol standardised goal-directed and distribution-learning benchmarks for de novo molecular design (Brown et al., 2019). scIB and OpenProblems demonstrated community-driven benchmarking for single-cell AI (Luecken et al., 2022; Luecken et al., 2025). Therapeutics Data Commons demonstrated multi-task therapeutic discovery benchmarking (Huang et al., 2022).

Evidence Anchor	What It Supports	Practical Constraint
CASP and CAMEO	Structure prediction assessment	Tasks evolve as methods improve
MoleculeNet and TDC	Molecular and therapeutic benchmarks	Dataset splits shape conclusions
GuacaMol	Generative molecule evaluation	Benchmark reward functions can be optimised without improving make-test value
PoseBusters	Physical validity in docking evaluation	One metric can hide failure
scIB and OpenProblems	Single-cell community benchmarks	Coverage extends as more tasks land
Virtual Cell Challenge	Blinded cell-response benchmark	Recent; community discipline is emerging
WMDP	Proxy hazardous-knowledge benchmark for biosecurity, cybersecurity, and chemical security	Public proxy; not an end-to-end misuse demonstration
LAB-Bench	Practical language-agent tasks for biology research	Useful biology capability can overlap with dual-use capability
RAND / OpenAI uplift studies	Human-with-model versus baseline evaluation pattern	Early results are model- and task-specific; monitoring must continue

Safety and dangerous-capability evaluation

Beneficial-capability benchmarks ask whether a model helps with a scientific task. Dangerous-capability evaluations ask a different question: whether a model, model scaffold, or tool connection materially lowers the barrier to harmful biological use. The Weapons of Mass Destruction Proxy benchmark is a public proxy for hazardous knowledge in biosecurity, cybersecurity, and chemical security, filtered to avoid directly releasing sensitive operational content (Li et al., 2024, preprint). LAB-Bench measures practical biology-research capabilities such as literature reasoning, protocol planning, database navigation, and sequence manipulation; those are useful research skills, but they become safety-relevant when they overlap with cloning, protocol execution, or agentic lab workflows (Laurent et al., 2024, preprint).

Uplift studies test the marginal effect of model access by comparing human performance with and without the model. RAND’s red-team study found no statistically significant difference in the viability of biological attack plans generated with or without LLM assistance for the systems tested, while noting that future capability movement still requires monitoring (Mouton et al., 2024). OpenAI’s early-warning study found at most mild GPT-4 uplift on biological threat-creation tasks and framed the result as a starting point for continued evaluation rather than a final risk estimate (OpenAI, 2024).

For life-sciences teams, the practical rule is direct: do not treat scientific benchmark performance as a release-safety decision. A model can perform well on CASP, TDC, or LAB-Bench and still require separate evaluation for misuse uplift, sensitive protocol assistance, model-weight security, and tool-mediated lab access. Safe-proxy TEVV belongs in the benchmark toolkit because it tests capability boundaries without publishing operational recipes for misuse; the information-hazards chapter explains that disclosure layer.

What is theoretical?

Theoretical capability includes prospective discovery benchmarks where models choose experiments and are judged by cost-adjusted learning. This is the right direction for many life sciences tasks, but it is more expensive than static benchmark release. Cost-aware benchmarks that rank methods by expected discovery yield per dollar are emerging in materials and chemistry; broader adoption in biology requires institutional and economic alignment.

Theoretical capability also includes durable bio-uplift benchmarks that stay diagnostic as models become better at tool use, long-horizon planning, and protocol execution. Static multiple-choice tests saturate or leak. Human-uplift studies are slower and require careful safety review. Agentic evaluations need sandboxed tools, safe biological proxies, and pre-registered red lines for what cannot be disclosed publicly.

What is beyond current capability?

Beyond current capabilities includes a universal biological benchmark that ranks all models. Biological tasks differ too much in ground truth, cost, and acceptable error. A benchmark that bridges protein structure, cellular perturbation, drug discovery, and clinical translation in one ranking is incompatible with the heterogeneity of biological evaluation.

What would make this more promising?

Benchmarks become more promising when results predict prospective experimental value, not only leaderboard rank. Stronger benchmark evidence would connect blinded results, biology-aware splits, failure categories, and cost-adjusted learning to the decision the model will change.

What should researchers, biotech teams, funders, and program leaders do with this?

Use benchmarks to reject claims, not only to support them.
Prefer splits that reflect intended use.
Report failure categories beside average metrics.
Hold back prospective tests when the field is likely to overfit public leaderboards.
Separate beneficial-capability benchmarks from dangerous-capability evaluations before releasing weights, agents, protocols, or connected lab tools.
Use safe-proxy TEVV for dual-use design claims instead of evaluating operationally sensitive sequences or procedures directly.
Cite the specific benchmark (CASP15, MoleculeNet, scIB, TDC) with version when version matters.
Cross-validate vendor claims against the relevant community benchmark before integrating a tool.