The Life Sciences AI Handbook: Steering Frontier Models in Biology

Tegomoh, Bryan

Evaluation Principles for Life Sciences AI

Published

July 7, 2026

A biology AI claim is only as strong as the evaluation that produced it. Most disappointments in the field, including ones that have cost programs and shaped public skepticism, trace to evaluations that look rigorous in a paper but do not match the conditions under which the model is then used. The fix is not more metrics; it is a small number of evaluation choices that respect biology rather than treat it as generic tabular data.

Learning Objectives

Use this chapter to:

Separate real biological capability from benchmark artifacts, weak splits, leakage, poor calibration, and retrospective fit.
The evaluation design has to match the biological question: sequence-family splits, donor splits, scaffold splits, tissue splits, time splits, and prospective validation answer different claims.

Chapter Summary (TL;DR)

Summary: Separate real biological capability from benchmark artifacts, weak splits, leakage, poor calibration, and retrospective fit. Blind benchmarks and biology-aware splits are widely accepted in some areas, but many newer model classes still rely on narrow or self-reported evaluations.

Key point: The evaluation design has to match the biological question: sequence-family splits, donor splits, scaffold splits, tissue splits, time splits, and prospective validation answer different claims. Open question: whether prospective, decision-aware validation becomes routine outside a few mature benchmark cultures.

Bottom line: Evaluation is the bridge between all domains in the handbook because every model class eventually has to survive a decision-relevant test.

Field Guide

What is this field trying to solve? Separate real biological capability from benchmark artifacts, weak splits, leakage, poor calibration, and retrospective fit.

What is the core idea? The evaluation design has to match the biological question: sequence-family splits, donor splits, scaffold splits, tissue splits, time splits, and prospective validation answer different claims.

What is the current state of the field? Blind benchmarks and biology-aware splits are widely accepted in some areas, but many newer model classes still rely on narrow or self-reported evaluations.

What do we know, and what remains open? Known reference points include CASP, CAMEO, PoseBusters, MoleculeNet, Therapeutics Data Commons, OpenProblems, scIB, DOME, calibration metrics, and prospective validation studies. What remains open is whether prospective, decision-aware validation becomes routine outside a few mature benchmark cultures.

Why does this matter? Evaluation is the bridge between all domains in the handbook because every model class eventually has to survive a decision-relevant test.

Introduction

Evaluation in life sciences AI must respect biology, not only machine-learning convention. A model can score well on a random hold-out from a public dataset and fail completely the first time it sees an unfamiliar protein family, an unfamiliar scaffold, or an unfamiliar assay protocol. The cost of that failure is not abstract: it is a months-long internal program betting on an external claim, or a regulatory submission whose performance claims fall apart under external audit.

The discipline of evaluation in biology has been worked out over thirty years by specific communities solving specific problems. The structural-biology community built CASP starting in 1994; the docking community built PoseBusters in 2024 because RMSD had been over-trusted for two decades; the molecular-property-prediction community built MoleculeNet in 2018 to standardise scaffold splits across a long-fragmented field; the broader biology-ML community published DOME in 2021 as a reporting standard. The principles those communities converged on are remarkably consistent: blind beats self-reported, biology-aware splits beat random splits, validity checks beat single metrics, calibration beats accuracy alone, prospective beats retrospective, reproducibility beats reputation.

The rest of this chapter walks each principle and then assembles them into an operational checklist. The intent is to give a non-specialist reader, a vendor evaluator, or a program lead a framework that travels across subdomains so the same questions apply whether the model in question is for protein structure, small-molecule generation, single-cell perturbation, or clinical-trial enrolment.

What is demonstrated?

Biology-aware splits

The single most common evaluation failure in life sciences ML is using a random train/test split when a biology-aware split would be appropriate. Random splits assume independent and identically distributed examples; biological data are rarely either.

Scaffold splits group molecules by their Bemis-Murcko scaffold (the central ring system) and assign whole scaffolds to either train or test. MoleculeNet (Wu et al., 2018) made scaffold splits the de facto standard for benchmarking molecular property prediction, after a decade in which random splits had systematically inflated reported performance because closely related analogues sat on both sides.

This is not only a cheminformatics convention. Wallach and Heifets showed that many ligand-based classification benchmarks favor models that recognize chemotypes already present in training rather than models that generalize to new chemical series (Wallach and Heifets, 2018). Any claim about compound generalization should therefore disclose the split axis and the nearest-neighbor relationship between test compounds and the training set.

Time splits assign examples by date, with older data in training and newer data in test, simulating the prospective situation where a model must extrapolate to future compounds, new structural classes, or new assay reagents.

Sequence-family or homology splits assign protein sequences to train or test so that no test sequence shares more than a chosen identity threshold (commonly 30%) with any training sequence. Without this discipline, a model trained on one membership of a protein family is being tested on near-relatives, which inflates reported generalisation.

Cell-line splits, target splits, and assay splits apply the same logic for cellular models, target-based drug discovery, and high-throughput screens: whichever biological axis the deployed model must generalise across should be the axis it has not seen at training time.

The operational rule: pick the split that matches the distribution the model will face in deployment. A model evaluated under a strictly weaker split has not been evaluated for its intended use; it has been evaluated for an easier version of its intended use.

Validity checks beyond geometric metrics

Geometric and statistical metrics are necessary but not sufficient. PoseBusters showed this for docking. The same lesson applies across the field.

Physical and chemical validity for docked poses, generated molecules, and predicted structures: realistic bond lengths, valid stereochemistry, no severe clashes, reasonable ring geometry.

Biological plausibility for generated sequences and structures: the protein folds, the codon usage is consistent with the host organism, the binder is the right size, the predicted activity sits in a meaningful range.

Manufacturability and synthetic accessibility for designed molecules: the compound can be made via a published synthetic route or by a competent medicinal chemist using available reagents.

GuacaMol (Brown et al., 2019) extended this principle into the generative-molecule literature with a multi-objective benchmark covering distribution-learning quality, goal-directed generation, and several validity dimensions; goal-directed scores that ignore validity collapse to optimising whatever metric the generator can game.

The rule for evaluators is symmetric: any benchmark that does not include validity checks for the modality (structure, pose, generated molecule, predicted phenotype) is overstating capability.

Calibration and uncertainty

A model that reports its uncertainty correctly is often more useful than a slightly more accurate model that does not. Calibration is the property that, when a model reports 80% confidence, it is right roughly 80% of the time. Modern deep models in biology are often poorly calibrated by default and require explicit calibration (Platt scaling, isotonic regression, temperature scaling, or conformal prediction) before their confidence outputs are trustworthy.

The structure-prediction community made this concrete with per-residue confidence outputs. AlphaFold 2’s pLDDT score (Jumper et al., 2021) and AlphaFold 3’s confidence outputs (Abramson et al., 2024) are widely used as decision metrics: high-pLDDT regions are treated as reliable, low-pLDDT regions are treated as either disordered or unreliable. The discipline of always reading the confidence map before using a prediction is what makes structure prediction useful in drug discovery and not just a publication metric.

The general rule: an evaluation that reports only point-prediction metrics is incomplete; calibration curves, uncertainty intervals, and failure-mode breakdowns are part of the minimum reporting set.

Prospective experimental validation

When a model is used to select experiments, the only adequate validation is to run those experiments and report results. Retrospective evaluation, on data that already exists, can never fully simulate the situation where the model’s predictions determine which biology gets tested.

This matters most for generative methods and for active-learning loops. A generator that produces 1,000 candidate binders is interesting only insofar as some non-trivial fraction validate experimentally. The RFdiffusion programme (treated in detail in the Protein Design and Engineering chapter) is methodologically notable in part because the papers reported experimental hit rates, not just in-silico scores.

For variant interpretation, the analogue is functional validation. AlphaMissense (Cheng et al., 2023) is positioned as a research tool partly because the large-scale functional evidence to back its classifications at clinical-decision quality does not yet exist for most variants; using it in care without that evidence would be premature.

The rule: for any AI system that drives experimental decisions, ask for the prospective evidence and the corresponding hit rate. Self-reported retrospective performance is necessary but not sufficient.

Reproducibility floor: DOME and adjacent standards

The DOME (Data, Optimization, Model, Evaluation) recommendations (Walsh et al., 2021) are the community baseline for supervised-ML reporting in biology. A DOME-compliant report tells the reader, at minimum:

Data: where the data came from, how it was preprocessed, how it was split, what was held out
Optimization: how hyperparameters were chosen, what was searched, what budget was used
Model: the architecture, the training objective, the loss function, the regularisation
Evaluation: which metrics, which baselines, which significance tests, which failure modes are surfaced

A paper that does not report the DOME elements is hard for an outside reader to trust because there are too many unstated degrees of freedom. The Reproducibility and Open Science chapter treats the reproducibility question at the institutional level (open weights, open data, model cards, protocol records); DOME is the per-paper analogue.

The rule: in 2026, DOME-compliant reporting is a minimum reporting standard. Evaluations that do not meet it should be treated as preliminary regardless of who published them.

Summary table

Evidence Anchor	What It Supports	Practical Constraint
CASP	Blinded community assessment for structure prediction	Targets and categories change across rounds
CAMEO	Continuous server evaluation	Automated evaluation depends on target release and criteria
PoseBusters	Physical and chemical plausibility for docking	RMSD-only evaluation rewards incomplete success
MoleculeNet	Standard scaffold-split benchmarks for molecular property prediction	Datasets age; absolute numbers are not always comparable across years
GuacaMol	Multi-objective generative-molecule evaluation	Goal-directed metrics can be gamed if validity is not enforced
DOME	Reporting standard for supervised ML in biology	Adoption is uneven across journals
Prospective wet-lab validation	The only adequate test for experiment-selecting models	Expensive; requires institutional and experimental commitment

What is theoretical?

Several evaluation regimes are plausible but not yet routine, and would change the field’s quality bar if adopted.

Cost-aware evaluation. Current benchmarks rank methods by accuracy or by a single quality metric. A more useful evaluation would rank methods by expected discovery yield per dollar, integrating model accuracy, prospective hit rate, and the cost of the experimental steps the model selects. This is the metric an R&D programme actually cares about; it is rarely the metric a paper reports.

Cross-modality benchmarks. A model that simultaneously handles sequence, structure, and small-molecule chemistry should be evaluated on tasks that require all three. The benchmark suite for biology-wide foundation models is still emerging; current evaluations mostly stitch together per-modality benchmarks rather than testing integrated reasoning.

Active-learning benchmarks. Many real workflows use the model to choose the next experiment, retrain, repeat. Benchmarks that evaluate the full loop, not just the model in isolation, would better reflect deployment. The single-cell perturbation literature is closer to this than most subfields, and the Perturbation Prediction and Virtual Cells chapter discusses the current state.

External-validation registries. A registry that tracks vendor-reported performance against independently reproduced performance, in the spirit of the FDA’s adverse-event reporting system, would change the negotiation between buyers and sellers of AI in life sciences. Several professional societies are discussing this in 2026; none has launched.

The theoretical-section rule: an evaluation regime is plausible if a small group of competent researchers can imagine running it and the data exist. Most of the gaps above are organisational and economic, not technical.

What is beyond current capability?

Two evaluation goals remain beyond current capabilities.

A universal score across all of biology. Structure prediction, compound screening, cellular response, and clinical translation require different ground truth and different error costs. A model that excels on CASP targets says nothing about its cellular-perturbation performance, and a model that excels at single-cell perturbation prediction says nothing about its docking accuracy. The dream of a single benchmark that orders methods across the entire field is incompatible with the heterogeneity of biological evaluation.

Fully simulated deployment-equivalence evaluation. Even the best retrospective evaluation is a model of deployment, not deployment itself. A simulator that fully predicts how a method will behave in a real laboratory, with real assay variability, real reagent batches, and real human handlers, would be tantamount to solving the underlying biology. It is the kind of capability that, if it existed, would itself transform the field.

The rule: claims that promise either of these should be treated with deep skepticism. They are conceptually beyond what the current generation of methods and infrastructure can support, regardless of training compute.

What would make this more promising?

Evaluation practice becomes more promising if the field adopted prospective, cost-aware, and externally reproduced assessments as routine reporting rather than special studies. Better evidence would include registries that compare vendor-reported performance with independent reproduction, active-learning benchmarks that score the full experiment-selection loop, and calibration reports tied to the decision the model supports. A universal biology score would require evidence that one metric predicts deployment behavior across molecules, structures, cells, organisms, and translation, and that evidence does not currently exist.

What should researchers, biotech teams, funders, and program leaders do with this?

For practitioners evaluating any AI system in the life sciences:

Demand the blind-benchmark result. Self-reported test-set numbers are necessary but not sufficient. Where is the CASP, CAMEO, PoseBusters, MoleculeNet, GuacaMol, or equivalent number? If none exists for this modality, what is the strongest available evidence and what would it take to produce a blind-benchmark equivalent?
Match the split to the use. Ask which biology-aware split was used (scaffold, time, sequence-family, cell-line, target, assay) and whether the deployment distribution looks like the training distribution.
Require validity checks, not only geometric metrics. For docking, PoseBusters-style physical and chemical validity. For generative molecules, synthetic accessibility and goal-aligned validity. For predicted structures, confidence-aware geometry, not raw RMSD.
Read the calibration plot. A model with reliable uncertainty is more useful for decision-making than a sharper model that does not know when to abstain. Per-residue or per-prediction confidence is not optional in 2026.
Insist on prospective validation when the model selects experiments. Hit-rate evidence from running the model’s predicted experiments beats any retrospective number.
Apply the DOME checklist. Data provenance, optimisation procedure, model details, evaluation procedure: if any of the four is missing, treat the result as preliminary.
Look for failure-mode breakdowns. Predictable failure (in disordered regions, novel scaffolds, induced fit, large complexes) is a workable constraint to design around. Unreported failure modes are not; they surface as deployment surprises.
Treat vendor performance figures with the same standard as published ones. Performance reported in a slide deck, marketing site, or webinar is not stronger than performance reported in a paper; usually it is weaker because there is no version of the manuscript to scrutinise.
Verify by independent reproduction wherever possible. A capability is real to the field when an independent group reproduces it. The Benchmarks for Bio AI and Reproducibility and Open Science chapters detail what to look for.