The Life Sciences AI Handbook: Steering Frontier Models in Biology

Tegomoh, Bryan

Appendix B — Quick Reference: All Chapter Summaries

Published

July 7, 2026

TL;DR summaries from every chapter, organised by handbook part. Use this page as a fast reference index, as a single-pass review of the handbook’s evidence-tier assessments, or as the lookup point before reading a full chapter.

Part I: Foundations

AI for the Life Sciences

Life sciences AI is not one field. It is a set of modelling practices that share three constraints: biological data is structured for measurement (not for computation), every model output is a hypothesis (not an endpoint), and the cost of being wrong is paid at the bench, in the clinic, or by a program that does not exist. Discovery AI is distinct from clinical AI and public health AI; each has a different evidence standard. The three-tier framework (Demonstrated, Theoretical, Beyond current capabilities) is used throughout the handbook to place every capability claim.

History of AI in the Life Sciences

The current AI wave in biology is the latest chapter in a five-decade arc, not a discontinuity. Sequence alignment in the 1970s; statistical-learning in the 1990s; deep learning entering biology around 2012-2015; AlphaFold 2 at CASP14 in 2020 as the watershed (Jumper et al., 2021); foundation models for biology from 2021 onwards. Reading current systems in continuity with this lineage prevents overclaiming.

Biological Data Infrastructure

Better models do not rescue poorly specified biological data. AI-ready data require provenance, assay context, versioning, licensing, negative-example handling, and a durable distinction between measured and computed records. FAIR data principles define reuse expectations (Wilkinson et al., 2016); dataset datasheets make collection and maintenance visible (Gebru et al., 2021); Bridge2AI frames the problem as infrastructure rather than a file drop (NIH Common Fund Bridge2AI, 2026). A dataset card is the minimum standard for any internal data product.

Biomedical Knowledge Graphs and Literature AI

Biomedical knowledge graphs and literature AI are the evidence layer beneath target scans, portfolio diligence, competitive intelligence, repurposing hypotheses, and source-grounded biomedical search. BioBERT, PubMedBERT, SciBERT, SemMedDB, Hetionet, PrimeKG, and Open Targets represent different pieces of that stack. The professional standard is not a polished answer. It is source provenance, typed relations, retrieval date, corpus boundary, negative-evidence review, and claim-to-source mapping.

Foundation Models for Biology

Foundation models for biology share one defining pattern: pretrain on a large biological corpus with a self-supervised objective, then adapt to many downstream tasks. The pattern has produced protein models, cell and spatial models, genome and RNA models, and structure-prediction systems. Each modality has its own training data, architectures, evaluation conventions, and failure modes. Pretraining works for representation and selected transfer tasks; the extent of useful transfer depends on the biological question and the validation discipline applied.

Evaluation Principles for Life Sciences AI

The credibility hierarchy: blind benchmark > prospective wet-lab validation > biology-aware held-out split > random hold-out > self-reported numbers with no public split. Always run the strongest classical baseline (PCA plus linear regression for cells; sequence-homology baselines for proteins; k-mer baselines for genomes). Apply validity checks alongside geometric metrics (PoseBusters lesson, Buttenschoen et al., 2024). Demand calibration alongside accuracy. DOME-compliant reporting is the per-paper baseline (Walsh et al., 2021).

Part II: Molecular Discovery and Design

Protein Structure Prediction

Structure models are now routine inputs to biology but are not substitutes for experiments. AlphaFold 2 effectively solved single-chain prediction (Jumper et al., 2021); AlphaFold 3 extended to biomolecular interactions (Abramson et al., 2024); the open-source response (Boltz, Chai) provides AF3-class capability with permissive licenses (Wohlwend et al., 2024, preprint; Chai Discovery et al., 2024, preprint). Confidence, conformational state, ligand geometry, and biological context determine whether a predicted structure supports a downstream decision. pLDDT and PAE are not optional reading.

Protein Design and Engineering

Protein design is an experimental discipline supported by generative models, not in-silico generation alone. RFdiffusion (backbones) (Watson et al., 2023), ProteinMPNN and LigandMPNN (sequences) (Dauparas et al., 2022; Dauparas et al., 2025), Chroma / ESM3 / ProteinGenerator / ProGen2 / EvoDiff (generative), and RFantibody (antibody-specific) form the current stack. EVOLVEpro and COMPSS show the value of assay-linked active learning and experimental score calibration (Jiang et al., 2025; Johnson et al., 2024). AlphaProteo is restricted-release; treat as preprint. Every designed protein still needs expression, characterisation, functional assay, and developability review before counting as a candidate.

Antibody and Biologic Design

RFantibody is the current peer-reviewed reference for de novo antibody design (Bennett et al., 2026). IgFold and AbLang handle adjacent tasks, while PLM-guided antibody evolution and structure-informed complex optimization are demonstrated for bounded optimization (Hie et al., 2024; Shanker et al., 2024). The chapter treats antibody anatomy, CDR loop geometry, epitope and paratope specificity, format choice, and the developability cascade as first-order content. Binding is necessary, but aggregation, viscosity, expression, immunogenicity, polyspecificity, pharmacokinetics, and manufacturability decide whether a design becomes a biologic candidate.

Nucleic Acid and Genome Models

Genome models move life sciences AI from protein sequence toward regulatory sequence, RNA, genome organisation, and cellular context. Evo and Evo 2 reason across DNA, RNA, and protein (Nguyen et al., 2024; Brixi et al., 2026); Semantic design with Evo demonstrates generated genes in tested prokaryotic systems (Merchant et al., 2026); AlphaGenome, GET, and Orthrus cover regulatory variant prediction, transcriptional state, and mature RNA representations (Avsec et al., 2026; Fu et al., 2025; Fradkin et al., 2026). The validation ladder is MPRA, CRISPRi, eQTL, RNA assays, and perturbation data.

Variant Effect Prediction

Variant predictors (AlphaMissense, EVE, ESM-1v, PrimateAI, SpliceAI, REVEL, CADD, DeepSEA, Enformer, Nucleotide Transformer, GPN-MSA, AlphaGenome) are useful as research and triage tools (Cheng et al., 2023; Frazer et al., 2021; Benegas et al., 2025; Zhan et al., 2025). The ACMG/AMP framework remains the clinical standard for classification (Richards et al., 2015); predictor scores are PP3/BP4 evidence inputs, not classifications. Per-laboratory calibration is mandatory before clinical use.

Part III: Cells, Tissues, and Systems Biology

Single-Cell Foundation Models

scGPT, Geneformer, scFoundation, scBERT, and Nicheformer produce useful cell-state or spatial-context representations (Cui et al., 2024; Theodoris et al., 2023; Hao et al., 2024; Yang et al., 2022; Tejada-Lapuerta et al., 2025). They support cell-atlas search (SCimilarity), integration, and selected spatial tasks competitively. Independent evaluations (Ahlmann-Eltze et al., 2025; Boiarsky et al., 2024) show that on perturbation prediction and several other tasks, the deep approaches do not consistently outperform PCA plus linear regression. scFMs are useful for some tasks and unverified for others.

Spatial Omics and Tissue Models

Spatial omics adds tissue position to molecular measurement. Platforms (Visium, MERFISH, Xenium, CosMx, Stereo-seq) give different resolution and sensitivity tradeoffs. SpatialData preserves platform-aware data structure, while Nicheformer and Novae are current peer-reviewed anchors for spatially aware representation learning (Marconato et al., 2024; Tejada-Lapuerta et al., 2025; Blampey et al., 2025). Spatial AI is most useful when it links molecular signals to tissue structure with clear resolution limits and explicit platform-aware comparisons. Assigning cell-type labels to spots is necessary but not sufficient.

Cell Painting and Image-Based Phenotyping

Cell Painting (Bray et al., 2016; Cimini et al., 2023) and CellProfiler (Carpenter et al., 2006; McQuin et al., 2018) form the canonical image-based phenotyping stack. JUMP morphology maps and genome-wide morphology atlases now provide stronger public benchmark layers for matched chemical, genetic, and expression perturbations (Chandrasekaran et al., 2024; Chandrasekaran et al., 2025; Ramezani et al., 2025). Mechanism inference from morphology alone remains beyond current capabilities.

Histopathology AI

Histopathology AI turns whole-slide tissue images into research representations for tumour microenvironment analysis, tissue biomarker discovery, weakly supervised slide classification, and nuclei-level quantification. UNI, CONCH, Virchow, Prov-GigaPath, CHIEF, MUSK, TITAN, and HoVer-Net form the current reference stack (Chen et al., 2024; Wang et al., 2024; Xiang et al., 2025; Ding et al., 2025). The credibility test is not a polished heatmap. It is site-held-out, scanner-held-out, stain-held-out, and cohort-held-out validation with orthogonal evidence for biomarker claims.

Microscopy and Cryo-EM AI

Microscopy and cryo-EM AI covers the image-processing layer beyond Cell Painting: segmentation, denoising, super-resolution, label-free prediction, particle picking, motion estimation, and heterogeneous reconstruction. Cellpose3, Segment Anything for Microscopy, DynaMight, and tomoDRGN extend the recent peer-reviewed reference stack (Stringer et al., 2025; Schwab et al., 2024; Powell et al., 2024). The core rule is measurement validity: visual improvement does not prove biological truth.

Perturbation Prediction and Virtual Cells

Perturbation prediction asks a counterfactual question: what would this cell state do under a genetic, chemical, dose, time, or combination perturbation? The field is grounded in Perturb-seq, CRISPRi single-cell screens, genetic-interaction manifolds, genome-scale Perturb-seq, Perturb-CITE-seq, GEARS, neural optimal transport, and the Virtual Cell Challenge. Independent evaluation still shows that strong linear baselines can match deep methods on key tasks (Ahlmann-Eltze et al., 2025).

Microbiome and Multi-Omics AI

Multi-omics AI is useful when each modality has clear provenance and the validation endpoint is explicit. Integration hides weak measurements when missingness and batch effects are not tracked. Microbiome data adds compositional and methodological complexity that standard ML handles badly without explicit corrections. The ESM Metagenomic Atlas expanded the predicted-structure layer for uncultured microbial proteins, while global microbiome mining shows that ML can prioritize antimicrobial peptide candidates for experimental testing (Santos-Júnior et al., 2024).

Systems Biology and Multiscale Modeling

Systems biology is the control layer between cell-state representation and organism-level claims. GRN inference, pathway modeling, mechanistic simulation, and multiscale models help name what a system is predicted to do and what perturbation would test it. Single-cell multi-omics strengthens regulatory inference, but observational network edges remain hypotheses without perturbation or orthogonal evidence (Badia-i-Mompel et al., 2023). Whole-cell modeling in Mycoplasma genitalium shows both the value and the scale burden of mechanistic biology (Karr et al., 2012).

Part IV: Organismal and Environmental Biology

Neuroscience AI and Brain Foundation Models

Neuroscience AI covers neural recordings, connectomics, neuroimaging, neural decoding, and foundation models for brain data. Brain foundation models can learn transferable structure from large neural and behavioral datasets, but the evidence standard changes when a representation is used to make claims about cognition, disease mechanism, or intervention response. The strongest current work is useful for representation, decoding, and hypothesis generation, not general brain understanding (Wang et al., 2025).

Aging and Longevity Biology AI

Aging AI is strongest as biomarker modeling and weakest when it becomes intervention prediction without longitudinal evidence. Epigenetic clocks and related biomarkers can estimate biological age or mortality-related signal in defined cohorts (Horvath, 2013; Bell et al., 2019). The hard question is whether a model can identify mechanisms and interventions that improve healthspan, not only whether it can fit an age-associated molecular pattern.

Plant, Crop, and Agricultural AI

Agriculture is life sciences at organism, population, environment, and breeding-program scale. Current work includes plant genome foundation models, plant RNA models, high-throughput phenotyping, and genotype-to-phenotype prediction. AgroNT and PlantRNA-FM show the foundation-model direction for plant molecular biology (Mendoza-Revilla et al., 2024; Zhang et al., 2024). Field performance still requires site, season, management, and environment validation.

Environmental and Ecological AI

Environmental and ecological AI covers biodiversity monitoring, environmental DNA, camera-trap inference, species distribution, conservation biology, and ecosystem-scale modeling. The key risk is confusing observation density with ecological truth. AI can improve detection, classification, and prioritization, but ecological claims need sampling design, uncertainty reporting, and field validation. Current ecological AI reviews emphasize opportunity and measurement discipline rather than replacement of ecological expertise (Rafiq et al., 2025; Guillera-Arroita et al., 2025).

Virtual Organisms and Digital Biology

Virtual organisms are not just larger virtual cells. They require multiscale coupling across tissues, development, physiology, behavior, and environment. Whole-cell modeling and virtual-cell programs provide important anchors, but organism-scale digital biology remains mostly theoretical (Karr et al., 2012; Bunne et al., 2024; Roohani et al., 2025). The durable rule is to name the scale and the falsifying measurement.

Part V: Therapeutic Discovery and Translation

Target Identification and Prioritization

Target identification is a decision under uncertainty, not a ranking contest. Open Targets organizes the evidence stack (Buniello et al., 2024); genetics is often the strongest prior, but direction of effect, modality, tissue expression, tractability, and safety genetics decide whether the target is actionable. Recent Nature and Nature Genetics work tightens the target-to-clinic connection by showing how genetic evidence and trial stoppage should inform target review (Trajanoska et al., 2023; Minikel et al., 2024; Razuvayevskaya et al., 2024). The winning target is the one with a falsifiable mechanism and rejected-target record.

Small Molecule Generation and ADMET

Small molecule discovery has multiple AI layers: generative chemistry (REINVENT lineage), structure-based docking (DiffDock, EquiBind, Pocket2Mol, with PoseBusters validity discipline), property prediction (Chemprop, ADMET-AI, ADMETlab 3.0), and benchmarks (MoleculeNet, TDC). Insilico Medicine’s ISM001-055 has peer-reviewed randomized Phase 2a evidence, but no AI-discovered small molecule has been approved (Ren et al., 2025; Xu et al., 2025). The value is in validation and stage-gate-shifting decisions, not the generative step.

Chemical Biology and Target Engagement

Chemical biology is the translational testbed between molecular generation and mechanism. AI can help select probes, prioritize binding hypotheses, infer mechanism of action from profiles, and design degrader or molecular-glue campaigns, but target engagement remains an experimental claim. Target 2035 frames chemical probes as infrastructure for testing biological function, not as decorative follow-on chemistry (Edwards et al., 2025).

Drug Repurposing and Combination Therapy

Repurposing and combination therapy AI helps rank old assets, mechanisms, signatures, graphs, networks, and drug pairs. Connectivity Map/L1000, DeepSynergy, SynergyFinder, COVID-19 repurposing evidence, network medicine, and TxGNN define the current reference frame (Subramanian et al., 2017; Preuer et al., 2018; Huang et al., 2024). The core rule is humility: a signature match, graph path, or synergy score is a hypothesis until disease-relevant assays, dose logic, toxicity, and clinical or translational evidence support it.

mRNA, RNA, and Vaccine Design

RNA and vaccine design combine sequence, structure, immunology, delivery, manufacturing, and population biology. mRNA construct anatomy and immune response should be separated: coding sequence, UTRs, cap, poly(A), nucleoside chemistry, secondary structure, purification, formulation, and delivery all matter. LinearDesign shows that algorithmic mRNA design can improve stability and immunogenicity in tested settings (Zhang et al., 2023), while viral language models support immune-escape surveillance (Hie et al., 2021). Protection still depends on delivery, dosing, safety, and human evidence.

Cell and Gene Therapy AI

Cell and gene therapy AI brings design into living therapeutics: engineered immune cells, guide and vector selection, delivery constraints, potency assays, and manufacturing analytics. The FDA regulates cellular and gene therapy products as biologics, which means AI claims have to land in product quality, potency, safety, or clinical evidence rather than only in sequence design (FDA, 2026). A model that improves a construct or cell state is useful only if the evidence survives delivery, manufacturing, and patient-level translation.

Diagnostics and Biomarker Translation

Diagnostics and biomarkers convert discovery signal into decisions. AI can help discover candidate markers, design assays, extract features, or define composite signatures, but qualification depends on context of use: what decision the biomarker supports, in which population, with which measurement process. FDA biomarker qualification materials make that context explicit (FDA, 2026). Discovery performance is not diagnostic validity.

Clinical Trial AI for Translational Research

Clinical trial AI covers operational analytics, eligibility matching, real-world data curation, endpoint extraction, monitoring, synthetic controls, and adaptive design. Each context of use carries a different evidentiary burden. FDA, EMA, JAMA, and Nature Medicine sources converge on the same discipline: context of use, lifecycle oversight, and regulator engagement when AI affects evidence (FDA, 2026; Warraich et al., 2025; Zhang et al., 2025). Write the context of use before choosing metrics.

Real-World Evidence and Biomarker AI

Real-world evidence and biomarker AI connect EHR-derived data, claims, registries, wearables, genomics, pathology, imaging, and multi-omic evidence to therapeutic development decisions. The first artifact should be a target-trial protocol, not a model, because eligibility, treatment strategy, time zero, endpoint, follow-up, causal contrast, and analysis plan determine credibility (Hernán et al., 2022; Hubbard et al., 2024). FDA’s RWE and biomarker programs make context of use central. AI improves curation and feature discovery, but it does not remove confounding, missingness, endpoint drift, or qualification requirements.

Translational Evidence and Failure Modes

The dominant failure mode is mismatch between model endpoint and biological mechanism, assay system, or program decision. A model that improves a proxy endpoint can harm the program if the proxy is poorly linked to disease biology or developability. PoseBusters is the canonical case for validity beyond geometric metrics (Buttenschoen et al., 2024). Ahlmann-Eltze 2025 is the canonical case for linear-baseline discipline (Ahlmann-Eltze et al., 2025). Failure analysis belongs near the start of the workflow.

Part VI: Research Systems, Practice, and Governance

Self-Driving Laboratories

A self-driving laboratory closes the experimental loop. Coscientist is the canonical LLM-planned chemistry example (Boiko et al., 2023). Virtual Lab demonstrates multi-agent biology (Swanson et al., 2025). A-Lab reported autonomous production of inorganic materials (Szymanski et al., 2023) and drew a PRX Energy critique that underlines how easily novelty claims can run ahead of validation (Leeman et al., 2024). The capability is real for bounded optimisation; open-ended autonomous biology remains beyond current capabilities.

Robotic Lab Automation and Cloud Labs

Cloud labs, software-defined biology, open-source liquid handling, and the mobile robotic chemist form the hardware-and-execution layer (Burger et al., 2020). Protocol languages and liquid-handling interfaces such as PyLabRobot make experiments computational artifacts, but calibration, deck layout, reagent lots, simulator output, and run logs remain part of the scientific record (Wierenga et al., 2023). Cross-cloud-lab portability requires explicit cross-validation. ARPA-H IGoR frames automation as research infrastructure (ARPA-H IGoR, 2026).

Synthetic Biology Design Tools

Synbio design joins sequence design (Evo, Evo 2, Nucleotide Transformer, ProGen2), protein design (RFdiffusion, ProteinMPNN, LigandMPNN), pathway engineering, and strain design. Semantic design with Evo is now a peer-reviewed example of function-guided generated genes with experimental tests (Merchant et al., 2026). The DBTL cycle is the engineering frame. AI accelerates design and learn steps without replacing build and test. DNA synthesis screening (IGSC) and IBC review are part of the workflow, not optional add-ons.

AI for Biomanufacturing

AI for biomanufacturing moves from discovery design to reproducible production: cell-line development, media optimisation, fermentation control, digital twins, PAT, critical process parameters, critical quality attributes, scale-up, and quality monitoring. The core distinction is research optimization versus validated manufacturing control. A useful model must be tied to product quality, scale, equipment, sensor validity, and change-control discipline, not only to a convenient process variable.

Agentic Science Workflows

ChemCrow, Coscientist, Virtual Lab, and CellVoyager are the canonical published agentic systems (M. Bran et al., 2024; Boiko et al., 2023; Swanson et al., 2025; Alber et al., 2026). Agentic systems should be read as tool-permission architectures: literature agents, computational agents, procurement agents, and lab-execution agents have different risk profiles. The discipline is bounded tasks, source control, tool validation, replayable logs, separated permissions, and explicit human authorization for biological actions.

Toolkit for AI-Augmented Bio Research

The durable workflow problem is not which tool is best in the abstract. It is which tool matches the research object, data sensitivity, reproducibility requirement, cost structure, and team capacity. The toolkit chapter gives a practical selection framework, current tool inventory, workflow patterns, and a 90-day adoption plan for teams that need to use AI without turning tool evaluation into the project.

Benchmarks for Bio AI

Benchmarks are social infrastructure for scientific claims. CASP and CAMEO for structure; PoseBusters for docking validity; MoleculeNet and TDC for chemistry; scIB and OpenProblems for single-cell; Virtual Cell Challenge for perturbation (Kryshtafovych et al., 2024; CAMEO, 2026; Buttenschoen et al., 2024; Wu et al., 2018; Huang et al., 2022; Luecken et al., 2022; Luecken et al., 2025; Roohani et al., 2025). The credibility hierarchy: blinded benchmark > prospective validation > biology-aware split > random hold-out > self-reported numbers. A leaderboard is a filter, not a validation plan.

Reproducibility and Open Science

Reproducibility in AI biology is both computational and experimental. The DOME framework is the per-paper reporting standard (Walsh et al., 2021). Model cards and dataset cards are the documentation layer. The AlphaFold 3 restricted-release and Boltz/Chai open-source response is the case study for how open-source releases can fill capability gaps when restrictions occur (Abramson et al., 2024; Wohlwend et al., 2024, preprint; Chai Discovery et al., 2024, preprint).

Information Hazards in Capability Research

The standard is not secrecy by default. The standard is deliberate disclosure: enough detail for scientific verification without unnecessary operational detail that raises misuse risk. The 2024 NIH and HHS DURC/PEPP framework provides institutional context (NIH, 2024). Recent Science and Nature Biotechnology pieces emphasize proportional governance, DNA synthesis screening, logging, access controls, and built-in safeguards (Bloomfield et al., 2024; Baker and Church, 2024; Wang et al., 2025).

Workforce, Compute, and Institutional Readiness

The minimum viable AI biology team is cross-disciplinary: biologist, data engineer, ML engineer, wet-lab partner, regulatory engagement when relevant, and governance owner. Compute without experimental judgement creates expensive noise. Bridge2AI and ARPA-H IGoR treat workforce and infrastructure as part of the AI-ready biomedical research agenda (NIH Common Fund Bridge2AI, 2026; ARPA-H IGoR, 2026). Plan compute together with storage, curation, validation experiments, governance, and workforce.

Emerging Frontiers in AI for the Life Sciences

The future-facing claims that matter most are virtual cells, autonomous discovery loops, multimodal biology foundation models, AI-designed therapeutics through approval, population-scale personalised medicine, global access, workforce transition, and the possibility of universal biology models. LucaOne and ChatNT show that nucleic-acid/protein multimodality is advancing, but task support is not the same as universal biology reasoning (He et al., 2025; de Almeida et al., 2025). The useful question is what would have to be true for each frontier to become routine practice.