The Life Sciences AI Handbook: Steering Frontier Models in Biology

Tegomoh, Bryan

Emerging Frontiers in AI for the Life Sciences

Published

July 7, 2026

The hard part of writing about AI’s future in the life sciences is separating accumulated evidence from forecast. This chapter organizes seven directions by evidence status: demonstrated in partial form, theoretical but plausible, and beyond what current methods support. The goal is not prediction; it is a record of what the field could credibly claim in 2026.

Learning Objectives

Use this chapter to:

Separate near-term frontiers from forecast-driven claims.
Track the evidence needed for virtual cells, autonomous labs, multimodal models, and AI-designed interventions.
Read frontier work against the handbook’s demonstrated, theoretical, and beyond-current-capability frame.

Chapter Summary (TL;DR)

Summary: The frontiers in this chapter are not moving at the same speed. Virtual cells, autonomous discovery loops, multimodal biology models, AI-designed therapeutics, population-scale personalization, global access, and workforce change each need a different evidence standard.

Key point: A frontier is not a conclusion. The useful question is what would have to be true for the frontier to become routine practice, and whether that evidence exists today.

Bottom line: Track frontiers with the same discipline used for current systems: named conditions, prospective evidence, independent evaluation, and clear limits. Do not treat a forecast as a result.

Introduction

The genre of “future of AI in [domain]” is unusually dangerous because it rewards confident prediction over calibrated uncertainty, and confident prediction in this field has a poor recent track record. The autonomous-driving timeline of the 2010s, the early clinical-AI deployments of the 2020s, and the recurring “drug discovery in two years” claims of every venture cycle should make any practitioner cautious about specific date predictions.

The structure is simple: separate what is happening, what could happen, and what should not yet be assumed. The chapter follows the project’s standard tier convention used in the Evaluation Principles chapter:

Demonstrated frontiers have partial working evidence, not full delivery. They are the first areas to track.
Theoretical frontiers are plausible from current evidence but have not yet been shown end to end. They are the most contested zone, where credible disagreement is healthy.
Beyond current capabilities frontiers should be treated with deep skepticism. They may eventually materialise, but capability claims in this tier today are claims, not results.

The aim is calibration. A practitioner in 2026 should be able to read this chapter and know which frontiers to track, which to engage with, which to defer, and which to ignore. A practitioner in 2030 should be able to read this chapter and see, item by item, where the calibration was right and where it was wrong.

What is demonstrated?

Virtual cells (cell-type-specific)

A virtual cell is a computational model that predicts how a cell will respond to perturbations: drugs, genetic edits, signalling inputs, environmental changes. The community roadmap (Bunne et al., 2024) frames the virtual cell not as a single model but as a multi-scale, multi-modal predictive system: gene expression, protein activity, metabolism, signalling, structure, behaviour, all coupled. The Virtual Cell Challenge puts that ambition into a benchmark frame rather than leaving it as a slogan (Roohani et al., 2025).

What exists in 2026: cell-type-specific perturbation models (Geneformer, scGPT, scFoundation lineage), pathway-specific simulation models, and several single-cell foundation models that capture useful representations across cell types. What does not exist: a general-purpose virtual cell that handles arbitrary perturbations on arbitrary cells with the reliability that AlphaFold 2 brought to structure prediction. The Perturbation Prediction and Virtual Cells chapter treats the current state in detail.

What would have to be true for a general virtual cell to materialise: orders-of-magnitude more perturbation data than currently exists, a foundation-model architecture that integrates the multi-scale data without losing biology, and an evaluation regime as rigorous as CASP was for structure. None of these is impossible; none is close to in place.

End-to-end autonomous discovery loops

A closed-loop discovery system proposes experiments via a model, executes them via robotic systems, analyses results via a model, and retrains. Materials science has working examples; biology has narrower ones (microbiome optimisation, enzyme engineering, certain antibody affinity maturation campaigns). The Self-Driving Laboratories and Agentic Science Workflows chapters track the current evidence.

What exists: autonomous loops in well-defined biological optimisation tasks. What does not exist: autonomous loops that produce clinically relevant biology without ongoing scientist supervision. The bottleneck is partly the integration of robotic execution with biological assays at clinical complexity, and partly the unsolved question of whether agentic systems make the kind of judgment calls that determine whether a piece of biology is interesting.

The evidence base justifies a narrow claim: closed loops can optimise constrained experimental spaces. Ada, the mobile robotic chemist, Coscientist, and Virtual Lab each show a bounded version of that pattern, with different hardware and model layers (MacLeod et al., 2020; Burger et al., 2020; Boiko et al., 2023; Swanson et al., 2025). A 2022 review of autonomous chemical experimentation makes the constraint explicit: chemical production, characterization, and exception handling remain central even when the optimisation policy is automated (Seifrid et al., 2022).

What would have to be true: reliable, scalable assays for the biology in question; agents whose reasoning is auditable enough that scientists trust their experimental choices; institutional and regulatory frameworks that handle agent-led decisions at the lab level. The first is biology-bound; the second is AI-bound; the third is governance-bound.

Multi-modal biology foundation models

Current foundation models for biology mostly handle one modality at a time: sequence (ESM-2, ESM-3, Evo), structure (AlphaFold lineage), cells (Geneformer and successors), small molecules (various). AlphaFold 3 (Abramson et al., 2024), Boltz-2, and Chai-1 combine sequence, structure, and small-molecule chemistry; ESM-3 combines sequence, structure, and function annotations. LucaOne and ChatNT illustrate the nucleic-acid/protein bridge, with LucaOne framing a unified nucleic-acid and protein language model and ChatNT framing a conversational agent over DNA, RNA, and protein tasks (He et al., 2025; de Almeida et al., 2025).

What exists: working two- and three-modality models. What does not exist: a model that handles sequence, structure, cell, tissue, organism, and behaviour in one representation. Whether such a model is even desirable is contested; biology may benefit more from specialised models that interoperate than from one universal model.

What would have to be true: pre-training data that span the modalities with adequate coverage, architectures that handle scale heterogeneity across modalities, and evaluations that test integrated reasoning rather than per-modality performance. Some progress on all three is visible; convergence is not.

What is theoretical?

AI-designed therapeutics through regulatory approval

AI-discovered, AI-optimised, and AI-designed candidates are in clinical trials in 2026. Several pharmaceutical companies report AI involvement at multiple steps of the discovery pipeline. A first analysis of AI-discovered drugs in clinical trials found early clinical-stage activity but also cautioned that the evidence base is young and subject to selection effects (Jayatunga et al., 2024). None has yet completed a Phase 3 pivotal trial and received FDA approval primarily on the strength of an AI-generated molecule.

The bottleneck is not AI capability. The bottleneck is the same translational gap that limits all of drug discovery: efficacy in the right patient population, safety at therapeutic doses, manufacturability at scale, and a commercial model that works. AI improves the early stages (hit discovery, lead optimisation, structure prediction); the later stages (clinical efficacy, manufacturing, regulatory) are governed by biology and economics that no model accelerates by much. The Translational Evidence and Failure Modes chapter treats the gap in detail.

What would have to be true: enough AI-derived candidates moving through trials to produce statistical evidence of improved success rates, regulatory frameworks that accept AI-derived data as primary evidence, and patient stratification approaches that exploit AI-derived precision in selecting who benefits. All three are plausible; none is yet established.

Personalised medicine at population scale

The vision is variant-level interpretation for every person in a population, integrated with phenotype, used to inform care. Partial evidence exists: AlphaMissense (Cheng et al., 2023) classifies approximately 89% of human missense variants as likely benign or likely pathogenic at research-grade quality. Polygenic risk scores are now used in some clinical settings, but responsible-use reviews warn that clinical utility, communication, and ancestry performance must be assessed before deployment (Polygenic Risk Score Task Force of the International Common Disease Alliance, 2021). Pharmacogenomic guidance is increasingly automated.

What does not exist: clinical-quality interpretation for most variants of unknown significance, integration of variant-level prediction with phenotype at the level of routine care, and the equity infrastructure to ensure that personalised medicine is not personalised only for those whose ancestry is over-represented in training data. The third is the under-discussed risk; AI personalisation built on a non-representative reference panel will entrench inequities, not narrow them (Sirugo et al., 2019; Martin et al., 2019).

What would have to be true: regulatory and clinical pathways for AI-generated variant interpretations to inform care; functional-validation data at scale to ground the predictions; reference data that represent global genetic diversity. Movement is visible on all three, slowly.

Global access narrowing or widening

In 2026, the question of whether AI in life sciences narrows global health inequities or widens them is genuinely open. The narrowing case: open-weight foundation models, accessible cloud compute, and shared benchmarks let well-resourced LMIC research groups operate near the global frontier on questions of local relevance (neglected diseases, regional pathogen surveillance, climate-driven health). The widening case: compute and data concentration in a small number of institutions creates a dependence on infrastructure most countries cannot replicate, and AI-personalised medicine built on under-representative reference data exports inequities that take decades to correct.

Both trajectories are visible. The outcome depends on policy and funding decisions more than on AI capability: open-data norms, compute access programmes, training pipelines for LMIC AI-in-biology researchers, and the way frontier labs handle data sovereignty. The Wiens et al. roadmap for responsible ML in healthcare (Wiens et al., 2019) and WHO’s health AI guidance frame the equity question for clinical AI and apply, with adjustments, to the broader life-sciences case (WHO, 2021).

What would have to be true for narrowing to win: sustained policy commitment to open-data and open-weight norms; targeted funding for LMIC AI-in-biology capacity; reference datasets that represent global diversity. None is automatic.

Workforce transition

Routine hypothesis generation, literature search, structural modelling, candidate ranking, and basic data analysis are increasingly AI-assisted. Wet-lab execution, experimental design judgment, and biological interpretation remain human-led. The net effect on the bio research workforce is uncertain. Plausible outcomes range from the most productive ten percent of researchers becoming substantially more productive (and the field consolidating around them) to a broader uplift in productivity that absorbs more biologists into AI-augmented roles than it displaces.

What would have to be true for the broader uplift outcome: education pathways that make AI tools second nature for the next generation of bench scientists; institutional reward structures that credit hybrid wet-dry-AI work; mid-career retraining that does not require leaving the field. The Workforce, Compute, and Institutional Readiness chapter covers the current state.

What is beyond current capability?

A single universal biology foundation model

The aspiration of one foundation model that handles sequence, structure, cell, tissue, organism, and behaviour in one representation is theoretically interesting and not a credible near-term direction. Biology’s scale heterogeneity is more extreme than language’s: nanometre-to-metre length scales, microsecond-to-decade time scales, and qualitatively different physics across scales. Universal-model claims should be evaluated against whether they handle this heterogeneity or whether they handle a narrower problem under a generous label.

Fully autonomous wet-lab science without human judgment

Demos exist; reliable production science does not. The genre of “AI ran an experiment and discovered” headlines mostly describes constrained optimisation in materials or microbiology, not open-ended scientific discovery. A general AI scientist that proposes meaningful biology, runs the right experiments, and interprets the results without scientist supervision is beyond current capabilities. Claims to the contrary should be evaluated against what an independent group reproduced, not against what a single team demonstrated under controlled conditions.

One-decade prediction of which AI will be dominant

The current architectures (transformers in various forms, diffusion models for generation) have been dominant for less than a decade. The history chapter shows that each prior decade’s dominant architecture was largely unanticipated by the prior decade’s practitioners. Claims that the current generation will still be dominant in 2036 should be evaluated against this prior, not on the strength of present momentum.

What would make this more promising?

Frontiers become more promising when they move from isolated demonstrations to repeated prospective results under independent evaluation. For this chapter, the strongest signal would be a frontier satisfying its named conditions: virtual-cell perturbation prediction across cell types, autonomous loops that reproduce across labs, AI-designed therapeutics with Phase 3 success, or population-scale variant interpretation with equitable performance.

What should researchers, biotech teams, funders, and program leaders do with this?

Decisions today that hold up across most futures:

Build on durable foundations, not on the current generation’s architectures. The functions (structure prediction, perturbation modelling, sequence search, generative design) will outlast the named systems. Invest in the workflow patterns, not the brand-name tool.
Choose frontiers to track explicitly. A laboratory cannot engage with every frontier. Pick two or three that match your scientific question and the next-five-year roadmap, and track them with the same discipline you apply to your own field.
Engage at the partially-demonstrated frontier, not at the theoretical or beyond-current frontier. Working with virtual-cell methods for your cell type, autonomous loops for your assay class, or multi-modal models for your modality is reasonable. Building a programme around fully autonomous science or universal foundation models is not.
Insist on the same evidence discipline for forward-looking claims as for retrospective ones. The Evaluation Principles chapter applies to “we are about to do X” as much as to “we have done X.” Blind benchmarks, biology-aware splits, prospective validation, and DOME-compliant reporting are the test.
Read the equity question explicitly. Personalised medicine, global access, and workforce questions are not subsidiary to the technical frontiers; they are the questions that determine whether the technical frontiers produce broad benefit or concentrate it.
Use this chapter as a timestamp. In 2030, note which frontiers materialised, which stalled, and which collapsed into hype. The exercise calibrates judgment about the next round.