The Life Sciences AI Handbook: Steering Frontier Models in Biology

Tegomoh, Bryan

Foundation Models for Biology

Published

July 7, 2026

Foundation models for biology share one defining pattern: pretrain on a large biological corpus with a self-supervised objective, then adapt to many downstream tasks. The pattern has produced ESM-2 and ESM3 for proteins, scGPT and Geneformer and scFoundation for cells, Evo and Evo 2 and Nucleotide Transformer for genomes, and the AlphaFold lineage for structures. Each modality has its own training data, architectures, evaluation conventions, and failure modes. The evidence supports pretraining for representation and selected transfer tasks; the extent of useful transfer depends on the biological question and the validation discipline applied to the claim.

Learning Objectives

Use this chapter to:

Explain how pretraining on biological sequences, structures, cells, genomes, and multimodal records changes representation learning across biology.
A foundation model is not automatically a reasoning system; its value depends on pretraining data, task fit, baselines, calibration, and validation outside the training distribution.

Prerequisites: AI for the Life Sciences for the chapter reference and utility standard; Evaluation Principles for Life Sciences AI for the split discipline that applies equally to every foundation model class.

Chapter Summary (TL;DR)

Summary: Explain how pretraining on biological sequences, structures, cells, genomes, and multimodal records changes representation learning across biology. Protein and structure models have the strongest evidence base, while cell, genome, tissue, and multimodal biology models are moving quickly but remain more context-bound.

Key point: A foundation model is not automatically a reasoning system; its value depends on pretraining data, task fit, baselines, calibration, and validation outside the training distribution. Open question: the boundary between useful representation learning and dependable biological prediction across contexts.

Bottom line: Foundation models sit underneath molecular design, genomics, single-cell work, spatial biology, therapeutics, and agentic laboratory systems.

Field Guide

What is this field trying to solve? Explain how pretraining on biological sequences, structures, cells, genomes, and multimodal records changes representation learning across biology.

What is the core idea? A foundation model is not automatically a reasoning system; its value depends on pretraining data, task fit, baselines, calibration, and validation outside the training distribution.

What is the current state of the field? Protein and structure models have the strongest evidence base, while cell, genome, tissue, and multimodal biology models are moving quickly but remain more context-bound.

What do we know, and what remains open? Known reference points include ESM, ESMFold, AlphaFold, Evo, Nucleotide Transformer, scGPT, Geneformer, scFoundation, AlphaGenome, UniRef, PDB, CELLxGENE, GenBank, and modality-specific benchmarks. What remains open is the boundary between useful representation learning and dependable biological prediction across contexts.

Why does this matter? Foundation models sit underneath molecular design, genomics, single-cell work, spatial biology, therapeutics, and agentic laboratory systems.

Introduction

The biology foundation model wave started with proteins. ESM-1b (Rives et al., 2021) showed that pretraining a transformer on 250 million protein sequences with a masked language modelling objective produced representations that transferred to many downstream tasks (secondary structure, contact prediction, fitness). ESM-2 (Lin et al., 2023) scaled this to 15 billion parameters and demonstrated that the sequence-only language model could predict atomic-resolution structures without an explicit alignment step (ESMFold). ESM3 (Hayes et al., 2025) extended the lineage to multimodal generation across sequence, structure, and function tokens.

The single-cell side followed. scGPT (Cui et al., 2024) and Geneformer (Theodoris et al., 2023) trained on tens of millions of single cells with the explicit goal of producing general-purpose cell representations. scFoundation (Hao et al., 2024) extended the scale further. Nicheformer then moved the corpus from dissociated cells alone toward joint single-cell and spatial transcriptomics (Tejada-Lapuerta et al., 2025).

The genome side came next. Nucleotide Transformer (Dalla-Torre et al., 2025) trained foundation-style transformers on diverse human and multi-species genome data. Evo (Nguyen et al., 2024) trained a 7-billion-parameter model on prokaryotic and phage genomes at single-nucleotide resolution. Evo 2 (Brixi et al., 2026) extended this to a 40-billion-parameter model trained on more than 9 trillion nucleotides across all domains of life. GET focused the regulatory grammar problem on transcription across human cell types (Fu et al., 2025), while Orthrus brought RNA-specific evolutionary contrastive pretraining into the peer-reviewed foundation-model literature (Fradkin et al., 2026).

The structure side has its own lineage covered in Protein Structure Prediction.

Each modality has its own evaluation conventions, its own failure modes, and its own pace of maturation. The shared concept (pretrain on a large biological corpus, then transfer) is real and useful. The shared overclaiming pattern (every modality has versions of “our foundation model achieves SOTA on every downstream task we tried”) is also real and requires the evaluation discipline that this handbook returns to repeatedly.

What is demonstrated?

Protein language models

The ESM lineage from Meta FAIR (now EvolutionaryScale) is the clearest demonstrated success of the biology foundation model pattern.

ESM-1b (Rives et al., 2021) trained on 250 million UniRef50 sequences with masked language modelling. The PNAS paper demonstrated that the learned representations captured secondary structure, contact, and fitness signal without explicit supervision, establishing that biological structure emerges from self-supervised sequence pretraining.
ESM-2 (Lin et al., 2023) scaled the same recipe to 15 billion parameters. The Science paper demonstrated ESMFold, which predicts atomic-resolution protein structures from sequence alone in orders-of-magnitude less compute than MSA-based methods. ESMFold is now widely used for fast structure prediction across large proteomes.
ESM3 (Hayes et al., 2025) extended the lineage to multimodal generation. The Science paper introduced a model that models sequence, structure, and function tokens in one architecture, and demonstrated experimental protein synthesis and validation of esmGFP, a designed green fluorescent protein with substantial sequence divergence from natural GFPs.

Adjacent work:

ProtBERT / ProtTrans (Elnaggar et al., 2022) is the BERT-style protein language model from the Rostlab and collaborators; it is widely used as a baseline for downstream protein property prediction.
ProGen2 (Nijkamp et al., 2023) is an autoregressive protein language model trained at scale by Salesforce Research, useful for sequence-conditioned generation.

Single-cell foundation models

The single-cell side has produced several large pretrained models that share architectural similarities and differ in pretraining recipe and downstream interface.

scGPT (Cui et al., 2024) trained on over 33 million single cells with a generative pretraining objective and demonstrated transfer to cell-type annotation, perturbation prediction, multi-batch integration, and gene-network inference.
Geneformer (Theodoris et al., 2023) trained on approximately 30 million single cells with a rank-order gene encoding and demonstrated transfer to network biology tasks including dosage-sensitivity prediction.
scFoundation (Hao et al., 2024) extended the scale to roughly 50 million cells with a different pretraining recipe.
Nicheformer (Tejada-Lapuerta et al., 2025) trained on SpatialCorpus-110M, combining dissociated and spatially resolved transcriptomics. The Nature Methods paper supports spatially aware cell representations for niche and spatial-composition tasks, not a general virtual tissue simulator.

The scFM lineage is treated in depth in Single-Cell Foundation Models, which also covers SCimilarity as an atlas-search foundation model (Heimberg et al., 2025) and the independent evaluation literature showing that, on some perturbation and transfer tasks, the deep approach does not consistently beat simple baselines (Boiarsky et al., 2024; Ahlmann-Eltze et al., 2025).

Genome foundation models

The genome side is the most recent and currently the most active.

DNABERT (Ji et al., 2021) is the early BERT-style DNA language model reference. It established the pattern of k-mer tokenization and self-supervised pretraining for genome sequence tasks before the current long-context genome foundation model wave.
Nucleotide Transformer (Dalla-Torre et al., 2025) trained foundation-style transformers on diverse human and multi-species genome data, with downstream applications across variant effect prediction, regulatory annotation, and chromatin features.
Evo (Nguyen et al., 2024) is a 7-billion-parameter genomic foundation model trained on prokaryotic and phage genomes at single-nucleotide resolution. The Science paper demonstrated transfer across DNA, RNA, and protein modalities through the underlying genome, including zero-shot prediction and generation tasks.
Evo 2 (Brixi et al., 2026) is the 40-billion-parameter successor trained on more than 9 trillion nucleotides spanning all domains of life, with a 1-megabase context window and open model parameters, training code, inference code, and OpenGenome2 data.
GET (Fu et al., 2025) models transcriptional regulation across human cell types. It belongs beside regulatory sequence models, not beside generic cell-state embedding models, because the output is regulatory grammar and expression prediction.
Orthrus (Fradkin et al., 2026) is an RNA foundation model trained with evolutionary and isoform-based contrastive objectives. Its evidence supports mature RNA property prediction and RNA representation learning, not end-to-end therapeutic RNA design.

These are early-stage models in the sense that the evaluation conventions for genome-scale foundation models are still being established. Independent reproduction across teams and across organism families will determine which capability claims hold.

Protein structure as a foundation model

The AlphaFold lineage (Jumper et al., 2021; Abramson et al., 2024) fits the foundation model pattern: pretraining on a large structural corpus (PDB plus distilled training data), then adaptation to many downstream tasks (single-chain prediction, complex prediction, ligand-aware prediction, variant effect prediction via AlphaMissense). The AlphaFold-specific architecture (triangle attention, structure module, diffusion head in AF3) carries strong structural inductive biases; the pretrain-then-transfer reading is nevertheless apt. Detailed treatment is in Protein Structure Prediction and Variant Effect Prediction.

Evidence anchor summary

Evidence Anchor	What It Supports	Practical Constraint
ESM-1b	Self-supervised sequence pretraining captures biological signal	Per-residue confidence not native to early ESM
ESM-2	15B-parameter protein language model with fast structure prediction	Speed traded for some accuracy vs MSA-based methods
ESM3	Multimodal generative protein language model with esmGFP demonstration	Generalisation under external reproduction still being mapped
ProtBERT / ProtTrans	Widely used protein-language-model baseline	Older recipe; superseded for many tasks
ProGen2	Autoregressive protein sequence generation	Sequence-only; structural realism is downstream
scGPT	33M-cell generative pretraining with multi-task transfer	Linear baselines often match for perturbation tasks
Geneformer	30M-cell rank-order pretraining	Useful for network biology; same caveats as scGPT
scFoundation	Larger-scale single-cell pretraining recipe	Same caveats
DNABERT	Early DNA language-model pattern	Shorter context and older recipe
Nucleotide Transformer	Genome foundation model with broad task evaluation	Evaluation conventions still maturing
Evo	7B-parameter genome foundation model spanning DNA, RNA, protein	Prokaryotic and phage pretraining; eukaryotic transfer partial
Evo 2	40B-parameter all-domains-of-life model	Peer reviewed; independent reproduction still pending
GET	Transcription foundation model across human cell types	Regulatory grammar, not whole-cell prediction
Orthrus	Mature RNA foundation model	RNA properties, not complete RNA therapeutic behavior

What is theoretical?

Several capabilities are plausible but not yet routine.

Cross-modality biology foundation models. A model that simultaneously handles sequence, structure, cells, and tissues would change downstream workflows. ESM3 combines sequence, structure, and function for proteins. Evo combines sequence-level inference across DNA, RNA, and protein for genomes. AlphaFold 3 combines proteins, nucleic acids, and ligands for structures. A model that genuinely spans the molecule-cell-organism scale is plausible but not demonstrated.

Foundation models with true out-of-distribution transfer at production quality. Most published transfer is near-distribution. Out-of-distribution transfer (novel protein families, novel tissues, novel species, novel perturbation classes) is partial and often weak. Improving this is an active research area; the framing for now is that pretraining is a meaningful prior, not a guarantee.

Production-grade inference at organism-scale context. Evo 2’s 1-megabase context window is the current frontier for genome-context models. Inference at full-genome context with production-grade latency is plausible but not routine; current deployments focus on local or regional context.

Foundation models as standard scientific infrastructure. SCimilarity for atlas search, AlphaFold Database for structures, and AlphaMissense outputs for variant interpretation are concrete examples of foundation models functioning as infrastructure rather than as research projects. Generalising this pattern to other modalities (genome foundation models as standard variant-interpretation tools, single-cell foundation models as standard atlas-search) is plausible and in active deployment.

Causal and mechanistic interpretability. Attention patterns, probing studies, and mechanistic interpretability work on biology foundation models is in its early phase. Whether these models can be made interpretable enough to generate testable mechanistic hypotheses is an open research question.

What is beyond current capability?

A few framing claims are not supported by current evidence.

A single foundation model solves all of biology. No such model exists, and the heterogeneity of biological evaluation makes one unlikely. Different modalities (sequence, structure, cells, tissues) have different ground truth, different cost structures, and different error tolerances. A model that excels at protein structure prediction tells you little about its single-cell perturbation performance, and vice versa.

Pretraining replaces task-specific evaluation. It does not. Every transfer claim requires its own biology-aware split, its own baseline comparison, and its own validation. The published “foundation model achieves SOTA” pattern often hides weak baselines or near-distribution evaluation.

Foundation models replace experimental biology. They do not. Pretraining captures patterns in the training distribution; biological measurement is the ground truth. The cells, proteins, and genomes the model has not seen are the biology you do not yet know.

Cross-modality multi-task models will consistently outperform single-modality specialists. They have not yet on the published benchmarks. Most successful current models are modality-specialised; multimodal foundation models are an active research frontier rather than a default best choice.

What would make this more promising?

Foundation-model claims would strengthen if independent groups reproduced transfer gains on biology-aware splits across protein families, donors, tissues, species, and genome regions while beating simple baselines on the same tasks. Cross-modality claims would need prospective experimental evidence that a shared representation improves decisions that single-modality systems miss. Infrastructure claims would need stable model versions, training-corpus documentation, open evaluation sets, and failure reports that make performance comparable across releases.

What should researchers, biotech teams, funders, and program leaders do with this?

For researchers and program leaders working with biology foundation models:

Match the modality to the task. Protein language models are mature for protein tasks. Single-cell foundation models are partial for cell tasks. Genome foundation models are early for genome tasks. AlphaFold and successors are mature for structure tasks. Generic “foundation model” framing is less useful than modality-specific framing.
Always run the baseline. PCA plus linear regression for cells. Sequence-homology baselines for proteins. K-mer baselines for genomes. The published comparison-to-other-deep-methods is not enough.
Read the pretraining distribution. A genome foundation model trained on prokaryotes transfers differently to eukaryotes. A single-cell foundation model trained mostly on blood transfers differently to liver. Pretraining coverage propagates to every downstream claim.
Apply the right biology-aware split. Sequence-family for proteins. Donor-level or tissue-level for cells. Time-split and species-split for genomes. Random splits inflate performance everywhere.
Track institutional lineage. ESM, scGPT, Geneformer, Evo, Nucleotide Transformer, AlphaFold each come from specific groups with specific evaluation conventions. The institutional context shapes which claims to trust and which to investigate further.
Treat preprint releases as Theoretical until peer review. Evo 2 has now cleared peer review in Nature. AlphaProteo remains an arXiv preprint with restricted code. Preprint releases are signals; peer review and independent reproduction move them into the Demonstrated tier.
Separate capability from release gating. Frontier general-purpose models used as biology assistants are increasingly evaluated under safety frameworks that connect biological or CBRN capability thresholds to deployment mitigations, access controls, and model-weight security. OpenAI’s Preparedness Framework tracks Biological and Chemical capabilities; Anthropic’s Responsible Scaling Policy and ASL-3 safeguards address chemical and biological weapons misuse; Google DeepMind’s Frontier Safety Framework uses tracked and critical capability levels for CBRN risk (OpenAI, 2025; Anthropic, 2026; Google DeepMind, 2026). A gated model is not necessarily scientifically weaker; it may reflect a release-safety decision rather than an ordinary biology benchmark result.
Document version and provenance. The pretraining corpus, the model version, the fine-tuning recipe, and the random seed all matter for reproducibility. Foundation model claims that omit these details should be treated as preliminary.
Compose foundation models with task-specific evidence. A foundation model output is one piece of evidence in a larger decision (variant classification, candidate selection, atlas search). The framework into which the model output feeds is at least as important as the foundation model itself.