Foundation Models for Biology

Author
Published

May 24, 2026

Biological foundation models learn reusable representations from large collections of biological measurements. The practical question is whether those representations transfer to a specific biological decision.

Learning Objectives
  • Distinguish sequence, structure, cellular, and multimodal foundation models.
  • Read claims about scale without confusing scale with external validity.
  • Use task-specific evaluation before adopting a general representation.
TL;DR

A foundation model is useful when pretraining improves a downstream biological task under realistic validation. Model size, modality count, and dataset volume matter less than task transfer, assay fidelity, and external testing.

Introduction

Biology is well suited to representation learning because many data types are symbolic or structured: DNA, RNA, protein sequence, molecular graphs, images, and expression matrices. ESMFold showed that protein language model representations could support atomic-level structure prediction from sequence (Lin et al., 2023). Geneformer and scGPT moved the same broad idea into single-cell transcriptomics (Theodoris et al., 2023; Cui et al., 2024).

Demonstrated

Demonstrated capability includes reusable embeddings and task transfer in published benchmark settings. ESMFold demonstrated fast protein structure prediction from protein language model representations (Lin et al., 2023). Geneformer demonstrated transfer across gene network tasks in limited-data settings (Theodoris et al., 2023). scGPT demonstrated pretraining over large single-cell repositories with downstream tasks in cell type annotation, perturbation, and integration (Cui et al., 2024).

Evidence Anchor What It Supports Practical Constraint
ESMFold Protein language representations linked to structure prediction Accuracy and confidence differ from MSA-based methods
Geneformer Single-cell transfer learning for gene network tasks Training data and cell context shape transfer
scGPT Single-cell pretraining across large repositories Benchmark selection determines apparent gains

Theoretical

Theoretical capability includes a single model that supports molecular design, cell-state forecasting, tissue interpretation, and experiment planning. Current models usually specialize by modality or task family. Multimodal systems are expanding, but general biological validity remains an empirical question.

Beyond Current Capabilities

Beyond current capabilities includes foundation models that infer causal biology from observational pretraining alone. Perturbational data, experimental design, and mechanistic testing remain necessary for causal claims.

Practice Notes

  • Ask what was masked or predicted during pretraining.
  • Compare foundation-model features against simple baselines.
  • Evaluate by biological split, not random row split, when testing transfer.
  • Track whether the model saw related cell types, homologs, assays, or structures during training.