The Life Sciences AI Handbook: AI for Biomedical Discovery, Biotechnology, and Translational Research

Name: The Life Sciences AI Handbook
Author: Bryan Tegomoh

Tegomoh, Bryan; [Bryan Tegomoh, MD, MPH](https://bryantegomoh.com/)

Single-Cell Foundation Models

Author

Bryan Tegomoh, MD, MPH

Published

May 24, 2026

Single-cell foundation models learn representations of cells and genes from large expression atlases. Their value depends on transfer to the cell types, perturbations, and assays that matter for a biological decision.

Learning Objectives

Compare single-cell foundation model tasks.
Identify leakage and batch effects in single-cell evaluation.
Use cell context when reading perturbation claims.

TL;DR

Single-cell foundation models are useful representation systems, not general virtual cells. Evaluation must account for cell type, donor, batch, disease state, and perturbation split.

Introduction

Geneformer pretrained on large collections of single-cell gene-expression profiles to support gene-network tasks (Theodoris et al., 2023). scGPT pretrained across tens of millions of cells and evaluated on cell annotation, perturbation, integration, and related tasks (Cui et al., 2024).

Demonstrated

Demonstrated capability includes representation learning for cell type annotation, batch integration, perturbation-related tasks, and gene network analysis in benchmark settings. Geneformer demonstrated transfer under limited-data scenarios (Theodoris et al., 2023). scGPT demonstrated generative pretraining for single-cell multi-omics tasks (Cui et al., 2024).

Evidence Anchor	What It Supports	Practical Constraint
Geneformer	Gene and cell representation learning	Transfer depends on biological and dataset proximity
scGPT	Single-cell multi-omics pretraining	Evaluation split determines credibility
GEARS	Perturbation prediction	Graph priors and context affect generalization

Theoretical

Theoretical capability includes cell-state models that forecast response to novel perturbations across donors, tissues, and disease states. This needs perturbational data and validation beyond atlas pretraining.

Beyond Current Capabilities

Beyond current capabilities includes a general virtual cell that reliably predicts full cellular behavior across all contexts. Current systems usually model measured expression outputs, not all cellular mechanisms.

Practice Notes

Use donor, batch, tissue, and time splits where relevant.
Benchmark against simple gene-level and cell-type baselines.
Audit whether target genes or similar perturbations were present in training.
Report uncertainty and failure cases for rare cell populations.