Single-Cell Foundation Models
Single-cell foundation models learn representations of cells and genes from large expression atlases. Their value depends on transfer to the cell types, perturbations, and assays that matter for a biological decision.
- Compare single-cell foundation model tasks.
- Identify leakage and batch effects in single-cell evaluation.
- Use cell context when reading perturbation claims.
Single-cell foundation models are useful representation systems, not general virtual cells. Evaluation must account for cell type, donor, batch, disease state, and perturbation split.
Introduction
Geneformer pretrained on large collections of single-cell gene-expression profiles to support gene-network tasks (Theodoris et al., 2023). scGPT pretrained across tens of millions of cells and evaluated on cell annotation, perturbation, integration, and related tasks (Cui et al., 2024).
Demonstrated
Demonstrated capability includes representation learning for cell type annotation, batch integration, perturbation-related tasks, and gene network analysis in benchmark settings. Geneformer demonstrated transfer under limited-data scenarios (Theodoris et al., 2023). scGPT demonstrated generative pretraining for single-cell multi-omics tasks (Cui et al., 2024).
| Evidence Anchor | What It Supports | Practical Constraint |
|---|---|---|
| Geneformer | Gene and cell representation learning | Transfer depends on biological and dataset proximity |
| scGPT | Single-cell multi-omics pretraining | Evaluation split determines credibility |
| GEARS | Perturbation prediction | Graph priors and context affect generalization |
Theoretical
Theoretical capability includes cell-state models that forecast response to novel perturbations across donors, tissues, and disease states. This needs perturbational data and validation beyond atlas pretraining.
Beyond Current Capabilities
Beyond current capabilities includes a general virtual cell that reliably predicts full cellular behavior across all contexts. Current systems usually model measured expression outputs, not all cellular mechanisms.
Practice Notes
- Use donor, batch, tissue, and time splits where relevant.
- Benchmark against simple gene-level and cell-type baselines.
- Audit whether target genes or similar perturbations were present in training.
- Report uncertainty and failure cases for rare cell populations.