AI for the Life Sciences
Life sciences AI is the application of machine learning systems to biological objects: sequences, structures, cells, tissues, organisms, experiments. Its center of gravity is upstream of clinical care: molecules generated before they are synthesized, cells represented before they are perturbed, experiments planned before they are run. AlphaFold 2 made over 214 million predicted structures freely available to researchers (Varadi et al., 2023). RFdiffusion designs proteins that bind, fold, and function in laboratory experiments (Watson et al., 2023). Coscientist plans and executes chemistry on cloud robotics with minimal human input (Boiko et al., 2023). None of this removes the experiment. The work is choosing the right one.
This chapter is the scope and orientation entry point for the handbook. You will learn to:
- Distinguish discovery AI from clinical AI and public health AI, and identify which evidence standard applies
- Apply the three-tier evidence framework (Demonstrated, Theoretical, Beyond current capabilities) used throughout the handbook
- Map the major model classes (structure prediction, design, foundation models, perturbation prediction, agentic systems) to the biological objects they represent
- Identify the audience this handbook is written for, and the role-specific reading paths
- Recognize the infrastructure layer (NIH Bridge2AI, ARPA-H, CZ Biohub, Arc Institute) that shapes what data and benchmarks exist
- Read frontier-lab announcements (DeepMind, Isomorphic Labs, Anthropic, OpenAI, Recursion, Insilico) with appropriate skepticism about evidence type
- Frame any AI claim around the falsifying experiment
Prerequisites: none for orientation. The rest of the handbook assumes you have read this chapter and the Executive Summary.
Introduction: What This Handbook Is For
October 2024, Stockholm: The Royal Swedish Academy of Sciences awards the Nobel Prize in Chemistry jointly to Demis Hassabis and John Jumper for AlphaFold, and to David Baker for computational protein design. The prize recognizes a decade of work in which deep learning moved from a tool researchers tried, to a tool researchers depended on, to (in the Committee’s framing) a tool that changed what was possible in structural biology.
May 2024: AlphaFold 3 extends biomolecular prediction beyond proteins to nucleic acids, ions, and ligands (Abramson et al., 2024). The initial release is server-only, with restricted commercial use. The open-source response (Boltz, Chai) arrives within six months.
October 2025: Anthropic launches Claude for Life Sciences with integrations into Benchling, 10x Genomics, PubMed, and Synapse (Anthropic, October 2025). OpenAI’s o1 reaches 77-78% on GPQA Diamond, exceeding the 69.7% PhD-expert baseline on graduate-level biology, chemistry, and physics questions (OpenAI, September 2024).
May 2026: ARPA-H launches the IGoR program (Intelligent Generator of Research) to “deliver gold-standard biomedical science faster through an AI-powered research ecosystem” focused on Alzheimer’s, Parkinson’s, and autoimmune disease (HHS press release, May 2026).
These are not unrelated events. They are a change in the default operating model of biomedical research: AI is no longer an experiment to run, it is infrastructure to use. The questions for working researchers are no longer “can I use AI” but rather “which model, for which decision, with what validation.”
This handbook is written for those questions. It is not a tour of every model release. It is a framework for reading model outputs as research inputs, choosing systems by biological question rather than brand, and designing the experiment that turns a prediction into a program decision.
What Life Sciences AI Is, and What It Is Not
The Discovery Layer
Life sciences AI sits in the discovery layer of biomedical research:
Population health
│
▼
Public Health AI ── Patient populations, surveillance, forecasting
│
▼
Clinical practice
│
▼
Clinical AI ── Diagnosis, treatment, workflow, liability
│
▼
Therapeutic development
│
▼
[LIFE SCIENCES AI] ── Molecules, cells, experiments, candidates
│
▼
Fundamental biology
Each layer above depends on inputs from the layer below. A drug candidate exists before a clinical trial; a clinical trial exists before a regulatory decision; a regulatory decision exists before a population-scale deployment. Life sciences AI operates at the molecule-cell-experiment level. Its outputs are inputs to clinical and public health programs.
The Three Adjacent Domains
| Domain | Object of Study | Failure Cost | Evidence Standard |
|---|---|---|---|
| Life sciences AI | Molecules, cells, experiments, research decisions | A failed experiment, a discontinued program, a non-validated paper | Benchmarks + prospective experimental validation |
| Clinical AI | Individual patients, diagnostic and therapeutic decisions | Misdiagnosis, mistreatment, liability | Prospective clinical trials, FDA clearance, real-world performance |
| Public health AI | Populations, surveillance, intervention design | Missed outbreaks, misallocated resources, eroded trust | Deployment-context evaluation, proper scoring rules, equity analysis |
The three are complementary but distinct. A model that predicts a binding interaction is a life sciences AI claim. A model that recommends a treatment for a specific patient is a clinical AI claim. A model that forecasts hospitalizations is a public health AI claim. The same architectural family (transformers, diffusion, graph networks) underlies all three, but the evidence standard depends on the decision the model informs, not the math under the hood.
Companion handbooks in this series cover the clinical, population, and biosecurity layers explicitly. Links and descriptions appear on the welcome page under “Explore the Handbook Series.”
The Three-Tier Evidence Framework
Every chapter in this handbook places capability claims in one of three tiers:
Supported by published evidence in peer-reviewed venues, official documentation, or reproducible benchmark results. The evidence must be specific to a defined task and dataset. “AlphaFold 2 predicts single-chain protein structures at near-experimental accuracy for the majority of well-folded domains evaluated in CASP14” is a demonstrated claim (Jumper et al., 2021).
Plausible given current methods but not yet established for routine use. The capability has been shown in selected systems, narrow tasks, or controlled settings without proven generalization. “Single-cell foundation models can transfer to new tissues and species” is currently a theoretical claim: published evidence supports transfer in some settings (Cui et al., 2024; Theodoris et al., 2023), but the boundary of useful transfer is an open research question.
Not supported by credible evidence with current systems. The capability is either aspirational, has been demonstrated in toy settings that do not generalize, or requires evidence that has not been produced. “Fully autonomous drug discovery without experimental validation” is beyond current capabilities. Coscientist demonstrates autonomous chemistry execution in bounded settings (Boiko et al., 2023), not autonomous discovery without measurement.
The point of the tiers is not to be conservative: it is to be specific. A claim that cannot be placed in a tier is not a claim about a system; it is marketing.
Reading Claims with the Framework
When a press release, a paper title, or a vendor pitch makes a capability claim, ask:
- What biological object is the model representing? Sequence? Structure? Cell state? Tissue? Experiment? Reaction?
- What is the specific task on which the claim is made? Property prediction? Generation? Ranking? Classification? Planning?
- What is the evidence? Held-out benchmark performance? Prospective experiment? Cross-laboratory replication? Vendor-reported internal evaluation?
- What experiment would falsify the claim? And has that experiment been done?
A claim that survives steps 1-4 is at minimum demonstrated for the specific task and dataset. A claim that fails step 4 (no falsifying experiment, or the experiment has not been done) is either theoretical or beyond current capabilities.
The Capability Landscape
The handbook organizes life sciences AI into six parts. Each part contains several chapters that apply the evidence framework to a specific model class.
Part I: Foundations (this part)
- AI for the Life Sciences (this chapter): Scope and framework
- Biological Data Infrastructure: Bridge2AI, CZ CELLxGENE, Tabula Sapiens, and why AI-ready data is its own problem
- Foundation Models for Biology: Protein language models (ESM), single-cell foundation models (scGPT, Geneformer), genomic foundation models (Evo, Evo 2)
- Evaluation Principles for Biomedical Discovery AI: Held-out benchmarks, prospective validation, distribution shift, calibration
Part II: Molecular AI
- Protein Structure Prediction: AlphaFold lineage and confidence interpretation
- Protein Design and Engineering: RFdiffusion, ProteinMPNN, the design-validate-iterate loop
- Antibody and Biologic Design: Antibody-specific limitations and developability
- Nucleic Acid and Genome Models: Evo, RNA structure, splicing
- Variant Effect Prediction: AlphaMissense and clinical interpretation boundaries
Part III: Therapeutics AI
Part IV: Cellular and Systems Biology
- Single-Cell Foundation Models: scGPT, Geneformer, and what foundation means at single-cell resolution
- Spatial Omics and Tissue Models
- Cell Painting and Image-Based Phenotyping: Bray-2016 protocol and current AI methods
- Perturbation Prediction and Virtual Cells: GEARS, CZI Virtual Cells Platform, and the gap between annotation and prediction
- Microbiome and Multi-Omics AI
Part V: Engineering and Automation
- Self-Driving Laboratories: Closed-loop experimentation
- Robotic Lab Automation and Cloud Labs
- Synthetic Biology Design Tools
- Agentic Science Workflows: Coscientist, Virtual Lab, and what an agent actually owns
Part VI: Practice and Governance
- Benchmarks for Bio AI
- Reproducibility and Open Science
- Information Hazards in Capability Research: Dual-use review
- Workforce, Compute, and Institutional Readiness
The Infrastructure Layer
Models are visible; infrastructure is decisive. The capability gaps in life sciences AI are often data gaps, benchmark gaps, or compute gaps before they are architecture gaps.
Public Programs
- NIH Bridge2AI (NIH Common Fund, Bridge2AI Consortium): Four grand-challenge data generation projects (CHORUS for AI/ML in clinical care, CM4AI for functional genomics, VOICE for precision public health, AI-READI for salutogenesis). The program’s premise: AI-ready datasets are themselves an infrastructure problem, requiring metadata, ethics review, quality control, and workforce development: not only more storage.
- ARPA-H IGoR (ARPA-H programs page; HHS press release, May 2026): Intelligent Generator of Research, focused on Alzheimer’s, Parkinson’s, and autoimmune disease. ARPA-H also funds adjacent AI programs: ADVOCATE (cardiovascular AI agents), RAPID (rare-disease AI diagnostics), CATALYST (ADME-tox modeling), ADAPT (precision cancer therapy).
- NCI Cancer Research Data Commons (NCI CRDC): Data infrastructure spanning genomics, proteomics, and imaging that AI work depends on, even when not formally an “AI program.”
Non-Profit and Foundation Programs
- CZ Biohub and CZ CELLxGENE (CZ CELLxGENE Discover): Roughly 100 million curated single-cell observations in a standardized, queryable platform. The Tabula Sapiens collection (1.1M cells from 28 organs, 24 donors) is a benchmark first-draft human cell atlas.
- CZI Virtual Cells Platform (CZI): An active program to build and benchmark foundation models for cell biology.
- Arc Institute: Co-developer (with Stanford, UC Berkeley, UCSF, and NVIDIA) of the Evo and Evo 2 genomic foundation models (Nguyen et al., 2024; Brixi et al., 2025, preprint).
Frontier Labs
The major AI labs each have life-sciences programs at varying degrees of openness:
| Lab | Visible Life-Sciences Work | Evidence Type |
|---|---|---|
| Google DeepMind / Isomorphic Labs | AlphaFold 2/3, AlphaMissense, AlphaProteo; Eli Lilly and Novartis drug-discovery partnerships (Isomorphic Labs, January 2024) | Peer-reviewed for AlphaFold lineage; AlphaProteo is arXiv preprint (Zambaldi et al., 2024, preprint); partnerships are factual but not efficacy evidence |
| Anthropic | Claude for Life Sciences (October 2025), AI for Science Program | Company announcement; no peer-reviewed life-sciences paper as of this writing |
| OpenAI | Color Health cancer-screening copilot (OpenAI, June 2024); Moderna ChatGPT Enterprise deployment (OpenAI, April 2024); o1 model GPQA Diamond performance (OpenAI, September 2024) | Verified partnerships and benchmark results; no peer-reviewed biology paper |
| Meta FAIR / EvolutionaryScale | ESM-2 protein language model (Lin et al., 2023); ESM-3 multimodal (Hayes et al., 2024, preprint) | ESM-2 peer-reviewed; ESM-3 preprint |
AI-Native Drug Discovery Companies
- Recursion Pharmaceuticals (Recursion mission): High-content imaging plus ML. 2025 reported first AI-enabled clinical proof of concept; clinical candidates include REC-617 (CDK7) and REC-4881. Note: pipeline contraction also disclosed in May 2025.
- Insilico Medicine: Generative chemistry for IPF target TNIK; ISM001-055 reported positive Phase IIa topline (Insilico Medicine, November 2024). Company-reported efficacy; not yet peer-reviewed in a journal.
- Insitro: Machine-learning models for metabolic disease and neuroscience; expanded Eli Lilly small-molecule collaboration in September 2025.
Read these with the framework: a partnership announcement is factual evidence of the partnership; it is not evidence that the AI-discovered molecule will read out positively, advance to Phase III, or change a patient’s outcome.
Who This Handbook Is For
The handbook is written for several overlapping audiences. The shared question is: when does an AI output deserve experimental attention?
| Role | What You Need From This Handbook |
|---|---|
| Computational biologist | Capability tier for each model class; what the failure modes are; how to design a benchmark that reflects your actual question |
| Biotechnology team lead | Build-vs-buy framing; license diligence; what the open-source alternatives are when a frontier release is restricted |
| Drug discovery scientist | Where AI shifts a stage gate vs. where it does not; how to read a vendor pitch against published evidence |
| Physician-scientist | Translation between bench AI and clinical decision-making; what makes a discovery-stage AI claim relevant to the clinic |
| Synthetic biologist | Design tools, autonomous lab integration, dual-use considerations |
| Graduate student | Conceptual entry points into model classes; canonical citations; how to read benchmark results |
| Research program leader | Capital allocation framing; which capabilities are infrastructure-grade vs. research-grade; how to evaluate proposals that invoke AI |
If you have read this far, the handbook is also written for you.
How to Read the Rest of the Handbook
If you have 20 minutes
- Executive Summary: Handbook-wide conclusions
- Protein Structure Prediction: The landmark capability and its limits
- Evaluation Principles for Biomedical Discovery AI: The framework that turns capability into decision
If you have an hour
Add:
- Single-Cell Foundation Models: The capability frontier in cell biology
- Self-Driving Laboratories: The autonomous laboratory frontier
- Information Hazards in Capability Research: Dual-use considerations for design and generation tools
If you are doing a deep program review
Read the relevant Part end-to-end. Each chapter is self-contained but cross-references the others.
Practice Notes
- Name the biological object first. Sequence, structure, ligand, cell state, tissue, experiment, or clinical endpoint. The right model class depends on the object.
- Name the validation object second. What experiment, benchmark, or independent dataset would change your decision if the model were wrong?
- Do not equate a model score with biological truth. A high pLDDT, a low Tanimoto, a strong attention weight: these are model outputs, not measurements.
- Treat every vendor claim as a claim about a specific data distribution until proven otherwise. A model that works on one cell line, one species, or one assay does not work on all of them.
- Read the license before scoping the project. A model you cannot run on your infrastructure is, for your program, not the state of the art.
- Cite by version and venue. “AlphaFold” without a version is ambiguous; an announcement is not a paper; a partnership is not an outcome.
Common Questions
Is life sciences AI the same as “AI for science”?
Overlapping but not identical. “AI for science” is a broader phrase that includes physics, chemistry, materials, climate, mathematics, and other domains. This handbook focuses on the biomedical subset: proteins, cells, genomes, organisms, and the research workflows that connect them.
Do I need to know machine learning to use this handbook?
No. The handbook is written for working biologists, drug discovery scientists, and program leaders. It treats ML as a research instrument the same way it would treat a sequencer or a microscope: you should understand what the output means and what its limits are; you do not need to build the instrument from scratch. The Foundation Models for Biology and Evaluation Principles chapters provide enough ML context to read the rest of the handbook.
How often does the handbook update?
Continuously, as landmark papers and benchmark results appear in Nature, Science, Cell, and adjacent venues, and as frontier-lab and FDA actions warrant. The publication and modification dates are visible in each chapter’s metadata.
Why is dual-use treated as a separate chapter rather than woven through?
Because the questions in dual-use review (what to publish, what to release, how to communicate capability without enabling misuse) apply to many model classes at once. The dedicated chapter (Information Hazards in Capability Research) keeps the framework in one place. Cross-references in individual chapters point back to it.
What’s the difference between a foundation model and a task-specific model in biology?
A foundation model is pretrained on broad data (sequences, structures, cells) and adapted to many downstream tasks. A task-specific model is trained for one task. In life sciences, the foundation-model wave is real: ESM-2 / ESMFold for proteins, scGPT and Geneformer for single cells, Evo and Evo 2 for genomes. The wave is also recent and the generalization claims are still being validated. Foundation Models for Biology treats this in depth.
Are AI-discovered drugs real yet?
In the sense that some AI-discovered molecules have reached and progressed in clinical trials, yes: Insilico’s ISM001-055 (Phase IIa, company-reported) is a frequently cited example. In the sense that an AI-discovered molecule has been approved and changed standard-of-care, not yet. Read the evidence with the framework: a Phase IIa topline is a meaningful signal, not a registration-grade outcome. Translational Evidence and Failure Modes covers this in depth.
Cross-References
- Executive Summary: Handbook-wide conclusions
- Biological Data Infrastructure: The data layer underneath the models
- Foundation Models for Biology: Architectural lineage of current systems
- Evaluation Principles for Biomedical Discovery AI: The framework that turns capability into decision
- Protein Structure Prediction: The landmark demonstrated capability
- Information Hazards in Capability Research: Dual-use review
- Companion handbooks in this series: see “Explore the Handbook Series” on the welcome page